i IBM SPSS Modeler 15 Modeling Nodes Note: Before using this information and the product it supports, read the general information under Notices on p. 483. This edition applies to IBM SPSS Modeler 15 and to all subsequent releases and modifications until otherwise indicated in new editions. Adobe product screenshot(s) reprinted with permission from Adobe Systems Incorporated. Microsoft product screenshot(s) reprinted with permission from Microsoft Corporation. Licensed Materials - Property of IBM © Copyright IBM Corporation 1994, 2012. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Preface IBM® SPSS® Modeler is the IBM Corp. enterprise-strength data mining workbench. SPSS Modeler helps organizations to improve customer and citizen relationships through an in-depth understanding of data. Organizations use the insight gained from SPSS Modeler to retain profitable customers, identify cross-selling opportunities, attract new customers, detect fraud, reduce risk, and improve government service delivery. SPSS Modeler’s visual interface invites users to apply their specific business expertise, which leads to more powerful predictive models and shortens time-to-solution. SPSS Modeler offers many modeling techniques, such as prediction, classification, segmentation, and association detection algorithms. Once models are created, IBM® SPSS® Modeler Solution Publisher enables their delivery enterprise-wide to decision makers or to a database. About IBM Business Analytics IBM Business Analytics software delivers complete, consistent and accurate information that decision-makers trust to improve business performance. A comprehensive portfolio of business intelligence, predictive analytics, financial performance and strategy management, and analytic applications provides clear, immediate and actionable insights into current performance and the ability to predict future outcomes. Combined with rich industry solutions, proven practices and professional services, organizations of every size can drive the highest productivity, confidently automate decisions and deliver better results. As part of this portfolio, IBM SPSS Predictive Analytics software helps organizations predict future events and proactively act upon that insight to drive better business outcomes. Commercial, government and academic customers worldwide rely on IBM SPSS technology as a competitive advantage in attracting, retaining and growing customers, while reducing fraud and mitigating risk. By incorporating IBM SPSS software into their daily operations, organizations become predictive enterprises – able to direct and automate decisions to meet business goals and achieve measurable competitive advantage. For further information or to reach a representative visit http://www.ibm.com/spss. Technical support Technical support is available to maintenance customers. Customers may contact Technical Support for assistance in using IBM Corp. products or for installation help for one of the supported hardware environments. To reach Technical Support, see the IBM Corp. web site at http://www.ibm.com/support. Be prepared to identify yourself, your organization, and your support agreement when requesting assistance. © Copyright IBM Corporation 1994, 2012. iii Contents 1 About IBM SPSS Modeler 1 IBM SPSS Modeler Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 IBM SPSS Modeler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Administration Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Solution Publisher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Server Adapters for IBM SPSS Collaboration and Deployment Services . IBM SPSS Modeler Editions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 2 2 2 3 IBM SPSS Modeler Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 SPSS Modeler Professional Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 SPSS Modeler Premium Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Demos Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Introduction to Modeling 7 Building the Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Browsing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Scoring Records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Modeling Overview 24 Overview of Modeling Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Building Split Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Splitting and Partitioning . . . . . . . . . . . . . . . . . Modeling Nodes Supporting Split Models . . . . Features Affected by Splitting . . . . . . . . . . . . . Modeling Node Fields Options . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ..... ..... ..... ..... 33 33 34 35 Using Frequency and Weight Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Modeling Node Analyze Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Model Nuggets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Model Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Replacing a Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 © Copyright IBM Corporation 1994, 2012. iv The Models Palette . . . . . . . . . . . . . . . . . . . . . Browsing Model Nuggets . . . . . . . . . . . . . . . . Model Nugget Summary / Information . . . . . . . Predictor Importance. . . . . . . . . . . . . . . . . . . . Models for Ensembles . . . . . . . . . . . . . . . . . . . Model Nuggets for Split Models. . . . . . . . . . . . Using Model Nuggets in Streams. . . . . . . . . . . Regenerating a Modeling Node . . . . . . . . . . . . Importing and Exporting Models as PMML. . . . Publishing Models for a Scoring Adapter . . . . . Unrefined Models . . . . . . . . . . . . . . . . . . . . . . 4 ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... 47 49 50 51 53 61 63 64 65 68 69 70 Screening Models Screening Fields and Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Feature Selection Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Feature Selection Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Feature Selection Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Feature Selection Model Nuggets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Feature Selection Model Results . . . . . . . . . . . . . . . . . Selecting Fields by Importance . . . . . . . . . . . . . . . . . . Generating a Filter from a Feature Selection Model . . . Anomaly Detection Node . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ..... ..... ..... ..... 74 76 76 77 Anomaly Detection Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Anomaly Detection Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Anomaly Detection Model Nuggets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Anomaly Detection Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Anomaly Detection Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Anomaly Detection Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5 Automated Modeling Nodes 86 Automated Modeling Node Algorithm Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Automated Modeling Node Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Auto Classifier Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Auto Classifier Node Model Options. . . Auto Classifier Node Expert Options. . . Misclassification Costs . . . . . . . . . . . . Auto Classifier Node Discard Options . Auto Classifier Node Settings Options . ...... ...... ...... ...... ...... v ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ..... ..... ..... ..... ..... 91 93 95 96 97 Auto Numeric Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Auto Numeric Node Model Options . . . Auto Numeric Node Expert Options . . . Auto Numeric Node Settings Options. . Auto Cluster Node . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... . . . . . 99 . . . . 101 . . . . 104 . . . . 104 Auto Cluster Node Model Options . . . . Auto Cluster Node Expert Options . . . . Auto Cluster Node Discard Options . . . Automated Model Nuggets . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 105 107 109 110 Generating Nodes and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Generating Evaluation Charts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Evaluation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6 116 Decision Trees Decision Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 The Interactive Tree Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Growing and Pruning the Tree . . . . . . . . . . . . . Defining Custom Splits. . . . . . . . . . . . . . . . . . . Split Details and Surrogates. . . . . . . . . . . . . . . Customizing the Tree View . . . . . . . . . . . . . . . . Gains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saving Tree Models and Results . . . . . . . . . . . Generating Filter and Select Nodes . . . . . . . . . Generating a Rule Set from a Decision Tree . . . Building a Tree Model Directly . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... .... .... .... 120 121 123 124 126 134 135 139 139 140 Decision Tree Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C&R Tree Node . . . . . . . . . . . . . . . . . . CHAID Node . . . . . . . . . . . . . . . . . . . . QUEST Node . . . . . . . . . . . . . . . . . . . . Decision Tree Node Fields Options. . . . Decision Tree Node Build Options . . . . Decision Tree Node Model Options . . . C5.0 Node . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... 143 144 144 145 146 158 160 C5.0 Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Decision Tree Model Nuggets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Single Tree Model Nuggets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Model Nuggets for Boosting, Bagging and Very Large Datasets. . . . . . . . . . . . . . . . . . . . . . 174 vi Rule Set Model Nuggets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Rule Set Model Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Importing Projects from AnswerTree 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7 Bayesian Network Models 179 Bayesian Network Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Bayesian Network Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Bayesian Network Node Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Bayesian Network Model Nuggets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Bayesian Network Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Bayesian Network Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8 Neural Networks 189 The Neural Networks Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Using Neural Networks with Legacy Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Predictor Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Predicted By Observed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 9 Decision List 204 Decision List Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Decision List Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Decision List Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Decision List Model Nugget Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 vii Decision List Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Working Model Pane . . . . . . . . . . . . . . Alternatives Tab. . . . . . . . . . . . . . . . . . Snapshots Tab . . . . . . . . . . . . . . . . . . . Working with Decision List Viewer . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... 10 Statistical Models .... .... .... .... 214 216 218 219 238 Linear Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Linear models . . . . . . . . . . . . . Objectives . . . . . . . . . . . . . . . Basics . . . . . . . . . . . . . . . . . . Model Selection . . . . . . . . . . . Ensembles . . . . . . . . . . . . . . . Advanced . . . . . . . . . . . . . . . . Model Options . . . . . . . . . . . . Model Summary . . . . . . . . . . . Automatic Data Preparation . . Predictor Importance . . . . . . . Predicted By Observed . . . . . . Residuals . . . . . . . . . . . . . . . . Outliers . . . . . . . . . . . . . . . . . . Effects . . . . . . . . . . . . . . . . . . Coefficients . . . . . . . . . . . . . . Estimated Means . . . . . . . . . . Model Building Summary . . . . Settings . . . . . . . . . . . . . . . . . Logistic Node . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... 239 241 242 244 245 246 247 248 249 250 251 252 253 254 255 257 258 259 259 Logistic Node Model Options . . . . . . . . . . . . . . Adding Terms to a Logistic Regression Model . Logistic Node Expert Options . . . . . . . . . . . . . . Logistic Regression Convergence Options . . . . Logistic Regression Advanced Output . . . . . . . Logistic Regression Stepping Options . . . . . . . Logistic Model Nugget . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... 260 265 266 267 268 270 271 Logistic Nugget Model Details . . . . . . . . . . . . . Logistic Model Nugget Summary . . . . . . . . . . . Logistic Model Nugget Settings . . . . . . . . . . . . Logistic Model Nugget Advanced Output . . . . . PCA/Factor Node . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... 272 273 274 276 277 PCA/Factor Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 viii PCA/Factor Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 PCA/Factor Node Rotation Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 PCA/Factor Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 PCA/Factor Model Nugget Equations . . . . . . . . PCA/Factor Model Nugget Summary . . . . . . . . PCA/Factor Model Nugget Advanced Output . . Discriminant Node . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 281 282 284 285 Discriminant Node Model Options . . . . Discriminant Node Expert Options . . . . Discriminant Node Output Options . . . . Discriminant Node Stepping Options . . Discriminant Model Nugget . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... 286 286 288 290 291 Discriminant Model Nugget Advanced Output . Discriminant Model Nugget Settings . . . . . . . . Discriminant Model Nugget Summary . . . . . . . GenLin Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GenLin Node Field Options . . . . . . . . . . . . . . . . GenLin Node Model Options . . . . . . . . . . . . . . GenLin Node Expert Options . . . . . . . . . . . . . . Generalized Linear Models Iterations. . . . . . . . Generalized Linear Models Advanced Output. . GenLin Model Nugget . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... .... .... .... 292 292 293 294 295 296 297 301 302 303 GenLin Model Nugget Advanced Output . . . . . GenLin Model Nugget Settings . . . . . . . . . . . . GenLin Model Nugget Summary. . . . . . . . . . . . GLMM Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 305 305 306 307 Generalized linear mixed models . . . . . Target . . . . . . . . . . . . . . . . . . . . . . . . . Fixed Effects . . . . . . . . . . . . . . . . . . . . Random Effects . . . . . . . . . . . . . . . . . . Weight and Offset . . . . . . . . . . . . . . . . Build Options . . . . . . . . . . . . . . . . . . . General . . . . . . . . . . . . . . . . . . . . . . . . Estimated Means . . . . . . . . . . . . . . . . Model view . . . . . . . . . . . . . . . . . . . . . Cox Node . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... .... .... .... 307 310 313 316 319 320 321 322 323 336 Cox Node Fields Options . . . . . Cox Node Model Options . . . . . Cox Node Expert Options . . . . . Cox Node Settings Options . . . Cox Model Nugget . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... 337 338 341 344 345 ...... ...... ...... ...... ...... Cox Regression Output Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Cox Regression Advanced Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 ix 11 Clustering Models 347 Kohonen Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Kohonen Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Kohonen Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Kohonen Model Nuggets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Kohonen Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 K-Means Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 K-Means Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 K-Means Node Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 K-Means Model Nuggets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 K-Means Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 TwoStep Cluster Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 TwoStep Cluster Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 TwoStep Cluster Model Nuggets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 TwoStep Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 The Cluster Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Cluster Viewer - Model Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Navigating the Cluster Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Generating Graphs from Cluster Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 12 Association Rules 377 Tabular versus Transactional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 Apriori Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Apriori Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 Apriori Node Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 CARMA Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 CARMA Node Fields Options . . CARMA Node Model Options. . CARMA Node Expert Options. . Association Rule Model Nuggets . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 383 386 387 388 Association Rule Model Nugget Details . . . . . . . . . . . . . . . . . . Association Rule Model Nugget Settings . . . . . . . . . . . . . . . . . Association Rule Model Nugget Summary . . . . . . . . . . . . . . . . Generating a Rule Set from an Association Model Nugget . . . . Generating a Filtered Model. . . . . . . . . . . . . . . . . . . . . . . . . . . Scoring Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deploying Association Models. . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... 388 395 397 398 399 400 401 x ...... ...... ...... ...... Sequence Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Sequence Node Fields Options . . . . . . Sequence Node Model Options . . . . . . Sequence Node Expert Options . . . . . . Sequence Model Nuggets . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 405 406 407 409 Sequence Model Nugget Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Model Nugget Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Model Nugget Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating a Rule SuperNode from a Sequence Model Nugget . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 411 413 414 415 417 13 Time Series Models Why Forecast? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 Characteristics of Time Series . . . . . . . . . . . . . . . . . . . Autocorrelation and Partial Autocorrelation Functions. Series Transformations . . . . . . . . . . . . . . . . . . . . . . . . Predictor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 417 422 423 423 Time Series Modeling Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . Time Series Model Options . . . . . . . . . . . . . . . Time Series Expert Modeler Criteria. . . . . . . . . Time Series Exponential Smoothing Criteria . . . Time Series ARIMA Criteria . . . . . . . . . . . . . . . Transfer Functions . . . . . . . . . . . . . . . . . . . . . . Handling Outliers . . . . . . . . . . . . . . . . . . . . . . . Generating Time Series Models . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... .... .... .... .... 425 427 429 431 432 434 436 437 Generating Multiple Models. . . . . . . . . . . . . . . Using Time Series Models in Forecasting. . . . . Reestimating and Forecasting . . . . . . . . . . . . . Time Series Model Nugget . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 437 437 437 438 ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .... .... .... .... 442 443 444 445 Time Series Model Parameters Time Series Model Residuals . . Time Series Model Summary . . Time Series Model Settings . . . ...... ...... ...... ...... ...... ...... ...... ...... 14 Self-Learning Response Node Models 446 SLRM Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 SLRM Node Fields Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 xi SLRM Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 SLRM Node Settings Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 SLRM Model Nuggets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 SLRM Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 15 Support Vector Machine Models 455 About SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 How SVM Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Tuning an SVM Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 SVM Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 SVM Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 SVM Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 SVM Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 SVM Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 16 Nearest Neighbor Models 462 KNN Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 KNN Node Objectives Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 KNN Node Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 KNN Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Model View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 KNN Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Appendix A Notices 483 Index 486 xii Chapter About IBM SPSS Modeler 1 IBM® SPSS® Modeler is a set of data mining tools that enable you to quickly develop predictive models using business expertise and deploy them into business operations to improve decision making. Designed around the industry-standard CRISP-DM model, SPSS Modeler supports the entire data mining process, from data to better business results. SPSS Modeler offers a variety of modeling methods taken from machine learning, artificial intelligence, and statistics. The methods available on the Modeling palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems. SPSS Modeler can be purchased as a standalone product, or used as a client in combination with SPSS Modeler Server. A number of additional options are also available, as summarized in the following sections. For more information, see http://www.ibm.com/software/analytics/spss/products/modeler/. IBM SPSS Modeler Products The IBM® SPSS® Modeler family of products and associated software comprises the following. IBM SPSS Modeler IBM SPSS Modeler Server IBM SPSS Modeler Administration Console IBM SPSS Modeler Batch IBM SPSS Modeler Solution Publisher IBM SPSS Modeler Server adapters for IBM SPSS Collaboration and Deployment Services IBM SPSS Modeler SPSS Modeler is a functionally complete version of the product that you install and run on your personal computer. You can run SPSS Modeler in local mode as a standalone product, or use it in distributed mode along with IBM® SPSS® Modeler Server for improved performance on large data sets. With SPSS Modeler, you can build accurate predictive models quickly and intuitively, without programming. Using the unique visual interface, you can easily visualize the data mining process. With the support of the advanced analytics embedded in the product, you can discover previously hidden patterns and trends in your data. You can model outcomes and understand the factors that influence them, enabling you to take advantage of business opportunities and mitigate risks. SPSS Modeler is available in two editions: SPSS Modeler Professional and SPSS Modeler Premium. For more information, see the topic IBM SPSS Modeler Editions on p. 3. © Copyright IBM Corporation 1994, 2012. 1 2 Chapter 1 IBM SPSS Modeler Server SPSS Modeler uses a client/server architecture to distribute requests for resource-intensive operations to powerful server software, resulting in faster performance on larger data sets. SPSS Modeler Server is a separately-licensed product that runs continually in distributed analysis mode on a server host in conjunction with one or more IBM® SPSS® Modeler installations. In this way, SPSS Modeler Server provides superior performance on large data sets because memory-intensive operations can be done on the server without downloading data to the client computer. IBM® SPSS® Modeler Server also provides support for SQL optimization and in-database modeling capabilities, delivering further benefits in performance and automation. IBM SPSS Modeler Administration Console The Modeler Administration Console is a graphical application for managing many of the SPSS Modeler Server configuration options, which are also configurable by means of an options file. The application provides a console user interface to monitor and configure your SPSS Modeler Server installations, and is available free-of-charge to current SPSS Modeler Server customers. The application can be installed only on Windows computers; however, it can administer a server installed on any supported platform. IBM SPSS Modeler Batch While data mining is usually an interactive process, it is also possible to run SPSS Modeler from a command line, without the need for the graphical user interface. For example, you might have long-running or repetitive tasks that you want to perform with no user intervention. SPSS Modeler Batch is a special version of the product that provides support for the complete analytical capabilities of SPSS Modeler without access to the regular user interface. An SPSS Modeler Server license is required to use SPSS Modeler Batch. IBM SPSS Modeler Solution Publisher SPSS Modeler Solution Publisher is a tool that enables you to create a packaged version of an SPSS Modeler stream that can be run by an external runtime engine or embedded in an external application. In this way, you can publish and deploy complete SPSS Modeler streams for use in environments that do not have SPSS Modeler installed. SPSS Modeler Solution Publisher is distributed as part of the IBM SPSS Collaboration and Deployment Services - Scoring service, for which a separate license is required. With this license, you receive SPSS Modeler Solution Publisher Runtime, which enables you to execute the published streams. IBM SPSS Modeler Server Adapters for IBM SPSS Collaboration and Deployment Services A number of adapters for IBM® SPSS® Collaboration and Deployment Services are available that enable SPSS Modeler and SPSS Modeler Server to interact with an IBM SPSS Collaboration and Deployment Services repository. In this way, an SPSS Modeler stream deployed to the repository 3 About IBM SPSS Modeler can be shared by multiple users, or accessed from the thin-client application IBM SPSS Modeler Advantage. You install the adapter on the system that hosts the repository. IBM SPSS Modeler Editions SPSS Modeler is available in the following editions. SPSS Modeler Professional SPSS Modeler Professional provides all the tools you need to work with most types of structured data, such as behaviors and interactions tracked in CRM systems, demographics, purchasing behavior and sales data. SPSS Modeler Premium SPSS Modeler Premium is a separately-licensed product that extends SPSS Modeler Professional to work with specialized data such as that used for entity analytics or social networking, and with unstructured text data. SPSS Modeler Premium comprises the following components. IBM® SPSS® Modeler Entity Analytics adds a completely new dimension to IBM® SPSS® Modeler predictive analytics. Whereas predictive analytics attempts to predict future behavior from past data, entity analytics focuses on improving the coherence and consistency of current data by resolving identity conflicts within the records themselves. An identity can be that of an individual, an organization, an object, or any other entity for which ambiguity might exist. Identity resolution can be vital in a number of fields, including customer relationship management, fraud detection, anti-money laundering, and national and international security. IBM SPSS Modeler Social Network Analysis transforms information about relationships into fields that characterize the social behavior of individuals and groups. Using data describing the relationships underlying social networks, IBM® SPSS® Modeler Social Network Analysis identifies social leaders who influence the behavior of others in the network. In addition, you can determine which people are most affected by other network participants. By combining these results with other measures, you can create comprehensive profiles of individuals on which to base your predictive models. Models that include this social information will perform better than models that do not. IBM® SPSS® Modeler Text Analytics uses advanced linguistic technologies and Natural Language Processing (NLP) to rapidly process a large variety of unstructured text data, extract and organize the key concepts, and group these concepts into categories. Extracted concepts and categories can be combined with existing structured data, such as demographics, and applied to modeling using the full suite of SPSS Modeler data mining tools to yield better and more focused decisions. 4 Chapter 1 IBM SPSS Modeler Documentation Documentation in online help format is available from the Help menu of SPSS Modeler. This includes documentation for SPSS Modeler, SPSS Modeler Server, and SPSS Modeler Solution Publisher, as well as the Applications Guide and other supporting materials. Complete documentation for each product (including installation instructions) is available in PDF format under the \Documentation folder on each product DVD. Installation documents can also be downloaded from the web at http://www-01.ibm.com/support/docview.wss?uid=swg27023172. Documentation in both formats is also available from the SPSS Modeler Information Center at http://publib.boulder.ibm.com/infocenter/spssmodl/v15r0m0/. SPSS Modeler Professional Documentation The SPSS Modeler Professional documentation suite (excluding installation instructions) is as follows. IBM SPSS Modeler User’s Guide. General introduction to using SPSS Modeler, including how to build data streams, handle missing values, build CLEM expressions, work with projects and reports, and package streams for deployment to IBM SPSS Collaboration and Deployment Services, Predictive Applications, or IBM SPSS Modeler Advantage. IBM SPSS Modeler Source, Process, and Output Nodes. Descriptions of all the nodes used to read, process, and output data in different formats. Effectively this means all nodes other than modeling nodes. IBM SPSS Modeler Modeling Nodes. Descriptions of all the nodes used to create data mining models. IBM® SPSS® Modeler offers a variety of modeling methods taken from machine learning, artificial intelligence, and statistics. For more information, see the topic Overview of Modeling Nodes in Chapter 3 on p. 24. IBM SPSS Modeler Algorithms Guide. Descriptions of the mathematical foundations of the modeling methods used in SPSS Modeler. This guide is available in PDF format only. IBM SPSS Modeler Applications Guide. The examples in this guide provide brief, targeted introductions to specific modeling methods and techniques. An online version of this guide is also available from the Help menu. For more information, see the topic Application Examples on p. 5. IBM SPSS Modeler Scripting and Automation. Information on automating the system through scripting, including the properties that can be used to manipulate nodes and streams. IBM SPSS Modeler Deployment Guide. Information on running SPSS Modeler streams and scenarios as steps in processing jobs under IBM® SPSS® Collaboration and Deployment Services Deployment Manager. IBM SPSS Modeler CLEF Developer’s Guide. CLEF provides the ability to integrate third-party programs such as data processing routines or modeling algorithms as nodes in SPSS Modeler. IBM SPSS Modeler In-Database Mining Guide. Information on how to use the power of your database to improve performance and extend the range of analytical capabilities through third-party algorithms. 5 About IBM SPSS Modeler IBM SPSS Modeler Server Administration and Performance Guide. Information on how to configure and administer IBM® SPSS® Modeler Server. IBM SPSS Modeler Administration Console User Guide. Information on installing and using the console user interface for monitoring and configuring SPSS Modeler Server. The console is implemented as a plug-in to the Deployment Manager application. IBM SPSS Modeler Solution Publisher Guide. SPSS Modeler Solution Publisher is an add-on component that enables organizations to publish streams for use outside of the standard SPSS Modeler environment. IBM SPSS Modeler CRISP-DM Guide. Step-by-step guide to using the CRISP-DM methodology for data mining with SPSS Modeler. IBM SPSS Modeler Batch User’s Guide. Complete guide to using IBM SPSS Modeler in batch mode, including details of batch mode execution and command-line arguments. This guide is available in PDF format only. SPSS Modeler Premium Documentation The SPSS Modeler Premium documentation suite (excluding installation instructions) is as follows. IBM SPSS Modeler Entity Analytics User Guide. Information on using entity analytics with SPSS Modeler, covering repository installation and configuration, entity analytics nodes, and administrative tasks. IBM SPSS Modeler Social Network Analysis User Guide. A guide to performing social network analysis with SPSS Modeler, including group analysis and diffusion analysis. SPSS Modeler Text Analytics User’s Guide. Information on using text analytics with SPSS Modeler, covering the text mining nodes, interactive workbench, templates, and other resources. IBM SPSS Modeler Text Analytics Administration Console User Guide. Information on installing and using the console user interface for monitoring and configuring IBM® SPSS® Modeler Server for use with SPSS Modeler Text Analytics . The console is implemented as a plug-in to the Deployment Manager application. Application Examples While the data mining tools in SPSS Modeler can help solve a wide variety of business and organizational problems, the application examples provide brief, targeted introductions to specific modeling methods and techniques. The data sets used here are much smaller than the enormous data stores managed by some data miners, but the concepts and methods involved should be scalable to real-world applications. You can access the examples by clicking Application Examples on the Help menu in SPSS Modeler. The data files and sample streams are installed in the Demos folder under the product installation directory. For more information, see the topic Demos Folder on p. 6. Database modeling examples. See the examples in the IBM SPSS Modeler In-Database Mining Guide. 6 Chapter 1 Scripting examples. See the examples in the IBM SPSS Modeler Scripting and Automation Guide. Demos Folder The data files and sample streams used with the application examples are installed in the Demos folder under the product installation directory. This folder can also be accessed from the IBM SPSS Modeler 15 program group on the Windows Start menu, or by clicking Demos on the list of recent directories in the File Open dialog box. Figure 1-1 Selecting the Demos folder from the list of recently-used directories Chapter Introduction to Modeling 2 A model is a set of rules, formulas, or equations that can be used to predict an outcome based on a set of input fields or variables. For example, a financial institution might use a model to predict whether loan applicants are likely to be good or bad risks, based on information that is already known about past applicants. The ability to predict an outcome is the central goal of predictive analytics, and understanding the modeling process is the key to using IBM® SPSS® Modeler. Figure 2-1 A simple decision tree model This example uses a decision tree model, which classifies records (and predicts a response) using a series of decision rules, for example: IF income = Medium AND cards <5 THEN -> 'Good' While this example uses a CHAID (Chi-squared Automatic Interaction Detection) model, it is intended as a general introduction, and most of the concepts apply broadly to other modeling types in SPSS Modeler. To understand any model, you first need to understand the data that go into it. The data in this example contain information about the customers of a bank. The following fields are used: Field name Credit_rating Age Description Credit rating: 0=Bad, 1=Good, 9=missing values Age in years © Copyright IBM Corporation 1994, 2012. 7 8 Chapter 2 Field name Income Credit_cards Education Car_loans Description Income level: 1=Low, 2=Medium, 3=High Number of credit cards held: 1=Less than five, 2=Five or more Level of education: 1=High school, 2=College Number of car loans taken out: 1=None or one, 2=More than two The bank maintains a database of historical information on customers who have taken out loans with the bank, including whether or not they repaid the loans (Credit rating = Good) or defaulted (Credit rating = Bad). Using this existing data, the bank wants to build a model that will enable them to predict how likely future loan applicants are to default on the loan. Using a decision tree model, you can analyze the characteristics of the two groups of customers and predict the likelihood of loan defaults. This example uses the stream named modelingintro.str, available in the Demos folder under the streams subfolder. The data file is tree_credit.sav. For more information, see the topic Demos Folder in Chapter 1 on p. 6. Let’s take a look at the stream. E Choose the following from the main menu: File > Open Stream E Click the gold nugget icon on the toolbar of the Open dialog box and choose the Demos folder. E Double-click the streams folder. E Double-click the file named modelingintro.str. 9 Introduction to Modeling Building the Stream Figure 2-2 Modeling stream To build a stream that will create a model, we need at least three elements: A source node that reads in data from some external source, in this case an IBM® SPSS® Statistics data file. A source or Type node that specifies field properties, such as measurement level (the type of data that the field contains), and the role of each field as a target or input in modeling. A modeling node that generates a model nugget when the stream is run. In this example, we’re using a CHAID modeling node. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method that builds decision trees by using a particular type of statistics known as chi-square statistics to work out the best places to make the splits in the decision tree. If measurement levels are specified in the source node, the separate Type node can be eliminated. Functionally, the result is the same. This stream also has Table and Analysis nodes that will be used to view the scoring results after the model nugget has been created and added to the stream. 10 Chapter 2 The Statistics File source node reads data in SPSS Statistics format from the tree_credit.sav data file, which is installed in the Demos folder. (A special variable named $CLEO_DEMOS is used to reference this folder under the current IBM® SPSS® Modeler installation. This ensures the path will be valid regardless of the current installation folder or version.) Figure 2-3 Reading data with a Statistics File source node The Type node specifies the measurement level for each field. The measurement level is a category that indicates the type of data in the field. Our source data file uses three different measurement levels. 11 Introduction to Modeling A Continuous field (such as the Age field) contains continuous numeric values, while a Nominal field (such as the Credit rating field) has two or more distinct values, for example Bad, Good, or No credit history. An Ordinal field (such as the Income level field) describes data with multiple distinct values that have an inherent order—in this case Low, Medium and High. Figure 2-4 Setting the target and input fields with the Type node For each field, the Type node also specifies a role, to indicate the part that each field plays in modeling. The role is set to Target for the field Credit rating, which is the field that indicates whether or not a given customer defaulted on the loan. This is the target, or the field for which we want to predict the value. Role is set to Input for the other fields. Input fields are sometimes known as predictors, or fields whose values are used by the modeling algorithm to predict the value of the target field. The CHAID modeling node generates the model. On the Fields tab in the modeling node, the option Use predefined roles is selected, which means the target and inputs will be used as specified in the Type node. We could change the field roles at this point, but for this example we’ll use them as they are. 12 Chapter 2 E Click the Build Options tab. Figure 2-5 CHAID modeling node, Fields tab Here there are several options where we could specify the kind of model we want to build. We want a brand-new model, so we’ll use the default option Build new model. We also just want a single, standard decision tree model without any enhancements, so we’ll also leave the default objective option Build a single tree. 13 Introduction to Modeling While we can optionally launch an interactive modeling session that allows us to fine-tune the model, this example simply generates a model using the default mode setting Generate model. Figure 2-6 CHAID modeling node, Build Options tab For this example, we want to keep the tree fairly simple, so we’ll limit the tree growth by raising the minimum number of cases for parent and child nodes. E On the Build Options tab, select Stopping Rules from the navigator pane on the left. E Select the Use absolute value option. E Set Minimum records in parent branch to 400. 14 Chapter 2 E Set Minimum records in child branch to 200. Figure 2-7 Setting the stopping criteria for decision tree building We can use all the other default options for this example, so click Run to create the model. (Alternatively, right-click on the node and choose Run from the context menu, or select the node and choose Run from the Tools menu.) 15 Introduction to Modeling Browsing the Model When execution completes, the model nugget is added to the Models palette in the upper right corner of the application window, and is also placed on the stream canvas with a link to the modeling node from which it was created. To view the model details, right-click on the model nugget and choose Browse (on the models palette) or Edit (on the canvas). Figure 2-8 Models palette In the case of the CHAID nugget, the Model tab displays the details in the form of a rule set—essentially a series of rules that can be used to assign individual records to child nodes based on the values of different input fields. Figure 2-9 CHAID model nugget, rule set For each decision tree terminal node—meaning those tree nodes that are not split further—a prediction of Good or Bad is returned. In each case the prediction is determined by the mode, or most common response, for records that fall within that node. 16 Chapter 2 To the right of the rule set, the Model tab displays the Predictor Importance chart, which shows the relative importance of each predictor in estimating the model. From this we can see that Income level is easily the most significant in this case, and that the only other significant factor is Number of credit cards. Figure 2-10 Predictor Importance chart 17 Introduction to Modeling The Viewer tab in the model nugget displays the same model in the form of a tree, with a node at each decision point. Use the Zoom controls on the toolbar to zoom in on a specific node or zoom out to see the more of the tree. Figure 2-11 Viewer tab in the model nugget, with zoom out selected Looking at the upper part of the tree, the first node (Node 0) gives us a summary for all the records in the data set. Just over 40% of the cases in the data set are classified as a bad risk. This is quite a high proportion, so let’s see if the tree can give us any clues as to what factors might be responsible. We can see that the first split is by Income level. Records where the income level is in the Low category are assigned to Node 2, and it’s no surprise to see that this category contains the highest percentage of loan defaulters. Clearly lending to customers in this category carries a high risk. However, 16% of the customers in this category actually didn’t default, so the prediction won’t always be correct. No model can feasibly predict every response, but a good model should allow us to predict the most likely response for each record based on the available data. In the same way, if we look at the high income customers (Node 1), we see that the vast majority (89%) are a good risk. But more than 1 in 10 of these customers has also defaulted. Can we refine our lending criteria to minimize the risk here? 18 Chapter 2 Notice how the model has divided these customers into two sub-categories (Nodes 4 and 5), based on the number of credit cards held. For high-income customers, if we lend only to those with fewer than 5 credit cards, we can increase our success rate from 89% to 97%—an even more satisfactory outcome. Figure 2-12 Tree view of high-income customers But what about those customers in the Medium income category (Node 3)? They’re much more evenly divided between Good and Bad ratings. Again, the sub-categories (Nodes 6 and 7 in this case) can help us. This time, lending only to those medium-income customers with fewer than 5 credit cards increases the percentage of Good ratings from 58% to 85%, a significant improvement. Figure 2-13 Tree view of medium-income customers 19 Introduction to Modeling So, we’ve learnt that every record that is input to this model will be assigned to a specific node, and assigned a prediction of Good or Bad based on the most common response for that node. This process of assigning predictions to individual records is known as scoring. By scoring the same records used to estimate the model, we can evaluate how accurately it performs on the training data—the data for which we know the outcome. Let’s look at how to do this. Evaluating the Model We’ve been browsing the model to understand how scoring works. But to evaluate how accurately it works, we need to score some records and compare the responses predicted by the model to the actual results. We’re going to score the same records that were used to estimate the model, allowing us to compare the observed and predicted responses. Figure 2-14 Attaching the model nugget to output nodes for model evaluation E To see the scores or predictions, attach the Table node to the model nugget, double-click the Table node and click Run. The table displays the predicted scores in a field named $R-Credit rating, which was created by the model. We can compare these values to the original Credit rating field that contains the actual responses. 20 Chapter 2 By convention, the names of the fields generated during scoring are based on the target field, but with a standard prefix such as $R- for predictions or $RC- for confidence values. Different models types use different sets of prefixes. A confidence value is the model’s own estimation, on a scale from 0.0 to 1.0, of how accurate each predicted value is. Figure 2-15 Table showing generated scores and confidence values As expected, the predicted value matches the actual responses for many records but not all. The reason for this is that each CHAID terminal node has a mix of responses. The prediction matches the most common one, but will be wrong for all the others in that node. (Recall the 16% minority of low-income customers who did not default.) To avoid this, we could continue splitting the tree into smaller and smaller branches, until every node was 100% pure—all Good or Bad with no mixed responses. But such a model would be extremely complicated and would probably not generalize well to other datasets. To find out exactly how many predictions are correct, we could read through the table and tally the number of records where the value of the predicted field $R-Credit rating matches the value of Credit rating. Fortunately, there’s a much easier way—we can use an Analysis node, which does this automatically. E Connect the model nugget to the Analysis node. 21 Introduction to Modeling E Double-click the Analysis node and click Run. Figure 2-16 Attaching an Analysis node The analysis shows that for 1899 out of 2464 records—over 77%—the value predicted by the model matched the actual response. Figure 2-17 Analysis results comparing observed and predicted responses 22 Chapter 2 This result is limited by the fact that the records being scored are the same ones used to estimate the model. In a real situation, you could use a Partition node to split the data into separate samples for training and evaluation. By using one sample partition to generate the model and another sample to test it, you can get a much better indication of how well it will generalize to other datasets. The Analysis node allows us to test the model against records for which we already know the actual result. The next stage illustrates how we can use the model to score records for which we don’t know the outcome. For example, this might include people who are not currently customers of the bank, but who are prospective targets for a promotional mailing. Scoring Records Earlier, we scored the same records used to estimate the model in order to evaluate how accurate the model was. Now we’re going to see how to score a different set of records from the ones used to create the model. This is the goal of modeling with a target field: Study records for which you know the outcome, to identify patterns that will allow you to predict outcomes you don’t yet know. Figure 2-18 Attaching new data for scoring You could update the Statistics File source node to point to a different data file, or you could add a new source node that reads in the data you want to score. Either way, the new dataset must contain the same input fields used by the model (Age, Income level, Education and so on) but not the target field Credit rating. Alternatively, you could add the model nugget to any stream that includes the expected input fields. Whether read from a file or a database, the source type doesn’t matter as long as the field names and types match those used by the model. You could also save the model nugget as a separate file, export the model in PMML format for use with other applications that support this format, or store the model in an IBM® SPSS® Collaboration and Deployment Services repository, which offers enterprise-wide deployment, scoring, and management of models. 23 Introduction to Modeling Regardless of the infrastructure used, the model itself works in the same way. Summary This example demonstrates the basic steps for creating, evaluating, and scoring a model. The modeling node estimates the model by studying records for which the outcome is known, and creates a model nugget. This is sometimes referred to as training the model. The model nugget can be added to any stream with the expected fields to score records. By scoring the records for which you already know the outcome (such as existing customers), you can evaluate how well it performs. Once you are satisfied that the model performs acceptably well, you can score new data (such as prospective customers) to predict how they will respond. The data used to train or estimate the model may be referred to as the analytical or historical data; the scoring data may also be referred to as the operational data. Chapter 3 Modeling Overview Overview of Modeling Nodes IBM® SPSS® Modeler offers a variety of modeling methods taken from machine learning, artificial intelligence, and statistics. The methods available on the Modeling palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems. The SPSS Modeler Applications Guide provides examples for many of these methods, along with a general introduction to the modeling process. This guide is available as an online tutorial, and also in PDF format. For more information, see the topic Application Examples in Chapter 1 on p. 5. Modeling methods are divided into three categories: Classification Association Segmentation Classification Models Classification models use the values of one or more input fields to predict the value of one or more output, or target, fields. Some examples of these techniques are: decision trees (C&R Tree, QUEST, CHAID and C5.0 algorithms), regression (linear, logistic, generalized linear, and Cox regression algorithms), neural networks, support vector machines, and Bayesian networks. Classification models helps organizations to predict a known result, such as whether a customer will buy or leave or whether a transaction fits a known pattern of fraud. Modeling techniques include machine learning, rule induction, subgroup identification, statistical methods, and multiple model generation. Classification nodes The Auto Classifier node creates and compares a number of different models for binary outcomes (yes or no, churn or do not churn, and so on), allowing you to choose the best approach for a given analysis. A number of modeling algorithms are supported, making it possible to select the methods you want to use, the specific options for each, and the criteria for comparing the results. The node generates a set of models based on the specified options and ranks the best candidates according to the criteria you specify. For more information, see the topic Auto Classifier Node in Chapter 5 on p. 89. © Copyright IBM Corporation 1994, 2012. 24 25 Modeling Overview The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods. The node works in the same manner as the Auto Classifier node, allowing you to choose the algorithms to use and to experiment with multiple combinations of options in a single modeling pass. Supported algorithms include neural networks, C&R Tree, CHAID, linear regression, generalized linear regression, and support vector machines (SVM). Models can be compared based on correlation, relative error, or number of variables used. For more information, see the topic Auto Numeric Node in Chapter 5 on p. 98. The Classification and Regression (C&R) Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups). For more information, see the topic C&R Tree Node in Chapter 6 on p. 143. The QUEST node provides a binary classification method for building decision trees, designed to reduce the processing time required for large C&R Tree analyses while also reducing the tendency found in classification tree methods to favor inputs that allow more splits. Input fields can be numeric ranges (continuous), but the target field must be categorical. All splits are binary. For more information, see the topic QUEST Node in Chapter 6 on p. 144. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and input fields can be numeric range (continuous) or categorical. Exhaustive CHAID is a modification of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see the topic CHAID Node in Chapter 6 on p. 144. The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the field that provides the maximum information gain at each level. The target field must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see the topic C5.0 Node in Chapter 6 on p. 160. The Decision List node identifies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side to compare the results. Decision List models consist of a list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the first rule that matches determines the outcome. For more information, see the topic Decision List in Chapter 9 on p. 204. Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors. For more information, see the topic Linear models in Chapter 10 on p. 239. 26 Chapter 3 The PCA/Factor node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) finds linear combinations of the input fields that do the best job of capturing the variance in the entire set of fields, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed fields. For both approaches, the goal is to find a small number of derived fields that effectively summarizes the information in the original set of fields. For more information, see the topic PCA/Factor Node in Chapter 10 on p. 277. The Feature Selection node screens input fields for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining inputs relative to a specified target. For example, given a data set with hundreds of potential inputs, which are most likely to be useful in modeling patient outcomes? For more information, see the topic Feature Selection Node in Chapter 4 on p. 70. Discriminant analysis makes more stringent assumptions than logistic regression but can be a valuable alternative or supplement to a logistic regression analysis when those assumptions are met. For more information, see the topic Discriminant Node in Chapter 10 on p. 285. Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range. For more information, see the topic Logistic Node in Chapter 10 on p. 259. The Generalized Linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates through a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers the functionality of a wide number of statistical models, including linear regression, logistic regression, loglinear models for count data, and interval-censored survival models. For more information, see the topic GenLin Node in Chapter 10 on p. 294. A generalized linear mixed model (GLMM) extends the linear model so that the target can have a non-normal distribution, is linearly related to the factors and covariates via a specified link function, and so that the observations can be correlated. Generalized linear mixed models cover a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data. For more information, see the topic GLMM Node in Chapter 10 on p. 307. The Cox regression node enables you to build a survival model for time-to-event data in the presence of censored records. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time (t) for given values of the input variables. For more information, see the topic Cox Node in Chapter 10 on p. 336. The Support Vector Machine (SVM) node enables you to classify data into one of two groups without overfitting. SVM works well with wide data sets, such as those with a very large number of input fields. For more information, see the topic SVM Node in Chapter 15 on p. 457. 27 Modeling Overview The Bayesian Network node enables you to build a probability model by combining observed and recorded evidence with real-world knowledge to establish the likelihood of occurrences. The node focuses on Tree Augmented Naïve Bayes (TAN) and Markov Blanket networks that are primarily used for classification. For more information, see the topic Bayesian Network Node in Chapter 7 on p. 179. The Self-Learning Response Model (SLRM) node enables you to build a model in which a single new case, or small number of new cases, can be used to reestimate the model without having to retrain the model using all data. For more information, see the topic SLRM Node in Chapter 14 on p. 446. The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data and produces forecasts of future performance. A Time Series node must always be preceded by a Time Intervals node. For more information, see the topic Time Series Modeling Node in Chapter 13 on p. 424. The k-Nearest Neighbor (KNN) node associates a new case with the category or value of the k objects nearest to it in the predictor space, where k is an integer. Similar cases are near each other and dissimilar cases are distant from each other. For more information, see the topic KNN Node in Chapter 16 on p. 462. Association Models Association models find patterns in your data where one or more entities (such as events, purchases, or attributes) are associated with one or more other entities. The models construct rule sets that define these relationships. Here the fields within the data can act as both inputs and targets. You could find these associations manually, but association rule algorithms do so much more quickly, and can explore more complex patterns. Apriori and Carma models are examples of the use of such algorithms. One other type of association model is a sequence detection model, which finds sequential patterns in time-structured data. Association models are most useful when predicting multiple outcomes—for example, customers who bought product X also bought Y and Z. Association models associate a particular conclusion (such as the decision to buy something) with a set of conditions. The advantage of association rule algorithms over the more standard decision tree algorithms (C5.0 and C&RT) is that associations can exist between any of the attributes. A decision tree algorithm will build rules with only a single conclusion, whereas association algorithms attempt to find many rules, each of which may have a different conclusion. 28 Chapter 3 Association nodes The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to process large data sets efficiently. For large problems, Apriori is generally faster to train; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output fields all be categorical but delivers better performance because it is optimized for this type of data. For more information, see the topic Apriori Node in Chapter 12 on p. 379. The CARMA model extracts a set of rules from the data without requiring you to specify input or target fields. In contrast to Apriori the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than just antecedent support. This means that the rules generated can be used for a wider variety of applications—for example, to find a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. For more information, see the topic CARMA Node in Chapter 12 on p. 383. The Sequence node discovers association rules in sequential or time-oriented data. A sequence is a list of item sets that tends to occur in a predictable order. For example, a customer who purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The Sequence node is based on the CARMA association rules algorithm, which uses an efficient two-pass method for finding sequences. For more information, see the topic Sequence Node in Chapter 12 on p. 404. Segmentation Models Segmentation models divide the data into segments, or clusters, of records that have similar patterns of input fields. As they are only interested in the input fields, segmentation models have no concept of output or target fields. Examples of segmentation models are Kohonen networks, K-Means clustering, two-step clustering and anomaly detection. Segmentation models (also known as “clustering models”) are useful in cases where the specific result is unknown (for example, when identifying new patterns of fraud, or when identifying groups of interest in your customer base). Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without the benefit of prior knowledge about the groups and their characteristics, and it distinguishes clustering models from the other modeling techniques in that there is no predefined output or target field for the model to predict. There are no right or wrong answers for these models. Their value is determined by their ability to capture interesting groupings in the data and provide useful descriptions of those groupings. Clustering models are often used to create clusters or segments that are then used as inputs in subsequent analyses (for example, by segmenting potential customers into homogeneous subgroups). 29 Modeling Overview Segmentation nodes The Auto Cluster node estimates and compares clustering models, which identify groups of records that have similar characteristics. The node works in the same manner as other automated modeling nodes, allowing you to experiment with multiple combinations of options in a single modeling pass. Models can be compared using basic measures with which to attempt to filter and rank the usefulness of the cluster models, and provide a measure based on the importance of particular fields. For more information, see the topic Auto Cluster Node in Chapter 5 on p. 104. The K-Means node clusters the data set into distinct groups (or clusters). The method defines a fixed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further refinement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input fields. For more information, see the topic K-Means Node in Chapter 11 on p. 354. The Kohonen node generates a type of neural network that can be used to cluster the data set into distinct groups. When the network is fully trained, records that are similar should be close together on the output map, while records that are different will be far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see the topic Kohonen Node in Chapter 11 on p. 348. The TwoStep node uses a two-step clustering method. The first step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed field types and large data sets efficiently. For more information, see the topic TwoStep Cluster Node in Chapter 11 on p. 358. The Anomaly Detection node identifies unusual cases, or outliers, that do not conform to patterns of “normal” data. With this node, it is possible to identify outliers even if they do not fit any previously known patterns and even if you are not exactly sure what you are looking for. For more information, see the topic Anomaly Detection Node in Chapter 4 on p. 77. In-Database Mining Models SPSS Modeler supports integration with data mining and modeling tools that are available from database vendors, including Oracle Data Miner, IBM DB2 InfoSphere Warehouse, and Microsoft Analysis Services. You can build, score, and store models inside the database—all from within the SPSS Modeler application. For full details, see the SPSS Modeler In-Database Mining Guide, available on the product DVD. IBM SPSS Statistics Models If you have a copy of IBM® SPSS® Statistics installed and licensed on your computer, you can access and run certain SPSS Statistics routines from within SPSS Modeler to build and score models. 30 Chapter 3 Further Information Detailed documentation on the modeling algorithms is also available. For more information, see the SPSS Modeler Algorithms Guide, available on the product DVD. Building Split Models Split modeling enables you to use a single stream to build separate models for each possible value of a flag, nominal or continuous input field, with the resulting models all being accessible from a single model nugget. The possible values for the input fields could have very different effects on the model. With split modeling, you can easily build the best-fitting model for each possible field value in a single execution of the stream. Note that interactive modeling sessions cannot use splitting. With interactive modeling you specify each model individually, so there would be no advantage in using splitting, which builds multiple models automatically. Split modeling works by designating a particular input field as a split field. You can do this by setting the field role to Split in the Type specification: Figure 3-1 Designating an input field as a split field You can designate only fields with a measurement level of Flag, Nominal, Ordinal or Continuous as split fields. 31 Modeling Overview You can assign more than one input field as a split field. In this case, however, the number of models created can be greatly increased. A model is built for each possible combination of the values of the selected split fields. For example, if three input fields, each having three possible values, are designated as split fields, this will result in the creation of 27 different models. Even after you assign one or more fields as split fields, you can still choose whether to create split models or a single model, by means of a check box setting on the modeling node dialog: Figure 3-2 Choosing to build split models If split fields are defined but the check box is not selected, only a single model is generated. Likewise if the check box is selected but no split field is defined, splitting is ignored and a single model is generated. 32 Chapter 3 When you run the stream, separate models are built behind the scenes for each possible value of the split field or fields, but only a single model nugget is placed in the models palette and the stream canvas. A split-model nugget is denoted by the split symbol: Figure 3-3 Split-model nugget in a stream When you browse the split-model nugget, you see a list of all the separate models that have been built: Figure 3-4 Split model viewer 33 Modeling Overview You can investigate an individual model from a list by double-clicking its nugget icon in the viewer. Doing so opens a standard browser window for the individual model. When the nugget is on the canvas, double-clicking a graph thumbnail opens the full-size graph. For more information, see the topic Split Model Viewer on p. 61. Once a model has been created as a split model, you cannot remove the split processing from it, nor can you undo splitting further downstream from a split-modeling node or nugget. Example. A national retailer wants to estimate sales by product category at each of its stores around the country. Using split modeling, they designate the Store field of their input data as a split field, enabling them to build separate models for each category at each store in a single operation. They can then use the resulting information to control stock levels much more accurately than they could with only a single model. Splitting and Partitioning Splitting has some features in common with partitioning, but the two are used in very different ways. Partitioning divides the dataset randomly into either two or three parts: training, testing and (optionally) validation, and is used to test the performance of a single model. Splitting divides the dataset into as many parts as there are possible values for a split field, and is used to build multiple models. Partitioning and splitting operate completely independently of each other. You can choose either, both or neither in a modeling node. Modeling Nodes Supporting Split Models A number of modeling nodes can create split models. The exceptions are Auto Cluster, Time Series, PCA/Factor, Feature Selection, SLRM, the association models (Apriori, Carma and Sequence), the clustering models (K-Means, Kohonen, Two Step and Anomaly), Statistics Model, and the nodes used for in-database modeling. The modeling nodes that support split modeling are: C&R Tree Bayes Net QUEST GenLin CHAID KNN 34 Chapter 3 C5.0 Cox Neural Net Auto Classifier Decision List Auto Numeric Regression Logistic Discriminant SVM Features Affected by Splitting The use of split models affects a number of IBM® SPSS® Modeler features in various ways. This section provides guidance on using split models in conjunction with other nodes in a stream. Record Ops nodes When using split models in a stream that contains a Sample node, stratify records by the split field to achieve an even sampling of records. This option is available when you choose Complex as the sample method. If the stream contains a Balance node, note that balancing applies to the overall set of input records, not to the subset of records inside a split. When aggregating records by means of an Aggregate node, set the split fields to be key fields if you want to calculate aggregates for each split. Field Ops nodes The Type node is where you specify which field or fields to use as split fields. Note that, while the Ensemble node is used to combine two or more model nuggets, it cannot be used to reverse the action of splitting, as the split models are contained inside a single model nugget. 35 Modeling Overview Modeling nodes Split models do not support the calculation of predictor importance (the relative importance of the predictor input fields in estimating the model). Predictor importance settings are ignored when building split models. The KNN (nearest neighbor) node supports split models only if it is set to predict a target field. The alternative setting (only identify nearest neighbors) does not create a model. If the option “Automatically select k” is chosen, each of the split models may have a different number of nearest neighbors. Thus the overall model will have a number of generated columns equal to the largest number of nearest neighbors found across all the split models. For those split models where the number of nearest neighbors is less than this maximum, there will be a corresponding number of columns filled with $null values. For more information, see the topic KNN Node in Chapter 16 on p. 462. Database Modeling nodes The in-database modeling nodes do not support split models. Model nuggets Export to PMML from a split model nugget is not possible, as the nugget contains multiple models and PMML does not support such a packaging. Export to text or HTML is possible, however. Modeling Node Fields Options All modeling nodes have a Fields tab, where you can specify the fields to be used in building the model. 36 Chapter 3 Figure 3-5 Example of a modeling node Fields tab Before you can build a model, you need to specify which fields you want to use as targets and as inputs. With a few exceptions, all modeling nodes will use field information from an upstream Type node. If you are using a Type node to select input and target fields, you don’t need to change anything on this tab. (Exceptions include the Sequence node and the Text Extraction node, which require that field settings be specified in the modeling node.) Use type node settings. This option tells the node to use field information from an upstream Type node. This is the default. Use custom settings. This option tells the node to use field information specified here instead of that given in any upstream Type node(s). After selecting this option, specify the fields below as required. Note: Not all fields are displayed for all nodes. Use transactional format (Apriori, CARMA, MS Association Rules and Oracle Apriori nodes only). Select this check box if the source data is in transactional format. Records in this format have two fields, one for an ID and one for content. Each record represents a single transaction or item, and associated items are linked by having the same ID. Deselect this box if the data is in tabular format, in which items are represented by separate flags, where each flag field represents the presence or absence of a specific item and each record represents a complete set of associated items. For more information, see the topic Tabular versus Transactional Data in Chapter 12 on p. 378. 37 Modeling Overview ID. For transactional data, select an ID field from the list. Numeric or symbolic fields can be used as the ID field. Each unique value of this field should indicate a specific unit of analysis. For example, in a market basket application, each ID might represent a single customer. For a Web log analysis application, each ID might represent a computer (by IP address) or a user (by login data). IDs are contiguous. (Apriori and CARMA nodes only) If your data are presorted so that all records with the same ID are grouped together in the data stream, select this option to speed up processing. If your data are not presorted (or you are not sure), leave this option unselected and the node will sort the data automatically. Note: If your data are not sorted and you select this option, you may get invalid results in your model. Content. Specify the content field(s) for the model. These fields contain the items of interest in association modeling. You can specify multiple flag fields (if data are in tabular format) or a single nominal field (if data are in transactional format). Target. For models that require one or more target fields, select the target field or fields. This is similar to setting the field role to Target in a Type node. Evaluation. (For Auto Cluster models only.) No target is specified for cluster models; however, you can select an evaluation field to identify its level of importance. In addition, you can evaluate how well the clusters differentiate values of this field, which in turn indicates whether the clusters can be used to predict this field. Inputs. Select the input field or fields. This is similar to setting the field role to Input in a Type node. Partition. This field allows you to specify a field used to partition the data into separate samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a different sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data. If multiple partition fields have been defined by using Type or Partition nodes, a single partition field must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.) Also note that to apply the selected partition in your analysis, partitioning must also be enabled in the Model Options tab for the node. (Deselecting this option makes it possible to disable partitioning without changing field settings.) Splits. For split models, select the split field or fields. This is similar to setting the field role to Split in a Type node. You can designate only fields with a measurement level of Flag, Nominal, Ordinal or Continuous as split fields. Fields chosen as split fields cannot be used as target, input, partition, frequency or weight fields. For more information, see the topic Building Split Models on p. 30. Use frequency field. This option allows you to select a field as a frequency weight. Use this if the records in your training data represent more than one unit each—for example, if you are using aggregated data. The field values should be the number of units represented by each record. For more information, see the topic Using Frequency and Weight Fields on p. 38. Note: If you see the error message Metadata (on input/output fields) not valid, ensure that you have specified all fields that are required, such as the frequency field. 38 Chapter 3 Use weight field. This option allows you to select a field as a case weight. Case weights are used to account for differences in variance across levels of the output field. For more information, see the topic Using Frequency and Weight Fields on p. 38. Consequents. For rule induction nodes (Apriori), select the fields to be used as consequents in the resulting rule set. (This corresponds to fields with role Target or Both in a Type node.) Antecedents. For rule induction nodes (Apriori), select the fields to be used as antecedents in the resulting rule set. (This corresponds to fields with role Input or Both in a Type node.) Some models have a Fields tab that differs from those described in this section. For more information, see the topic Sequence Node Fields Options in Chapter 12 on p. 405. For more information, see the topic CARMA Node Fields Options in Chapter 12 on p. 383. Using Frequency and Weight Fields Frequency and weight fields are used to give extra importance to some records over others, for example, because you know that one section of the population is under-represented in the training data (weight) or because one record represents a number of identical cases (frequency). Values for a frequency field should be positive integers. Records with a negative or zero frequency weight are excluded from the analysis. Non-integer frequency weights are rounded to the nearest integer. Case weight values should be positive but need not be integer values. Records with a negative or zero case weight are excluded from the analysis. Scoring Frequency and Weight Fields Frequency and weight fields are used in training models, but are not used in scoring, because the score for each record is based on its characteristics regardless of how many cases it represents. For example, suppose you have the following data: Married Yes Yes Yes Yes Responded Yes Yes Yes No No No No Yes No No Based on this, you conclude that three out of four married people respond to the promotion, and two out of three unmarried people didn’t respond. So you will score any new records accordingly: Married Yes No $-Responded Yes No $RP-Responded 0.75 (three/four) 0.67 (two/three) 39 Modeling Overview Alternatively, you could store your training data more compactly, using a frequency field: Married Yes Yes No No Responded Yes No Yes No Frequency 3 1 1 2 Since this represents exactly the same dataset, you will build the same model and predict responses based solely on marital status. If you have ten married people in your scoring data, you will predict Yes for each of them regardless of whether they are presented as ten separate records, or one with a frequency value of 10. Weight, although generally not an integer, can be thought of as similarly indicating the importance of a record. This is why frequency and weight fields are not used when scoring records. Evaluating and Comparing Models Some model types support frequency fields, some support weight fields, and some support both. But in all cases where they apply, they are used only for model building and are not considered when evaluating models using an Evaluation node or Analysis node, or when ranking models using most of the methods supported by the Auto Classifier and Auto Numeric nodes. When comparing models (using evaluation charts, for example), frequency and weight values will be ignored. This allows a level comparison between models that use these fields and models that don’t, but means that for an accurate evaluation, a dataset that accurately represents the population without relying on a frequency or weight field must be used. In practical terms, you can do this by making sure that models are evaluated using a testing sample in which the value of the frequency or weight field is always null or 1. (This restriction only applies when evaluating models; if frequency or weight values were always 1 for both training and testing samples, there would be no reason to use these fields in the first place.) If using Auto Classifier, frequency can be taken into account if ranking models based on Profit, so this method is recommended in that case. If necessary, you can split the data into training and testing samples using a Partition node. Modeling Node Analyze Options Many modeling nodes include an Analyze tab that allows you to obtain predictor importance information along with raw and adjusted propensity scores. 40 Chapter 3 Figure 3-6 Analyze tab in the modeling node Model Evaluation Calculate predictor importance. For models that produce an appropriate measure of importance, you can display a chart that indicates the relative importance of each predictor in estimating the model. Typically you will want to focus your modeling efforts on the predictors that matter most, and consider dropping or ignoring those that matter least. Note that predictor importance may take longer to calculate for some models, particularly when working with large datasets, and is off by default for some models as a result. Predictor importance is not available for decision list models. For more information, see the topic Predictor Importance on p. 51. Propensity Scores Propensity scores can be enabled in the modeling node, and on the Settings tab in the model nugget. This functionality is available only when the selected target is a flag field. For more information, see the topic Propensity Scores on p. 41. Calculate raw propensity scores. Raw propensity scores are derived from the model based on the training data only. If the model predicts the true value (will respond), then the propensity is the same as P, where P is the probability of the prediction. If the model predicts the false value, then the propensity is calculated as (1 – P). 41 Modeling Overview If you choose this option when building the model, propensity scores will be enabled in the model nugget by default. However, you can always choose to enable raw propensity scores in the model nugget whether or not you select them in the modeling node. When scoring the model, raw propensity scores will be added in a field with the letters RP appended to the standard prefix. For example, if the predictions are in a field named $R-churn, the name of the propensity score field will be $RRP-churn. Calculate adjusted propensity scores. Raw propensities are based purely on estimates given by the model, which may be overfitted, leading to over-optimistic estimates of propensity. Adjusted propensities attempt to compensate by looking at how the model performs on the test or validation partitions and adjusting the propensities to give a better estimate accordingly. This setting requires that a valid partition field is present in the stream. Unlike raw confidence scores, adjusted propensity scores must be calculated when building the model; otherwise, they will not be available when scoring the model nugget. When scoring the model, adjusted propensity scores will be added in a field with the letters AP appended to the standard prefix. For example, if the predictions are in a field named $R-churn, the name of the propensity score field will be $RAP-churn. Adjusted propensity scores are not available for logistic regression models. When calculating the adjusted propensity scores, the test or validation partition used for the calculation must not have been balanced. To avoid this, be sure the Only balance training data option is selected in any upstream Balance nodes. In addition, if a complex sample has been taken upstream this will invalidate the adjusted propensity scores. Adjusted propensity scores are not available for “boosted” tree and rule set models. For more information, see the topic Boosted C5.0 Models in Chapter 6 on p. 171. Based on. For adjusted propensity scores to be computed, a partition field must be present in the stream. You can specify whether to use the testing or validation partition for this computation. For best results, the testing or validation partition should include at least as many records as the partition used to train the original model. Propensity Scores For models that return a yes or no prediction, you can request propensity scores in addition to the standard prediction and confidence values. Propensity scores indicate the likelihood of a particular outcome or response. For example: Table 3-1 Propensity scores Customer Joe Smith Jane Smith Propensity to respond 35% 15% Propensity scores are available only for models with flag targets, and indicate the likelihood of the True value defined for the field, as specified in a source or Type node. 42 Chapter 3 Propensity Scores Versus Confidence Scores Propensity scores differ from confidence scores, which apply to the current prediction, whether yes or no. In cases where the prediction is no, for example, a high confidence actually means a high likelihood not to respond. Propensity scores sidestep this limitation to allow easier comparison across all records. For example, a no prediction with a confidence of 0.85 translates to a raw propensity of 0.15 (or 1 minus 0.85). Table 3-2 Confidence scores Customer Joe Smith Jane Smith Prediction Will respond Won’t respond Confidence .35 .85 Obtaining Propensity Scores Propensity scores can be enabled on the Analyze tab in the modeling node or on the Settings tab in the model nugget. This functionality is available only when the selected target is a flag field. For more information, see the topic Modeling Node Analyze Options on p. 39. Propensity scores may also be calculated by the Ensemble node, depending on the ensemble method used. Calculating Adjusted Propensity Scores Adjusted propensity scores are calculated as part of the process of building the model, and will not be available otherwise. Once the model is built, it is then scored using data from the test or validation partition, and a new model to deliver adjusted propensity scores is constructed by analyzing the original model’s performance on that partition. Depending on the type of model, one of two methods may be used to calculate the adjusted propensity scores. For rule set and tree models, adjusted propensity scores are generated by recalculating the frequency of each category at each tree node (for tree models) or the support and confidence of each rule (for rule set models). This results in a new rule set or tree model which is stored with the original model, to be used whenever adjusted propensity scores are requested. Each time the original model is applied to new data, the new model can subsequently be applied to the raw propensity scores to generate the adjusted scores. For other models, records produced by scoring the original model on the test or validation partition are then binned by their raw propensity score. Next, a neural network model is trained that defines a non-linear function that maps from the mean raw propensity in each bin to the mean observed propensity in the same bin. As noted earlier for tree models, the resulting neural net model is stored with the original model, and can be applied to the raw propensity scores whenever adjusted propensity scores are requested. Caution regarding missing values in the testing partition. Handling of missing input values in the testing/validation partition varies by model (see individual model scoring algorithms for details). The C5 model cannot compute adjusted propensities when there are missing inputs. 43 Modeling Overview Model Nuggets Figure 3-7 Model nugget A model nugget is a container for a model, that is, the set of rules, formulas or equations that represent the results of your model building operations in IBM® SPSS® Modeler. The main purpose of a nugget is for scoring data to generate predictions, or to allow further analysis of the model properties. Opening a model nugget on the screen enables you to see various details about the model, such as the relative importance of the input fields in creating the model. To view the predictions, you need to attach and execute a further process or output node. For more information, see the topic Using Model Nuggets in Streams on p. 63. Figure 3-8 Model link from modeling node to model nugget When you successfully execute a modeling node, a corresponding model nugget is placed on the stream canvas, where it is represented by a gold, diamond-shaped icon (hence the name “nugget”). On the stream canvas, the nugget is shown with a connection (solid line) to the nearest suitable node before the modeling node, and a link (dotted line) to the modeling node itself. The nugget is also placed in the Models palette in the upper right corner of the SPSS Modeler window. From either location, nuggets can be selected and browsed to view details of the model. Nuggets are always placed in the Models palette when a modeling node is successfully executed. You can set a user option to control whether the nugget is additionally placed on the stream canvas. The following topics provide information on using model nuggets in SPSS Modeler. For an in-depth understanding of the algorithms used, see the SPSS Modeler Algorithms Guide, available in the \Documentation folder on the DVD for IBM® SPSS® Modeler. Model Links By default, a nugget is shown on the canvas with a link to the modeling node that created it. This is especially useful in complex streams with several nuggets, enabling you to identify the nugget that will be updated by each modeling node. Each link contains a symbol to indicate whether the model is replaced when the modeling node is executed. For more information, see the topic Replacing a Model on p. 46. 44 Chapter 3 Defining and Removing Model Links You can define and remove links manually on the canvas. When you are defining a new link, the cursor changes to the link cursor. Figure 3-9 Link cursor Defining a new link (context menu) E Right-click on the modeling node from which you want the link to start. E Choose Define Model Link from the context menu. E Click the nugget where you want the link to end. Defining a new link (main menu) E Click the modeling node from which you want the link to start. E From the main menu, choose: Edit > Node > Define Model Link E Click the nugget where you want the link to end. Removing an existing link (context menu) E Right-click on the nugget at the end of the link. E Choose Remove Model Link from the context menu. Alternatively: E Right-click on the symbol in the middle of the link. E Choose Remove Link from the context menu. Removing an existing link (main menu) E Click the modeling node or nugget from which you want to remove the link. E From the main menu, choose: Edit > Node > Remove Model Link 45 Modeling Overview Copying and Pasting Model Links If you copy a linked nugget, without its modeling node, and paste it into the same stream, the nugget is pasted with a link to the modeling node. The new link has the same model replacement status (see Replacing a Model on p. 46) as the original link: Figure 3-10 Copying and pasting a linked nugget If you copy and paste a nugget together with its linked modeling node, the link is retained whether the objects are pasted into the same stream or a new stream: Figure 3-11 Copying and pasting a linked nugget Note: If you copy a linked nugget, without its modeling node, and paste the nugget into a new stream (or into a SuperNode that does not contain the modeling node), the link is broken and only the nugget is pasted. Model Links and SuperNodes If you define a SuperNode to include either the modeling node or the model nugget of a linked model (but not both), the link is broken. Expanding the SuperNode does not restore the link; you can only do this by undoing creation of the SuperNode. 46 Chapter 3 Replacing a Model You can choose whether to replace (that is, update) an existing nugget on re-execution of the modeling node that created the nugget. If you turn off the replacement option, a new nugget is created when you re-execute the modeling node. Note: Replacing a model is different from refreshing a model, which refers to updating a model in a scenario. Each link from modeling node to nugget contains a symbol to indicate whether the model is replaced when the modeling node is re-executed. Figure 3-12 Model link with model replacement turned on The link is initially shown with model replacement turned on, depicted by the small sunburst symbol in the link. In this state, re-executing the modeling node at one end of the link simply updates the nugget at the other end. Figure 3-13 Model link with model replacement turned off If model replacement is turned off, the link symbol is replaced by a gray dot. In this state, re-executing the modeling node at one end of the link adds a new, updated version of the nugget to the canvas. In either case, in the Models palette the existing nugget is updated or a new nugget is added, depending on the setting of the Replace previous model system option. ). Order of Execution When you execute a stream with multiple branches containing model nuggets, the stream is first evaluated to make sure that a branch with model replacement turned on is executed before any branch that uses the resulting model nugget. If your requirements are more complex, you can set the order of execution manually through scripting. 47 Modeling Overview Changing the Model Replacement Setting To change the setting for model replacement: E Right-click on the symbol on the link. E Choose Turn On(Off) Model Replacement as desired. Note: The model replacement setting on a model link overrides the setting on the Notifications tab of the User Options dialog (Tools > Options > User Options). The Models Palette The models palette (on the Models tab in the managers window) allows you to use, examine, and modify model nuggets in various ways. Figure 3-14 Models palette Right-clicking a model nugget in the models palette opens a context menu with the following options: Figure 3-15 Model nugget context menu 48 Chapter 3 Add To Stream. Adds the model nugget to the currently active stream. If there is a selected node in the stream, the model nugget will be connected to the selected node when such a connection is possible, or otherwise to the nearest possible node. The nugget is displayed with a link to the modeling node that created the model, if that node is still in the stream. Browse. Opens the model browser for the nugget. Rename and Annotate. Allows you to rename the model nugget and/or modify the annotation for the nugget. Generate Modeling Node. If you have a model nugget that you want to modify or update and the stream used to create the model is not available, you can use this option to recreate a modeling node with the same options used to create the original model. Save Model, Save Model As. Saves the model nugget to an external generated model (.gm) binary file. Store Model. Stores the model nugget in IBM® SPSS® Collaboration and Deployment Services Repository. Export PMML. Exports the model nugget as predictive model markup language (PMML), which can be used for scoring new data outside of IBM® SPSS® Modeler. Export PMML is available for all generated model nodes. Note: A license for IBM® SPSS® Modeler Server is required in order to use this feature. Add to Project. Saves the model nugget and adds it to the current project. On the Classes tab, the nugget will be added to the Generated Models folder. On the CRISP-DM tab, it will be added to the default project phase. Delete. Deletes the model nugget from the palette. Figure 3-16 Models palette context menu Right-clicking an unoccupied area in the models palette opens a context menu with the following options: Open Model. Loads a model nugget previously created in SPSS Modeler. Retrieve Model. Retrieves a stored model from an IBM SPSS Collaboration and Deployment Services repository. Load Palette. Loads a saved models palette from an external file. Retrieve Palette. Retrieves a stored models palette from an IBM SPSS Collaboration and Deployment Services repository. 49 Modeling Overview Save Palette. Saves the entire contents of the models palette to an external generated models palette (.gen) file. Store Palette. Stores the entire contents of the models palette in an IBM SPSS Collaboration and Deployment Services repository. Clear Palette. Deletes all nuggets from the palette. Add Palette To Project. Saves the models palette and adds it to the current project. On the Classes tab, the nugget will be added to the Generated Models folder. On the CRISP-DM tab, it will be added to the default project phase. Import PMML. Loads a model from an external file. You can open, browse, and score PMML models created by IBM® SPSS® Statistics or other applications that support this format. For more information, see the topic Importing and Exporting Models as PMML on p. 65. Browsing Model Nuggets The model nugget browsers allow you to examine and use the results of your models. From the browser, you can save, print, or export the generated model, examine the model summary, and view or edit annotations for the model. For some types of model nugget, you can also generate new nodes, such as Filter nodes or Rule Set nodes. For some models, you can also view model parameters, such as rules or cluster centers. For some types of models (tree-based models and cluster models), you can view a graphical representation of the structure of the model. Controls for using the model nugget browsers are described below. Menus File menu. All model nuggets have a File menu, containing some subset of the following options: Save Node. Saves the model nugget to a node (.nod) file. Store Node. Stores the model nugget in an IBM SPSS Collaboration and Deployment Services repository. Header and Footer. Allows you to edit the page header and footer for printing from the nugget. Page Setup. Allows you to change the page setup for printing from the nugget. Print Preview. Displays a preview of how the nugget will look when printed. Select the information you want to preview from the submenu. Print. Prints the contents of the nugget. Select the information you want to print from the submenu. Print View. Prints the current view or all views. Export Text. Exports the contents of the nugget to a text file. Select the information you want to export from the submenu. Export HTML. Exports the contents of the nugget to an HTML file. Select the information you want to export from the submenu. 50 Chapter 3 Export PMML. Exports the model as predictive model markup language (PMML), which can be used with other PMML-compatible software. For more information, see the topic Importing and Exporting Models as PMML on p. 65.Note: A license for IBM® SPSS® Modeler Server is required in order to use this feature. Export SQL. Exports the model as structured query language (SQL), which can be edited and used with other databases. Note: SQL Export is available only from the following models: C5, C&RT, CHAID, QUEST, Linear Regression, Logistic Regression, Neural Net, PCA/Factor, and Decision List models. Publish for Server Scoring Adapter. Publishes the model to a database that has a scoring adapter installed, enabling model scoring to be performed within the database. For more information, see the topic Publishing Models for a Scoring Adapter on p. 68. Generate menu. Most model nuggets also have a Generate menu, allowing you to generate new nodes based on the model nugget. The options available from this menu will depend on the type of model you are browsing. See the specific model nugget type for details about what you can generate from a particular model. View menu. On the Model tab of a nugget, this menu enables you to display or hide the various visualization toolbars that are available in the current mode. To make the full set of toolbars available, select Edit Mode (the paintbrush icon) from the General toolbar. Preview button. Some model nuggets have a Preview button, which enables you to see a sample of the model data, including the extra fields created by the modeling process. The default number of rows displayed is 10; however, you can change this in the stream properties. Add to Current Project button. Saves the model nugget and adds it to the current project. On the Classes tab, the nugget will be added to the Generated Models folder. On the CRISP-DM tab, it will be added to the default project phase. Model Nugget Summary / Information The Summary tab or Information view for a model nugget displays information about the fields, build settings, and model estimation process. Results are presented in a tree view that can be expanded or collapsed by clicking specific items. Analysis. Displays information about the model. Specific details vary by model type, and are covered in the section for each model nugget. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also be displayed in this section. Fields. Lists the fields used as the target and the inputs in building the model. For split models, also lists the fields that determined the splits. Build Settings / Options. Contains information about the settings used in building the model. Training Summary. Shows the type of model, the stream used to create it, the user who created it, when it was built, and the elapsed time for building the model. 51 Modeling Overview Predictor Importance Typically, you will want to focus your modeling efforts on the predictor fields that matter most and consider dropping or ignoring those that matter least. The predictor importance chart helps you do this by indicating the relative importance of each predictor in estimating the model. Since the values are relative, the sum of the values for all predictors on the display is 1.0. Predictor importance does not relate to model accuracy. It just relates to the importance of each predictor in making a prediction, not whether or not the prediction is accurate. Figure 3-17 Predictor importance chart Predictor importance is available for models that produce an appropriate statistical measure of importance, including neural networks, decision trees (C&R Tree, C5.0, CHAID, and QUEST), Bayesian networks, discriminant, SVM, and SLRM models, linear and logistic regression, generalized linear, and nearest neighbor (KNN) models. For most of these models, predictor importance can be enabled on the Analyze tab in the modeling node. For more information, see the topic Modeling Node Analyze Options on p. 39. For KNN models, see Neighbors on p. 467. 52 Chapter 3 Note: Predictor importance is not supported for split models. Predictor importance settings are ignored when building split models. For more information, see the topic Building Split Models on p. 30. Calculating predictor importance may take significantly longer than model building, particularly when using large datasets. It takes longer to calculate for SVM and logistic regression than for other models, and is disabled for these models by default. If using a dataset with a large number of predictors, initial screening using a Feature Selection node may give faster results (see below). Predictor importance is calculated from the test partition, if available. Otherwise the training data is used. For SLRM models, predictor importance is available but is computed by the SLRM algorithm. For more information, see the topic SLRM Model Nuggets in Chapter 14 on p. 451. You can use IBM® SPSS® Modeler’s graph tools to interact, edit, and save the graph. Optionally, you can generate a Filter node based on the information in the predictor importance chart. For more information, see the topic Filtering Variables Based on Importance on p. 52. Predictor Importance and Feature Selection The predictor importance chart displayed in a model nugget may seem to give results similar to the Feature Selection node in some cases. While feature selection ranks each input field based on the strength of its relationship to the specified target, independent of other inputs, the predictor importance chart indicates the relative importance of each input for this particular model. Thus feature selection will be more conservative in screening inputs. For example, if job title and job category are both strongly related to salary, then feature selection would indicate that both are important. But in modeling, interactions and correlations are also taken into consideration. Thus you might find that only one of two inputs is used if both duplicate much of the same information. In practice, feature selection is most useful for preliminary screening, particularly when dealing with large datasets with large numbers of variables, and predictor importance is more useful in fine-tuning the model. Filtering Variables Based on Importance Optionally, you can generate a Filter node based on the information in the predictor importance chart. Mark the predictors you want to include on the chart, if applicable, and from the menus choose: Generate > Filter Node (Predictor Importance) OR > Field Selection (Predictor Importance) 53 Modeling Overview Figure 3-18 Filtering predictors based on importance Top number of variables. Includes or excludes the most important predictors up to the specified number. Importance greater than. Includes or excludes all predictors with relative importance greater than the specified value. Models for Ensembles The model for an ensemble provides information about the component models in the ensemble and the performance of the ensemble as a whole. Figure 3-19 Model Summary view 54 Chapter 3 The main (view-independent) toolbar allows you to choose whether to use the ensemble or a reference model for scoring. If the ensemble is used for scoring you can also select the combining rule. These changes do not require model re-execution; however, these choices are saved to the model (nugget) for scoring and/or downstream model evaluation. They also affect PMML exported from the ensemble viewer. Combining Rule. When scoring an ensemble, this is the rule used to combine the predicted values from the base models to compute the ensemble score value. Ensemble predicted values for categorical targets can be combined using voting, highest probability, or highest mean probability. Voting selects the category that has the highest probability most often across the base models. Highest probability selects the category that achieves the single highest probability across all base models. Highest mean probability selects the category with the highest value when the category probabilities are averaged across base models. Ensemble predicted values for continuous targets can be combined using the mean or median of the predicted values from the base models. The default is taken from the specifications made during model building. Changing the combining rule recomputes the model accuracy and updates all views of model accuracy. The Predictor Importance chart also updates. This control is disabled if the reference model is selected for scoring. Show All Combining rules. When selected , results for all available combining rules are shown in the model quality chart. The Component Model Accuracy chart is also updated to show reference lines for each voting method. 55 Modeling Overview Model Summary Figure 3-20 Model Summary view The Model Summary view is a snapshot, at-a-glance summary of the ensemble quality and diversity. Quality. The chart displays the accuracy of the final model, compared to a reference model and a naive model. Accuracy is presented in larger is better format; the “best” model will have the highest accuracy. For a categorical target, accuracy is simply the percentage of records for which the predicted value matches the observed value. For a continuous target, accuracy is 1 minus the ratio of the mean absolute error in prediction (the average of the absolute values of the predicted values minus the observed values) to the range of predicted values (the maximum predicted value minus the minimum predicted value). For bagging ensembles, the reference model is a standard model built on the whole training partition. For boosted ensembles, the reference model is the first component model. 56 Chapter 3 The naive model represents the accuracy if no model were built, and assigns all records to the modal category. The naive model is not computed for continuous targets. Diversity. The chart displays the “diversity of opinion” among the component models used to build the ensemble, presented in larger is more diverse format. It is a measure of how much predictions vary across the base models. Diversity is not available for boosted ensemble models, nor is it shown for continuous targets. Predictor Importance Figure 3-21 Predictor Importance view Typically, you will want to focus your modeling efforts on the predictor fields that matter most and consider dropping or ignoring those that matter least. The predictor importance chart helps you do this by indicating the relative importance of each predictor in estimating the model. Since the values are relative, the sum of the values for all predictors on the display is 1.0. Predictor importance does not relate to model accuracy. It just relates to the importance of each predictor in making a prediction, not whether or not the prediction is accurate. Predictor importance is not available for all ensemble models. The predictor set may vary across component models, but importance can be computed for predictors used in at least one component model. 57 Modeling Overview Predictor Frequency Figure 3-22 Predictor Frequency view The predictor set can vary across component models due to the choice of modeling method or predictor selection. The Predictor Frequency plot is a dot plot that shows the distribution of predictors across component models in the ensemble. Each dot represents one or more component models containing the predictor. Predictors are plotted on the y-axis, and are sorted in descending order of frequency; thus the topmost predictor is the one that is used in the greatest number of component models and the bottommost one is the one that was used in the fewest. The top 10 predictors are shown. Predictors that appear most frequently are typically the most important. This plot is not useful for methods in which the predictor set cannot vary across component models. 58 Chapter 3 Component Model Accuracy Figure 3-23 Component Model Accuracy view The chart is a dot plot of predictive accuracy for component models. Each dot represents one or more component models with the level of accuracy plotted on the y-axis. Hover over any dot to obtain information on the corresponding individual component model. Reference lines. The plot displays color coded lines for the ensemble as well as the reference model and naïve models. A checkmark appears next to the line corresponding to the model that will be used for scoring. Interactivity. The chart updates if you change the combining rule. Boosted ensembles. A line chart is displayed for boosted ensembles. 59 Modeling Overview Figure 3-24 Ensemble Accuracy view, boosted ensemble 60 Chapter 3 Component Model Details Figure 3-25 Component Model Details view The table displays information on component models, listed by row. By default, component models are sorted in ascending model number order. You can sort the rows in ascending or descending order by the values of any column. Model. A number representing the sequential order in which the component model was created. Accuracy. Overall accuracy formatted as a percentage. Method. The modeling method. Predictors. The number of predictors used in the component model. Model Size. Model size depends on the modeling method: for trees, it is the number of nodes in the tree; for linear models, it is the number of coefficients; for neural networks, it is the number of synapses. Records. The weighted number of input records in the training sample. 61 Modeling Overview Automatic Data Preparation Figure 3-26 Automatic Data Preparation view This view shows information about which fields were excluded and how transformed fields were derived in the automatic data preparation (ADP) step. For each field that was transformed or excluded, the table lists the field name, its role in the analysis, and the action taken by the ADP step. Fields are sorted by ascending alphabetical order of field names. The action Trim outliers, if shown, indicates that values of continuous predictors that lie beyond a cutoff value (3 standard deviations from the mean) have been set to the cutoff value. Model Nuggets for Split Models The model nugget for a split model provides access to all the separate models created by the splits. A split-model nugget contains: a list of all the split models created, together with a set of statistics about each model information about the overall model From the list of split models, you can open up individual models to examine them further. Split Model Viewer The Model tab lists all the models contained in the nugget, and provides statistics in various forms about the split models. It has two general forms, depending on the modeling node, as follows. 62 Chapter 3 Figure 3-27 Split model viewer Sort by. Use this list to choose the order in which the models are listed. You can sort the list based on the values of any of the display columns, in ascending or descending order. Alternatively, click on a column heading to sort the list by that column. Default is descending order of overall accuracy. Show/hide columns menu. Click this button to display a menu from where you can choose individual columns to show or hide. View. If you are using partitioning, you can choose to view the results for either the training data or the testing data. For each split, the details shown are as follows: Graph. A thumbnail indicating the data distribution for this model. When the nugget is on the canvas, double-click the thumbnail to open the full-size graph. Model. An icon of the model type. Double-click the icon to open the model nugget for this particular split. Split fields. The fields designated in the modeling node as split fields, with their various possible values. No. Records in Split. The number of records involved in this particular split. No. Fields Used. Ranks split models based on the number of input fields used. 63 Modeling Overview Overall Accuracy (%). The percentage of records that is correctly predicted by the split model relative to the total number of records in that split. Figure 3-28 Split model viewer Split. The column heading shows the field(s) used to create splits, and the cells are the split values. Double-click on any split to open a Model Viewer for the model built for that split. Accuracy. Overall accuracy formatted as a percentage. Model Size. Model size depends on the modeling method: for trees, it is the number of nodes in the tree; for linear models, it is the number of coefficients; for neural networks, it is the number of synapses. Records. The weighted number of input records in the training sample. Using Model Nuggets in Streams Model nuggets are placed in streams to enable you to score new data and generate new nodes. Scoring data allows you to use the information gained from model building to create predictions for new records. To see the results of scoring, you need to attach a terminal node (that is, a processing or output node) to the nugget and execute the terminal node. For some models, model nuggets can also give you additional information about the quality of the prediction, such as confidence values or distances from cluster centers. Generating new nodes allows you to easily create new nodes based on the structure of the generated model. For example, most models that perform input field selection allow you to generate Filter nodes that will pass only input fields that the model identified as important. To Use a Model Nugget for Scoring Data E Connect the model nugget to a data source or stream that will pass data to it. 64 Chapter 3 Figure 3-29 Using a model nugget for scoring E Add or connect one or more processing or output nodes (such as a Table or Analysis node) to the model nugget. E Execute one of the nodes downstream from the model nugget. Note: You cannot use the Unrefined Rule node for scoring data. To score data based on an association rule model, use the Unrefined Rule node to generate a Rule Set nugget, and use the Rule Set nugget for scoring. For more information, see the topic Generating a Rule Set from an Association Model Nugget in Chapter 12 on p. 398. To Use a Model Nugget for Generating Processing Nodes E On the palette, browse the model, or, on the stream canvas, edit the model. E Select the desired node type from the Generate menu of the model nugget browser window. The options available will vary, depending on the type of model nugget. See the specific model nugget type for details about what you can generate from a particular model. Regenerating a Modeling Node If you have a model nugget that you want to modify or update and the stream used to create the model is not available, you can regenerate a modeling node with the same options used to create the original model. E To rebuild a model, right-click on the model in the models palette and choose Generate Modeling Node. E Alternatively, when browsing any model, choose Generate Modeling Node from the Generate menu. 65 Modeling Overview The regenerated modeling node should be functionally identical to the one used to create the original model in most cases. For Decision Tree models, additional settings specified during the interactive session may also be stored with the node, and the Use tree directives option will be enabled in the regenerated modeling node. For Decision List models, the Use saved interactive session information option will be enabled. For more information, see the topic Decision List Model Options in Chapter 9 on p. 209. For Time Series models, the Continue estimation using existing model(s) option is enabled, allowing you to regenerate the previous model with current data. For more information, see the topic Time Series Model Options in Chapter 13 on p. 427. Importing and Exporting Models as PMML PMML, or predictive model markup language, is an XML format for describing data mining and statistical models, including inputs to the models, transformations used to prepare data for data mining, and the parameters that define the models themselves. IBM® SPSS® Modeler can import and export PMML, making it possible to share models with other applications that support this format, such as IBM® SPSS® Statistics. For more information about PMML, see the Data Mining Group website (http://www.dmg.org). To Export a Model PMML export is supported for most of the model types generated in SPSS Modeler. For more information, see the topic Model Types Supporting PMML on p. 67. E Right-click a model nugget on the models palette. (Alternatively, double-click a model nugget on the canvas and select the File menu.) E On the menu, click Export PMML. Figure 3-30 Exporting a model in PMML format E In the Export (or Save) dialog box, specify a target directory and a unique name for the model. 66 Chapter 3 Note: You can change options for PMML export in the User Options dialog box. On the main menu, click: Tools > Options > User Options and click the PMML tab. To Import a Model Saved as PMML Models exported as PMML from SPSS Modeler or another application can be imported into the models palette. For more information, see the topic Model Types Supporting PMML on p. 67. E In the models palette, right-click the palette and select Import PMML from the menu. Figure 3-31 Importing a model in PMML format E Select the file to import and specify options for variable labels as required. E Click Open. 67 Modeling Overview Figure 3-32 Selecting the XML file for a model saved using PMML Use variable labels if present in model. The PMML may specify both variable names and variable labels (such as Referrer ID for RefID) for variables in the data dictionary. Select this option to use variable labels if they are present in the originally exported PMML. If you have selected the variable label option but there are no variable labels in the PMML, the variable names are used as normal. Model Types Supporting PMML PMML Export SPSS Modeler models. The following models created in IBM® SPSS® Modeler can be exported as PMML 4.0: C&R Tree QUEST CHAID Linear Regression Neural Net 68 Chapter 3 C5.0 Logistic Regression Genlin SVM Bayes Net Apriori Carma K-Means Kohonen TwoStep KNN Statistics Model The following model created in SPSS Modeler can be exported as PMML 3.2: Decision List Database native models. For models generated using database-native algorithms, PMML export is available for IBM InfoSphere Warehouse models only. Models created using Analysis Services from Microsoft or Oracle Data Miner cannot be exported. Also note that IBM models exported as PMML cannot be imported back into SPSS Modeler. PMML Import SPSS Modeler can import and score PMML models generated by current versions of all IBM® SPSS® Statistics products, including models exported from SPSS Modeler as well as model or transformation PMML generated by SPSS Statistics 17.0 or later. Essentially, this means any PMML that the scoring engine can score, with the following exceptions: Apriori, CARMA, Anomaly Detection, and Sequence models cannot be imported. PMML models may not be browsed after importing into SPSS Modeler even though they can be used in scoring. (Note that this includes models that were exported from SPSS Modeler to begin with. To avoid this limitation, export the model as a generated model file [*.gm] rather than PMML.) IBM InfoSphere Warehouse models exported as PMML cannot be imported. Limited validation occurs on import, but full validation is performed on attempting to score the model. Thus it is possible for import to succeed but scoring to fail or produce incorrect results. Publishing Models for a Scoring Adapter You can publish models to a database server that has a scoring adapter installed. A scoring adapter enables model scoring to be performed within the database by using the user-defined function (UDF) capabilities of the database. Performing scoring in the database avoids the need 69 Modeling Overview to extract the data before scoring. Publishing to a scoring adapter also generates some example SQL to execute the UDF. To publish to a scoring adapter E Double-click the model nugget to open it. E From the model nugget menu, choose: File > Publish for Server Scoring Adapter E Fill in the relevant fields on the dialog box and click OK. Database connection. The connection details for the database you want to use for the model. Publish ID. (DB2 for z/OS databases only) An identifier for the model. If you rebuild the same model and use the same publish ID, the generated SQL remains the same, so it is possible to rebuild a model without having to change the application that uses the SQL previously generated. (For other databases the generated SQL is unique to the model.) Generate Example SQL. If selected, generates the example SQL into the file specified in the File field. Unrefined Models An unrefined model contains information extracted from the data but is not designed for generating predictions directly. This means that it cannot be added to streams. Unrefined models are displayed as “diamonds in the rough” on the generated models palette. Figure 3-33 Unrefined model icon To see information about the unrefined rule model, right-click the model and choose Browse from the context menu. Like other models generated in IBM® SPSS® Modeler, the various tabs provide summary and rule information about the model created. Generating nodes. The Generate menu enables you to create new nodes based on the rules. Select Node. Generates a Select node to select records to which the currently selected rule applies. This option is disabled if no rule is selected. Rule set. Generates a Rule Set node to predict values for a single target field. For more information, see the topic Generating a Rule Set from an Association Model Nugget in Chapter 12 on p. 398. Chapter 4 Screening Models Screening Fields and Records Several modeling nodes can be used during the preliminary stages of an analysis in order to locate fields and records that are most likely to be of interest in modeling. You can use the Feature Selection node to screen and rank fields by importance and the Anomaly Detection node to locate unusual records that do not conform to the known patterns of “normal” data. The Feature Selection node screens input fields for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining inputs relative to a specified target. For example, given a data set with hundreds of potential inputs, which are most likely to be useful in modeling patient outcomes? For more information, see the topic Feature Selection Node on p. 70. The Anomaly Detection node identifies unusual cases, or outliers, that do not conform to patterns of “normal” data. With this node, it is possible to identify outliers even if they do not fit any previously known patterns and even if you are not exactly sure what you are looking for. For more information, see the topic Anomaly Detection Node on p. 77. Note that anomaly detection identifies unusual records or cases through cluster analysis based on the set of fields selected in the model without regard for any specific target (dependent) field and regardless of whether those fields are relevant to the pattern you are trying to predict. For this reason, you may want to use anomaly detection in combination with feature selection or another technique for screening and ranking fields. For example, you can use feature selection to identify the most important fields relative to a specific target and then use anomaly detection to locate the records that are the most unusual with respect to those fields. (An alternative approach would be to build a decision tree model and then examine any misclassified records as potential anomalies. However, this method would be more difficult to replicate or automate on a large scale.) Feature Selection Node Data mining problems may involve hundreds, or even thousands, of fields that can potentially be used as inputs. As a result, a great deal of time and effort may be spent examining which fields or variables to include in the model. To narrow down the choices, the Feature Selection algorithm can be used to identify the fields that are most important for a given analysis. For example, if you are trying to predict patient outcomes based on a number of factors, which factors are the most likely to be important? Feature selection consists of three steps: Screening. Removes unimportant and problematic inputs and records, or cases such as input fields with too many missing values or with too much or too little variation to be useful. © Copyright IBM Corporation 1994, 2012. 70 71 Screening Models Ranking. Sorts remaining inputs and assigns ranks based on importance. Selecting. Identifies the subset of features to use in subsequent models—for example, by preserving only the most important inputs and filtering or excluding all others. In an age where many organizations are overloaded with too much data, the benefits of feature selection in simplifying and speeding the modeling process can be substantial. By focusing attention quickly on the fields that matter most, you can reduce the amount of computation required; more easily locate small but important relationships that might otherwise be overlooked; and, ultimately, obtain simpler, more accurate, and more easily explainable models. By reducing the number of fields used in the model, you may find that you can reduce scoring times as well as the amount of data collected in future iterations. Example. A telephone company has a data warehouse containing information about responses to a special promotion by 5,000 of the company’s customers. The data includes a large number of fields containing customers’ ages, employment, income, and telephone usage statistics. Three target fields show whether or not the customer responded to each of three offers. The company wants to use this data to help predict which customers are most likely to respond to similar offers in the future. Requirements. A single target field (one with its role set to Target), along with multiple input fields that you want to screen or rank relative to the target. Both target and input fields can have a measurement level of Continuous (numeric range) or Categorical. Feature Selection Model Settings The settings on the Model tab include standard model options along with settings that allow you to fine-tune the criteria for screening input fields. Figure 4-1 Feature Selection Model tab 72 Chapter 4 Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Screening Input Fields Screening involves removing inputs or cases that do not add any useful information with respect to the input/target relationship. Screening options are based on attributes of the field in question without respect to predictive power relative to the selected target field. Screened fields are excluded from the computations used to rank inputs and optionally can be filtered or removed from the data used in modeling. Fields can be screened based on the following criteria: Maximum percentage of missing values. Screens fields with too many missing values, expressed as a percentage of the total number of records. Fields with a large percentage of missing values provide little predictive information. Maximum percentage of records in a single category. Screens fields that have too many records falling into the same category relative to the total number of records. For example, if 95% of the customers in the database drive the same type of car, including this information is not useful in distinguishing one customer from the next. Any fields that exceed the specified maximum are screened. This option applies to categorical fields only. Maximum number of categories as a percentage of records. Screens fields with too many categories relative to the total number of records. If a high percentage of the categories contains only a single case, the field may be of limited use. For example, if every customer wears a different hat, this information is unlikely to be useful in modeling patterns of behavior. This option applies to categorical fields only. Minimum coefficient of variation. Screens fields with a coefficient of variance less than or equal to the specified minimum. This measure is the ratio of the input field standard deviation to the mean of the input field. If this value is near zero, there is not much variability in the values for the variable. This option applies to continuous (numeric range) fields only. Minimum standard deviation. Screens fields with standard deviation less than or equal to the specified minimum. This option applies to continuous (numeric range) fields only. Records with missing data. Records or cases that have missing values for the target field, or missing values for all inputs, are automatically excluded from all computations used in the rankings. Feature Selection Options The Options tab allows you to specify the default settings for selecting or excluding input fields in the model nugget. You can then add the model to a stream to select a subset of fields for use in subsequent model-building efforts. Alternatively, you can override these settings by selecting or deselecting additional fields in the model browser after generating the model. However, the default settings make it possible to apply the model nugget without further changes, which may be particularly useful for scripting purposes. For more information, see the topic Feature Selection Model Results on p. 74. 73 Screening Models Figure 4-2 Feature Selection Options tab The following options are available: All fields ranked. Selects fields based on their ranking as important, marginal, or unimportant. You can edit the label for each ranking as well as the cutoff values used to assign records to one rank or another. Top number of fields. Selects the top n fields based on importance. Importance greater than. Selects all fields with importance greater than the specified value. The target field is always preserved regardless of the selection. Importance Ranking Options All categorical. When all inputs and the target are categorical, importance can be ranked based on any of four measures: Pearson chi-square. Tests for independence of the target and the input without indicating the strength or direction of any existing relationship. Likelihood-ratio chi-square. Similar to Pearson’s chi-square but also tests for target-input independence. Cramer’s V. A measure of association based on Pearson’s chi-square statistic. Values range from 0, which indicates no association, to 1, which indicates perfect association. Lambda. A measure of association reflecting the proportional reduction in error when the variable is used to predict the target value. A value of 1 indicates that the input field perfectly predicts the target, while a value of 0 means the input provides no useful information about the target. 74 Chapter 4 Some categorical. When some—but not all—inputs are categorical and the target is also categorical, importance can be ranked based on either the Pearson or likelihood-ratio chi-square. (Cramer’s V and lambda are not available unless all inputs are categorical.) Categorical versus continuous. When ranking a categorical input against a continuous target or vice versa (one or the other is categorical but not both), the F statistic is used. Both continuous. When ranking a continuous input against a continuous target, the t statistic based on the correlation coefficient is used. Feature Selection Model Nuggets Feature Selection model nuggets display the importance of each input relative to a selected target, as ranked by the Feature Selection node. Any fields that were screened out prior to the ranking are also listed. For more information, see the topic Feature Selection Node on p. 70. When you run a stream containing a Feature Selection model nugget, the model acts as a filter that preserves only selected inputs, as indicated by the current selection on the Model tab. For example, you could select all fields ranked as important (one of the default options) or manually select a subset of fields on the Model tab. The target field is also preserved regardless of the selection. All other fields are excluded. Filtering is based on the field name only; for example, if you select age and income, any field that matches either of these names will be preserved. The model does not update field rankings based on new data; it simply filters fields based on the selected names. For this reason, care should be used in applying the model to new or updated data. When in doubt, regenerating the model is recommended. Feature Selection Model Results The Model tab for a Feature Selection model nugget displays the rank and importance of all inputs in the upper pane and allows you to select fields for filtering by using the check boxes in the column on the left. When you run the stream, only the checked fields are preserved. The other fields are discarded. The default selections are based on the options specified in the model-building node, but you can select or deselect additional fields as needed. The lower pane lists inputs that have been excluded from the rankings based on the percentage of missing values or on other criteria specified in the modeling node. As with the ranked fields, you can choose to include or discard these fields by using the check boxes in the column on the left. For more information, see the topic Feature Selection Model Settings on p. 71. 75 Screening Models Figure 4-3 Feature Selection model results To sort the list by rank, field name, importance, or any of the other displayed columns, click on the column header. Or, to use the toolbar, select the desired item from the Sort By list, and use the up and down arrows to change the direction of the sort. You can use the toolbar to check or uncheck all fields and to access the Check Fields dialog box, which allows you to select fields by rank or importance. You can also press the Shift and Ctrl keys while clicking on fields to extend the selection and use the spacebar to toggle on or off a group of selected fields. For more information, see the topic Selecting Fields by Importance on p. 76. The threshold values for ranking inputs as important, marginal, or unimportant are displayed in the legend below the table. These values are specified in the modeling node. For more information, see the topic Feature Selection Options on p. 72. 76 Chapter 4 Selecting Fields by Importance When scoring data using a Feature Selection model nugget, all fields selected from the list of ranked or screened fields—as indicated by the check boxes in the column on the left—will be preserved. Other fields will be discarded. To change the selection, you can use the toolbar to access the Check Fields dialog box, which allows you to select fields by rank or importance. Figure 4-4 Check fields dialog box All fields marked. Selects all fields marked as important, marginal, or unimportant. Top number of fields. Allows you to select the top n fields based on importance. Importance greater than. Selects all fields with importance greater than the specified threshold. Generating a Filter from a Feature Selection Model Based on the results of a Feature Selection model, you can generate one or more Filter nodes that include or exclude subsets of fields based on importance relative to the specified target. While the model nugget can also be used as a filter, this gives you the flexibility to experiment with different subsets of fields without copying or modifying the model. The target field is always preserved by the filter regardless of whether include or exclude is selected. Figure 4-5 Generating a Filter node 77 Screening Models Include/Exclude. You can choose to include or exclude fields—for example, to include the top 10 fields or exclude all fields marked as unimportant. Selected fields. Includes or excludes all fields currently selected in the table. All fields marked. Selects all fields marked as important, marginal, or unimportant. Top number of fields. Allows you to select the top n fields based on importance. Importance greater than. Selects all fields with importance greater than the specified threshold. Anomaly Detection Node Anomaly detection models are used to identify outliers, or unusual cases, in the data. Unlike other modeling methods that store rules about unusual cases, anomaly detection models store information on what normal behavior looks like. This makes it possible to identify outliers even if they do not conform to any known pattern, and it can be particularly useful in applications, such as fraud detection, where new patterns may constantly be emerging. Anomaly detection is an unsupervised method, which means that it does not require a training dataset containing known cases of fraud to use as a starting point. While traditional methods of identifying outliers generally look at one or two variables at a time, anomaly detection can examine large numbers of fields to identify clusters or peer groups into which similar records fall. Each record can then be compared to others in its peer group to identify possible anomalies. The further away a case is from the normal center, the more likely it is to be unusual. For example, the algorithm might lump records into three distinct clusters and flag those that fall far from the center of any one cluster. Figure 4-6 Using clustering to identify potential anomalies 78 Chapter 4 Each record is assigned an anomaly index, which is the ratio of the group deviation index to its average over the cluster that the case belongs to. The larger the value of this index, the more deviation the case has than the average. Under the usual circumstance, cases with anomaly index values less than 1 or even 1.5 would not be considered as anomalies, because the deviation is just about the same or a bit more than the average. However, cases with an index value greater than 2 could be good anomaly candidates because the deviation is at least twice the average. Anomaly detection is an exploratory method designed for quick detection of unusual cases or records that should be candidates for further analysis. These should be regarded as suspected anomalies, which, on closer examination, may or may not turn out to be real. You may find that a record is perfectly valid but choose to screen it from the data for purposes of model building. Alternatively, if the algorithm repeatedly turns up false anomalies, this may point to an error or artifact in the data collection process. Note that anomaly detection identifies unusual records or cases through cluster analysis based on the set of fields selected in the model without regard for any specific target (dependent) field and regardless of whether those fields are relevant to the pattern you are trying to predict. For this reason, you may want to use anomaly detection in combination with feature selection or another technique for screening and ranking fields. For example, you can use feature selection to identify the most important fields relative to a specific target and then use anomaly detection to locate the records that are the most unusual with respect to those fields. (An alternative approach would be to build a decision tree model and then examine any misclassified records as potential anomalies. However, this method would be more difficult to replicate or automate on a large scale.) Example. In screening agricultural development grants for possible cases of fraud, anomaly detection can be used to discover deviations from the norm, highlighting those records that are abnormal and worthy of further investigation. You are particularly interested in grant applications that seem to claim too much (or too little) money for the type and size of farm. Requirements. One or more input fields. Note that only fields with a role set to Input using a source or Type node can be used as inputs. Target fields (role set to Target or Both) are ignored. Strengths. By flagging cases that do not conform to a known set of rules rather than those that do, Anomaly Detection models can identify unusual cases even when they don’t follow previously known patterns. When used in combination with feature selection, anomaly detection makes it possible to screen large amounts of data to identify the records of greatest interest relatively quickly. 79 Screening Models Anomaly Detection Model Options Figure 4-7 Anomaly Detection Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Determine cutoff value for anomaly based on. Specifies the method used to determine the cutoff value for flagging anomalies. The following options are available: Minimum anomaly index level. Specifies the minimum cutoff value for flagging anomalies. Records that meet or exceed this threshold are flagged. Percentage of most anomalous records in the training data. Automatically sets the threshold at a level that flags the specified percentage of records in the training data. The resulting cutoff is included as a parameter in the model. Note that this option determines how the cutoff value is set, not the actual percentage of records to be flagged during scoring. Actual scoring results may vary depending on the data. Number of most anomalous records in the training data. Automatically sets the threshold at a level that flags the specified number of records in the training data. The resulting threshold is included as a parameter in the model. Note that this option determines how the cutoff value is set, not the specific number of records to be flagged during scoring. Actual scoring results may vary depending on the data. Note: Regardless of how the cutoff value is determined, it does not affect the underlying anomaly index value reported for each record. It simply specifies the threshold for flagging records as anomalous when estimating or scoring the model. If you later want to examine a larger or smaller 80 Chapter 4 number of records, you can use a Select node to identify a subset of records based on the anomaly index value ($O-AnomalyIndex > X). Number of anomaly fields to report. Specifies the number of fields to report as an indication of why a particular record is flagged as an anomaly. The most anomalous fields are reported, defined as those that show the greatest deviation from the field norm for the cluster to which the record is assigned. Anomaly Detection Expert Options To specify options for missing values and other settings, set the mode to Expert on the Expert tab. Figure 4-8 Anomaly Detection Expert tab Adjustment coefficient. Value used to balance the relative weight given to continuous (numeric range) and categorical fields in calculating the distance. Larger values increase the influence of continuous fields. This must be a nonzero value. Automatically calculate number of peer groups. Anomaly detection can be used to rapidly analyze a large number of possible solutions to choose the optimal number of peer groups for the training data. You can broaden or narrow the range by setting the minimum and maximum number of peer groups. Larger values will allow the system to explore a broader range of possible solutions; however, the cost is increased processing time. 81 Screening Models Specify number of peer groups. If you know how many clusters to include in your model, select this option and enter the number of peer groups. Selecting this option will generally result in improved performance. Noise level and ratio. These settings determine how outliers are treated during two-stage clustering. In the first stage, a cluster feature (CF) tree is used to condense the data from a very large number of individual records to a manageable number of clusters. The tree is built based on similarity measures, and when a node of the tree gets too many records in it, it splits into child nodes. In the second stage, hierarchical clustering commences on the terminal nodes of the CF tree. Noise handling is turned on in the first data pass, and it is off in the second data pass. The cases in the noise cluster from the first data pass are assigned to the regular clusters in the second data pass. Noise level. Specify a value between 0 and 0.5. This setting is relevant only if the CF tree fills during the growth phase, meaning that it cannot accept any more cases in a leaf node and that no leaf node can be split. If the CF tree fills and the noise level is set to 0, the threshold will be increased and the CF tree regrown with all cases. After final clustering, values that cannot be assigned to a cluster are labeled outliers. The outlier cluster is given an identification number of –1. The outlier cluster is not included in the count of the number of clusters; that is, if you specify n clusters and noise handling, the algorithm will output n clusters and one noise cluster. In practical terms, increasing this value gives the algorithm more latitude to fit unusual records into the tree rather than assign them to a separate outlier cluster. If the CF tree fills and the noise level is greater than 0, the CF tree will be regrown after placing any data in sparse leaves into their own noise leaf. A leaf is considered sparse if the ratio of the number of cases in the sparse leaf to the number of cases in the largest leaf is less than the noise level. After the tree is grown, the outliers will be placed in the CF tree if possible. If not, the outliers are discarded for the second phase of clustering. Noise ratio. Specifies the portion of memory allocated for the component that should be used for noise buffering. This value ranges between 0.0 and 0.5. If inserting a specific case into a leaf of the tree would yield tightness less than the threshold, the leaf is not split. If the tightness exceeds the threshold, the leaf is split, adding another small cluster to the CF tree. In practical terms, increasing this setting may cause the algorithm to gravitate more quickly toward a simpler tree. Impute missing values. For continuous fields, substitutes the field mean in place of any missing values. For categorical fields, missing categories are combined and treated as a valid category. If this option is deselected, any records with missing values are excluded from the analysis. Anomaly Detection Model Nuggets Anomaly Detection model nuggets contain all of the information captured by the Anomaly Detection model as well as information about the training data and estimation process. 82 Chapter 4 When you run a stream containing an Anomaly Detection model nugget, a number of new fields are added to the stream, as determined by the selections made on the Settings tab in the model nugget. For more information, see the topic Anomaly Detection Model Settings on p. 84. New field names are based on the model name, prefaced by $O, as summarized in the following table: $O-Anomaly $O-AnomalyIndex $O-PeerGroup $O-Field-n $O-FieldImpact-n Flag field indicating whether or not the record is anomalous. The anomaly index value for the record. Specifies the peer group to which the record is assigned. Name of the nth most anomalous field in terms of deviation from the cluster norm. Variable deviation index for the field. This value measures the deviation from the field norm for the cluster to which the record is assigned. Optionally, you can suppress scores for non-anomalous records to make the results easier to read. Figure 4-9 Scoring results with non-anomalous records suppressed Anomaly Detection Model Details The Model tab for a generated Anomaly Detection model displays information about the peer groups in the model. 83 Screening Models Figure 4-10 Anomaly Detection model nugget details Note that the peer group sizes and statistics reported are estimates based on the training data and may differ slightly from actual scoring results even if run on the same data. Anomaly Detection Model Summary The Summary tab for an Anomaly Detection model nugget displays information about the fields, build settings, and estimation process. The number of peer groups is also shown, along with the cutoff value used to flag records as anomalous. 84 Chapter 4 Figure 4-11 Anomaly Detection model nugget summary Anomaly Detection Model Settings The Settings tab allows you to specify options for scoring the model nugget. 85 Screening Models Figure 4-12 Scoring options for an Anomaly Detection model nugget Indicate anomalous records with. Specifies how anomalous records are treated in the output. Flag and index. Creates a flag field that is set to True for all records that exceed the cutoff value included in the model. The anomaly index is also reported for each record in a separate field. For more information, see the topic Anomaly Detection Model Options on p. 79. Flag only. Creates a flag field but without reporting the anomaly index for each record. Index only. Reports the anomaly index without creating a flag field. Number of anomaly fields to report. Specifies the number of fields to report as an indication of why a particular record is flagged as an anomaly. The most anomalous fields are reported, defined as those that show the greatest deviation from the field norm for the cluster to which the record is assigned. Discard records. Select this option to discard all non-anomalous records from the stream, making it easier to focus on potential anomalies in any downstream nodes. Alternatively, you can choose to discard all anomalous records in order to limit the subsequent analysis to those records that are not flagged as potential anomalies based on the model. Note: Due to slight differences in rounding, the actual number of records flagged during scoring may not be identical to those flagged while training the model even if run on the same data. Chapter Automated Modeling Nodes 5 The automated modeling nodes estimate and compare a number of different modeling methods, allowing you to try out a variety of approaches in a single modeling run. You can select the modeling algorithms to use, and the specific options for each, including combinations that would otherwise be mutually-exclusive. For example, rather than choose between the quick, dynamic, or prune methods for a Neural Net, you can try them all. The node explores every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best for use in scoring or further analysis. You can choose from three automated modeling nodes, depending on the needs of your analysis: The Auto Classifier node creates and compares a number of different models for binary outcomes (yes or no, churn or do not churn, and so on), allowing you to choose the best approach for a given analysis. A number of modeling algorithms are supported, making it possible to select the methods you want to use, the specific options for each, and the criteria for comparing the results. The node generates a set of models based on the specified options and ranks the best candidates according to the criteria you specify. For more information, see the topic Auto Classifier Node on p. 89. The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods. The node works in the same manner as the Auto Classifier node, allowing you to choose the algorithms to use and to experiment with multiple combinations of options in a single modeling pass. Supported algorithms include neural networks, C&R Tree, CHAID, linear regression, generalized linear regression, and support vector machines (SVM). Models can be compared based on correlation, relative error, or number of variables used. For more information, see the topic Auto Numeric Node on p. 98. The Auto Cluster node estimates and compares clustering models, which identify groups of records that have similar characteristics. The node works in the same manner as other automated modeling nodes, allowing you to experiment with multiple combinations of options in a single modeling pass. Models can be compared using basic measures with which to attempt to filter and rank the usefulness of the cluster models, and provide a measure based on the importance of particular fields. For more information, see the topic Auto Cluster Node on p. 104. The best models are saved in a single composite model nugget, allowing you to browse and compare them, and to choose which models to use in scoring. For binary, nominal, and numeric targets only you can select multiple scoring models and combine the scores in a single model ensemble. By combining predictions from multiple models, limitations in individual models may be avoided, often resulting in a higher overall accuracy than can be gained from any one of the models. Optionally, you can choose to drill down into the results and generate modeling nodes or model nuggets for any of the individual models you want to use or explore further. © Copyright IBM Corporation 1994, 2012. 86 87 Automated Modeling Nodes Models and Execution Time Depending on the dataset and the number of models, automated modeling nodes may take hours or even longer to execute. When selecting options, pay attention to the number of models being produced. When practical, you may want to schedule modeling runs during nights or weekends when system resources are less likely to be in demand. If necessary, a Partition or Sample node can be used to reduce the number of records included in the initial training pass. Once you have narrowed the choices to a few candidate models, the full dataset can be restored. To reduce the number of input fields, use Feature Selection. For more information, see the topic Feature Selection Node in Chapter 4 on p. 70. Alternatively, you can use your initial modeling runs to identify fields and options that are worth exploring further. For example, if your best-performing models all seem to use the same three fields, this is a strong indication that those fields are worth keeping. Optionally, you can limit the amount of time spent estimating any one model and specify the evaluation measures used to screen and rank models. Automated Modeling Node Algorithm Settings For each model type, you can use the default settings, or you can choose options for each model type. The specific options are similar to those available in the separate modeling nodes, with the difference that rather than choosing one setting or another, you can choose as many as you want to apply in most cases. For example, if comparing Neural Net models, you can choose several different training methods, and try each method with and without a random seed. All possible combinations of the selected options will be used, making it very easy to generate many different models in a single pass. Use care, however, as choosing multiple settings can cause the number of models to multiply very quickly. 88 Chapter 5 Figure 5-1 Choosing algorithm settings for automated modeling To choose options for each model type E On the automated modeling node, select the Expert tab. E Click in the Model parameters column for the model type. E From the drop-down menu, choose Specify. E On the Algorithm settings dialog, select options from the Options column. Note: Further options are available on the Expert tab of the Algorithm settings dialog. Automated Modeling Node Stopping Rules Stopping rules specified for automated modeling nodes relate to the overall node execution, not the stopping of individual models built by the node. 89 Automated Modeling Nodes Figure 5-2 Stopping rules Restrict overall execution time. (Neural Network, K-Means, Kohonen, TwoStep, SVM, KNN, Bayes Net and C&R Tree models only) Stops execution after a specified number of hours. All models generated up to that point will be included in the model nugget, but no further models will be produced. Stop as soon as valid models are produced. Stops execution when a model passes all criteria specified on the Discard tab (for the Auto Classifier or Auto Cluster node) or the Model tab (for the Auto Numeric node). For more information, see the topic Auto Classifier Node Discard Options on p. 96.For more information, see the topic Auto Cluster Node Discard Options on p. 109. Auto Classifier Node The Auto Classifier node estimates and compares models for either nominal (set) or binary (yes/no) targets, using a number of different methods, allowing you to try out a variety of approaches in a single modeling run. You can select the algorithms to use, and experiment with multiple combinations of options. For example, rather than choose between the quick, dynamic, or prune methods for a Neural Net, you can try them all. The node explores every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best models for use in scoring or further analysis.For more information, see the topic Automated Modeling Nodes on p. 86. 90 Chapter 5 Figure 5-3 Auto Classifier modeling results Example. A retail company has historical data tracking the offers made to specific customers in past campaigns. The company now wants to achieve more profitable results by matching the right offer to each customer. Requirements. A target field with a measurement level of either Nominal or Flag (with the role set to Target), and at least one input field (with the role set to Input). For a flag field, the True value defined for the target is assumed to represent a hit when calculating profits, lift, and related statistics. Input fields can have a measurement level of Continuous or Categorical, with the limitation that some inputs may not be appropriate for some model types. For example, ordinal fields used as inputs in C&R Tree, CHAID, and QUEST models must have numeric storage (not string), and will be ignored by these models if specified otherwise. Similarly, continuous input fields can be binned in some cases. The requirements are the same as when using the individual modeling nodes; for example a Bayes Net model works the same whether generated from the Bayes Net node or the Auto Classifier node. Frequency and weight fields. Frequency and weight are used to give extra importance to some records over others because, for example, the user knows that the build dataset under-represents a section of the parent population (Weight) or because one record represents a number of identical cases (Frequency). If specified, a frequency field can be used by C&R Tree, CHAID, QUEST, Decision List, and Bayes Net models. A weight field can be used by C&RT, CHAID, and C5.0 models. Other model types will ignore these fields and build the models anyway. Frequency and weight fields are used only for model building, and are not considered when evaluating or scoring models. For more information, see the topic Using Frequency and Weight Fields in Chapter 3 on p. 38. 91 Automated Modeling Nodes Supported Model Types Supported model types include Neural Net, C&R Tree, QUEST, CHAID, C5.0, Logistic Regression, Decision List, Bayes Net, Discriminant, Nearest Neighbor, and SVM. For more information, see the topic Auto Classifier Node Expert Options on p. 93. Auto Classifier Node Model Options The Model tab of the Auto Classifier node allows you to specify the number of models to be created, along with the criteria used to compare models. Figure 5-4 Auto Classifier node: Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. 92 Chapter 5 Rank models by. Specifies the criteria used to compare and rank models. Options include overall accuracy, area under the ROC curve, profit, lift, and number of fields. Note that all of these measures will be available in the summary report regardless of which is selected here. Note: For a nominal (set) target, ranking is restricted to either Overall Accuracy or Number of Fields. When calculating profits, lift, and related statistics, the True value defined for the target field is assumed to represent a hit. Rank models using. If a partition is in use, you can specify whether ranks are based on the training dataset or the testing set. With large datasets, use of a partition for preliminary screening of models may greatly improve performance. Number of models to use. Specifies the maximum number of models to be listed in the model nugget produced by the node. The top-ranking models are listed according to the specified ranking criterion. Note that increasing this limit may slow performance. The maximum allowable value is 100. Calculate predictor importance. For models that produce an appropriate measure of importance, you can display a chart that indicates the relative importance of each predictor in estimating the model. Typically you will want to focus your modeling efforts on the predictors that matter most, and consider dropping or ignoring those that matter least. Note that predictor importance may extend the time needed to calculate some models, and is not recommended if you simply want a broad comparison across many different models. It is more useful once you have narrowed your analysis to a handful of models that you want to explore in greater detail. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Profit Criteria.Note. Only for flag targets. Profit equals the revenue for each record minus the cost for the record. Profits for a quantile are simply the sum of profits for all records in the quantile. Profits are assumed to apply only to hits, but costs apply to all records. Costs. Specify the cost associated with each record. You can select Fixed or Variable costs. For fixed costs, specify the cost value. For variable costs, click the Field Chooser button to select a field as the cost field. Revenue. Specify the revenue associated with each record that represents a hit. You can select Fixed or Variable costs. For fixed revenue, specify the revenue value. For variable revenue, click the Field Chooser button to select a field as the revenue field. Weight. If the records in your data represent more than one unit, you can use frequency weights to adjust the results. Specify the weight associated with each record, using Fixed or Variable weights. For fixed weights, specify the weight value (the number of units per record). For variable weights, click the Field Chooser button to select a field as the weight field. 93 Automated Modeling Nodes Lift Criteria.Note. Only for flag targets. Specifies the percentile to use for lift calculations. Note that you can also change this value when comparing the results. For more information, see the topic Automated Model Nuggets on p. 110. Auto Classifier Node Expert Options The Expert tab of the Auto Classifier node allows you to apply a partition (if available), select the algorithms to use, and specify stopping rules. Figure 5-5 Auto Classifier node: Expert tab Models used. Use the check boxes in the column on the left to select the model types (algorithms) to include in the comparison. The more types you select, the more models will be created and the longer the processing time will be. Model type. Lists the available algorithms (see below). Model parameters. For each model type, you can use the default settings or select Specify to choose options for each model type. The specific options are similar to those available in the separate modeling nodes, with the difference that multiple options or combinations can be selected. For 94 Chapter 5 example, if comparing Neural Net models, rather than choosing one of the six training methods, you can choose all of them to train six models in a single pass. Number of models. Lists the number of models produced for each algorithm based on current settings. When combining options, the number of models can quickly add up, so paying close attention to this number is strongly recommended, particularly when using large datasets. Restrict maximum time spent building a single model. (K-Means, Kohonen, TwoStep, SVM, KNN, Bayes Net and Decision List models only) Sets a maximum time limit for any one model. For example, if a particular model requires an unexpectedly long time to train because of some complex interaction, you probably don’t want it to hold up your entire modeling run. Note: If the target is a nominal (set) field, the Decision List option is unavailable. Supported Algorithms The Neural Net node uses a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply. The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the field that provides the maximum information gain at each level. The target field must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see the topic C5.0 Node in Chapter 6 on p. 160. The Classification and Regression (C&R) Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups). For more information, see the topic C&R Tree Node in Chapter 6 on p. 143. The QUEST node provides a binary classification method for building decision trees, designed to reduce the processing time required for large C&R Tree analyses while also reducing the tendency found in classification tree methods to favor inputs that allow more splits. Input fields can be numeric ranges (continuous), but the target field must be categorical. All splits are binary. For more information, see the topic QUEST Node in Chapter 6 on p. 144. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and input fields can be numeric range (continuous) or categorical. Exhaustive CHAID is a modification of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see the topic CHAID Node in Chapter 6 on p. 144. 95 Automated Modeling Nodes Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range. For more information, see the topic Logistic Node in Chapter 10 on p. 259. The Decision List node identifies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side to compare the results. Decision List models consist of a list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the first rule that matches determines the outcome. For more information, see the topic Decision List in Chapter 9 on p. 204. The Bayesian Network node enables you to build a probability model by combining observed and recorded evidence with real-world knowledge to establish the likelihood of occurrences. The node focuses on Tree Augmented Naïve Bayes (TAN) and Markov Blanket networks that are primarily used for classification. For more information, see the topic Bayesian Network Node in Chapter 7 on p. 179. Discriminant analysis makes more stringent assumptions than logistic regression but can be a valuable alternative or supplement to a logistic regression analysis when those assumptions are met. For more information, see the topic Discriminant Node in Chapter 10 on p. 285. The k-Nearest Neighbor (KNN) node associates a new case with the category or value of the k objects nearest to it in the predictor space, where k is an integer. Similar cases are near each other and dissimilar cases are distant from each other. For more information, see the topic KNN Node in Chapter 16 on p. 462. The Support Vector Machine (SVM) node enables you to classify data into one of two groups without overfitting. SVM works well with wide data sets, such as those with a very large number of input fields. For more information, see the topic SVM Node in Chapter 15 on p. 457. Misclassification Costs In some contexts, certain kinds of errors are more costly than others. For example, it may be more costly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a low-risk applicant as high risk (a different kind of error). Misclassification costs allow you to specify the relative importance of different kinds of prediction errors. Misclassification costs are basically weights applied to specific outcomes. These weights are factored into the model and may actually change the prediction (as a way of protecting against costly mistakes). With the exception of C5.0 models, misclassification costs are not applied when scoring a model and are not taken into account when ranking or comparing models using an Auto Classifier node, evaluation chart, or Analysis node. A model that includes costs may not produce fewer errors than one that doesn’t and may not rank any higher in terms of overall accuracy, but it is likely to perform better in practical terms because it has a built-in bias in favor of less expensive errors. 96 Chapter 5 The cost matrix shows the cost for each possible combination of predicted category and actual category. By default, all misclassification costs are set to 1.0. To enter custom cost values, select Use misclassification costs and enter your custom values into the cost matrix. To change a misclassification cost, select the cell corresponding to the desired combination of predicted and actual values, delete the existing contents of the cell, and enter the desired cost for the cell. Costs are not automatically symmetrical. For example, if you set the cost of misclassifying A as B to be 2.0, the cost of misclassifying B as A will still have the default value of 1.0 unless you explicitly change it as well. Auto Classifier Node Discard Options The Discard tab of the Auto Classifier node allows you to automatically discard models that do not meet certain criteria. These models will not be listed in the summary report. Figure 5-6 Auto Classifier node: Discard tab You can specify a minimum threshold for overall accuracy and a maximum threshold for the number of variables used in the model. In addition, for flag targets, you can specify a minimum threshold for lift, profit, and area under the curve; lift and profit are determined as specified on the Model tab. For more information, see the topic Auto Classifier Node Model Options on p. 91. 97 Automated Modeling Nodes Optionally, you can configure the node to stop execution the first time a model is generated that meets all specified criteria. For more information, see the topic Automated Modeling Node Stopping Rules on p. 88. Auto Classifier Node Settings Options The Settings tab of the Auto Classifier node allows you to pre-configure the score-time options that are available on the nugget. Figure 5-7 Auto Classifier node: Settings tab Ensemble method. For targets you can select from the following Ensemble methods: Voting Confidence-weighted voting Raw propensity-weighted voting (flag targets only) Highest confidence wins Average raw propensity (flag targets only). If voting is tied, select value using. For voting methods, you can specify how ties are resolved: Random selection. One of the tied values is chosen at random. 98 Chapter 5 Highest confidence. The tied value that was predicted with the highest confidence wins. Note that this is not necessarily the same as the highest confidence of all predicted values. Raw propensity. (Flag targets only) The tied value that was predicted with the largest absolute propensity, where the absolute propensity is calculated as: abs(0.5 - propensity) * 2 Auto Numeric Node The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods, allowing you to try out a variety of approaches in a single modeling run. You can select the algorithms to use, and experiment with multiple combinations of options. For example, you could predict housing values using neural net, linear regression, C&RT, and CHAID models to see which performs best, and you could try out different combinations of stepwise, forward, and backward regression methods. The node explores every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best for use in scoring or further analysis. For more information, see the topic Automated Modeling Nodes on p. 86. Figure 5-8 Auto Numeric results Example. A municipality wants to more accurately estimate real estate taxes and to adjust values for specific properties as needed without having to inspect every property. Using the Auto Numeric node, the analyst can generate and compare a number of models that predict property values based on building type, neighborhood, size, and other known factors. 99 Automated Modeling Nodes Requirements. A single target field (with the role set to Target), and at least one input field (with the role set to Input). The target must be a continuous (numeric range) field, such as age or income. Input fields can be continuous or categorical, with the limitation that some inputs may not be appropriate for some model types. For example, C&R Tree models can use categorical string fields as inputs, while linear regression models cannot use these fields and will ignore them if specified. The requirements are the same as when using the individual modeling nodes. For example, a CHAID model works the same whether generated from the CHAID node or the Auto Numeric node. Frequency and weight fields. Frequency and weight are used to give extra importance to some records over others because, for example, the user knows that the build dataset under-represents a section of the parent population (Weight) or because one record represents a number of identical cases (Frequency). If specified, a frequency field can be used by C&R Tree and CHAID algorithms. A weight field can be used by C&RT, CHAID, Regression, and GenLin algorithms. Other model types will ignore these fields and build the models anyway. Frequency and weight fields are used only for model building and are not considered when evaluating or scoring models. For more information, see the topic Using Frequency and Weight Fields in Chapter 3 on p. 38. Supported Model Types Supported model types include Neural Net, C&R Tree, CHAID, Regression, GenLin, Nearest Neighbor, and SVM. For more information, see the topic Auto Numeric Node Expert Options on p. 101. Auto Numeric Node Model Options The Model tab of the Auto Numeric node allows you to specify the number of models to be saved, along with the criteria used to compare models. 100 Chapter 5 Figure 5-9 Auto Numeric node: Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Rank models by. Specifies the criteria used to compare models. Correlation. The Pearson Correlation between the observed value for each record and the value predicted by the model. The correlation is a measure of linear association between two variables, with values closer to 1 indicating a stronger relationship. (Correlation values range between –1, for a perfect negative relationship, and +1 for a perfect positive relationship. A value of 0 indicates no linear relationship, while a model with a negative correlation would rank lowest of all.) Number of fields. The number of fields used as predictors in the model. Choosing models that use fewer fields may streamline data preparation and improve performance in some cases. Relative error. The relative error is the ratio of the variance of the observed values from those predicted by the model to the variance of the observed values from the mean. In practical terms, it compares how well the model performs relative to a null or intercept model that simply returns the mean value of the target field as the prediction. For a good model, this value should be less than 1, indicating that the model is more accurate than the null model. A model with a relative error greater than 1 is less accurate than the null model and is therefore 101 Automated Modeling Nodes not useful. For linear regression models, the relative error is equal to the square of the correlation and adds no new information. For nonlinear models, the relative error is unrelated to the correlation and provides an additional measure for assessing model performance. Rank models using. If a partition is in use, you can specify whether ranks are based on the training partition or the testing partition. With large datasets, use of a partition for preliminary screening of models may greatly improve performance. Number of models to use. Specifies the maximum number of models to be shown in the model nugget produced by the node. The top-ranking models are listed according to the specified ranking criterion. Increasing this limit will allow you to compare results for more models but may slow performance. The maximum allowable value is 100. Calculate predictor importance. For models that produce an appropriate measure of importance, you can display a chart that indicates the relative importance of each predictor in estimating the model. Typically you will want to focus your modeling efforts on the predictors that matter most, and consider dropping or ignoring those that matter least. Note that predictor importance may extend the time needed to calculate some models, and is not recommended if you simply want a broad comparison across many different models. It is more useful once you have narrowed your analysis to a handful of models that you want to explore in greater detail. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Do not keep models if. Specifies threshold values for correlation, relative error, and number of fields used. Models that fail to meet any of these criteria will be discarded and will not be listed in the summary report. Correlation less than. The minimum correlation (in terms of absolute value) for a model to be included in the summary report. Number of fields used is greater than. The maximum number of fields to be used by any model to be included. Relative error is greater than. The maximum relative error for any model to be included. Optionally, you can configure the node to stop execution the first time a model is generated that meets all specified criteria. For more information, see the topic Automated Modeling Node Stopping Rules on p. 88. Auto Numeric Node Expert Options The Expert tab of the Auto Numeric node allows you to select the algorithms and options to use and to specify stopping rules. 102 Chapter 5 Figure 5-10 Auto Numeric node: Expert tab Models used. Use the check boxes in the column on the left to select the model types (algorithms) to include in the comparison. The more types you select, the more models will be created and the longer the processing time will be. Model type. Lists the available algorithms (see below). Model parameters. For each model type, you can use the default settings or select Specify to choose options for each model type. The specific options are similar to those available in the separate modeling nodes, with the difference that multiple options or combinations can be selected. For example, if comparing Neural Net models, rather than choosing one of the six training methods, you can choose all of them to train six models in a single pass. Number of models. Lists the number of models produced for each algorithm based on current settings. When combining options, the number of models can quickly add up, so paying close attention to this number is strongly recommended, particularly when using large datasets. Restrict maximum time spent building a single model. (K-Means, Kohonen, TwoStep, SVM, KNN, Bayes Net and Decision List models only) Sets a maximum time limit for any one model. For example, if a particular model requires an unexpectedly long time to train because of some complex interaction, you probably don’t want it to hold up your entire modeling run. 103 Automated Modeling Nodes Supported Algorithms The Neural Net node uses a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply. The Classification and Regression (C&R) Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups). For more information, see the topic C&R Tree Node in Chapter 6 on p. 143. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and input fields can be numeric range (continuous) or categorical. Exhaustive CHAID is a modification of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see the topic CHAID Node in Chapter 6 on p. 144. Linear regression is a common statistical technique for summarizing data and making predictions by fitting a straight line or surface that minimizes the discrepancies between predicted and actual output values. The Generalized Linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates through a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers the functionality of a wide number of statistical models, including linear regression, logistic regression, loglinear models for count data, and interval-censored survival models. For more information, see the topic GenLin Node in Chapter 10 on p. 294. The k-Nearest Neighbor (KNN) node associates a new case with the category or value of the k objects nearest to it in the predictor space, where k is an integer. Similar cases are near each other and dissimilar cases are distant from each other. For more information, see the topic KNN Node in Chapter 16 on p. 462. The Support Vector Machine (SVM) node enables you to classify data into one of two groups without overfitting. SVM works well with wide data sets, such as those with a very large number of input fields. For more information, see the topic SVM Node in Chapter 15 on p. 457. Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors. For more information, see the topic Linear models in Chapter 10 on p. 239. 104 Chapter 5 Auto Numeric Node Settings Options The Settings tab of the Auto Numeric node allows you to pre-configure the score-time options that are available on the nugget. Figure 5-11 Auto Numeric node: Settings tab Calculate standard error. For a continuous (numeric range) target, a standard error calculation is run by default to calculate the difference between the measured or estimated values and the true values; and to show how close those estimates matched. Auto Cluster Node The Auto Cluster node estimates and compares clustering models that identify groups of records with similar characteristics. The node works in the same manner as other automated modeling nodes, allowing you to experiment with multiple combinations of options in a single modeling pass. Models can be compared using basic measures with which to attempt to filter and rank the usefulness of the cluster models, and provide a measure based on the importance of particular fields. Clustering models are often used to identify groups that can be used as inputs in subsequent analyses. For example you may want to target groups of customers based on demographic characteristics such as income, or based on the services they have bought in the past. This can be done without prior knowledge about the groups and their characteristics — you may not know how many groups to look for, or what features to use in defining them. Clustering models are 105 Automated Modeling Nodes often referred to as unsupervised learning models, since they do not use a target field, and do not return a specific prediction that can be evaluated as true or false. The value of a clustering model is determined by its ability to capture interesting groupings in the data and provide useful descriptions of those groupings. For more information, see the topic Clustering Models in Chapter 11 on p. 347. Figure 5-12 Auto Cluster results Requirements. One or more fields that define characteristics of interest. Cluster models do not use target fields in the same manner as other models, because they do not make specific predictions that can be assessed as true or false. Instead they are used to identify groups of cases that may be related. For example you cannot use a cluster model to predict whether a given customer will churn or respond to an offer. But you can use a cluster model to assign customers to groups based on their tendency to do those things. Weight and frequency fields are not used. Evaluation fields. While no target is used, you can optionally specify one or more evaluation fields to be used in comparing models. The usefulness of a cluster model may be evaluated by measuring how well (or badly) the clusters differentiate these fields. Supported Model Types Supported model types include TwoStep, K-Means, and Kohonen. Auto Cluster Node Model Options The Model tab of the Auto Cluster node allows you to specify the number of models to be saved, along with the criteria used to compare models. 106 Chapter 5 Figure 5-13 Auto Cluster node: Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Rank models by. Specifies the criteria used to compare and rank models. Silhouette. An index measuring both cluster cohesion and separation. See Silhouette Ranking Measure below for more information. Number of clusters. The number of clusters in the model. Size of smallest cluster. The smallest cluster size. Size of largest cluster. The largest cluster size. Smallest / largest cluster. The ratio of the size of the smallest cluster to the largest cluster. Importance. The importance of the Evaluation field on the Fields tab. Note that this can only be calculated if an Evaluation field has been specified. Rank models using. If a partition is in use, you can specify whether ranks are based on the training dataset or the testing set. With large datasets, use of a partition for preliminary screening of models may greatly improve performance. 107 Automated Modeling Nodes Number of models to keep. Specifies the maximum number of models to be listed in the nugget produced by the node. The top-ranking models are listed according to the specified ranking criterion. Note that increasing this limit may slow performance. The maximum allowable value is 100. Silhouette Ranking Measure The default ranking measure, Silhouette, has a default value of 0 because a value of less than 0 (i.e. negative) indicates that the average distance between a case and points in its assigned cluster is greater than the minimum average distance to points in another cluster. Therefore, models with a negative Silhouette can safely be discarded. The ranking measure is actually a modified silhouette coefficient, which combines the concepts of cluster cohesion (favoring models which contain tightly cohesive clusters) and cluster separation (favoring models which contain highly separated clusters). The average Silhouette coefficient is simply the average over all cases of the following calculation for each individual case: (B - A) / max(A, B) where A is the distance from the case to the centroid of the cluster which the case belongs to; and B is the minimal distance from the case to the centroid of every other cluster. The Silhouette coefficient (and its average) range between -1 (indicating a very poor model) and 1 (indicating an excellent model). The average can be conducted on the level of total cases (which produces total Silhouette) or the level of clusters (which produces cluster Silhouette). Distances may be calculated using Euclidean distances. Auto Cluster Node Expert Options The Expert tab of the Auto Cluster node allows you to apply a partition (if available), select the algorithms to use, and specify stopping rules. 108 Chapter 5 Figure 5-14 Auto Cluster node: Expert tab Models used. Use the check boxes in the column on the left to select the model types (algorithms) to include in the comparison. The more types you select, the more models will be created and the longer the processing time will be. Model type. Lists the available algorithms (see below). Model parameters. For each model type, you can use the default settings or select Specify to choose options for each model type. The specific options are similar to those available in the separate modeling nodes, with the difference that multiple options or combinations can be selected. For example, if comparing Neural Net models, rather than choosing one of the six training methods, you can choose all of them to train six models in a single pass. Number of models. Lists the number of models produced for each algorithm based on current settings. When combining options, the number of models can quickly add up, so paying close attention to this number is strongly recommended, particularly when using large datasets. Restrict maximum time spent building a single model. (K-Means, Kohonen, TwoStep, SVM, KNN, Bayes Net and Decision List models only) Sets a maximum time limit for any one model. For example, if a particular model requires an unexpectedly long time to train because of some complex interaction, you probably don’t want it to hold up your entire modeling run. 109 Automated Modeling Nodes Supported Algorithms The K-Means node clusters the data set into distinct groups (or clusters). The method defines a fixed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further refinement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input fields. For more information, see the topic K-Means Node in Chapter 11 on p. 354. The Kohonen node generates a type of neural network that can be used to cluster the data set into distinct groups. When the network is fully trained, records that are similar should be close together on the output map, while records that are different will be far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see the topic Kohonen Node in Chapter 11 on p. 348. The TwoStep node uses a two-step clustering method. The first step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed field types and large data sets efficiently. For more information, see the topic TwoStep Cluster Node in Chapter 11 on p. 358. Auto Cluster Node Discard Options The Discard tab of the Auto Cluster node allows you to automatically discard models that do not meet certain criteria. These models will not be listed on the model nugget. 110 Chapter 5 Figure 5-15 Auto Cluster node: Discard tab You can specify the minimum silhouette value, cluster numbers, cluster sizes, and the importance of the evaluation field used in the model. Silhouette and the number and size of clusters are determined as specified in the modeling node. For more information, see the topic Auto Cluster Node Model Options on p. 105. Optionally, you can configure the node to stop execution the first time a model is generated that meets all specified criteria. For more information, see the topic Automated Modeling Node Stopping Rules on p. 88. Automated Model Nuggets When an automated modeling node is executed, the node estimates candidate models for every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best models in a composite automated model nugget. This model nugget actually contains a set of one or more models generated by the node, which can be individually browsed or selected for use in scoring. The model type and build time are listed for each model, along with a number of other measures as appropriate for the type of model. You can sort the table on any of these columns to quickly identify the most interesting models. 111 Automated Modeling Nodes Figure 5-16 Auto Numeric results To browse any of the individual model nuggets, double-click the nugget icon. From there you can then generate a modeling node for that model to the stream canvas, or a copy of the model nugget to the models palette. Thumbnail graphs give a quick visual assessment for each model type, as summarized below. You can double-click on a thumbnail to generate a full-sized graph. The full-sized plot shows up to 1000 points and will be based on a sample if the dataset contains more. (For scatterplots only, the graph is regenerated each time it is displayed, so any changes in the upstream data—such as updating of a random sample or partition if Set Random Seed is not selected—may be reflected each time the scatterplot is redrawn.) Use the toolbar to show or hide specific columns on the Model tab or to change the column used to sort the table. (You can also change the sort by clicking on the column headers.) Use the Delete button to permanently remove any unused models. To reorder columns, click on a column header and drag the column to the desired location. If a partition is in use, you can choose to view results for the training or testing partition as applicable. The specific columns depend on the type of models being compared, as detailed below. Binary Targets For binary models, the thumbnail graph shows the distribution of actual values, overlaid with the predicted values, to give a quick visual indication of how many records were correctly predicted in each category. 112 Chapter 5 Ranking criteria match the options in the Auto Classifier modeling node. For more information, see the topic Auto Classifier Node Model Options on p. 91. For the maximum profit, the percentile in which the maximum occurs is also reported. For cumulative lift, you can change the selected percentile using the toolbar. Nominal Targets For nominal (set) models, the thumbnail graph shows the distribution of actual values, overlaid with the predicted values, to give a quick visual indication of how many records were correctly predicted in each category. Ranking criteria match the options in the Auto Classifier modeling node. For more information, see the topic Auto Classifier Node Model Options on p. 91. Continuous Targets For continuous (numeric range) models, the graph plots predicted against observed values for each model, providing a quick visual indication of the correlation between them. For a good model, points should tend to cluster along the diagonal rather than be scattered randomly across the graph. Ranking criteria match the options in the Auto Numeric modeling node. For more information, see the topic Auto Numeric Node Model Options on p. 99. Cluster Targets For cluster models, the graph plots counts against clusters for each model, providing a quick visual indication of cluster distribution. Ranking criteria match the options in the Auto Cluster modeling node. For more information, see the topic Auto Cluster Node Model Options on p. 105. Selecting Models for Scoring The Use? column enables you to select the models to use in scoring. For binary, nominal, and numeric targets, you can select multiple scoring models and combine the scores in the single, ensembled model nugget. By combining predictions from multiple models, limitations in individual models may be avoided, often resulting in a higher overall accuracy than can be gained from any one of the models. For cluster models, only one scoring model can be selected at a time. By default, the top ranked one is selected first. Generating Nodes and Models You can generate a copy of the composite automated model nugget, or the automated modeling node from which it was built. For example, this may be useful if you do not have the original stream from which the automated model nugget was built. Alternatively, you can generate a nugget or modeling node for any of the individual models listed in the automated model nugget. 113 Automated Modeling Nodes Automated Modeling Nugget E From the Generate menu, select Model to Palette to add the automated model nugget to the Models palette. The generated model can be saved or used as is without rerunning the stream. E Alternatively, you can select Generate Modeling Node from the Generate menu to add the modeling node to the stream canvas. This node can be used to reestimate the selected models without repeating the entire modeling run. Individual Modeling Nugget E In the Model menu, double-click on the individual nugget you require. A copy of that nugget opens in a new dialog. E From the Generate menu in the new dialog, select Model to Palette to add the individual modeling nugget to the Models palette. E Alternatively, you can select Generate Modeling Node from the Generate menu in the new dialog to add the individual modeling node to the stream canvas. Generating Evaluation Charts For binary models only, you can generate evaluation charts that offer a visual way to assess and compare the performance of each model. Evaluation charts are not available for models generated by the Auto Numeric or Auto Cluster nodes. Figure 5-17 Response chart (cumulative) with best line and baseline 114 Chapter 5 E Under the Use? column in the Auto Classifier automated model nugget, select the models that you want to evaluate. E From the Generate menu, choose Evaluation Chart(s). Figure 5-18 Generating an evaluation chart E Select the chart type and other options as desired. Evaluation Graphs On the Model tab of the automated model nugget you can drill down to display individual graphs for each of the models shown. For Auto Classifier and Auto Numeric nuggets, the Graph tab displays both a graph and predictor importance that reflect the results of all the models combined. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. For Auto Classifier a distribution graph is shown, whereas a multiplot (also known as a scatterplot) is shown for Auto Numeric. 115 Automated Modeling Nodes Figure 5-19 Auto Numeric - Multiplot graph for the ensembled models within the automated model nugget Chapter 6 Decision Trees Decision Tree Models Decision tree models allow you to develop classification systems that predict or classify future observations based on a set of decision rules. If you have data divided into classes that interest you (for example, high- versus low-risk loans, subscribers versus nonsubscribers, voters versus nonvoters, or types of bacteria), you can use your data to build rules that you can use to classify old or new cases with maximum accuracy. For example, you might build a tree that classifies credit risk or purchase intent based on age and other factors. Figure 6-1 Interactive Tree window This approach, sometimes known as rule induction, has several advantages. First, the reasoning process behind the model is clearly evident when browsing the tree. This is in contrast to other “black box” modeling techniques in which the internal logic can be difficult to work out. © Copyright IBM Corporation 1994, 2012. 116 117 Decision Trees Figure 6-2 Simple decision tree for buying a car Second, the process will automatically include in its rule only the attributes that really matter in making a decision. Attributes that do not contribute to the accuracy of the tree are ignored. This can yield very useful information about the data and can be used to reduce the data to relevant fields before training another learning technique, such as a neural net. Decision tree model nuggets can be converted into a collection of if-then rules (a rule set), which in many cases show the information in a more comprehensible form. The decision-tree presentation is useful when you want to see how attributes in the data can split, or partition, the population into subsets relevant to the problem. The rule set presentation is useful if you want to see how particular groups of items relate to a specific conclusion. For example, the following rule gives us a profile for a group of cars that is worth buying: IF tested = 'yes' AND mileage = 'low' THEN -> 'BUY'. Tree-Building Algorithms Four algorithms are available for performing classification and segmentation analysis. These algorithms all perform basically the same thing—they examine all of the fields of your dataset to find the one that gives the best classification or prediction by splitting the data into subgroups. The process is applied recursively, splitting subgroups into smaller and smaller units until the tree is finished (as defined by certain stopping criteria). The target and input fields used in tree building can be continuous (numeric range) or categorical, depending on the algorithm used. 118 Chapter 6 If a continuous target is used, a regression tree is generated; if a categorical target is used, a classification tree is generated. The Classification and Regression (C&R) Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups). For more information, see the topic C&R Tree Node on p. 143. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and input fields can be numeric range (continuous) or categorical. Exhaustive CHAID is a modification of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see the topic CHAID Node on p. 144. The QUEST node provides a binary classification method for building decision trees, designed to reduce the processing time required for large C&R Tree analyses while also reducing the tendency found in classification tree methods to favor inputs that allow more splits. Input fields can be numeric ranges (continuous), but the target field must be categorical. All splits are binary. For more information, see the topic QUEST Node on p. 144. The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the field that provides the maximum information gain at each level. The target field must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see the topic C5.0 Node on p. 160. General Uses of Tree-Based Analysis The following are some general uses of tree-based analysis: Segmentation. Identify persons who are likely to be members of a particular class. Stratification. Assign cases into one of several categories, such as high-, medium-, and low-risk groups. Prediction. Create rules and use them to predict future events. Prediction can also mean attempts to relate predictive attributes to values of a continuous variable. Data reduction and variable screening. Select a useful subset of predictors from a large set of variables for use in building a formal parametric model. Interaction identification. Identify relationships that pertain only to specific subgroups and specify these in a formal parametric model. Category merging and banding continuous variables. Recode group predictor categories and continuous variables with minimal loss of information. 119 Decision Trees The Interactive Tree Builder You can generate a tree model automatically, allowing the algorithm to choose the best split at each level, or you can use the interactive tree builder to take control, applying your business knowledge to refine or simplify the tree before saving the model nugget. E Create a stream and add one of the decision tree nodes C&R Tree, CHAID, or QUEST. Note: Interactive tree building is not supported for C5.0 trees. E Open the node and, on the Fields tab, select target and predictor fields and specify additional model options as needed. For specific instructions, see the documentation for each tree-building node. E On the Objectives panel of the Build Options tab, select Launch interactive session. E Click Run to launch the tree builder. Figure 6-3 Interactive Tree Builder window The current tree is displayed, starting with the root node. You can edit and prune the tree level-by-level and access gains, risks, and related information before generating one or more models. Comments With the C&R Tree, CHAID, and QUEST nodes, any ordinal fields used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. 120 Chapter 6 Optionally, you can use a partition field to separate the data into training and test samples. As an alternative to using the tree builder, you can also generate a model directly from the modeling node as with other IBM® SPSS® Modeler models. For more information, see the topic Building a Tree Model Directly on p. 140. Growing and Pruning the Tree The Viewer tab in the tree builder allows you to view the current tree, starting with the root node. E To grow the tree, from the menus choose: Tree > Grow Tree The system builds the tree by recursively splitting each branch until one or more stopping criteria are met. At each split, the best predictor is automatically selected based on the modeling method used. E Alternatively, select Grow Tree One Level to add a single level. E To add a branch below a specific node, select the node and select Grow Branch. E To choose the predictor used for a split, select the desired node and select Grow Branch with Custom Split. For more information, see the topic Defining Custom Splits on p. 121. E To prune a branch, select a node and select Remove Branch to clear up the selected node. E To remove the bottom level from the tree, select Remove One Level. E For C&R Tree and QUEST trees only, select Grow Tree and Prune to prune based on a cost-complexity algorithm that adjusts the risk estimate based on the number of terminal nodes, typically resulting in a simpler tree. For more information, see the topic C&R Tree Node on p. 143. Reading Split Rules on the Viewer Tab Figure 6-4 Split rules displayed on the Viewer tab 121 Decision Trees When viewing split rules on the Viewer tab, square brackets mean that the adjacent value is included in the range whereas parentheses indicate that the adjacent value is excluded from the range. The expression (23,37] therefore means from 23 exclusive to 37 inclusive; that is, from just above 23 to 37. On the Model tab, the same condition would be displayed as: Age > 23 and Age <= 37 Interrupting tree growth. To interrupt a tree-growing operation (if it is taking longer than expected, for example), click the Stop Execution button on the toolbar. Figure 6-5 Stop Execution button The button is enabled only during tree growth. It stops the current growing operation at its current point, leaving any nodes that have already been added, without saving changes or closing the window. The tree builder remains open, allowing you to generate a model, update directives, or export output in the appropriate format, as needed. Defining Custom Splits The Define Split dialog box allows you to select the predictor and specify conditions for each split. E In the tree builder, select a node on the Viewer tab, and from the menus choose: Tree > Grow Branch with Custom Split Figure 6-6 Define Split dialog box 122 Chapter 6 E Select the desired predictor from the drop-down list, or click on the Predictors button to view details of each predictor. For more information, see the topic Viewing Predictor Details on p. 122. E You can accept the default conditions for each split or select Custom to specify conditions for the split as appropriate. For continuous (numeric range) predictors, you can use the Edit Range Values fields to specify the range of values that fall into each new node. For categorical predictors, you can use the Edit Set Values or Edit Ordinal Values fields to specify the specific values (or range of values in case of an ordinal predictor) that map to each new node. E Select Grow to regrow the branch using the selected predictor. The tree can generally be split using any predictor, regardless of stopping rules. The only exceptions are when the node is pure (meaning that 100% of cases fall into the same target class, thus there is nothing left to split) or the chosen predictor is constant (there is nothing to split against). Missing values into. For CHAID trees only, if missing values are available for a given predictor, you have the option when defining a custom split to assign them to a specific child node. (With C&R Tree and QUEST, missing values are handled using surrogates as defined in the algorithm. For more information, see the topic Split Details and Surrogates on p. 123.) Viewing Predictor Details The Select Predictor dialog box displays statistics on available predictors (or “competitors” as they are sometimes called) that can be used for the current split. Figure 6-7 Select Predictor dialog box For CHAID and exhaustive CHAID, the chi-square statistic is listed for each categorical predictor; if a predictor is a numeric range, the F statistic is shown. The chi-square statistic is a measure of how independent the target field is from the splitting field. A high chi-square statistic generally relates to a lower probability, meaning that there is less chance that the two fields are independent—an indication that the split is a good one. Degrees of freedom are also included because these take into account the fact that it is easier for a three-way split to have a large statistic and small probability than it is for a two-way split. For C&R Tree and QUEST, the improvement for each predictor is displayed. The greater the improvement, the greater the reduction in impurity between the parent and child nodes if that predictor is used. (A pure node is one in which all cases fall into a single target category; 123 Decision Trees the lower the impurity across the tree, the better the model fits the data.) In other words, a high improvement figure generally indicates a useful split for this type of tree. The impurity measure used is specified in the tree-building node. Split Details and Surrogates You can select any node on the Viewer tab and select the split information button on the right side of the toolbar to view details about the split for that node. The split rule used is displayed, along with relevant statistics. For C&R Tree categorical trees, improvement and association are displayed. The association is a measure of correspondence between a surrogate and the primary split field, with the “best” surrogate generally being the one that most closely mimics the split field. For C&R Tree and QUEST, any surrogates used in place of the primary predictor are also listed. Figure 6-8 Interactive Tree Builder window with split information displayed E To edit the split for the selected node, you can click the icon on the left side of the surrogates panel to open the Define Split dialog box. (As a shortcut, you can select a surrogate from the list before clicking the icon to select it as the primary split field.) 124 Chapter 6 Surrogates. Where applicable, any surrogates for the primary split field are shown for the selected node. Surrogates are alternate fields used if the primary predictor value is missing for a given record. The maximum number of surrogates allowed for a given split is specified in the tree-building node, but the actual number depends on the training data. In general, the more missing data, the more surrogates are likely to be used. For other decision tree models, this tab is empty. Note: To be included in the model, surrogates must be identified during the training phase. If the training sample has no missing values, then no surrogates will be identified, and any records with missing values encountered during testing or scoring will automatically fall into the child node with the largest number of records. If missing values are expected during testing or scoring, be sure that values are missing from the training sample, as well. Surrogates are not available for CHAID trees. Although surrogates are not used for CHAID trees, when defining a custom split you have the option to assign them to a specific child node. For more information, see the topic Defining Custom Splits on p. 121. Customizing the Tree View The Viewer tab in the tree builder displays the current tree. By default, all branches in the tree are expanded, but you can expand and collapse branches and customize other settings as needed. 125 Decision Trees Figure 6-9 Left-to-right view with split rule details, node graphs and labels visible Click the minus sign (–) at the bottom right corner of a parent node to hide all of its child nodes. Click the plus sign (+) at the bottom right corner of a parent node to display its child nodes. Use the View menu or the toolbar to change the orientation of the tree (top-down, left-to-right, or right-to-left). Click the “Display field and value labels” button on the main toolbar to show or hide field and value labels. Use the magnifying glass buttons to zoom the view in or out, or click the tree map button on the right side of the toolbar to view a diagram of the complete tree. If a partition field is in use, you can swap the tree view between training and testing partitions (View > Partition). When the testing sample is displayed, the tree can be viewed but not edited. (The current partition is displayed in the status bar in the lower right corner of the window.) Click the split information button (the “i” button on the far right of the toolbar) to view details on the current split. For more information, see the topic Split Details and Surrogates on p. 123. Display statistics, graphs, or both within each node (see below). 126 Chapter 6 Displaying Statistics and Graphs Node statistics. For a categorical target field, the table in each node shows the number and percentage of records in each category and the percentage of the entire sample that the node represents. For a continuous (numeric range) target field, the table shows the mean, standard deviation, number of records, and predicted value of the target field. Node graphs. For a categorical target field, the graph is a bar chart of percentages in each category of the target field. Preceding each row in the table is a color swatch that corresponds to the color that represents each of the target field categories in the graphs for the node. For a continuous (numeric range) target field, the graph shows a histogram of the target field for records in the node. Gains The Gains tab displays statistics for all terminal nodes in the tree. Gains provide a measure of how far the mean or proportion at a given node differs from the overall mean. Generally speaking, the greater this difference, the more useful the tree is as a tool for making decisions. For example, an index or “lift” value of 148% for a node indicates that records in the node are about one-and-a-half times as likely to fall under the target category as for the dataset as a whole. For C&R Tree and QUEST nodes where an overfit prevention set is specified, two sets of statistics are displayed: tree growing set - the training sample with the overfit prevention set removed overfit prevention set For other C&R Tree and QUEST interactive trees, and for all CHAID interactive trees, only the tree growing set statistics are displayed. Figure 6-10 Gains tab The Gains tab allows you to: Display node-by-node, cumulative, or quantile statistics. Display gains or profits. Swap the view between tables and charts. 127 Decision Trees Select the target category (categorical targets only). Sort the table in ascending or descending order based on the index percentage. If statistics for multiple partitions are displayed, sorts are always applied on the training sample rather than on the testing sample. In general, selections made in the gains table will be updated in the tree view and vice versa. For example, if you select a row in the table, the corresponding node will be selected in the tree. Classification Gains For classification trees (those with a categorical target variable), the gain index percentage tells you how much greater the proportion of a given target category at each node differs from the overall proportion. Node-by-Node Statistics In this view, the table displays one row for each terminal node. For example, if the overall response to your direct mail campaign was 10% but 20% of the records that fall into node X responded positively, the index percentage for the node would be 200%, indicating that respondents in this group are twice as likely to buy relative to the overall population. For C&R Tree and QUEST nodes where an overfit prevention set is specified, two sets of statistics are displayed: tree growing set - the training sample with the overfit prevention set removed overfit prevention set For other C&R Tree and QUEST interactive trees, and for all CHAID interactive trees, only the tree growing set statistics are displayed. Figure 6-11 Node-by-node gain statistics Nodes. The ID of the current node (as displayed on the Viewer tab). Node: n. The total number of records in that node. Node (%). The percentage of all records in the dataset that fall into this node. Gain: n. The number of records with the selected target category that fall into this node. In other words, of all the records in the dataset that fall under the target category, how many are in this node? 128 Chapter 6 Gain (%). The percentage of all records in the target category, across the entire dataset, that fall into this node. Response (%). The percentage of records in the current node that fall under the target category. Responses in this context are sometimes referred to as “hits.” Index (%). The response percentage for the current node expressed as a percentage of the response percentage for the entire dataset. For example, an index value of 300% indicates that records in this node are three times as likely to fall under the target category as for the dataset as a whole. Cumulative Statistics In the cumulative view, the table displays one node per row, but statistics are cumulative, sorted in ascending or descending order by index percentage. For example if a descending sort is applied, the node with the highest index percentage is listed first, and statistics in the rows that follow are cumulative for that row and above. Figure 6-12 Cumulative gains sorted in descending order by index percentage The cumulative index percentage decreases row-by-row as nodes with lower and lower response percentages are added. The cumulative index for the final row is always 100% because at this point the entire dataset is included. Quantiles In this view, each row in the table represents a quantile rather than a node. The quantiles are either quartiles, quintiles (fifths), deciles (tenths), vingtiles (twentieths), or percentiles (hundredths). Multiple nodes can be listed in a single quantile if more than one node is needed to make up that percentage (for example, if quartiles are displayed but the top two nodes contain fewer than 50% of all cases). The rest of the table is cumulative and can be interpreted in the same manner as the cumulative view. 129 Decision Trees Figure 6-13 Gains by quartile listed in descending order by index percentage Classification Profits and ROI For classification trees, gains statistics can also be displayed in terms of profit and ROI (return on investment). The Define Profits dialog box allows you to specify revenue and expenses for each category. E On the Gains tab, click the Profit button (labeled $/$) on the toolbar to access the dialog box. Figure 6-14 Define Profits dialog box E Enter revenue and expense values for each category of the target field. For example, if it costs you $0.48 to mail an offer to each customer and the revenue from a positive response is $9.95 for a three-month subscription, then each no response costs you $0.48 and each yes earns you $9.47 (calculated as 9.95–0.48). In the gains table, profit is calculated as the sum of revenues minus expenditures for each of the records at a terminal node. ROI is total profit divided by total expenditure at a node. Comments Profit values affect only average profit and ROI values displayed in the gains table, as a way of viewing statistics in terms more applicable to your bottom line. They do not affect the basic tree model structure. Profits should not be confused with misclassification costs, 130 Chapter 6 which are specified in the tree-building node and are factored into the model as a way of protecting against costly mistakes. Profit specifications are not persisted between one interactive tree-building session and the next. Regression Gains For regression trees, you can choose between node-by-node, cumulative node-by-node, and quantile views. Average values are shown in the table. Charts are available only for quantiles. Gains Charts Charts can be displayed on the Gains tab as an alternative to tables. E On the Gains tab, select the Quantiles icon (third from left on the toolbar). (Charts are not available for node-by-node or cumulative statistics.) E Select the Charts icon. E Select the displayed units (percentiles, deciles, and so on) from the drop-down list as desired. E Select Gains, Response, or Lift to change the displayed measure. Gains Chart The gains chart plots the values in the Gains (%) column from the table. Gains are defined as the proportion of hits in each increment relative to the total number of hits in the tree, using the equation: (hits in increment / total number of hits) x 100% 131 Decision Trees Figure 6-15 Gains chart The chart effectively illustrates how widely you need to cast the net to capture a given percentage of all the hits in the tree. The diagonal line plots the expected response for the entire sample, if the model were not used. In this case, the response rate would be constant, since one person is just as likely to respond as another. To double your yield, you would need to ask twice as many people. The curved line indicates how much you can improve your response by including only those who rank in the higher percentiles based on gain. For example, including the top 50% might net you more than 70% of the positive responses. The steeper the curve, the higher the gain. Lift Chart The lift chart plots the values in the Index (%) column in the table. This chart compares the percentage of records in each increment that are hits with the overall percentage of hits in the training dataset, using the equation: (hits in increment / records in increment) / (total number of hits / total number of records) 132 Chapter 6 Figure 6-16 Lift chart Response Chart The response chart plots the values in the Response (%) column of the table. The response is a percentage of records in the increment that are hits, using the equation: (responses in increment / records in increment) x 100% 133 Decision Trees Figure 6-17 Response chart Gains-Based Selection The Gains-Based Selection dialog box allows you to automatically select terminal nodes with the best (or worst) gains based on a specified rule or threshold. You can then generate a Select node based on the selection. 134 Chapter 6 Figure 6-18 Gains-Based Selection dialog box E On the Gains tab, select the node-by-node or cumulative view and select the target category on which you want to base the selection. (Selections are based on the current table display and are not available for quantiles.) E On the Gains tab, from the menus choose: Edit > Select Terminal Nodes > Gains-Based Selection Select only. You can select matching nodes or nonmatching nodes—for example, to select all but the top 100 records. Match by gains information. Matches nodes based on gain statistics for the current target category, including: Nodes where the gain, response, or lift (index) matches a specified threshold—for example, response greater than or equal to 50%. The top n nodes based on the gain for the target category. The top nodes up to a specified number of records. The top nodes up to a specified percentage of training data. E Click OK to update the selection on the Viewer tab. E To create a new Select node based on the current selection on the Viewer tab, choose Select Node from the Generate menu. For more information, see the topic Generating Filter and Select Nodes on p. 139. Note: Since you are actually selecting nodes rather than records or percentages, a perfect match with the selection criterion may not always be achieved. The system selects complete nodes up to the specified level. For example, if you select the top 12 cases and you have 10 in the first node and two in the second node, only the first node will be selected. Risks Risks tell you the chances of misclassification at any level. The Risks tab displays a point risk estimate and (for categorical outputs) a misclassification table. 135 Decision Trees Figure 6-19 Misclassification table for a categorical target For numeric predictions, the risk is a pooled estimate of the variance at each of the terminal nodes. For categorical predictions, the risk is the proportion of cases incorrectly classified, adjusted for any priors or misclassification costs. Saving Tree Models and Results You can save or export the results of your interactive tree-building sessions in a number of ways, including: Generate a model based on the current tree (Generate > Generate model). Save the directives used to grow the current tree. The next time the tree-building node is executed, the current tree will automatically be regrown, including any custom splits that you have defined. Export model, gain, and risk information. For more information, see the topic Exporting Model, Gain, and Risk Information on p. 139. From either the tree builder or a tree model nugget, you can: Generate a Filter or Select node based on the current tree. For more information, see the topic Generating Filter and Select Nodes on p. 139. Generate a Rule Set nugget that represents the tree structure as a set of rules defining the terminal branches of the tree. For more information, see the topic Generating a Rule Set from a Decision Tree on p. 139. 136 Chapter 6 In addition, for tree model nuggets only, you can export the model in PMML format. For more information, see the topic The Models Palette in Chapter 3 on p. 47. If the model includes any custom splits, this information is not preserved in the exported PMML. (The split is preserved, but the fact that it is custom rather than chosen by the algorithm is not.) Generate a graph based on a selected part of the current tree. Note: this only works for a nugget when it is attached to other nodes in a stream. For more information, see the topic Generating Graphs on p. 172. Note: The interactive tree itself cannot be saved. To avoid losing your work, generate a model and/or update tree directives before closing the tree builder window. Generating a Model from the Tree Builder To generate a model based on the current tree, from the tree builder menus choose: Generate > Model Figure 6-20 Generating a decision tree model You can choose from the following options: Model name. You can specify a custom name or generate the name automatically based on the name of the modeling node. Create node on. You can add the node on the Canvas, GM Palette, or Both. Include tree directives. To include the directives from the current tree in the generated model, select this box. This enables you to regenerate the tree, if required. For more information, see the topic Tree-Growing Directives on p. 136. Tree-Growing Directives For C&R Tree, CHAID, and QUEST models, tree directives specify conditions for growing the tree, one level at a time. Directives are applied each time the interactive tree builder is launched from the node. Directives are most safely used as a way to regenerate a tree created during a previous interactive session. For more information, see the topic Updating Tree Directives on p. 138. You can also edit directives manually, but this should be done with care. Directives are highly specific to the structure of the tree they describe. Thus, any change to the underlying data or modeling options may cause a previously valid set of directives to fail. For example, if the CHAID algorithm changes a two-way split to a three-way split based on updated data, any directives based on the previous two-way split would fail. 137 Decision Trees Note: If you choose to generate a model directly (without using the tree builder), any tree directives are ignored. Editing Directives E To view or edit saved directives, open the tree-building node and select the Objective panel of the Build Options tab. E Select Launch interactive session to enable the controls, select Use tree directives, and click Directives. Figure 6-21 Tree-growing directives Directive Syntax Directives specify conditions for growing the tree, starting with the root node. For example to grow the tree one level: Grow Node Index 0 Children 1 2 Since no predictor is specified, the algorithm chooses the best split. Note that the first split must always be on the root node (Index 0) and the index values for both children must be specified (1 and 2 in this case). It is invalid to specify Grow Node Index 2 Children 3 4 unless you first grew the root that created Node 2. To grow the tree: Grow Tree To grow and prune the tree (C&R Tree only): Grow_And_Prune Tree To specify a custom split for a continuous predictor: Grow Node Index 0 Children 1 2 Spliton ( "EDUCATE", Interval ( NegativeInfinity, 12.5) 138 Chapter 6 Interval ( 12.5, Infinity )) To split on a nominal predictor with two values: Grow Node Index 2 Children 3 4 Spliton ( "GENDER", Group( "0.0" )Group( "1.0" )) To split on a nominal predictor with multiple values: Grow Node Index 6 Children 7 8 Spliton ( "ORGS", Group( "2.0","4.0" ) Group( "0.0","1.0","3.0","6.0" )) To split on an ordinal predictor: Grow Node Index 4 Children 5 6 Spliton ( "CHILDS", Interval ( NegativeInfinity, 1.0) Interval ( 1.0, Infinity )) Note: When specifying custom splits, field names and values (EDUCATE, GENDER, CHILDS, etc.) are case sensitive. Directives for CHAID Trees Directives for CHAID trees are particularly sensitive to changes in the data or model because—unlike C&R Tree and QUEST—they are not constrained to use binary splits. For example, the following syntax looks perfectly valid but would fail if the algorithm splits the root node into more than two children: Grow Node Index 0 Children 1 2 Grow Node Index 1 Children 3 4 With CHAID, it is possible that Node 0 will have 3 or 4 children, which would cause the second line of syntax to fail. Using Directives in Scripts Directives can also be embedded in scripts using triple quotation marks. Updating Tree Directives To preserve your work from an interactive tree-building session, you can save the directives used to generate the current tree. Unlike saving a model nugget, which cannot be edited further, this allows you to regenerate the tree in its current state for further editing. E To update directives, from the tree builder menus choose: File > Update Directives Directives are saved in the modeling node used to create the tree (either C&R Tree, QUEST, or CHAID) and can be used to regenerate the current tree. For more information, see the topic Tree-Growing Directives on p. 136. 139 Decision Trees Exporting Model, Gain, and Risk Information From the tree builder, you can export model, gain, and risk statistics in text, HTML, or image formats as appropriate. E In the tree builder window, select the tab or view that you want to export. E From the menus choose: File > Export E Select Text, HTML, or Graph as appropriate, and select the specific items you want to export from the submenu. Where applicable, the export is based on current selections. Exporting Text or HTML formats. You can export gain or risk statistics for the training or testing partition (if defined). The export is based on the current selections on the Gains tab—for example, you can choose node-by-node, cumulative, or quantile statistics. Exporting graphics. You can export the current tree as displayed on the Viewer tab or export gains charts for the training or testing partition (if defined). Available formats include .JPEG, .PNG, and .BMP. For gains, the export is based on current selections on the Gains tab (available only when a chart is displayed). Generating Filter and Select Nodes E In the tree builder window, or when browsing a decision tree model nugget, from the menus choose: Generate > Filter Node or > Select Node Filter Node. Generates a node that filters any fields not used by the current tree. This is a quick way to pare down the dataset to include only those fields that are selected as important by the algorithm. If there is a Type node upstream from this decision tree node, any fields with the role Target are passed on by the Filter model nugget. Select Node. Generates a node that selects all records that fall into the current node. This option requires that one or more tree branches be selected on the Viewer tab. The model nugget is placed on the stream canvas. Generating a Rule Set from a Decision Tree You can generate a Rule Set model nugget that represents the tree structure as a set of rules defining the terminal branches of the tree. Rule sets can often retain most of the important information from a full decision tree but with a less complex model. The most important difference is that with a rule set, more than one rule may apply for any particular record or no rules at all may apply. For example, you might see all of the rules that predict a no outcome followed by all of those that predict yes. If multiple rules apply, each rule gets a weighted “vote” based on the confidence associated with that rule, and the final prediction is decided by combining 140 Chapter 6 the weighted votes of all of the rules that apply to the record in question. If no rule applies, a default prediction is assigned to the record. Rule sets can be generated only from trees with categorical target fields (no regression trees). E In the tree builder window, or when browsing a decision tree model nugget, from the menus choose: Generate > Rule Set Figure 6-22 Generate Rule Set dialog box Rule set name. Allows you to specify the name of the new Rule Set model nugget. Create node on. Controls the location of the new Rule Set model nugget. Select Canvas, GM Palette, or Both. Minimum instances. Specify the minimum number of instances (number of records to which the rule applies) to preserve in the Rule Set model nugget. Rules with support less than the specified value will not be included in the new rule set. Minimum confidence. Specify the minimum confidence for rules to be preserved in the Rule Set model nugget. Rules with confidence less than the specified value will not be included in the new rule set. Building a Tree Model Directly As an alternative to using the interactive tree builder, you can build a decision tree model directly from the node when the stream is run. This is consistent with most other model-building nodes. For C5.0 tree models, which are not supported by the interactive tree builder, this is the only method that can be used. E Create a stream and add one of the decision tree nodes—C&R Tree, CHAID, QUEST, or C5.0. 141 Decision Trees Figure 6-23 Building a C5.0 tree directly E For C&R Tree, QUEST or CHAID, on the Objective panel of the Build Options tab, choose one of the main objectives. If you choose Build a single tree, ensure that Mode is set to Generate model. For C5.0, on the Model tab, set Output type to Decision tree. E Select target and predictor fields and specify additional model options, as needed. For specific instructions, see the documentation for each tree-building node. E Run the stream to generate the model. Comments When generating trees using this method, tree-growing directives are ignored. Whether interactive or direct, both methods of creating decision trees ultimately generate similar models. It’s just a question of how much control you want along the way. Decision Tree Nodes The Decision Tree nodes in IBM® SPSS® Modeler provide access to the tree-building algorithms introduced earlier: C&R Tree QUEST 142 Chapter 6 CHAID C5.0 For more information, see the topic Decision Tree Models on p. 116. The algorithms are similar in that they can all construct a decision tree by recursively splitting the data into smaller and smaller subgroups. However, there are some important differences. Input fields. The input fields (predictors) can be any of the following types (measurement levels): continuous, categorical, flag, nominal or ordinal. Target fields. Only one target field can be specified. For C&R Tree and CHAID, the target can be continuous, categorical, flag, nominal or ordinal. For QUEST it can be categorical, flag or nominal. For C5.0 the target can be flag, nominal or ordinal. Type of split. C&R Tree and QUEST support only binary splits (that is, each node of the tree can be split into no more than two branches). By contrast, CHAID and C5.0 support splitting into more than two branches at a time. Method used for splitting. The algorithms differ in the criteria used to decide the splits. When C&R Tree predicts a categorical output, a dispersion measure is used (by default the Gini coefficient, though you can change this). For continuous targets, the least squared deviation method is used. CHAID uses a chi-square test; QUEST uses a chi-square test for categorical predictors, and analysis of variance for continuous inputs. For C5.0 an information theory measure is used, the information gain ratio. Missing value handling. All algorithms allow missing values for the predictor fields, though they use different methods to handle them. C&R Tree and QUEST use substitute prediction fields, where needed, to advance a record with missing values through the tree during training. CHAID makes the missing values a separate category and allows them to be used in tree building. C5.0 uses a fractioning method, which passes a fractional part of a record down each branch of the tree from a node where the split is based on a field with a missing value. Pruning. C&R Tree, QUEST and C5.0 offer the option to grow the tree fully and then prune it back by removing bottom-level splits that do not contribute significantly to the accuracy of the tree. However, all of the decision tree algorithms allow you to control the minimum subgroup size, which helps avoid branches with few data records. Interactive tree building. C&R Tree, QUEST and CHAID provide an option to launch an interactive session. This allows you to build your tree one level at a time, edit the splits, and prune the tree before you create the model. C5.0 does not have an interactive option. Prior probabilities. C&R Tree and QUEST support the specification of prior probabilities for categories when predicting a categorical target field. Prior probabilities are estimates of the overall relative frequency for each target category in the population from which the training data are drawn. In other words, they are the probability estimates that you would make for each possible target value prior to knowing anything about predictor values. CHAID and C5.0 do not support specifying prior probabilities. 143 Decision Trees Rule sets. For models with categorical target fields, the decision tree nodes provide the option to create the model in the form of a rule set, which can sometimes be easier to interpret than a complex decision tree. For C&R Tree, QUEST and CHAID you can generate a rule set from an interactive session; for C5.0 you can specify this option on the modeling node. In addition, all decision tree models enable you to generate a rule set from the model nugget. For more information, see the topic Generating a Rule Set from a Decision Tree on p. 139. C&R Tree Node The Classification and Regression (C&R) Tree node is a tree-based classification and prediction method. Similar to C5.0, this method uses recursive partitioning to split the training records into segments with similar output field values. The C&R Tree node starts by examining the input fields to find the best split, measured by the reduction in an impurity index that results from the split. The split defines two subgroups, each of which is subsequently split into two more subgroups, and so on, until one of the stopping criteria is triggered. All splits are binary (only two subgroups). Pruning C&R Trees give you the option to first grow the tree and then prune based on a cost-complexity algorithm that adjusts the risk estimate based on the number of terminal nodes. This method, which allows the tree to grow large before pruning based on more complex criteria, may result in smaller trees with better cross-validation properties. Increasing the number of terminal nodes generally reduces the risk for the current (training) data, but the actual risk may be higher when the model is generalized to unseen data. In an extreme case, suppose you have a separate terminal node for each record in the training set. The risk estimate would be 0%, since every record falls into its own node, but the risk of misclassification for unseen (testing) data would almost certainly be greater than 0. The cost-complexity measure attempts to compensate for this. Example. A cable TV company has commissioned a marketing study to determine which customers would buy a subscription to an interactive news service via cable. Using the data from the study, you can create a stream in which the target field is the intent to buy the subscription and the predictor fields include age, sex, education, income category, hours spent watching television each day, and number of children. By applying a C&R Tree node to the stream, you will be able to predict and classify the responses to get the highest response rate for your campaign. Requirements. To train a C&R Tree model, you need one or more Inputfields and exactly one Target field. Target and input fields can be continuous (numeric range) or categorical. Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated, and any ordinal (ordered set) fields used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. Strengths. C&R Tree models are quite robust in the presence of problems such as missing data and large numbers of fields. They usually do not require long training times to estimate. In addition, C&R Tree models tend to be easier to understand than some other model types—the rules derived from the model have a very straightforward interpretation. Unlike C5.0, C&R Tree can accommodate continuous as well as categorical output fields. 144 Chapter 6 CHAID Node CHAID, or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits. CHAID first examines the crosstabulations between each of the input fields and the outcome, and tests for significance using a chi-square independence test. If more than one of these relations is statistically significant, CHAID will select the input field that is the most significant (smallest p value). If an input has more than two categories, these are compared, and categories that show no differences in the outcome are collapsed together. This is done by successively joining the pair of categories showing the least significant difference. This category-merging process stops when all remaining categories differ at the specified testing level. For nominal input fields, any categories can be merged; for an ordinal set, only contiguous categories can be merged. Exhaustive CHAID is a modification of CHAID that does a more thorough job of examining all possible splits for each predictor but takes longer to compute. Requirements. Target and input fields can be continuous or categorical; nodes can be split into two or more subgroups at each level. Any ordinal fields used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. Strengths. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. It therefore tends to create a wider tree than the binary growing methods. CHAID works for all types of inputs, and it accepts both case weights and frequency variables. QUEST Node QUEST—or Quick, Unbiased, Efficient Statistical Tree—is a binary classification method for building decision trees. A major motivation in its development was to reduce the processing time required for large C&R Tree analyses with either many variables or many cases. A second goal of QUEST was to reduce the tendency found in classification tree methods to favor inputs that allow more splits, that is, continuous (numeric range) input fields or those with many categories. QUEST uses a sequence of rules, based on significance tests, to evaluate the input fields at a node. For selection purposes, as little as a single test may need to be performed on each input at a node. Unlike C&R Tree, all splits are not examined, and unlike C&R Tree and CHAID, category combinations are not tested when evaluating an input field for selection. This speeds the analysis. Splits are determined by running quadratic discriminant analysis using the selected input on groups formed by the target categories. This method again results in a speed improvement over exhaustive search (C&R Tree) to determine the optimal split. Requirements. Input fields can be continuous (numeric ranges), but the target field must be categorical. All splits are binary. Weight fields cannot be used. Any ordinal (ordered set) fields used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. Strengths. Like CHAID, but unlike C&R Tree, QUEST uses statistical tests to decide whether or not an input field is used. It also separates the issues of input selection and splitting, applying different criteria to each. This contrasts with CHAID, in which the statistical test result that 145 Decision Trees determines variable selection also produces the split. Similarly, C&R Tree employs the impurity-change measure to both select the input field and to determine the split. Decision Tree Node Fields Options On the Fields tab, you choose whether you want to use the field role settings already defined in upstream nodes, or make the field assignments manually. Figure 6-24 C&R Tree node, Fields tab Use predefined roles. This option uses the role settings (targets, predictors and so on) from an upstream Type node (or the Types tab of an upstream source node). Use custom field assignments. Choose this option if you want to assign targets, predictors and other roles manually on this screen. Fields. Use the arrow buttons to assign items manually from this list to the various role fields on the right of the screen. The icons indicate the valid measurement levels for each role field. 146 Chapter 6 Click the All button to select all the fields in the list, or click an individual measurement level button to select all fields with that measurement level. Target. Choose one field as the target for the prediction. Predictors (Inputs). Choose one or more fields as inputs for the prediction. Analysis Weight. (CHAID and C&RT only) To use a field as a case weight, specify the field here. Case weights are used to account for differences in variance across levels of the output field. For more information, see the topic Using Frequency and Weight Fields in Chapter 3 on p. 38. Decision Tree Node Build Options The Build Options tab is where you set all the options for building the model. You can, of course, just click the Run button to build a model with all the default options, but normally you will want to customize the build for your own purposes. You can choose here whether to build a new model or update an existing one. You also set the main objective of the node: to build a standard model, to build one with enhanced accuracy or stability, or to build one for use with very large datasets. 147 Decision Trees Figure 6-25 C&R Tree node, Build Options tab What do you want to do? Build new model. (Default) Creates a completely new model each time you run a stream containing this modeling node. Continue training existing model. By default, a completely new model is created each time a modeling node is executed. If this option is selected, training continues with the last model successfully produced by the node. This makes it possible to update or refresh an existing model without having to access the original data and may result in significantly faster performance since only the new or updated records are fed into the stream. Details on the previous model are stored with the modeling node, making it possible to use this option even if the previous model nugget is no longer available in the stream or Models palette. Note: This option is activated only if you select Create a model for very large datasets as the objective. 148 Chapter 6 What is your main objective? Build a single tree. Creates a single, standard decision tree model. Standard models are generally easier to interpret, and can be faster to score, than models built using the other objective options. Mode. Specifies the method used to build the model. Generate model creates a model automatically when the stream is run. Launch interactive session opens the tree builder, which allows you to build your tree one level at a time, edit splits, and prune as desired before creating the model nugget. Use tree directives. Select this option to specify directives to apply when generating an interactive tree from the node. For example, you can specify the first- and second-level splits, and these would automatically be applied when the tree builder is launched. You can also save directives from an interactive tree-building session in order to re-create the tree at a future date. For more information, see the topic Updating Tree Directives on p. 138. Enhance model accuracy (boosting). Choose this option if you want to use a special method, known as boosting, to improve the model accuracy rate. Boosting works by building multiple models in a sequence. The first model is built in the usual way. Then, a second model is built in such a way that it focuses on the records that were misclassified by the first model. Then a third model is built to focus on the second model’s errors, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can significantly improve the accuracy of a decision tree model, but it also requires longer training. Enhance model stability (bagging). Choose this option if you want to use a special method, known as bagging (bootstrap aggregating), to improve the stability of the model and to avoid overfitting. This option creates multiple models and combines them, in order to obtain more reliable predictions. Models obtained using this option can take longer to build and score than standard models. Create a model for very large datasets. Choose this option when working with datasets that are too large to build a model using any of the other objective options. This option divides the data into smaller data blocks and builds a model on each block. The most accurate models are then automatically selected and combined into a single model nugget. You can perform incremental model updating if you select the Continue training existing model option on this screen. Note: This option for very large datasets requires a connection to IBM® SPSS® Modeler Server. Decision Tree Nodes - Basics This is where you specify the basic options about how the decision tree is to be built. 149 Decision Trees Figure 6-26 Decision tree basic options Tree growing algorithm. (CHAID only) Choose the type of CHAID algorithm you want to use. Exhaustive CHAID is a modification of CHAID that does a more thorough job of examining all possible splits for each predictor but takes longer to compute. Maximum tree depth. Specify the maximum number of levels below the root node (the number of times the sample will be split recursively). The default is 5; choose Custom and enter a value to specify a different number of levels. Pruning (C&RT and QUEST only) Prune tree to avoid overfitting. Pruning consists of removing bottom-level splits that do not contribute significantly to the accuracy of the tree. Pruning can help simplify the tree, making it easier to interpret and, in some cases, improving generalization. If you want the full tree without pruning, leave this option deselected. Maximum difference in risk (in Standard Errors). Enables you to specify a more liberal pruning rule. The standard error rule allows the algorithm to select the simplest tree whose risk estimate is close to (but possibly greater than) that of the subtree with the smallest risk. The value indicates the size of the allowable difference in the risk estimate between the pruned tree and the tree with the smallest risk in terms of the risk estimate. For example, if you specify 2, a tree whose risk estimate is (2 × standard error) larger than that of the full tree could be selected. 150 Chapter 6 Maximum surrogates. Surrogates are a method for dealing with missing values. For each split in the tree, the algorithm identifies the input fields that are most similar to the selected split field. Those fields are the surrogates for that split. When a record must be classified but has a missing value for a split field, its value on a surrogate field can be used to make the split. Increasing this setting will allow more flexibility to handle missing values but may also lead to increased memory usage and longer training times. Decision Tree Nodes - Stopping Rules Figure 6-27 Options for stopping rules These options control how the tree is constructed. Stopping rules determine when to stop splitting specific branches of the tree. Set the minimum branch sizes to prevent splits that would create very small subgroups. Minimum records in parent branch will prevent a split if the number of records in the node to be split (the parent) is less than the specified value. Minimum records in child branch will prevent a split if the number of records in any branch created by the split (the child) would be less than the specified value. Use percentage. Allows you to specify sizes in terms of percentage of overall training data. Use absolute value. Allows you to specify sizes as the absolute numbers of records. 151 Decision Trees Decision Tree Nodes - Ensembles Figure 6-28 Options for ensembling These settings determine the behavior of ensembling that occurs when boosting, bagging, or very large datasets are requested in Objectives. Options that do not apply to the selected objective are ignored. Bagging and Very Large Datasets. When scoring an ensemble, this is the rule used to combine the predicted values from the base models to compute the ensemble score value. Default combining rule for categorical targets. Ensemble predicted values for categorical targets can be combined using voting, highest probability, or highest mean probability. Voting selects the category that has the highest probability most often across the base models. Highest probability selects the category that achieves the single highest probability across all base models. Highest mean probability selects the category with the highest value when the category probabilities are averaged across base models. Default combining rule for continuous targets. Ensemble predicted values for continuous targets can be combined using the mean or median of the predicted values from the base models. 152 Chapter 6 Note that when the objective is to enhance model accuracy, the combining rule selections are ignored. Boosting always uses a weighted majority vote to score categorical targets and a weighted median to score continuous targets. Boosting and Bagging. Specify the number of base models to build when the objective is to enhance model accuracy or stability; for bagging, this is the number of bootstrap samples. It should be a positive integer. C&R Tree and QUEST Nodes - Costs & Priors Figure 6-29 Setting misclassification costs and prior probabilities Misclassification Costs In some contexts, certain kinds of errors are more costly than others. For example, it may be more costly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a low-risk applicant as high risk (a different kind of error). Misclassification costs allow you to specify the relative importance of different kinds of prediction errors. 153 Decision Trees Misclassification costs are basically weights applied to specific outcomes. These weights are factored into the model and may actually change the prediction (as a way of protecting against costly mistakes). With the exception of C5.0 models, misclassification costs are not applied when scoring a model and are not taken into account when ranking or comparing models using an Auto Classifier node, evaluation chart, or Analysis node. A model that includes costs may not produce fewer errors than one that doesn’t and may not rank any higher in terms of overall accuracy, but it is likely to perform better in practical terms because it has a built-in bias in favor of less expensive errors. The cost matrix shows the cost for each possible combination of predicted category and actual category. By default, all misclassification costs are set to 1.0. To enter custom cost values, select Use misclassification costs and enter your custom values into the cost matrix. To change a misclassification cost, select the cell corresponding to the desired combination of predicted and actual values, delete the existing contents of the cell, and enter the desired cost for the cell. Costs are not automatically symmetrical. For example, if you set the cost of misclassifying A as B to be 2.0, the cost of misclassifying B as A will still have the default value of 1.0 unless you explicitly change it as well. Priors These options allow you to specify prior probabilities for categories when predicting a categorical target field. Prior probabilities are estimates of the overall relative frequency for each target category in the population from which the training data are drawn. In other words, they are the probability estimates that you would make for each possible target value prior to knowing anything about predictor values. There are three methods of setting priors: Based on training data. This is the default. Prior probabilities are based on the relative frequencies of the categories in the training data. Equal for all classes. Prior probabilities for all categories are defined as 1/k, where k is the number of target categories. Custom. You can specify your own prior probabilities. Starting values for prior probabilities are set as equal for all classes. You can adjust the probabilities for individual categories to user-defined values. To adjust a specific category’s probability, select the probability cell in the table corresponding to the desired category, delete the contents of the cell, and enter the desired value. The prior probabilities for all categories should sum to 1.0 (the probability constraint). If they do not sum to 1.0, a warning is displayed, with an option to automatically normalize the values. This automatic adjustment preserves the proportions across categories while enforcing the probability constraint. You can perform this adjustment at any time by clicking the Normalize button. To reset the table to equal values for all categories, click the Equalize button. Adjust priors using misclassification costs. This option allows you to adjust the priors, based on misclassification costs (specified on the Costs tab). This enables you to incorporate cost information directly into the tree-growing process for trees that use the Twoing impurity measure. (When this option is not selected, cost information is used only in classifying records and calculating risk estimates for trees based on the Twoing measure.) 154 Chapter 6 CHAID Node - Costs Figure 6-30 Misclassification costs in the CHAID node In some contexts, certain kinds of errors are more costly than others. For example, it may be more costly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a low-risk applicant as high risk (a different kind of error). Misclassification costs allow you to specify the relative importance of different kinds of prediction errors. Misclassification costs are basically weights applied to specific outcomes. These weights are factored into the model and may actually change the prediction (as a way of protecting against costly mistakes). With the exception of C5.0 models, misclassification costs are not applied when scoring a model and are not taken into account when ranking or comparing models using an Auto Classifier node, evaluation chart, or Analysis node. A model that includes costs may not produce fewer errors than one that doesn’t and may not rank any higher in terms of overall accuracy, but it is likely to perform better in practical terms because it has a built-in bias in favor of less expensive errors. The cost matrix shows the cost for each possible combination of predicted category and actual category. By default, all misclassification costs are set to 1.0. To enter custom cost values, select Use misclassification costs and enter your custom values into the cost matrix. To change a misclassification cost, select the cell corresponding to the desired combination of predicted and actual values, delete the existing contents of the cell, and enter the desired cost for the cell. Costs are not automatically symmetrical. For example, if you set the cost of 155 Decision Trees misclassifying A as B to be 2.0, the cost of misclassifying B as A will still have the default value of 1.0 unless you explicitly change it as well. C&R Tree Node - Advanced The advanced options enable you to fine-tune the tree-building process. Figure 6-31 Setting advanced options for the C&R Tree node Minimum change in impurity. Specify the minimum change in impurity to create a new split in the tree. Impurity refers to the extent to which subgroups defined by the tree have a wide range of output field values within each group. For categorical targets, a node is considered “pure” if 100% of cases in the node fall into a specific category of the target field. The goal of tree building is to create subgroups with similar output values—in other words, to minimize the impurity within each node. If the best split for a branch reduces the impurity by less than the specified amount, the split will not be made. Impurity measure for categorical targets. For categorical target fields, specify the method used to measure the impurity of the tree. (For continuous targets, this option is ignored, and the least squared deviation impurity measure is always used.) Gini is a general impurity measure based on probabilities of category membership for the branch. 156 Chapter 6 Twoing is an impurity measure that emphasizes the binary split and is more likely to lead to approximately equal-sized branches from a split. Ordered adds the additional constraint that only contiguous target classes can be grouped together, as is applicable only with ordinal targets. If this option is selected for a nominal target, the standard twoing measure is used by default. Overfit prevention set. The algorithm internally separates records into a model building set and an overfit prevention set, which is an independent set of data records used to track errors during training in order to prevent the method from modeling chance variation in the data. Specify a percentage of records. The default is 30. Replicate results. Setting a random seed allows you to replicate analyses. Specify an integer or click Generate, which will create a pseudo-random integer between 1 and 2147483647, inclusive. QUEST Node - Advanced The advanced options enable you to fine-tune the tree-building process. Figure 6-32 Setting advanced options for the QUEST node 157 Decision Trees Significance level for splitting. Specifies the significance level (alpha) for splitting nodes. The value must be between 0 and 1. Lower values tend to produce trees with fewer nodes. Overfit prevention set. The algorithm internally separates records into a model building set and an overfit prevention set, which is an independent set of data records used to track errors during training in order to prevent the method from modeling chance variation in the data. Specify a percentage of records. The default is 30. Replicate results. Setting a random seed allows you to replicate analyses. Specify an integer or click Generate, which will create a pseudo-random integer between 1 and 2147483647, inclusive. CHAID Node - Advanced The advanced options enable you to fine-tune the tree-building process. Figure 6-33 Setting advanced options for the CHAID node Significance level for splitting. Specifies the significance level (alpha) for splitting nodes. The value must be between 0 and 1. Lower values tend to produce trees with fewer nodes. 158 Chapter 6 Significance level for merging. Specifies the significance level (alpha) for merging categories. The value must be greater than 0 and less than or equal to 1. To prevent any merging of categories, specify a value of 1. For continuous targets, this means the number of categories for the variable in the final tree matches the specified number of intervals. This option is not available for Exhaustive CHAID. Adjust significance values using Bonferroni method. Adjusts significance values when testing the various category combinations of a predictor. Values are adjusted based on the number of tests, which directly relates to the number of categories and measurement level of a predictor. This is generally desirable because it better controls the false-positive error rate. Disabling this option will increase the power of your analysis to find true differences, but at the cost of an increased false-positive rate. In particular, disabling this option may be recommended for small samples. Allow resplitting of merged categories within a node. The CHAID algorithm attempts to merge categories in order to produce the simplest tree that describes the model. If selected, this option allows merged categories to be resplit if that results in a better solution. Chi-square for categorical targets. For categorical targets, you can specify the method used to calculate the chi-square statistic. Pearson. This method provides faster calculations but should be used with caution on small samples. Likelihood ratio. This method is more robust than Pearson but takes longer to calculate. For small samples, this is the preferred method. For continuous targets, this method is always used. Minimum change in expected cell frequencies. When estimating cell frequencies (for both the nominal model and the row effects ordinal model), an iterative procedure (epsilon) is used to converge on the optimal estimate used in the chi-square test for a specific split. Epsilon determines how much change must occur for iterations to continue; if the change from the last iteration is smaller than the specified value, iterations stop. If you are having problems with the algorithm not converging, you can increase this value or increase the maximum number of iterations until convergence occurs. Maximum iterations for convergence. Specifies the maximum number of iterations before stopping, whether convergence has taken place or not. Replicate results. Setting a random seed allows you to replicate analyses. Specify an integer or click Generate, which will create a pseudo-random integer between 1 and 2147483647, inclusive. Decision Tree Node Model Options On the Model Options tab, you can choose whether to specify a name for the model, or generate a name automatically. You can also choose to obtain predictor importance information, as well as raw and adjusted propensity scores for flag targets. 159 Decision Trees Figure 6-34 Setting the model options for a decision tree node Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Model Evaluation Calculate predictor importance. For models that produce an appropriate measure of importance, you can display a chart that indicates the relative importance of each predictor in estimating the model. Typically you will want to focus your modeling efforts on the predictors that matter most, and consider dropping or ignoring those that matter least. Note that predictor importance may take longer to calculate for some models, particularly when working with large datasets, and is off by default for some models as a result. Predictor importance is not available for decision list models. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. 160 Chapter 6 Propensity Scores Propensity scores can be enabled in the modeling node, and on the Settings tab in the model nugget. This functionality is available only when the selected target is a flag field. For more information, see the topic Propensity Scores in Chapter 3 on p. 41. Calculate raw propensity scores. Raw propensity scores are derived from the model based on the training data only. If the model predicts the true value (will respond), then the propensity is the same as P, where P is the probability of the prediction. If the model predicts the false value, then the propensity is calculated as (1 – P). If you choose this option when building the model, propensity scores will be enabled in the model nugget by default. However, you can always choose to enable raw propensity scores in the model nugget whether or not you select them in the modeling node. When scoring the model, raw propensity scores will be added in a field with the letters RP appended to the standard prefix. For example, if the predictions are in a field named $R-churn, the name of the propensity score field will be $RRP-churn. Calculate adjusted propensity scores. Raw propensities are based purely on estimates given by the model, which may be overfitted, leading to over-optimistic estimates of propensity. Adjusted propensities attempt to compensate by looking at how the model performs on the test or validation partitions and adjusting the propensities to give a better estimate accordingly. This setting requires that a valid partition field is present in the stream. Unlike raw confidence scores, adjusted propensity scores must be calculated when building the model; otherwise, they will not be available when scoring the model nugget. When scoring the model, adjusted propensity scores will be added in a field with the letters AP appended to the standard prefix. For example, if the predictions are in a field named $R-churn, the name of the propensity score field will be $RAP-churn. Adjusted propensity scores are not available for logistic regression models. When calculating the adjusted propensity scores, the test or validation partition used for the calculation must not have been balanced. To avoid this, be sure the Only balance training data option is selected in any upstream Balance nodes. In addition, if a complex sample has been taken upstream this will invalidate the adjusted propensity scores. Adjusted propensity scores are not available for “boosted” tree and rule set models. For more information, see the topic Boosted C5.0 Models on p. 171. Based on. For adjusted propensity scores to be computed, a partition field must be present in the stream. You can specify whether to use the testing or validation partition for this computation. For best results, the testing or validation partition should include at least as many records as the partition used to train the original model. C5.0 Node Note: This feature is available in SPSS Modeler Professional and SPSS Modeler Premium. 161 Decision Trees This node uses the C5.0 algorithm to build either a decision tree or a rule set. A C5.0 model works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned. Note: The C5.0 node can predict only a categorical target. When analyzing data with categorical (nominal or ordinal) fields, the node is more likely to group categories together than versions of C5.0 prior to release 11.0. C5.0 can produce two kinds of models. A decision tree is a straightforward description of the splits found by the algorithm. Each terminal (or “leaf”) node describes a particular subset of the training data, and each case in the training data belongs to exactly one terminal node in the tree. In other words, exactly one prediction is possible for any particular data record presented to a decision tree. In contrast, a rule set is a set of rules that tries to make predictions for individual records. Rule sets are derived from decision trees and, in a way, represent a simplified or distilled version of the information found in the decision tree. Rule sets can often retain most of the important information from a full decision tree but with a less complex model. Because of the way rule sets work, they do not have the same properties as decision trees. The most important difference is that with a rule set, more than one rule may apply for any particular record, or no rules at all may apply. If multiple rules apply, each rule gets a weighted “vote” based on the confidence associated with that rule, and the final prediction is decided by combining the weighted votes of all of the rules that apply to the record in question. If no rule applies, a default prediction is assigned to the record. Example. A medical researcher has collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications. You can use a C5.0 model, in conjunction with other nodes, to help find out which drug might be appropriate for a future patient with the same illness. Requirements. To train a C5.0 model, there must be one categorical (i.e., nominal or ordinal) Target field, and one or more Input fields of any type. Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated. A weight field can also be specified. Strengths. C5.0 models are quite robust in the presence of problems such as missing data and large numbers of input fields. They usually do not require long training times to estimate. In addition, C5.0 models tend to be easier to understand than some other model types, since the rules derived from the model have a very straightforward interpretation. C5.0 also offers the powerful boosting method to increase accuracy of classification. Note: C5.0 model building speed may benefit from enabling parallel processing. 162 Chapter 6 C5.0 Node Model Options Figure 6-35 C5.0 node model options Model name. Specify the name of the model to be produced. Auto. With this option selected, the model name will be generated automatically, based on the target field name(s). This is the default. Custom. Select this option to specify your own name for the model nugget that will be created by this node. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Output type. Specify here whether you want the resulting model nugget to be a Decision tree or a Rule set. Group symbolics. If this option is selected, C5.0 will attempt to combine symbolic values that have similar patterns with respect to the output field. If this option is not selected, C5.0 will create a child node for every value of the symbolic field used to split the parent node. For example, if C5.0 splits on a COLOR field (with values RED, GREEN, and BLUE), it will create a three-way split by default. However, if this option is selected, and the records where COLOR = RED are very similar to records where COLOR = BLUE, it will create a two-way split, with the GREENs in one group and the BLUEs and REDs together in the other. 163 Decision Trees Use boosting. The C5.0 algorithm has a special method for improving its accuracy rate, called boosting. It works by building multiple models in a sequence. The first model is built in the usual way. Then, a second model is built in such a way that it focuses on the records that were misclassified by the first model. Then a third model is built to focus on the second model’s errors, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0 model, but it also requires longer training. The Number of trials option allows you to control how many models are used for the boosted model. This feature is based on the research of Freund & Schapire, with some proprietary improvements to handle noisy data better. Cross-validate. If this option is selected, C5.0 will use a set of models built on subsets of the training data to estimate the accuracy of a model built on the full dataset. This is useful if your dataset is too small to split into traditional training and testing sets. The cross-validation models are discarded after the accuracy estimate is calculated. You can specify the number of folds, or the number of models used for cross-validation. Note that in previous versions of IBM® SPSS® Modeler, building the model and cross-validating it were two separate operations. In the current version, no separate model-building step is required. Model building and cross-validation are performed at the same time. Mode. For Simple training, most of the C5.0 parameters are set automatically. Expert training allows more direct control over the training parameters. Simple Mode Options Favor. By default, C5.0 will try to produce the most accurate tree possible. In some instances, this can lead to overfitting, which can result in poor performance when the model is applied to new data. Select Generality to use algorithm settings that are less susceptible to this problem. Note: Models built with the Generality option selected are not guaranteed to generalize better than other models. When generality is a critical issue, always validate your model against a held-out test sample. Expected noise (%). Specify the expected proportion of noisy or erroneous data in the training set. Expert Mode Options Pruning severity. Determines the extent to which the decision tree or rule set will be pruned. Increase this value to obtain a smaller, more concise tree. Decrease it to obtain a more accurate tree. This setting affects local pruning only (see “Use global pruning” below). Minimum records per child branch. The size of subgroups can be used to limit the number of splits in any branch of the tree. A branch of the tree will be split only if two or more of the resulting subbranches would contain at least this many records from the training set. The default value is 2. Increase this value to help prevent overtraining with noisy data. Use global pruning. Trees are pruned in two stages: First, a local pruning stage, which examines subtrees and collapses branches to increase the accuracy of the model. Second, a global pruning stage considers the tree as a whole, and weak subtrees may be collapsed. Global pruning is performed by default. To omit the global pruning stage, deselect this option. 164 Chapter 6 Winnow attributes. If this option is selected, C5.0 will examine the usefulness of the predictors before starting to build the model. Predictors that are found to be irrelevant are then excluded from the model-building process. This option can be helpful for models with many predictor fields and can help prevent overfitting. Note: C5.0 model building speed may benefit from enabling parallel processing. Decision Tree Model Nuggets Decision tree model nuggets represent the tree structures for predicting a particular output field discovered by one of the decision tree modeling nodes (C&R Tree, CHAID, QUEST or C5.0). Tree models can be generated directly from the tree-building node, or indirectly from the interactive tree builder. For more information, see the topic The Interactive Tree Builder on p. 119. Scoring Tree Models When you run a stream containing a tree model nugget, the specific result depends on the type of tree. For classification trees (categorical target), two new fields, containing the predicted value and the confidence for each record, are added to the data. The prediction is based on the most frequent category for the terminal node to which the record is assigned; if a majority of respondents in a given node is yes, the prediction for all records assigned to that node is yes. For regression trees, only predicted values are generated; confidences are not assigned. Optionally, for CHAID, QUEST, and C&R Tree models, an additional field can be added that indicates the ID for the node to which each record is assigned. The new field names are derived from the model name by adding prefixes. For C&R Tree, CHAID, and QUEST, the prefixes are $R- for the prediction field, $RC- for the confidence field, and $RIfor the node identifier field. For C5.0 trees, the prefixes are $C- for the prediction field and $CCfor the confidence field. If multiple tree model nodes are present, the new field names will include numbers in the prefix to distinguish them if necessary—for example, $R1- and $RC1-, and $R2-. Working with Tree Model Nuggets You can save or export information related to the model in a number of ways. Note: Many of these options are also available from the tree builder window. From either the tree builder or a tree model nugget, you can: Generate a Filter or Select node based on the current tree. For more information, see the topic Generating Filter and Select Nodes on p. 139. Generate a Rule Set nugget that represents the tree structure as a set of rules defining the terminal branches of the tree. For more information, see the topic Generating a Rule Set from a Decision Tree on p. 139. In addition, for tree model nuggets only, you can export the model in PMML format. For more information, see the topic The Models Palette in Chapter 3 on p. 47. If the model includes any custom splits, this information is not preserved in the exported PMML. (The split is preserved, but the fact that it is custom rather than chosen by the algorithm is not.) 165 Decision Trees Generate a graph based on a selected part of the current tree. Note: this only works for a nugget when it is attached to other nodes in a stream. For more information, see the topic Generating Graphs on p. 172. For boosted C5.0 models only, you can choose Single Decision Tree (Canvas) or Single Decision Tree (GM Palette) to create a new single rule set derived from the currently selected rule. For more information, see the topic Boosted C5.0 Models on p. 171. Note: Although the Build Rule node was replaced by the C&R Tree node, decision tree nodes in existing streams that were originally created using a Build Rule node will still function properly. Single Tree Model Nuggets If you select Build a single tree as the main objective on the modeling node, the resulting model nugget contains the following tabs. Tab Model Description Displays the rules that define the model. Viewer Displays the tree view of the model. Summary Displays information about the fields, build settings, and model estimation process. Enables you to specify options for confidences and for SQL generation during model scoring. Enables you to add descriptive annotations, specify a custom name, add tooltip text and specify search keywords for the model. Settings Annotation Further Information For more information, see the topic Decision Tree Model Rules on p. 165. For more information, see the topic Decision Tree Model Viewer on p. 168. For more information, see the topic Model Nugget Summary / Information in Chapter 3 on p. 50. For more information, see the topic Decision Tree/Rule Set Model Nugget Settings on p. 170. Decision Tree Model Rules The Model tab for a decision tree nugget displays the rules that define the model. Optionally, a graph of predictor importance and a third panel with information about history, frequencies, and surrogates can also be displayed. Note: If you select the Create a model for very large datasets option on the CHAID node Build Options tab (Objective panel), the Model tab displays the tree rule details only. 166 Chapter 6 Figure 6-36 Decision tree model nugget Tree Rules The left pane displays a list of conditions defining the partitioning of data discovered by the algorithm—essentially a series of rules that can be used to assign individual records to child nodes based on the values of different predictors. Decision trees work by recursively partitioning the data based on input field values. The data partitions are called branches. The initial branch (sometimes called the root) encompasses all data records. The root is split into subsets, or child branches, based on the value of a particular input field. Each child branch can be further split into sub-branches, which can in turn be split again, and so on. At the lowest level of the tree are branches that have no more splits. Such branches are known as terminal branches (or leaves). The rule browser shows the input values that define each partition or branch and a summary of output field values for the records in that split. For general information on using the model browser, see Browsing Model Nuggets. For splits based on numeric fields, the branch is shown by a line of the form: fieldname relation value [summary] 167 Decision Trees where relation is a numeric relation. For example, a branch defined by values greater than 100 for the revenue field would be shown as: revenue > 100 [summary] For splits based on symbolic fields, the branch is shown by a line of the form: fieldname = value [summary] or fieldname in [values] [summary] where values represents the field values that define the branch. For example, a branch that includes records where the value of region can be North, West, or South would be represented as: region in ["North" "West" "South"] [summary] For terminal branches, a prediction is also given, adding an arrow and the predicted value to the end of the rule condition. For example, a leaf defined by revenue > 100 that predicts a value of high for the output field would be displayed as: revenue > 100 [Mode: high] → high The summary for the branch is defined differently for symbolic and numeric output fields. For trees with numeric output fields, the summary is the average value for the branch, and the effect of the branch is the difference between the average for the branch and the average of its parent branch. For trees with symbolic output fields, the summary is the mode, or the most frequent value, for records in the branch. To fully describe a branch, you need to include the condition that defines the branch, plus the conditions that define the splits further up the tree. For example, in the tree: revenue > 100 region = "North" region in ["South" "East" "West"] revenue <= 200 the branch represented by the second line is defined by the conditions revenue > 100 and region = “North”. If you click Show Instances/Confidence on the toolbar, each rule will also show information about the number of records to which the rule applies (Instances) and the proportion of those records for which the rule is true (Confidence). Predictor Importance Optionally, a chart that indicates the relative importance of each predictor in estimating the model may also be displayed on the Model tab. Typically you will want to focus your modeling efforts on the predictors that matter most and consider dropping or ignoring those that matter least. Note this chart is only available if Calculate predictor importance is selected on the Analyze tab before generating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. 168 Chapter 6 Additional Model Information If you click Show Additional Information Panel on the toolbar, you will see a panel at the bottom of the window displaying detailed information for the selected rule. The information panel contains three tabs. Figure 6-37 Surrogates displayed in the Information panel History. This tab traces the split conditions from the root node down to the selected node. This provides a list of conditions that determine when a record is assigned to the selected node. Records for which all of the conditions are true will be assigned to this node. Frequencies. For models with symbolic target fields, this tab shows, for each possible target value, the number of records assigned to this node (in the training data) that have that target value. The frequency figure, expressed as a percentage (shown to a maximum of three decimal places) is also displayed. For models with numeric targets, this tab is empty. Surrogates. Where applicable, any surrogates for the primary split field are shown for the selected node. Surrogates are alternate fields used if the primary predictor value is missing for a given record. The maximum number of surrogates allowed for a given split is specified in the tree-building node, but the actual number depends on the training data. In general, the more missing data, the more surrogates are likely to be used. For other decision tree models, this tab is empty. Note: To be included in the model, surrogates must be identified during the training phase. If the training sample has no missing values, then no surrogates will be identified, and any records with missing values encountered during testing or scoring will automatically fall into the child node with the largest number of records. If missing values are expected during testing or scoring, be sure that values are missing from the training sample, as well. Surrogates are not available for CHAID trees. Decision Tree Model Viewer The Viewer tab for a decision tree model nugget resembles the display in the tree builder. The main difference is that when browsing the model nugget, you can not grow or modify the tree. Other options for viewing and customizing the display are similar between the two components. For more information, see the topic Customizing the Tree View on p. 124. Note: The Viewer tab is not displayed for CHAID model nuggets built if you select the Create a model for very large datasets option on the Build Options tab - Objective panel. 169 Decision Trees Figure 6-38 Decision tree Viewer tab with tree map window When viewing split rules on the Viewer tab, square brackets mean that the adjacent value is included in the range whereas parentheses indicate that the adjacent value is excluded from the range. The expression (23,37] therefore means from 23 exclusive to 37 inclusive, i.e. from just above 23 to 37. On the Model tab, the same condition would be displayed as: Age > 23 and Age <= 37 170 Chapter 6 Figure 6-39 Split rules displayed on the Viewer tab Decision Tree/Rule Set Model Nugget Settings The Settings tab for a decision tree or Rule Set model nugget allows you to specify options for confidences and for SQL generation during model scoring. This tab is available only after the model nugget has been added to a stream. Figure 6-40 Example of decision tree model nugget settings Calculate confidences. Select to include confidences in scoring operations. When scoring models in the database, excluding confidences allows you to generate more efficient SQL. For regression trees, confidences are not assigned. Note: If you select the Create a model for very large datasets option on the Build Options tab - Method panel for CHAID models, this checkbox is available only in the model nuggets for categorical targets of nominal or flag. 171 Decision Trees Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Note: If you select the Create a model for very large datasets option on the Build Options tab - Method panel for CHAID models, this checkbox is available only in model nuggets with a categorical target of flag. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. Note: Adjusted propensity scores are not available for boosted tree and rule set models. For more information, see the topic Boosted C5.0 Models on p. 171. Rule identifier. For CHAID, QUEST, and C&R Tree models, this option adds a field in the scoring output that indicates the ID for the terminal node to which each record is assigned. Note: When this option is selected, SQL generation is not available. Generate SQL for this model. When using data from a database, SQL code can be pushed back to the database for execution, providing superior performance for many operations. Select one of the following options to specify how SQL generation is performed. Default: Score using Server Scoring Adapter (if installed) otherwise in process. If connected to a database with a scoring adapter installed, generates SQL using the scoring adapter, otherwise generates SQL natively inside SPSS Modeler. Generate with no missing value support. Select to enable SQL generation without the overhead of handling missing values. This option simply sets the prediction to null ($null$) when a missing value is encountered while scoring a case. Note: This option is not available for CHAID models. For other model types, it is only available for decision trees (not rule sets). Generate with missing value support. For CHAID, QUEST, and C&R Tree models, you can enable SQL generation with full missing value support. This means that SQL is generated so that missing values are handled as specified in the model. For example, C&R Trees use surrogate rules and biggest child fallback. Note: For C5.0 models, this option is only available for rule sets (not decision trees). Boosted C5.0 Models Note: This feature is available in SPSS Modeler Professional and SPSS Modeler Premium. 172 Chapter 6 Figure 6-41 Boosted C5.0 model nugget, Model tab When you create a boosted C5.0 model (either a rule set or a decision tree), you actually create a set of related models. The model rule browser for a boosted C5.0 model shows the list of models at the top level of the hierarchy, along with the estimated accuracy of each model and the overall accuracy of the ensemble of boosted models. To examine the rules or splits for a particular model, select that model and expand it as you would a rule or branch in a single model. You can also extract a particular model from the set of boosted models and create a new Rule Set model nugget containing just that model. To create a new rule set from a boosted C5.0 model, select the rule set or tree of interest and choose either Single Decision Tree (GM Palette) or Single Decision Tree (Canvas) from the Generate menu. Generating Graphs The Tree nodes provide a lot of information; however, it may not always be in an easily accessible format for business users. To provide the data in a way that can be easily incorporated into business reports, presentations, and so on, you can produce graphs of selected data. For example, from either the Model or the Viewer tabs of a model nugget, or from the Viewer tab of an interactive tree, you can generate a graph for a selected part of the tree, thereby only creating a graph for the cases in the selected tree or branch node. 173 Decision Trees Note: You can only generate a graph from a nugget when it is attached to other nodes in a stream. Generate a graph The first step is to select the information to be shown on the graph: On the Model tab of a nugget, expand the list of conditions and rules in the left pane and select the one in which you are interested. On the Viewer tab of a nugget, expand the list of branches and select the node in which you are interested. On the Viewer tab of an interactive tree, expand the list of branches and select the node in which you are interested. Note: You cannot select the top node on either Viewer tab. The way in which you create a graph is the same, regardless of how you select the data to be shown: E From the Generate menu, select Graph (from selection); alternatively, on the Viewer tab, click the Graph (from selection) button in the bottom left corner. The Graphboard Basic tab is displayed. Figure 6-42 Graphboard node dialog box, Basic tab 174 Chapter 6 Note: Only the Basic and Detailed tabs are available when you display the Graphboard in this way. E Using either the Basic or Detailed tab settings, specify the details to be displayed on the graph. E Click OK to generate the graph. Figure 6-43 Histogram generated from Graphboard Basic tab The graph heading identifies the nodes or rules that were chosen for inclusion. Model Nuggets for Boosting, Bagging and Very Large Datasets If you select Enhance model accuracy (boosting), Enhance model stability (bagging), or Create a model for very large datasets as the main objective on the modeling node, IBM® SPSS® Modeler builds an ensemble of multiple models. For more information, see the topic Models for Ensembles in Chapter 3 on p. 53. The resulting model nugget contains the following tabs. The Model tab provides a number of different views of the model. Tab Model View Model Summary Description Displays a summary of the ensemble quality and (except for boosted models and continuous targets) diversity, a measure of how much the predictions vary across the different models. Further Information For more information, see the topic Model Summary in Chapter 3 on p. 55. 175 Decision Trees Tab View Predictor Importance Predictor Frequency Component Model Accuracy Component Model Details Information Description Displays a chart indicating the relative importance of each predictor (input field) in estimating the model. Displays a chart showing the relative frequency with which each predictor is used in the set of models. Plots a chart of the predictive accuracy of each of the different models in the ensemble. Displays information on each of the different models in the ensemble. Displays information about the fields, build settings, and model estimation process. Settings Enables you to include confidences in scoring operations. Annotation Enables you to add descriptive annotations, specify a custom name, add tooltip text and specify search keywords for the model. Further Information For more information, see the topic Predictor Importance in Chapter 3 on p. 56. For more information, see the topic Predictor Frequency in Chapter 3 on p. 57. For more information, see the topic Component Model Details in Chapter 3 on p. 60. For more information, see the topic Model Nugget Summary / Information in Chapter 3 on p. 50. For more information, see the topic Decision Tree/Rule Set Model Nugget Settings on p. 170. Rule Set Model Nuggets A Rule Set model nugget represents the rules for predicting a particular output field discovered by the association rule modeling node (Apriori) or by one of the tree-building nodes (C&R Tree, CHAID, QUEST, or C5.0). For association rules, the rule set must be generated from an unrefined Rule nugget . For trees, a rule set can be generated from the tree builder, from a C5.0 model-building node, or from any tree model nugget. Unlike unrefined Rule nuggets, Rule Set nuggets can be placed in streams to generate predictions. When you run a stream containing a Rule Set nugget, two new fields are added to the stream containing the predicted value and the confidence for each record to the data. The new field names are derived from the model name by adding prefixes. For association rule sets, the prefixes are $Afor the prediction field and $AC- for the confidence field. For C5.0 rule sets, the prefixes are $Cfor the prediction field and $CC- for the confidence field. For C&R Tree rule sets, the prefixes are $R- for the prediction field and $RC- for the confidence field. In a stream with multiple Rule Set nuggets in a series predicting the same output field(s), the new field names will include numbers in the prefix to distinguish them from each other. The first association Rule Set nugget in the stream will use the usual names, the second node will use names starting with $A1- and $AC1-, the third node will use names starting with $A2- and $AC2-, and so on. 176 Chapter 6 How rules are applied. Rule sets generated from association rules are unlike other model nuggets because for any particular record, more than one prediction can be generated, and those predictions may not all agree. There are two methods for generating predictions from rule sets. Note: Rule sets generated from decision trees return the same results regardless of which method is used, since the rules derived from a decision tree are mutually exclusive. Voting. This method attempts to combine the predictions of all of the rules that apply to the record. For each record, all rules are examined and each rule that applies to the record is used to generate a prediction and an associated confidence. The sum of confidence figures for each output value is computed, and the value with the greatest confidence sum is chosen as the final prediction. The confidence for the final prediction is the confidence sum for that value divided by the number of rules that fired for that record. First hit. This method simply tests the rules in order, and the first rule that applies to the record is the one used to generate the prediction. The method used can be controlled in the stream options. Generating nodes. The Generate menu allows you to create new nodes based on the rule set. Filter Node. Creates a new Filter node to filter fields that are not used by rules in the rule set. Select Node. Creates a new Select node to select records to which the selected rule applies. The generated node will select records for which all antecedents of the rule are true. This option requires a rule to be selected. Rule Trace Node. Creates a new SuperNode that will compute a field indicating which rule was used to make the prediction for each record. When a rule set is evaluated using the first hit method, this is simply a symbol indicating the first rule that would fire. When the rule set is evaluated using the voting method, this is a more complex string showing the input to the voting mechanism. Single Decision Tree (Canvas)/Single Decision Tree (GM Palette). Creates a new single Rule Set nugget derived from the currently selected rule. Available only for boosted C5.0 models. For more information, see the topic Boosted C5.0 Models on p. 171. Model to Palette. Returns the model to the models palette. This is useful in situations where a colleague may have sent you a stream containing the model and not the model itself. Note: The Settings and Summary tabs in the Rule Set nugget are identical to those for decision tree models. Rule Set Model Tab The Model tab for a Rule Set nugget displays a list of rules extracted from the data by the algorithm. 177 Decision Trees Figure 6-44 Rule Set model nugget, Model tab Rules are broken down by consequent (predicted category) and are presented in the following format: if antecedent_1 and antecedent_2 ... and antecedent_n then predicted value where consequent and antecedent_1 through antecedent_n are all conditions. The rule is interpreted as “for records where antecedent_1 through antecedent_n are all true, consequent is also likely to be true.” If you click the Show Instances/Confidence button on the toolbar, each rule will also show information on the number of records to which the rule applies—that is, for which the antecedents are true (Instances) and the proportion of those records for which the entire rule is true (Confidence). Note that confidence is calculated somewhat differently for C5.0 rule sets. C5.0 uses the following formula for calculating the confidence of a rule: (1 + number of records where rule is correct) / (2 + number of records for which the rule's antecedents are true) This calculation of the confidence estimate adjusts for the process of generalizing rules from a decision tree (which is what C5.0 does when it creates a rule set). 178 Chapter 6 Importing Projects from AnswerTree 3.0 IBM® SPSS® Modeler can import projects saved in AnswerTree 3.0 or 3.1 using the standard File > Open dialog box, as follows: E From the SPSS Modeler menus choose: File > Open Stream E From the Files of Type drop-down list, select AT Project Files (*.atp, *.ats). Each imported project is converted into an SPSS Modeler stream with the following nodes: One source node that defines the data source used (for example, an IBM® SPSS® Statistics data file or database source). For each tree in the project (there can be several), one Type node is created that defines properties for each field (variable), including type, role (input or predictor field versus output or predicted field), missing values, and other options. For each tree in the project, a Partition node is created that partitions the data for a training or test sample, and a tree-building node is created that defines parameters for generating the tree (either a C&R Tree, QUEST, or CHAID node). E To view the generated tree(s), run the stream. Comments Decision trees generated in SPSS Modeler cannot be exported to AnswerTree; the import from AnswerTree to SPSS Modeler is a one-way trip. Profits defined in AnswerTree are not preserved when the project is imported into SPSS Modeler. Chapter Bayesian Network Models 7 Bayesian Network Node The Bayesian Network node enables you to build a probability model by combining observed and recorded evidence with “common-sense” real-world knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes. The node focuses on Tree Augmented Naïve Bayes (TAN) and Markov Blanket networks that are primarily used for classification. Bayesian networks are used for making predictions in many varied situations; some examples are: Selecting loan opportunities with low default risk. Estimating when equipment will need service, parts, or replacement, based on sensor input and existing records. Resolving customer problems via online troubleshooting tools. Diagnosing and troubleshooting cellular telephone networks in real-time. Assessing the potential risks and rewards of research-and-development projects in order to focus resources on the best opportunities. A Bayesian network is a graphical model that displays variables (often referred to as nodes) in a dataset and the probabilistic, or conditional, independencies between them. Causal relationships between nodes may be represented by a Bayesian network; however, the links in the network (also known as arcs) do not necessarily represent direct cause and effect. For example, a Bayesian network can be used to calculate the probability of a patient having a specific disease, given the presence or absence of certain symptoms and other relevant data, if the probabilistic independencies between symptoms and disease as displayed on the graph hold true. Networks are very robust where information is missing and make the best possible prediction using whatever information is present. A common, basic, example of a Bayesian network was created by Lauritzen and Spiegelhalter (1988). It is often referred to as the “Asia” model and is a simplified version of a network that may be used to diagnose a doctor’s new patients; the direction of the links roughly corresponding to causality. Each node represents a facet that may relate to the patient’s condition; for example, “Smoking” indicates that they are a confirmed smoker, and “VisitAsia” shows if they recently visited Asia. Probability relationships are shown by the links between any nodes; for example, smoking increases the chances of the patient developing both bronchitis and lung cancer, whereas age only seems to be associated with the possibility of developing lung cancer. In the same way, abnormalities on an x-ray of the lungs may be caused by either tuberculosis or lung cancer, while the chances of a patient suffering from of shortness of breath (dyspnea) are increased if they also suffer from either bronchitis or lung cancer. © Copyright IBM Corporation 1994, 2012. 179 180 Chapter 7 Figure 7-1 Lauritzen and Spegelhalter’s Asia network example There are several reasons why you might decide to use a Bayesian network: It helps you learn about causal relationships. From this, it enables you to understand a problem area and to predict the consequences of any intervention. The network provides an efficient approach for avoiding the overfitting of data. A clear visualization of the relationships involved is easily observed. Requirements. Target fields must be categorical and can have a measurement level of Nominal, Ordinal, or Flag. Inputs can be fields of any type. Continuous (numeric range) input fields will be automatically binned; however, if the distribution is skewed, you may obtain better results by manually binning the fields using a Binning node before the Bayesian Network node. For example, use Optimal Binning where the Supervisor field is the same as the Bayesian Network node Target field. Example. An analyst for a bank wants to be able to predict customers, or potential customers, who are likely to default on their loan repayments. You can use a Bayesian network model to identify the characteristics of customers most likely to default, and build several different types of model to establish which is the best at predicting potential defaulters. Example. A telecommunications operator wants to reduce the number of customers who leave the business (known as “churn”), and update the model on a monthly basis using each preceding month’s data. You can use a Bayesian network model to identify the characteristics of customers most likely to churn, and continue training the model each month with the new data. 181 Bayesian Network Models Bayesian Network Node Model Options Figure 7-2 Bayesian Network node: Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Build model for each split. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Partition. This field allows you to specify a field used to partition the data into separate samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a different sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data. If multiple partition fields have been defined by using Type or Partition nodes, a single partition field must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.) Also note that to apply the selected partition in your analysis, partitioning must also be enabled in the Model Options tab for the node. (Deselecting this option makes it possible to disable partitioning without changing field settings.) 182 Chapter 7 Splits. For split models, select the split field or fields. This is similar to setting the field role to Split in a Type node. You can designate only fields with a measurement level of Flag, Nominal, Ordinal or Continuous as split fields. Fields chosen as split fields cannot be used as target, input, partition, frequency or weight fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Continue training existing model. If you select this option, the results shown on the model nugget Model tab are regenerated and updated each time the model is run. For example, you would do this when you have added a new or updated data source to an existing model. Note: This can only update the existing network; it cannot add or remove nodes or connections. Each time you retrain the model the network will be the same shape, only the conditional probabilities and predictor importance will change. If your new data are broadly similar to your old data then this does not matter since you expect the same things to be significant; however, if you want to check or update what is significant (as opposed to how significant it is), you will need to build a new model, that is, build a new network Structure type. Select the structure to be used when building the Bayesian network: TAN. The Tree Augmented Naïve Bayes model (TAN) creates a simple Bayesian network model that is an improvement over the standard Naïve Bayes model. This is because it allows each predictor to depend on another predictor in addition to the target variable, thereby increasing the classification accuracy. Markov Blanket. This selects the set of nodes in the dataset that contain the target variable’s parents, its children, and its children’s parents. Essentially, a Markov blanket identifies all the variables in the network that are needed to predict the target variable. This method of building a network is considered to be more accurate; however, with large datasets there maybe a processing time-penalty due to the high number of variables involved. To reduce the amount of processing, you can use the Feature Selection options on the Expert tab to select the variables that are significantly related to the target variable. Include feature selection preprocessing step. Selecting this box enables you to use the Feature Selection options on the Expert tab. Parameter learning method. Bayesian network parameters refer to the conditional probabilities for each node given the values of its parents. There are two possible selections that you can use to control the task of estimating the conditional probability tables between nodes where the values of the parents are known: Maximum likelihood. Select this box when using a large dataset. This is the default selection. Bayes adjustment for small cell counts. For smaller datasets there is a danger of overfitting the model, as well as the possibility of a high number of zero-counts. Select this option to alleviate these problems by applying smoothing to reduce the effect of any zero-counts and any unreliable estimate effects. 183 Bayesian Network Models Bayesian Network Node Expert Options Figure 7-3 Bayesian Network node: Expert tab The node expert options enable you to fine-tune the model-building process. To access the expert options, set Mode to Expert on the Expert tab. Missing values. By default, IBM® SPSS® Modeler only uses records that have valid values for all fields used in the model. (This is sometimes called listwise deletion of missing values.) If you have a lot of missing data, you may find that this approach eliminates too many records, leaving you without enough data to generate a good model. In such cases, you can deselect the Use only complete records option. SPSS Modeler then attempts to use as much information as possible to estimate the model, including records where some of the fields have missing values. (This is sometimes called pairwise deletion of missing values.) However, in some situations, using incomplete records in this manner can lead to computational problems in estimating the model. Append all probabilities. Specifies whether probabilities for each category of the output field are added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added. 184 Chapter 7 Independence test. A test of independence assesses whether paired observations on two variables are independent of each other. Select the type of test to be used, available options are: Likelihood ratio. Tests for target-predictor independence by calculating a ratio between the maximum probability of a result under two different hypotheses. Pearson chi-square. Tests for target-predictor independence by using a null hypothesis that the relative frequencies of occurrence of observed events follow a specified frequency distribution. Bayesian network models conduct conditional tests of independence where additional variables are used beyond the tested pairs. In addition, the models explore not only the relations between the target and predictors, but also the relations among the predictors themselves Note: The Independence test options are only available if you select either Include feature selection preprocessing step or a Structure type of Markov Blanket on the Model tab. Significance level. Used in conjunction with the Independence test settings, this enables you to set a cut-off value to be used when conducting the tests. The lower the value, the fewer links remains in the network; the default level is 0.01. Note: This option is only available if you select either Include feature selection preprocessing step or a Structure type of Markov Blanket on the Model tab. Maximal conditioning set size. The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size to carry out independence testing and remove unnecessary links from the network. Because tests involving a high number of conditioning variables require more time and memory for processing you can limit the number of variables to be included. This can be especially useful when processing data with strong dependencies among many variables. Note however that the resulting network may contain some superfluous links. Specify the maximal number of conditioning variables to be used for independence testing. The default setting is 5. Note: This option is only available if you select either Include feature selection preprocessing step or a Structure type of Markov Blanket on the Model tab. Feature selection. These options enable you to restrict the number of inputs used when processing the model in order to speed up the model building process. This is especially useful when creating a Markov Blanket structure due to the possible large number of potential inputs; it enables you to select the inputs that are significantly related to the target variable. Note: The feature selection options are only available if you select Include feature selection preprocessing step on the Model tab. Inputs always selected Using the Field Chooser (button to the right of the text field), select the fields from the dataset that are always to be used when building the Bayesian network model. Note that the target field is always selected. Maximum number of inputs. Specify the total number of inputs from the dataset to be used when building the Bayesian network model. The highest number you can enter is the total number of inputs in the dataset. Note: If the number of fields selected in Inputs always selected exceeds the value of Maximum number of inputs, an error message is displayed. 185 Bayesian Network Models Bayesian Network Model Nuggets Figure 7-4 Bayesian Network and associated Predictor Importance model details Note: If you selected Continue training existing parameters on the modeling node Model tab, the information shown on the model nugget Model tab is updated each time you regenerate the model. The model nugget Model tab is split into two panels: Left Panel Basic. This view contains a network graph of nodes that displays the relationship between the target and its most important predictors, as well as the relationship between the predictors. The importance of each predictor is shown by the density of its color; a strong color shows an important predictor, and vice versa. The bin values for nodes representing a range are displayed in a popup ToolTip when you hover the mouse pointer over the node. You can use IBM® SPSS® Modeler’s graph tools to interact, edit, and save the graph. For example, for use in other applications such as MS Word. Tip: If the network contains a lot of nodes, you can click on a node and drag it to make the graph more legible. Distribution. This view displays the conditional probabilities for each node in the network as a mini graph. Hover the mouse pointer over a graph to display its values in a popup ToolTip. 186 Chapter 7 Right Panel Predictor Importance. This displays a chart that indicates the relative importance of each predictor in estimating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Conditional Probabilities. When you select a node or mini distribution graph in the left panel the associated conditional probabilities table is displayed in the right panel. This table contains the conditional probability value for each node value and each combination of values in its parent nodes. In addition, it includes the number of records observed for each record value and each combination of values in the parent nodes. Bayesian Network Model Settings The Settings tab for a Bayesian Network model nugget specifies options for modifying the built model. For example, you may use the Bayesian Network node to build several different models using the same data and settings, then use this tab in each model to slightly modify the settings to see how that affects the results. Note: This tab is only available after the model nugget has been added to a stream. Figure 7-5 Settings tab for a Bayesian Network model Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation 187 Bayesian Network Models partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. Append all probabilities. Specifies whether probabilities for each category of the output field are added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added. The default setting of this check box is determined by the corresponding check box on the Expert tab of the modeling node. For more information, see the topic Bayesian Network Node Expert Options on p. 183. Bayesian Network Model Summary Figure 7-6 Summary tab for a Bayesian Network model The Summary tab of a model nugget displays information about the model itself (Analysis), fields used in the model (Fields), settings used when building the model (Build Settings), and model training (Training Summary). When you first browse the node, the Summary tab results are collapsed. To see the results of interest, use the expander control to the left of an item to unfold it or click the Expand All button to show all results. To hide the results when you have finished viewing them, use the expander 188 Chapter 7 control to collapse the specific results that you want to hide or click the Collapse All button to collapse all results. Analysis. Displays information about the specific model. Fields. Lists the fields used as the target and the inputs in building the model. Build Settings. Contains information about the settings used in building the model. Training Summary. Shows the type of model, the stream used to create it, the user who created it, when it was built, and the elapsed time for building the model. Chapter 8 Neural Networks A neural network can approximate a wide range of predictive models with minimal demands on model structure and assumption. The form of the relationships is determined during the learning process. If a linear relationship between the target and predictors is appropriate, the results of the neural network should closely approximate those of a traditional linear model. If a nonlinear relationship is more appropriate, the neural network will automatically approximate the “correct” model structure. The trade-off for this flexibility is that the neural network is not easily interpretable. If you are trying to explain an underlying process that produces the relationships between the target and predictors, it would be better to use a more traditional statistical model. However, if model interpretability is not important, you can obtain good predictions using a neural network. Figure 8-1 Fields tab Field requirements. There must be at least one Target and one Input. Fields set to Both or None are ignored. There are no measurement level restrictions on targets or predictors (inputs). For more information, see the topic Modeling Node Fields Options in Chapter 3 on p. 35. The Neural Networks Model Neural networks are simple models of the way the nervous system operates. The basic units are neurons, which are typically organized into layers, as shown in the following figure. © Copyright IBM Corporation 1994, 2012. 189 190 Chapter 8 Figure 8-2 Structure of a neural network A neural network is a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected processing units that resemble abstract versions of neurons. The processing units are arranged in layers. There are typically three parts in a neural network: an input layer, with units representing the input fields; one or more hidden layers; and an output layer, with a unit or units representing the target field(s). The units are connected with varying connection strengths (or weights). Input data are presented to the first layer, and values are propagated from each neuron to every neuron in the next layer. Eventually, a result is delivered from the output layer. The network learns by examining individual records, generating a prediction for each record, and making adjustments to the weights whenever it makes an incorrect prediction. This process is repeated many times, and the network continues to improve its predictions until one or more of the stopping criteria have been met. Initially, all weights are random, and the answers that come out of the net are probably nonsensical. The network learns through training. Examples for which the output is known are repeatedly presented to the network, and the answers it gives are compared to the known outcomes. Information from this comparison is passed back through the network, gradually changing the weights. As training progresses, the network becomes increasingly accurate in replicating the known outcomes. Once trained, the network can be applied to future cases where the outcome is unknown. Using Neural Networks with Legacy Streams Version 14 of IBM® SPSS® Modeler introduced a new Neural Net node, supporting boosting and bagging techniques and optimization for very large datasets. Existing streams containing the old node will still build and score models in this release. However, this support will be removed in a future release, so we recommend using the new version from now on. From version 13 onwards, fields with unknown values (that is, values not present in the training data) are no longer automatically treated as missing values, and are scored with the value $null$. Thus if you want to score fields with unknown values as non-null using an older (pre-13) Neural 191 Neural Networks Net model in version 13 or later, you should mark unknown values as missing values (for example, by means of the Type node). Note that, for compatibility, any legacy streams that still contain the old node may still be using the Limit set size option from Tools > Stream Properties > Options; this option only applies to Kohonen nets and K-Means nodes from version 14 onwards. Objectives Figure 8-3 Objectives settings What do you want to do? Build a new model. Build a completely new model. This is the usual operation of the node. Continue training an existing model. Training continues with the last model successfully produced by the node. This makes it possible to update or refresh an existing model without having to access the original data and may result in significantly faster performance since only the new or updated records are fed into the stream. Details on the previous model are stored with the modeling node, making it possible to use this option even if the previous model nugget is no longer available in the stream or Models palette. Note: When this option is enabled, all other controls on the Fields and Build Options tabs are disabled. What is your main objective? 192 Chapter 8 Create a standard model. The method builds a single model to predict the target using the predictors. Generally speaking, standard models are easier to interpret and can be faster to score than boosted, bagged, or large dataset ensembles. Enhance model accuracy (boosting). The method builds an ensemble model using boosting, which generates a sequence of models to obtain more accurate predictions. Ensembles can take longer to build and to score than a standard model. Boosting produces a succession of “component models”, each of which is built on the entire dataset. Prior to building each successive component model, the records are weighted based on the previous component model’s residuals. Cases with large residuals are given relatively higher analysis weights so that the next component model will focus on predicting these records well. Together these component models form an ensemble model. The ensemble model scores new records using a combining rule; the available rules depend upon the measurement level of the target. Enhance model stability (bagging). The method builds an ensemble model using bagging (bootstrap aggregating), which generates multiple models to obtain more reliable predictions. Ensembles can take longer to build and to score than a standard model. Bootstrap aggregation (bagging) produces replicates of the training dataset by sampling with replacement from the original dataset. This creates bootstrap samples of equal size to the original dataset. Then a “component model” is built on each replicate. Together these component models form an ensemble model. The ensemble model scores new records using a combining rule; the available rules depend upon the measurement level of the target. Create a model for very large datasets (requires IBM® SPSS® Modeler Server). The method builds an ensemble model by splitting the dataset into separate data blocks. Choose this option if your dataset is too large to build any of the models above, or for incremental model building. This option can take less time to build, but can take longer to score than a standard model. This option requires SPSS Modeler Server connectivity. When there are multiple targets, this method will only create a standard model, regardless of the selected objective. 193 Neural Networks Basics Figure 8-4 Basics settings Neural network model. The type of model determines how the network connects the predictors to the targets through the hidden layer(s). The multilayer perceptron (MLP) allows for more complex relationships at the possible cost of increasing the training and scoring time. The radial basis function (RBF) may have lower training and scoring times, at the possible cost of reduced predictive power compared to the MLP. Hidden Layers. The hidden layer(s) of a neural network contains unobservable units. The value of each hidden unit is some function of the predictors; the exact form of the function depends in part upon the network type. A multilayer perceptron can have one or two hidden layers; a radial basis function network can have one hidden layer. Automatically compute number of units. This option builds a network with one hidden layer and computes the “best” number of units in the hidden layer. Customize number of units. This option allows you to specify the number of units in each hidden layer. The first hidden layer must have at least one unit. Specifying 0 units for the second hidden layer builds a multilayer perceptron with a single hidden layer. Note: You should choose values so that the number of nodes does not exceed the number of continuous predictors plus the total number of categories across all categorical (flag, nominal, and ordinal) predictors. 194 Chapter 8 Stopping Rules Figure 8-5 Stopping Rules settings These are the rules that determine when to stop training multilayer perceptron networks; these settings are ignored when the radial basis function algorithm is used. Training proceeds through at least one cycle (data pass), and can then be stopped according to the following criteria. Use maximum training time (per component model). Choose whether to specify a maximum number of minutes for the algorithm to run. Specify a number greater than 0. When an ensemble model is built, this is the training time allowed for each component model of the ensemble. Note that training may go a bit beyond the specified time limit in order to complete the current cycle. Customize number of maximum training cycles. The maximum number of training cycles allowed. If the maximum number of cycles is exceeded, then training stops. Specify an integer greater than 0. Use minimum accuracy. With this option, training will continue until the specified accuracy is attained. This may never happen, but you can interrupt training at any point and save the net with the best accuracy achieved so far. The training algorithm will also stop if the error in the overfit prevention set does not decrease after each cycle, if the relative change in the training error is small, or if the ratio of the current training error is small compared to the initial error. 195 Neural Networks Ensembles Figure 8-6 Ensembles settings These settings determine the behavior of ensembling that occurs when boosting, bagging, or very large datasets are requested in Objectives. Options that do not apply to the selected objective are ignored. Bagging and Very Large Datasets. When scoring an ensemble, this is the rule used to combine the predicted values from the base models to compute the ensemble score value. Default combining rule for categorical targets. Ensemble predicted values for categorical targets can be combined using voting, highest probability, or highest mean probability. Voting selects the category that has the highest probability most often across the base models. Highest probability selects the category that achieves the single highest probability across all base models. Highest mean probability selects the category with the highest value when the category probabilities are averaged across base models. Default combining rule for continuous targets. Ensemble predicted values for continuous targets can be combined using the mean or median of the predicted values from the base models. Note that when the objective is to enhance model accuracy, the combining rule selections are ignored. Boosting always uses a weighted majority vote to score categorical targets and a weighted median to score continuous targets. Boosting and Bagging. Specify the number of base models to build when the objective is to enhance model accuracy or stability; for bagging, this is the number of bootstrap samples. It should be a positive integer. 196 Chapter 8 Advanced Figure 8-7 Advanced settings Advanced settings provide control over options that do not fit neatly into other groups of settings. Overfit prevention set. The neural network method internally separates records into a model building set and an overfit prevention set, which is an independent set of data records used to track errors during training in order to prevent the method from modeling chance variation in the data. Specify a percentage of records. The default is 30. Replicate results. Setting a random seed allows you to replicate analyses. Specify an integer or click Generate, which will create a pseudo-random integer between 1 and 2147483647, inclusive. By default, analyses are replicated with seed 229176228. Missing values in predictors. This specifies how to treat missing values. Delete listwise removes records with missing values on predictors from model building. Impute missing values will replace missing values in predictors and use those records in the analysis. Continuous fields impute the average of the minimum and maximum observed values; categorical fields impute the most frequently occurring category. Note that records with missing values on any other field specified on the Fields tab are always removed from model building. 197 Neural Networks Model Options Figure 8-8 Model Options tab Model Name. You can generate the model name automatically based on the target fields or specify a custom name. The automatically generated name is the target field name. If there are multiple targets, then the model name is the field names in order, connected by ampersands. For example, if field1 field2 field3 are targets, then the model name is: field1 & field2 & field3. Make Available for Scoring. When the model is scored, the selected items in this group should be produced. The predicted value (for all targets) and confidence (for categorical targets) are always computed when the model is scored. The computed confidence can be based on the probability of the predicted value (the highest predicted probability) or the difference between the highest predicted probability and the second highest predicted probability. Predicted probability for categorical targets. This produces the predicted probabilities for categorical targets. A field is created for each category. Propensity scores for flag targets. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. The model produces raw propensity scores; if partitions are in effect, the model also produces adjusted propensity scores based on the testing partition. For more information, see the topic Propensity Scores in Chapter 3 on p. 41. 198 Chapter 8 Model Summary Figure 8-9 Neural Networks Model Summary view The Model Summary view is a snapshot, at-a-glance summary of the neural network predictive or classification accuracy. Model summary. The table identifies the target, the type of neural network trained, the stopping rule that stopped training (shown if a multilayer perceptron network was trained), and the number of neurons in each hidden layer of the network. Neural Network Quality. The chart displays the accuracy of the final model, which is presented in larger is better format. For a categorical target, this is simply the percentage of records for which the predicted value matches the observed value. For a continuous target, this 1 minus the ratio of the mean absolute error in prediction (the average of the absolute values of the predicted values minus the observed values) to the range of predicted values (the maximum predicted value minus the minimum predicted value). Multiple targets. If there are multiple targets, then each target is displayed in the Target row of the table. The accuracy displayed in the chart is the average of the individual target accuracies. 199 Neural Networks Predictor Importance Figure 8-10 Predictor Importance view Typically, you will want to focus your modeling efforts on the predictor fields that matter most and consider dropping or ignoring those that matter least. The predictor importance chart helps you do this by indicating the relative importance of each predictor in estimating the model. Since the values are relative, the sum of the values for all predictors on the display is 1.0. Predictor importance does not relate to model accuracy. It just relates to the importance of each predictor in making a prediction, not whether or not the prediction is accurate. Multiple targets. If there are multiple targets, then each target is displayed in a separate chart and there is a Target dropdown list that controls which target to display. 200 Chapter 8 Predicted By Observed Figure 8-11 Predicted By Observed view For continuous targets, this displays a binned scatterplot of the predicted values on the vertical axis by the observed values on the horizontal axis. Multiple targets. If there are multiple continuous targets, then each target is displayed in a separate chart and there is a Target dropdown list that controls which target to display. 201 Neural Networks Classification Figure 8-12 Classification view, row percents style For categorical targets, this displays the cross-classification of observed versus predicted values in a heat map, plus the overall percent correct. Table styles. There are several different display styles, which are accessible from the Style dropdown list. Row percents. This displays the row percentages (the cell counts expressed as a percent of the row totals) in the cells. This is the default. Cell counts. This displays the cell counts in the cells. The shading for the heat map is still based on the row percentages. Heat map. This displays no values in the cells, just the shading. Compressed. This displays no row or column headings, or values in the cells. It can be useful when the target has a lot of categories. Missing. If any records have missing values on the target, they are displayed in a (Missing) row under all valid rows. Records with missing values do not contribute to the overall percent correct. Multiple targets. If there are multiple categorical targets, then each target is displayed in a separate table and there is a Target dropdown list that controls which target to display. Large tables. If the displayed target has more than 100 categories, no table is displayed. 202 Chapter 8 Network Figure 8-13 Network view, inputs on the left, effects style This displays a graphical representation of the neural network. Chart styles. There are two different display styles, which are accessible from the Style dropdown list. Effects. This displays each predictor and target as one node in the diagram irrespective of whether the measurement scale is continuous or categorical. This is the default. Coefficients. This displays multiple indicator nodes for categorical predictors and targets. The connecting lines in the coefficients-style diagram are colored based on the estimated value of the synaptic weight. Diagram orientation. By default, the network diagram is arranged with the inputs on the left and the targets on the right. Using toolbar controls, you can change the orientation so that inputs are on top and targets on the bottom, or inputs on the bottom and targets on top. 203 Neural Networks Predictor importance. Connecting lines in the diagram are weighted based on predictor importance, with greater line width corresponding to greater importance. There is a Predictor Importance slider in the toolbar that controls which predictors are shown in the network diagram. This does not change the model, but simply allows you to focus on the most important predictors. Multiple targets. If there are multiple targets, all targets are displayed in the chart. Settings Figure 8-14 Settings tab When the model is scored, the selected items in this tab should be produced. The predicted value (for all targets) and confidence (for categorical targets) are always computed when the model is scored. The computed confidence can be based on the probability of the predicted value (the highest predicted probability) or the difference between the highest predicted probability and the second highest predicted probability. Predicted probability for categorical targets. This produces the predicted probabilities for categorical targets. A field is created for each category. Propensity scores for flag targets. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. The model produces raw propensity scores; if partitions are in effect, the model also produces adjusted propensity scores based on the testing partition. For more information, see the topic Propensity Scores in Chapter 3 on p. 41. Generate SQL for this model. When using data from a database, SQL code can be pushed back to the database for execution, providing superior performance for many operations. Score by converting to native SQL. If selected, generates SQL to score the model natively within the application. Chapter 9 Decision List Decision List models identify subgroups or segments that show a higher or lower likelihood of a binary (yes or no) outcome relative to the overall sample. For example, you might look for customers who are least likely to churn or most likely to say yes to a particular offer or campaign. The Decision List Viewer gives you complete control over the model, allowing you to edit segments, add your own business rules, specify how each segment is scored, and customize the model in a number of other ways to optimize the proportion of hits across all segments. As such, it is particularly well-suited for generating mailing lists or otherwise identifying which records to target for a particular campaign. You can also use multiple mining tasks to combine modeling approaches—for example, by identifying high- and low-performing segments within the same model and including or excluding each in the scoring stage as appropriate. Figure 9-1 Decision List model © Copyright IBM Corporation 1994, 2012. 204 205 Decision List Segments, Rules, and Conditions A model consists of a list of segments, each of which is defined by a rule that selects matching records. A given rule may have multiple conditions; for example: RFM_SCORE > 10 and MONTHS_CURRENT <= 9 Rules are applied in the order listed, with the first matching rule determining the outcome for a given record. Taken independently, rules or conditions may overlap, but the order of rules resolves ambiguity. If no rule matches, the record is assigned to the remainder rule. Complete Control over Scoring The Decision List Viewer allows you to view, modify, and reorganize segments and to choose which to include or exclude for purposes of scoring. For example, you can choose to exclude one group of customers from future offers and include others and immediately see how this affects your overall hit rate. Decision List models return a score of Yes for included segments and $null$ for everything else, including the remainder. This direct control over scoring makes Decision List models ideal for generating mailing lists, and they are widely used in customer relationship management, including call center or marketing applications. 206 Chapter 9 Figure 9-2 Decision List model Mining Tasks, Measures, and Selections The modeling process is driven by mining tasks. Each mining task effectively initiates a new modeling run and returns a new set of alternative models to choose from. The default task is based on your initial specifications in the Decision List node, but you can define any number of custom tasks. You can also apply tasks iteratively—for example, you can run a high probability search on the entire training set and then run a low probability search on the remainder to weed out low-performing segments. 207 Decision List Figure 9-3 Creating a mining task Data Selections You can define data selections and custom model measures for model building and evaluation. For example, you can specify a data selection in a mining task to tailor the model to a specific region and create a custom measure to evaluate how well that model performs on the whole country. Unlike mining tasks, measures don’t change the underlying model but provide another lens to assess how well it performs. 208 Chapter 9 Figure 9-4 Creating a data selection Adding Your Business Knowledge By fine-tuning or extending the segments identified by the algorithm, the Decision List Viewer allows you to incorporate your business knowledge right into the model. You can edit the segments generated by the model or add additional segments based on rules that you specify. You can then apply the changes and preview the results. Figure 9-5 Specifying a rule For further insight, a dynamic link with Excel allows you to export your data to Excel, where it can be used to create presentation charts and to calculate custom measures, such as complex profit and ROI, which can be viewed in the Decision List Viewer while you are building the model. Example. The marketing department of a financial institution wants to achieve more profitable results in future campaigns by matching the right offer to each customer. You can use a Decision List model to identify the characteristics of customers most likely to respond favorably based on previous promotions and to generate a mailing list based on the results. 209 Decision List Requirements. A single categorical target field with a measurement level of type Flag or Nominal that indicates the binary outcome you want to predict (yes/no), and at least one input field. When the target field type is Nominal, you must manually choose a single value to be treated as a hit, or response; all the other values are lumped together as not hit. An optional frequency field may also be specified. Continuous date/time fields are ignored. Continuous numeric range inputs are automatically binned by the algorithm as specified on the Expert tab in the modeling node. For finer control over binning, add an upstream binning node and use the binned field as input with a measurement level of Ordinal. Decision List Model Options Figure 9-6 Decision List node: Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. 210 Chapter 9 Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Mode. Specifies the method used to build the model. Generate model. Automatically generates a model on the models palette when the node is executed. The resulting model can be added to streams for purposes of scoring but cannot be further edited. Launch interactive session. Opens the interactive Decision List Viewer modeling (output) window, allowing you to pick from multiple alternatives and repeatedly apply the algorithm with different settings to progressively grow or modify the model. For more information, see the topic Decision List Viewer on p. 213. Use saved interactive session information. Launches an interactive session using previously saved settings. Interactive settings can be saved from the Decision List Viewer using the Generate menu (to create a model or modeling node) or the File menu (to update the node from which the session was launched). Target value. Specifies the value of the target field that indicates the outcome you want to model. For example, if the target field churn is coded 0 = no and 1 = yes, specify 1 to identify rules that indicate which records are likely to churn. Find segments with. Indicates whether the search for the target variable should look for a High probability or Low probability of occurrence. Finding and excluding them can be a useful way to improve your model and can be particularly useful when the remainder has a low probability. Maximum number of segments. Specifies the maximum number of segments to return. The top N segments are created, where the best segment is the one with the highest probability or, if more than one model has the same probability, the highest coverage. The minimum allowed setting is 1; there is no maximum setting. Minimum segment size. The two settings below dictate the minimum segment size. The larger of the two values takes precedence. For example, if the percentage value equates to a number higher than the absolute value, the percentage setting takes precedence. As percentage of previous segment (%). Specifies the minimum group size as a percentage of records. The minimum allowed setting is 0; the maximum allowed setting is 99.9. As absolute value (N). Specifies the minimum group size as an absolute number of records. The minimum allowed setting is 1; there is no maximum setting. Segment rules. Maximum number of attributes. Specifies the maximum number of conditions per segment rule. The minimum allowed setting is 1; there is no maximum setting. Allow attribute re-use. When enabled, each cycle can consider all attributes, even those that have been used in previous cycles. The conditions for a segment are built up in cycles, where each cycle adds a new condition. The number of cycles is defined using the Maximum number of attributes setting. 211 Decision List Confidence interval for new conditions (%). Specifies the confidence level for testing segment significance. This setting plays a significant role in the number of segments (if any) that are returned as well as the number-of-conditions-per-segment rule. The higher the value, the smaller the returned result set. The minimum allowed setting is 50; the maximum allowed setting is 99.9. Decision List Node Expert Options Figure 9-7 Decision List node: Expert tab Expert options allow you to fine-tune the model-building process. Binning method. The method used for binning continuous fields (equal count or equal width). Number of bins. The number of bins to create for continuous fields. The minimum allowed setting is 2; there is no maximum setting. Model search width. The maximum number of model results per cycle that can be used for the next cycle. The minimum allowed setting is 1; there is no maximum setting. Rule search width. The maximum number of rule results per cycle that can be used for the next cycle. The minimum allowed setting is 1; there is no maximum setting. Bin merging factor. The minimum amount by which a segment must grow when merged with its neighbor. The minimum allowed setting is 1.01; there is no maximum setting. 212 Chapter 9 Allow missing values in conditions. True to allow the IS MISSING test in rules. Discard intermediate results. When True, only the final results of the search process are returned. A final result is a result that is not refined any further in the search process. When False, intermediate results are also returned. Maximum number of alternatives. Specifies the maximum number of alternatives that can be returned upon running the mining task. The minimum allowed setting is 1; there is no maximum setting. Note that the mining task will only return the actual number of alternatives, up to the maximum specified. For example, if the maximum is set to 100 and only 3 alternatives are found, only those 3 are shown. Decision List Model Nugget A model consists of a list of segments, each of which is defined by a rule that selects matching records. You can easily view or modify the segments before generating the model and choose which ones to include or exclude. When used in scoring, Decision List models return Yes for included segments and $null$ for everything else, including the remainder. This direct control over scoring makes Decision List models ideal for generating mailing lists, and they are widely used in customer relationship management, including call center or marketing applications. Figure 9-8 Decision List model nugget 213 Decision List When you run a stream containing a Decision List model, the node adds three new fields containing the score, either 1 (meaning Yes) for included fields or $null$ for excluded fields, the probability (hit rate) for the segment within which the record falls, and the ID number for the segment. The names of the new fields are derived from the name of the output field being predicted, prefixed with $D- for the score, $DP- for the probability, and $DI- for the segment ID. The model is scored based on the target value specified when the model was built. You can manually exclude segments so that they score as $null$. For example, if you run a low probability search to find segments with lower than average hit rates, these “low” segments will be scored as Yes unless you manually exclude them. If necessary, nulls can be recoded as No using a Derive or Filler node. PMML A Decision List model can be stored as a PMML RuleSetModel with a “first hit” selection criterion. However, all of the rules are expected to have the same score. To allow for changes to the target field or the target value, multiple rule set models can be stored in one file to be applied in order, cases not matched by the first model being passed to the second, and so on. The algorithm name DecisionList is used to indicate this non-standard behavior, and only rule set models with this name are recognized as Decision List models and scored as such. Decision List Model Nugget Settings The Settings tab for a Decision List model nugget allows you to obtain propensity scores and to enable or disable SQL optimization. This tab is available only after adding the model nugget to a stream. Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. Score by converting to native SQL. If selected, generates SQL to score the model natively within the application. Decision List Viewer The easy-to-use, task-based Decision List Viewer graphical interface takes the complexity out of the model building process, freeing you from the low-level details of data mining techniques and allowing you to devote your full attention to those parts of the analysis requiring user intervention, such as setting objectives, choosing target groups, analyzing the results, and selecting the optimal model. 214 Chapter 9 Figure 9-9 Decision List interactive viewer Working Model Pane The working model pane displays the current model, including mining tasks and other actions that apply to the working model. 215 Decision List Figure 9-10 Working model pane ID. Identifies the sequential segment order. Model segments are calculated, in sequence, according to their ID number. Segment Rules. Provides the segment name and defined segment conditions. By default, the segment name is the field name or concatenated field names used in the conditions, with a comma as a separator. Score. Represents the field that you want to predict, whose value is assumed to be related to the values of other fields (the predictors). Note: The following options can be toggled to display via the Organizing Model Measures dialog. Cover. The pie chart visually identifies the coverage each segment has in relation to the entire cover. Cover (n). Lists the coverage for each segment in relation to the entire cover. Frequency. Lists the number of hits received in relation to the cover. For example, when the cover is 79 and the frequency is 50, that means that 50 out of 79 responded for the selected segment. Probability. Indicates the segment probability. For example, when the cover is 79 and the frequency is 50, that means that the probability for the segment is 63.29% (50 divided by 79). Error. Indicates the segment error. The information at the bottom of the pane indicates the cover, frequency, and probability for the entire model. Working Model Toolbar The working model pane provides the following functions via a toolbar. Note: Some functions are also available by right-clicking a model segment. 216 Chapter 9 Table 9-1 Working model toolbar buttons Launches the Generate New Model dialog, which provides options for creating a new model nugget. Saves the current state of the interactive session. The Decision List modeling node is updated with the current settings, including mining tasks, model snapshots, data selections, and custom measures. To restore a session to this state, check the Use saved session information box on the Model tab of the modeling node and click Run. Displays the Organize Model Measures dialog. For more information, see the topic Organizing Model Measures on p. 230. Displays the Organize Data Selections dialog. For more information, see the topic Organizing Data Selections on p. 224. Displays the Snapshots tab. For more information, see the topic Snapshots Tab on p. 218. Displays the Alternatives tab. For more information, see the topic Alternatives Tab on p. 216. Takes a snapshot of the current model structure. Snapshots display on the Snapshots tab and are commonly used for model comparison purposes. Launches the Inserting Segments dialog, which provides options for creating new model segments. Launches the Editing Segment Rules dialog, which provides options for adding conditions to model segments or changing previously defined model segment conditions. Moves the selected segment up in the model hierarchy. Moves the selected segment down in the model hierarchy. Deletes the selected segment. Toggles whether the selected segment is included in the model. When excluded, the segment results are added to the remainder. This differs from deleting a segment in that you have the option of reactivating the segment. Alternatives Tab Generated when you click Find Segments, the Alternatives tab lists all alternative mining results for the selected model or segment on the working model pane. E To promote an alternative to be the working model, highlight the required alternative and click Load; the alternative model is displayed in the working model pane. Note: The Alternatives tab is only displayed if you have set Maximum number of alternatives on the Decision List modeling node Expert tab to create more than one alternative. 217 Decision List Figure 9-11 Alternatives tab Each generated model alternative displays specific model information: Name. Each alternative is sequentially numbered. The first alternative usually contains the best results. Target. Indicates the target value. For example: 1, which equals “true”. No. of Segments. The number of segment rules used in the alternative model. Cover. The coverage of the alternative model. Freq. The number of hits in relation to the cover. Prob. Indicates the probability percentage of the alternative model. Note: Alternative results are not saved with the model; results are valid only during the active session. 218 Chapter 9 Snapshots Tab A snapshot is a view of a model at a specific point in time. For example, you could take a model snapshot when you want to load a different alternative model into the working model pane but do not want to lose the work on the current model. The Snapshots tab lists all model snapshots manually taken for any number of working model states. Note: Snapshots are saved with the model. We recommend that you take a snapshot when you load the first model. This snapshot will then preserve the original model structure, ensuring that you can always return to the original model state. The generated snapshot name displays as a timestamp, indicating when it was generated. Create a Model Snapshot E Select an appropriate model/alternative to display in the working model pane. E Make any necessary changes to the working model. E Click Take Snapshot. A new snapshot is displayed on the Snapshots tab. Figure 9-12 Snapshots tab 219 Decision List Name. The snapshot name. You can change a snapshot name by double-clicking the snapshot name. Target. Indicates the target value. For example: 1, which equals “true”. No. of Segments. The number of segment rules used in the model. Cover. The coverage of the model. Freq. The number of hits in relation to the cover. Prob. Indicates the probability percentage of the model. E To promote a snapshot to be the working model, highlight the required snapshot and click Load; the snapshot model is displayed in the working model pane. E You can delete a snapshot by clicking Delete or by right-clicking the snapshot and choosing Delete from the menu. Working with Decision List Viewer A model that will best predict customer response and behavior is built in various stages. When Decision List Viewer launches, the working model is populated with the defined model segments and measures, ready for you start a mining task, modify the segments/measures as required, and generate a new model or modeling node. You can add one or more segment rules until you have developed a satisfactory model. You can add segment rules to the model by running mining tasks or by using the Edit Segment Rule function. In the model building process, you can assess the performance of the model by validating the model against measure data, by visualizing the model in a chart, or by generating custom Excel measures. When you feel certain about the model’s quality, you can generate a new model and place it on the IBM® SPSS® Modeler canvas or Model palette. Mining Tasks A mining task is a collection of parameters that determines the way new rules are generated. Some of these parameters are selectable to provide you with the flexibility to adapt models to new situations. A task consists of a task template (type), a target, and a build selection (mining dataset). The following sections detail the various mining task operations: Running Mining Tasks Creating and Editing a Mining Task Organizing Data Selections 220 Chapter 9 Running Mining Tasks Decision List Viewer allows you to manually add segment rules to a model by running mining tasks or by copying and pasting segment rules between models. A mining task holds information on how to generate new segment rules (the data mining parameter settings, such as the search strategy, source attributes, search width, confidence level, and so on), the customer behavior to predict, and the data to investigate. The goal of a mining task is to search for the best possible segment rules. To generate a model segment rule by running a mining task: E Click the Remainder row. If there are already segments displayed on the working model pane, you can also select one of the segments to find additional rules based on the selected segment. After selecting the remainder or segment, use one of the following methods to generate the model, or alternative models: From the Tools menu, choose Find Segments. Right-click the Remainder row/segment and choose Find Segments. Click the Find Segments button on the working model pane. While the task is processing, the progress is displayed at the bottom of the workspace and informs you when the task has completed. Precisely how long a task takes to complete depends on the complexity of the mining task and the size of the dataset. If there is only a single model in the results it is displayed on the working model pane as soon as the task completes; however, where the results contain more than one model they are displayed on the Alternatives tab. Note: A task result will either complete with models, complete with no models, or fail. The process of finding new segment rules can be repeated until no new rules are added to the model. This means that all significant groups of customers have been found. It is possible to run a mining task on any existing model segment. If the result of a task is not what you are looking for, you can choose to start another mining task on the same segment. This will provide additional found rules based on the selected segment. Segments that are “below” the selected segment (that is, added to the model later than the selected segment) are replaced by the new segments because each segment depends on its predecessors. Creating and Editing a Mining Task A mining task is the mechanism that searches for the collection of rules that make up a data model. Alongside the search criteria defined in the selected template, a task also defines the target (the actual question that motivated the analysis, such as how many customers are likely to respond to a mailing), and it identifies the datasets to be used. The goal of a mining task is to search for the best possible models. Create a mining task To create a mining task: E Select the segment from which you want to mine additional segment conditions. 221 Decision List E Click Settings. The Create/Edit Mining Task dialog opens. This dialog provides options for defining the mining task. E Make any necessary changes and click OK to return to the working model pane. Decision List Viewer uses the settings as the defaults to run for each task until an alternative task or settings is selected. E Click Find Segments to start the mining task on the selected segment. Edit a mining task The Create/Edit Mining Task dialog provides options for defining a new mining task or editing an existing one. Most parameters available for mining tasks are similar to those offered in the Decision List node. The exceptions are shown below. For more information, see the topic Decision List Model Options on p. 209. Figure 9-13 Create/Edit Mining Task dialog 222 Chapter 9 Load Settings: When you have created more than one mining task, select the required task. New... Click to create a new mining task based on the settings of the task currently displayed. Target Target Field: Represents the field that you want to predict, whose value is assumed to be related to the values of other fields (the predictors). Target value. Specifies the value of the target field that indicates the outcome you want to model. For example, if the target field churn is coded 0 = no and 1 = yes, specify 1 to identify rules that indicate which records are likely to churn. Simple Settings Maximum number of alternatives. Specifies the number of alternatives that will be displayed upon running the mining task. The minimum allowed setting is 1; there is no maximum setting. Expert Settings Edit... Opens the Edit Advanced Parameters dialog that allows you to define the advanced settings. For more information, see the topic Edit Advanced Parameters on p. 222. Data Build selection. Provides options for specifying the evaluation measure that Decision List Viewer should analyze to find new rules. The listed evaluation measures are created/edited in the Organize Data Selections dialog. Available fields. Provides options for displaying all fields or manually selecting which fields to display. Edit... If the Custom option is selected, this opens the Customize Available Fields dialog that allows you to select which fields are available as segment attributes found by the mining task. For more information, see the topic Customize Available Fields on p. 223. Edit Advanced Parameters Figure 9-14 Advanced Parameters The Edit Advanced Parameters dialog provides the following configuration options. 223 Decision List Binning method. The method used for binning continuous fields (equal count or equal width). Number of bins. The number of bins to create for continuous fields. The minimum allowed setting is 2; there is no maximum setting. Model search width. The maximum number of model results per cycle that can be used for the next cycle. The minimum allowed setting is 1; there is no maximum setting. Rule search width. The maximum number of rule results per cycle that can be used for the next cycle. The minimum allowed setting is 1; there is no maximum setting. Bin merging factor. The minimum amount by which a segment must grow when merged with its neighbor. The minimum allowed setting is 1.01; there is no maximum setting. Allow missing values in conditions. True to allow the IS MISSING test in rules. Discard intermediate results. When True, only the final results of the search process are returned. A final result is a result that is not refined any further in the search process. When False, intermediate results are also returned. Customize Available Fields Figure 9-15 Customize Available Fields dialog The Customize Available Fields dialog allows you to select which fields are available as segment attributes found by the mining task. Available. Lists the fields that are currently available as segment attributes. To remove fields from the list, select the appropriate fields and click Remove >>. The selected fields move from the Available list to the Not Available list. Not Available. Lists the fields that are not available as segment attributes. To include the fields in the available list, select the appropriate fields and click << Add. The selected fields move from the Not Available list to the Available list. 224 Chapter 9 Organizing Data Selections By organizing data selections (a mining dataset), you can specify which evaluation measures Decision List Viewer should analyze to find new rules and select which data selections are used as the basis for measures. To organize data selections: E From the Tools menu, choose Organize Data Selections, or right-click a segment and choose the option. The Organize Data Selections dialog opens. Figure 9-16 Organize Data Selections dialog Note: The Organize Data Selections dialog also allows you to edit or delete existing data selections. E Click the Add new data selection button. A new data selection entry is added to the existing table. E Click Name and enter an appropriate selection name. E Click Partition and select an appropriate partition type. E Click Condition and select an appropriate condition option. When Specify is selected, the Specify Selection Condition dialog opens, providing options for defining specific field conditions. 225 Decision List Figure 9-17 Specify Selection Condition dialog E Define the appropriate condition and click OK. The data selections are available from the Build Selection drop-down list in the Create/Edit Mining Task dialog. The list allows you to select which evaluation measure is used for a particular mining task. Segment Rules You find model segment rules by running a mining task based on a task template. You can manually add segment rules to a model using the Insert Segment or Edit Segment Rule functions. If you choose to mine for new segment rules, the results, if any, are displayed on the Viewer tab of the Interactive List dialog. You can quickly refine your model by selecting one of the alternative results from the Model Albums dialog and clicking Load. In this way, you can experiment with differing results until you are ready to build a model that accurately describes your optimum target group. Inserting Segments You can manually add segment rules to a model using the Insert Segment function. To add a segment rule condition to a model: E In the Interactive List dialog, select a location where you want to add a new segment. The new segment will be inserted directly above the selected segment. E From the Edit menu, choose Insert Segment, or access this selection by right-clicking a segment. The Insert Segment dialog opens, allowing you to insert new segment rule conditions. E Click Insert. The Insert Condition dialog opens, allowing you to define the attributes for the new rule condition. 226 Chapter 9 E Select a field and an operator from the drop-down lists. Note: If you select the Not in operator, the selected condition will function as an exclusion condition and displays in red in the Insert Rule dialog. For example, when the condition region = 'TOWN' displays in red, it means that TOWN is excluded from the result set. E Enter one or more values or click the Insert Value icon to display the Insert Value dialog. The dialog allows you to choose a value defined for the selected field. For example, the field married will provide the values yes and no. E Click OK to return to the Insert Segment dialog. Click OK a second time to add the created segment to the model. The new segment will display in the specified model location. Editing Segment Rules The Edit Segment Rule functionality allows you to add, change, or delete segment rule conditions. To change a segment rule condition: E Select the model segment that you want to edit. E From the Edit menu, choose Edit Segment Rule, or right-click on the rule to access this selection. The Edit Segment Rule dialog opens. E Select the appropriate condition and click Edit. The Edit Condition dialog opens, allowing you to define the attributes for the selected rule condition. E Select a field and an operator from the drop-down lists. Note: If you select the Not in operator, the selected condition will function as an exclusion condition and displays in red in the Edit Segment Rule dialog. For example, when the condition region = 'TOWN' displays in red, it means that TOWN is excluded from the result set. E Enter one or more values or click the Insert Value button to display the Insert Value dialog. The dialog allows you to choose a value defined for the selected field. For example, the field married will provide the values yes and no. E Click OK to return to the Edit Segment Rule dialog. Click OK a second time to return to the working model. The selected segment will display with the updated rule conditions. Deleting Segment Rule Conditions To delete a segment rule condition: E Select the model segment containing the rule conditions that you want to delete. 227 Decision List E From the Edit menu, choose Edit Segment Rule, or right-click on the segment to access this selection. The Edit Segment Rule dialog opens, allowing you to delete one or more segment rule conditions. E Select the appropriate rule condition and click Delete. E Click OK. Deleting one or more segment rule conditions causes the working model pane to refresh its measure metrics. Copying Segments Decision List Viewer provides you with a convenient way to copy model segments. When you want to apply a segment from one model to another model, simply copy (or cut) the segment from one model and paste it into another model. You can also copy a segment from a model displayed in the Alternative Preview panel and paste it into the model displayed in the working model pane. These cut, copy, and paste functions use a system clipboard to store or retrieve temporary data. This means that in the clipboard the conditions and target are copied. The clipboard contents are not solely reserved to be used in Decision List Viewer but can also be pasted in other applications. For example, when the clipboard contents are pasted in a text editor, the conditions and target are pasted in XML-format. To copy or cut model segments: E Select the model segment that you want to use in another model. E From the Edit menu, choose Copy (or Cut), or right-click on the model segment and select Copy or Cut. E Open the appropriate model (where the model segment will be pasted). E Select one of the model segments, and click Paste. Note: Instead of the Cut, Copy, and Paste commands you can also use the key combinations: Ctrl+X, Ctrl+C, and Ctrl+V. The copied (or cut) segment is inserted above the previously selected model segment. The measures of the pasted segment and segments below are recalculated. Note: Both models in this procedure must be based on the same underlying model template and contain the same target, otherwise an error message is displayed. Alternative Models Where there is more than one result, the Alternatives tab displays the results of each mining task. Each result consists of the conditions in the selected data that most closely match the target, as well as any “good enough” alternatives. The total number of alternatives shown depends on the search criteria used in the analysis process. 228 Chapter 9 To view alternative models: E Click on an alternative model on the Alternatives tab. The alternative model segments display, or replace the current model segments, in the Alternative Preview panel. E To work with an alternative model in the working model pane, select the model and click Load in the Alternative Preview panel or right-click an alternative name on the Alternatives tab and choose Load. Note: Alternative models are not saved when you generate a new model. Customizing a Model Data are not static. Customers move, get married, and change jobs. Products lose market focus and become obsolete. Decision List Viewer offers business users the flexibility to adapt models to new situations easily and quickly. You can change a model by editing, prioritizing, deleting, or inactivating specific model segments. Prioritizing Segments You can rank model rules in any order you choose. By default, model segments are displayed in order of priority, the first segment having the highest priority. When you assign a different priority to one or more of the segments, the model is changed accordingly. You may alter the model as required by moving segments to a higher or lower priority position. To prioritize model segments: E Select the model segment to which you want to assign a different priority. E Click one of the two arrow buttons on the working model pane toolbar to move the selected model segment up or down the list. After prioritization, all previous assessment results are recalculated and the new values are displayed. Deleting Segments To delete one or more segments: E Select a model segment. E From the Edit menu, choose Delete Segment, or click the delete button on the toolbar of the working model pane. The measures are recalculated for the modified model, and the model is changed accordingly. 229 Decision List Excluding Segments As you are searching for particular groups, you will probably base business actions on a selection of the model segments. When deploying a model, you may choose to exclude segments within a model. Excluded segments are scored as null values. Excluding a segment does not mean the segment is not used; it means that all records matching this rule are excluded from the mailing list. The rule is still applied but differently. To exclude specific model segments: E Select a segment from the working model pane. E Click the Toggle Segment Exclusion button on the toolbar of the working model pane. Excluded is now displayed in the selected Target column of the selected segment. Note: Unlike deleted segments, excluded segments remain available for reuse in the final model. Excluded segments affect chart results. Change Target Value The Change Target Value dialog allows you to change the target value for the current target field. Snapshots and session results with a different target value than the Working Model are identified by changing the table background for that row to yellow. This indicates that snapshot/session result is outdated. The Create/Edit Mining Task dialog displays the target value for the current working model. The target value is not saved with the mining task. It is instead taken from the Working Model value. When you promote a saved model to the Working Model that has a different target value from the current working model (for example, by editing an alternative result or editing a copy of a snapshot), the target value of the saved model is changed to be the same as the working model (the target value shown in the Working Model pane is not changed). The model metrics are reevaluated with the new target. Generate New Model The Generate New Model dialog provides options for naming the model and selecting where the new node is created. Model name. Select Custom to adjust the auto-generated name or to create a unique name for the node as displayed on the stream canvas. Create node on. Selecting Canvas places the new model on the working canvas; selecting GM Palette places the new model on the Models palette; selecting Both places the new model on both the working canvas and the Models palette. Include interactive session state. When enabled, the interactive session state is preserved in the generated model. When you later generate a modeling node from the model, the state is carried over and used to initialize the interactive session. Regardless of whether the option is selected, the model itself scores new data identically. When the option is not selected, the model is still able to 230 Chapter 9 create a build node, but it will be a more generic build node that starts a new interactive session rather than pick up where the old session left off. If you change the node settings but execute with a saved state, the settings you have changed are ignored in favor of the settings from the saved state. Note: The standard metrics are the only metrics that remain with the model. Additional metrics are preserved with the interactive state. The generated model does not represent the saved interactive mining task state. Once you launch the Decision List Viewer, it displays the settings originally made through the Viewer. For more information, see the topic Regenerating a Modeling Node in Chapter 3 on p. 64. Model Assessment Successful modeling requires the careful assessment of the model before implementation in the production environment takes place. Decision List Viewer provides a number of statistical and business measures that can be used to assess the impact of a model in the real world. These include gains charts and full interoperability with Excel, thus enabling cost/benefit scenarios to be simulated for assessing the impact of deployment. You can assess your model in the following ways: Using the predefined statistical and business model measures available in Decision List Viewer (probability, frequency). Evaluating measures imported from Microsoft Excel. Visualizing the model using a gains chart. Organizing Model Measures Decision List Viewer provides options for defining the measures that are calculated and displayed as columns. Each segment can include the default cover, frequency, probability, and error measures represented as columns. You can also create new measures that will be displayed as columns. Defining Model Measures To add a measure to your model or to define an existing measure: E From the Tools menu, choose Organize Model Measures, or right-click the model to make this selection. The Organize Model Measures dialog opens. 231 Decision List Figure 9-18 Organize Model Measures dialog E Click the Add new model measure button (to the right of the Show column). A new measure is displayed in the table. E Provide a measure name and select an appropriate type, display option, and selection. The Show column indicates whether the measure will display for the working model. When defining an existing measure, select an appropriate metric and selection and indicate if the measure will display for the working model. E Click OK to return to the Decision List Viewer workspace. If the Show column for the new measure was checked, the new measure will display for the working model. Custom Metrics in Excel For more information, see the topic Assessment in Excel on p. 232. Refreshing Measures In certain cases, it may be necessary to recalculate the model measures, such as when you apply an existing model to a new set of customers. To recalculate (refresh) the model measures: From the Edit menu, choose Refresh All Measures. 232 Chapter 9 or Press F5. All measures are recalculated, and the new values are shown for the working model. Assessment in Excel Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value calculations and profit formulas directly within the model building process to simulate cost/benefit scenarios. The link with Excel allows you to export data to Excel, where it can be used to create presentation charts, calculate custom measures, such as complex profit and ROI measures, and view them in Decision List Viewer while building the model. Note: In order for you to work with an Excel spreadsheet, the analytical CRM expert has to define configuration information for the synchronization of Decision List Viewer with Microsoft Excel. The configuration is contained in an Excel spreadsheet file and indicates which information is transferred from Decision List Viewer to Excel, and vice versa. The following steps are valid only when MS Excel is installed. If Excel is not installed, the options for synchronizing models with Excel are not displayed. To synchronize models with MS Excel: E Open the model, run an interactive session, and choose Organize Model Measures from the Tools menu. E Select Yes for the Calculate custom measures in Excel option. The Workbook field activates, allowing you to select a preconfigured Excel workbook template. E Click the Connect to Excel button. The Open dialog opens, allowing you to navigate to the preconfigured template location on your local or network file system. E Select the appropriate Excel template and click Open. The selected Excel template launches; use the Windows taskbar (or press Alt-Tab) to navigate back to the Choose Inputs for Custom Measures dialog. E Select the appropriate mappings between the metric names defined in the Excel template and the model metric names and click OK. Once the link is established, Excel starts with the preconfigured Excel template that displays the model rules in the spreadsheet. The results calculated in Excel are displayed as new columns in Decision List Viewer. Note: Excel metrics do not remain when the model is saved; the metrics are valid only during the active session. However, you can create snapshots that include Excel metrics. The Excel metrics saved in the snapshot views are valid only for historical comparison purposes and do not refresh when reopened. For more information, see the topic Snapshots Tab on p. 218. The Excel metrics will not display in the snapshots until you reestablish a connection to the Excel template. 233 Decision List MS Excel Integration Setup The integration between Decision List Viewer and Microsoft Excel is accomplished through the use of a preconfigured Excel spreadsheet template. The template consists of three worksheets: Model Measures. Displays the imported Decision List Viewer measures, the custom Excel measures, and the calculation totals (defined on the Settings worksheet). Settings. Provides the variables to generate calculations based on the imported Decision List Viewer measures and the custom Excel measures. Configuration. Provides options for specifying which measures are imported from Decision List Viewer and for defining the custom Excel measures. WARNING: The structure of the Configuration worksheet is rigidly defined. Do NOT edit any cells in the green shaded area. Metrics From Model. Indicates which Decision List Viewer metrics are used in the calculations. Metrics To Model. Indicates which Excel-generated metric(s) will be returned to Decision List Viewer. The Excel-generated metrics display as new measure columns in Decision List Viewer. Note: Excel metrics do not remain with the model when you generate a new model; the metrics are valid only during the active session. Changing the Model Measures The following examples explain how to change Model Measures in several ways: Change an existing measure. Import an additional standard measure from the model. Export an additional custom measure to the model. Change an existing measure E Open the template and select the Configuration worksheet. E Edit any Name or Description by highlighting and typing over them. Note that if you want to change a measure—for example, to prompt the user for Probability instead of Frequency—you only need to change the name and description in Metrics From Model – this is then displayed in the model and the user can choose the appropriate measure to map. Import an additional standard measure from the model E Open the template and select the Configuration worksheet. E From the menus choose: Tools > Protection > Unprotect Sheet E Select cell A5, which is shaded yellow and contains the word End. 234 Chapter 9 E From the menus choose: Insert > Rows E Type in the Name and Description of the new measure. For example, Error and Error associated with segment. E In cell C5, enter the formula =COLUMN(‘Model Measures’!N3). E In cell D5, enter the formula =ROW(‘Model Measures’!N3)+1. These formulae will cause the new measure to be displayed in column N of the Model Measures worksheet, which is currently empty. E From the menus choose: Tools > Protection > Protect Sheet E Click OK. E On the Model Measures worksheet, ensure that cell N3 has Error as a title for the new column. E Select all of column N. E From the menus choose: Format > Cells E By default, all of the cells have a General number category. Click Percentage to change how the figures are displayed. This helps you check your figures in Excel; in addition, it enables you to utilize the data in other ways, for example, as an output to a graph. E Click OK. E Save the spreadsheet as an Excel 2003 template, with a unique name and the file extension .xlt. For ease of locating the new template, we recommend you save it in the preconfigured template location on your local or network file system. Export an additional custom measure to the model E Open the template to which you added the Error column in the previous example; select the Configuration worksheet. E From the menus choose: Tools > Protection > Unprotect Sheet E Select cell A14, which is shaded yellow and contains the word End. E From the menus choose: Insert > Rows E Type in the Name and Description of the new measure. For example, Scaled Error and Scaling applied to error from Excel. E In cell C14, enter the formula =COLUMN(‘Model Measures’!O3). E In cell D14, enter the formula =ROW(‘Model Measures’!O3)+1. 235 Decision List These formulae specify that the column O will supply the new measure to the model. E Select the Settings worksheet. E In cell A17, enter the description ’- Scaled Error. E In cell B17, enter the scaling factor of 10. E On the Model Measures worksheet, enter the description Scaled Error in cell O3 as a title for the new column. E In cell O4, enter the formula =N4*Settings!$B$17. E Select the corner of cell O4 and drag it down to cell O22 to copy the formula into each cell. E From the menus choose: Tools > Protection > Protect Sheet E Click OK. E Save the spreadsheet as an Excel 2003 template, with a unique name and the file extension .xlt. For ease of locating the new template, we recommend you save it in the preconfigured template location on your local or network file system. When you connect to Excel using this template, the Error value is available as a new custom measure. Visualizing Models The best way to understand the impact of a model is to visualize it. Using a gains chart, you can obtain valuable day-to-day insight into the business and technical benefit of your model by studying the effect of multiple alternatives in real time. The Gains Chart section shows the benefit of a model over randomized decision-making and allows the direct comparison of multiple charts when there are alternative models. Gains Chart The gains chart plots the values in the Gains % column from the table. Gains are defined as the proportion of hits in each increment relative to the total number of hits in the tree, using the equation: (hits in increment / total number of hits) x 100% Gains charts effectively illustrate how widely you need to cast the net to capture a given percentage of all of the hits in the tree. The diagonal line plots the expected response for the entire sample if the model is not used. In this case, the response rate would be constant, since one person is just as likely to respond as another. To double your yield, you would need to ask twice as many people. The curved line indicates how much you can improve your response by including only those who rank in the higher percentiles based on gain. For example, including the top 50% might net you more than 70% of the positive responses. The steeper the curve, the higher the gain. 236 Chapter 9 Figure 9-19 Gains tab To view a gains chart: E Open a stream that contains a Decision List node and launch an interactive session from the node. E Click the Gains tab. Depending on which partitions are specified, you may see one or two charts (two charts would display, for example, when both the training and testing partitions are defined for the model measures). By default, the charts display as segments. You can switch the charts to display as quantiles by selecting Quantiles and then selecting the appropriate quantile method from the drop-down menu. Note: See for information on working with graphs. Chart Options The Chart Options feature provides options for selecting which models and snapshots are charted, which partitions are plotted, and whether or not segment labels display. 237 Decision List Figure 9-20 Chart Options dialog Models to Plot Current Models. Allows you to select which models to chart. You can select the working model or any created snapshot models. Partitions to Plot Partitions for left-hand chart. The drop-down list provides options for displaying all defined partitions or all data. Partitions for right-hand chart. The drop-down list provides options for displaying all defined partitions, all data, or only the left-hand chart. When Graph only left is selected, only the left chart is displayed. Display Segment Labels. When enabled, each segment label is displayed on the charts. Chapter 10 Statistical Models Statistical models use mathematical equations to encode information extracted from the data. In some cases, statistical modeling techniques can provide adequate models very quickly. Even for problems in which more flexible machine-learning techniques (such as neural networks) can ultimately give better results, you can use some statistical models as baseline predictive models to judge the performance of more advanced techniques. The following statistical modeling nodes are available. Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors. For more information, see the topic Linear models on p. 239. Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range. For more information, see the topic Logistic Node on p. 259. The PCA/Factor node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) finds linear combinations of the input fields that do the best job of capturing the variance in the entire set of fields, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed fields. For both approaches, the goal is to find a small number of derived fields that effectively summarizes the information in the original set of fields. For more information, see the topic PCA/Factor Node on p. 277. Discriminant analysis makes more stringent assumptions than logistic regression but can be a valuable alternative or supplement to a logistic regression analysis when those assumptions are met. For more information, see the topic Discriminant Node on p. 285. The Generalized Linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates through a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers the functionality of a wide number of statistical models, including linear regression, logistic regression, loglinear models for count data, and interval-censored survival models. For more information, see the topic GenLin Node on p. 294. A generalized linear mixed model (GLMM) extends the linear model so that the target can have a non-normal distribution, is linearly related to the factors and covariates via a specified link function, and so that the observations can be correlated. Generalized linear mixed models cover a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data. For more information, see the topic GLMM Node on p. 307. © Copyright IBM Corporation 1994, 2012. 238 239 Statistical Models The Cox regression node enables you to build a survival model for time-to-event data in the presence of censored records. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time (t) for given values of the input variables. For more information, see the topic Cox Node on p. 336. Linear Node Linear regression is a common statistical technique for classifying records based on the values of numeric input fields. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. Figure 10-1 Simple linear regression graph Requirements. Only numeric fields can be used in a linear regression model. You must have exactly one target field (with the role set to Target) and one or more predictors (with the role set to Input). Fields with a role of Both or None are ignored, as are non-numeric fields. (If necessary, non-numeric fields can be recoded using a Derive node. ) Strengths. Linear regression models are relatively simple and give an easily interpreted mathematical formula for generating predictions. Because linear regression is a long-established statistical procedure, the properties of these models are well understood. Linear models are also typically very fast to train. The Linear node provides methods for automatic field selection in order to eliminate nonsignificant input fields from the equation. Note: In cases where the target field is categorical rather than a continuous range, such as yes/no or churn/don’t churn, logistic regression can be used as an alternative. Logistic regression also provides support for non-numeric inputs, removing the need to recode these fields. For more information, see the topic Logistic Node on p. 259. Linear models Linear models predict a continuous target based on linear relationships between the target and one or more predictors. 240 Chapter 10 Linear models are relatively simple and give an easily interpreted mathematical formula for scoring. The properties of these models are well understood and can typically be built very quickly compared to other model types (such as neural networks or decision trees) on the same dataset. Example. An insurance company with limited resources to investigate homeowners’ insurance claims wants to build a model for estimating claims costs. By deploying this model to service centers, representatives can enter claim information while on the phone with a customer and immediately obtain the “expected” cost of the claim based on past data. Figure 10-2 Fields tab Field requirements. There must be a Target and at least one Input. By default, fields with predefined roles of Both or None are not used. The target must be continuous (scale). There are no measurement level restrictions on predictors (inputs); categorical (flag, nominal, and ordinal) fields are used as factors in the model and continuous fields are used as covariates.For more information, see the topic Modeling Node Fields Options in Chapter 3 on p. 35. 241 Statistical Models Objectives Figure 10-4 Objectives settings What do you want to do? Build a new model. Build a completely new model. This is the usual operation of the node. Continue training an existing model. Training continues with the last model successfully produced by the node. This makes it possible to update or refresh an existing model without having to access the original data and may result in significantly faster performance since only the new or updated records are fed into the stream. Details on the previous model are stored with the modeling node, making it possible to use this option even if the previous model nugget is no longer available in the stream or Models palette. Note: When this option is enabled, all other controls on the Fields and Build Options tabs are disabled. What is your main objective? Create a standard model. The method builds a single model to predict the target using the predictors. Generally speaking, standard models are easier to interpret and can be faster to score than boosted, bagged, or large dataset ensembles. Enhance model accuracy (boosting). The method builds an ensemble model using boosting, which generates a sequence of models to obtain more accurate predictions. Ensembles can take longer to build and to score than a standard model. Boosting produces a succession of “component models”, each of which is built on the entire dataset. Prior to building each successive component model, the records are weighted based on the previous component model’s residuals. Cases with large residuals are given relatively higher analysis weights so that the next component model will focus on predicting these 242 Chapter 10 records well. Together these component models form an ensemble model. The ensemble model scores new records using a combining rule; the available rules depend upon the measurement level of the target. Enhance model stability (bagging). The method builds an ensemble model using bagging (bootstrap aggregating), which generates multiple models to obtain more reliable predictions. Ensembles can take longer to build and to score than a standard model. Bootstrap aggregation (bagging) produces replicates of the training dataset by sampling with replacement from the original dataset. This creates bootstrap samples of equal size to the original dataset. Then a “component model” is built on each replicate. Together these component models form an ensemble model. The ensemble model scores new records using a combining rule; the available rules depend upon the measurement level of the target. Create a model for very large datasets (requires IBM® SPSS® Modeler Server). The method builds an ensemble model by splitting the dataset into separate data blocks. Choose this option if your dataset is too large to build any of the models above, or for incremental model building. This option can take less time to build, but can take longer to score than a standard model. This option requires SPSS Modeler Server connectivity. See Ensembles on p. 245 for settings related to boosting, bagging, and very large datasets. Basics Figure 10-5 Basics settings 243 Statistical Models Automatically prepare data. This option allows the procedure to internally transform the target and predictors in order to maximize the predictive power of the model; any transformations are saved with the model and applied to new data for scoring. The original versions of transformed fields are excluded from the model. By default, the following automatic data preparation are performed. Date and Time handling. Each date predictor is transformed into new a continuous predictor containing the elapsed time since a reference date (1970-01-01). Each time predictor is transformed into a new continuous predictor containing the time elapsed since a reference time (00:00:00). Adjust measurement level. Continuous predictors with less than 5 distinct values are recast as ordinal predictors. Ordinal predictors with greater than 10 distinct values are recast as continuous predictors. Outlier handling. Values of continuous predictors that lie beyond a cutoff value (3 standard deviations from the mean) are set to the cutoff value. Missing value handling. Missing values of nominal predictors are replaced with the mode of the training partition. Missing values of ordinal predictors are replaced with the median of the training partition. Missing values of continuous predictors are replaced with the mean of the training partition. Supervised merging. This makes a more parsimonious model by reducing the number of fields to be processed in association with the target. Similar categories are identified based upon the relationship between the input and the target. Categories that are not significantly different (that is, having a p-value greater than 0.1) are merged. If all categories are merged into one, the original and derived versions of the field are excluded from the model because they have no value as a predictor. Confidence level. This is the level of confidence used to compute interval estimates of the model coefficients in the Coefficients view. Specify a value greater than 0 and less than 100. The default is 95. 244 Chapter 10 Model Selection Figure 10-6 Model Selection settings Model selection method. Choose one of the model selection methods (details below) or Include all predictors, which simply enters all available predictors as main effects model terms. By default, Forward stepwise is used. Forward Stepwise Selection. This starts with no effects in the model and adds and removes effects one step at a time until no more can be added or removed according to the stepwise criteria. Criteria for entry/removal. This is the statistic used to determine whether an effect should be added to or removed from the model. Information Criterion (AICC) is based on the likelihood of the training set given the model, and is adjusted to penalize overly complex models. F Statistics is based on a statistical test of the improvement in model error. Adjusted R-squared is based on the fit of the training set, and is adjusted to penalize overly complex models. Overfit Prevention Criterion (ASE) is based on the fit (average squared error, or ASE) of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model. If any criterion other than F Statistics is chosen, then at each step the effect that corresponds to the greatest positive increase in the criterion is added to the model. Any effects in the model that correspond to a decrease in the criterion are removed. 245 Statistical Models If F Statistics is chosen as the criterion, then at each step the effect that has the smallest p-value less than the specified threshold, Include effects with p-values less than, is added to the model. The default is 0.05. Any effects in the model with a p-value greater than the specified threshold, Remove effects with p-values greater than, are removed. The default is 0.10. Customize maximum number of effects in the final model. By default, all available effects can be entered into the model. Alternatively, if the stepwise algorithm ends a step with the specified maximum number of effects, the algorithm stops with the current set of effects. Customize maximum number of steps. The stepwise algorithm stops after a certain number of steps. By default, this is 3 times the number of available effects. Alternatively, specify a positive integer maximum number of steps. Best Subsets Selection. This checks “all possible” models, or at least a larger subset of the possible models than forward stepwise, to choose the best according to the best subsets criterion. Information Criterion (AICC) is based on the likelihood of the training set given the model, and is adjusted to penalize overly complex models. Adjusted R-squared is based on the fit of the training set, and is adjusted to penalize overly complex models. Overfit Prevention Criterion (ASE) is based on the fit (average squared error, or ASE) of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model. The model with the greatest value of the criterion is chosen as the best model. Note: Best subsets selection is more computationally intensive than forward stepwise selection. When best subsets is performed in conjunction with boosting, bagging, or very large datasets, it can take considerably longer to build than a standard model built using forward stepwise selection. Ensembles Figure 10-7 Ensembles settings These settings determine the behavior of ensembling that occurs when boosting, bagging, or very large datasets are requested in Objectives. Options that do not apply to the selected objective are ignored. 246 Chapter 10 Bagging and Very Large Datasets. When scoring an ensemble, this is the rule used to combine the predicted values from the base models to compute the ensemble score value. Default combining rule for continuous targets. Ensemble predicted values for continuous targets can be combined using the mean or median of the predicted values from the base models. Note that when the objective is to enhance model accuracy, the combining rule selections are ignored. Boosting always uses a weighted majority vote to score categorical targets and a weighted median to score continuous targets. Boosting and Bagging. Specify the number of base models to build when the objective is to enhance model accuracy or stability; for bagging, this is the number of bootstrap samples. It should be a positive integer. Advanced Figure 10-8 Advanced settings Replicate results. Setting a random seed allows you to replicate analyses. The random number generator is used to choose which records are in the overfit prevention set. Specify an integer or click Generate, which will create a pseudo-random integer between 1 and 2147483647, inclusive. The default is 54752075. 247 Statistical Models Model Options Figure 10-9 Model Options tab Model Name. You can generate the model name automatically based on the target fields or specify a custom name. The automatically generated name is the target field name. Note that the predicted value is always computed when the model is scored. The name of the new field is the name of the target field, prefixed with $L-. For example, for a target field named sales, the new field would be named $L-sales. 248 Chapter 10 Model Summary Figure 10-10 Model Summary view The Model Summary view is a snapshot, at-a-glance summary of the model and its fit. Table. The table identifies some high-level model settings, including: The name of the target specified on the Fields tab, Whether automatic data preparation was performed as specified on the Basics settings, The model selection method and selection criterion specified on the Model Selection settings. The value of the selection criterion for the final model is also displayed, and is presented in smaller is better format. Chart. The chart displays the accuracy of the final model, which is presented in larger is better format. The value is 100 × the adjusted R2 for the final model. 249 Statistical Models Automatic Data Preparation Figure 10-11 Automatic Data Preparation view This view shows information about which fields were excluded and how transformed fields were derived in the automatic data preparation (ADP) step. For each field that was transformed or excluded, the table lists the field name, its role in the analysis, and the action taken by the ADP step. Fields are sorted by ascending alphabetical order of field names. The possible actions taken for each field include: Derive duration: months computes the elapsed time in months from the values in a field containing dates to the current system date. Derive duration: hours computes the elapsed time in hours from the values in a field containing times to the current system time. Change measurement level from continuous to ordinal recasts continuous fields with less than 5 unique values as ordinal fields. Change measurement level from ordinal to continuous recasts ordinal fields with more than 10 unique values as continuous fields. Trim outliers sets values of continuous predictors that lie beyond a cutoff value (3 standard deviations from the mean) to the cutoff value. Replace missing values replaces missing values of nominal fields with the mode, ordinal fields with the median, and continuous fields with the mean. Merge categories to maximize association with target identifies “similar” predictor categories based upon the relationship between the input and the target. Categories that are not significantly different (that is, having a p-value greater than 0.05) are merged. Exclude constant predictor / after outlier handling / after merging of categories removes predictors that have a single value, possibly after other ADP actions have been taken. 250 Chapter 10 Predictor Importance Figure 10-12 Predictor Importance view Typically, you will want to focus your modeling efforts on the predictor fields that matter most and consider dropping or ignoring those that matter least. The predictor importance chart helps you do this by indicating the relative importance of each predictor in estimating the model. Since the values are relative, the sum of the values for all predictors on the display is 1.0. Predictor importance does not relate to model accuracy. It just relates to the importance of each predictor in making a prediction, not whether or not the prediction is accurate. 251 Statistical Models Predicted By Observed Figure 10-13 Predicted By Observed view This displays a binned scatterplot of the predicted values on the vertical axis by the observed values on the horizontal axis. Ideally, the points should lie on a 45-degree line; this view can tell you whether any records are predicted particularly badly by the model. 252 Chapter 10 Residuals Figure 10-14 Residuals view, histogram style This displays a diagnostic chart of model residuals. Chart styles. There are different display styles, which are accessible from the Style dropdown list. Histogram. This is a binned histogram of the studentized residuals with an overlay of the normal distribution. Linear models assume that the residuals have a normal distribution, so the histogram should ideally closely approximate the smooth line. P-P Plot. This is a binned probability-probability plot comparing the studentized residuals to a normal distribution. If the slope of the plotted points is less steep than the normal line, the residuals show greater variability than a normal distribution; if the slope is steeper, the residuals show less variability than a normal distribution. If the plotted points have an S-shaped curve, then the distribution of residuals is skewed. 253 Statistical Models Outliers Figure 10-15 Outliers view This table lists records that exert undue influence upon the model, and displays the record ID (if specified on the Fields tab), target value, and Cook’s distance. Cook’s distance is a measure of how much the residuals of all records would change if a particular record were excluded from the calculation of the model coefficients. A large Cook’s distance indicates that excluding a record from changes the coefficients substantially, and should therefore be considered influential. Influential records should be examined carefully to determine whether you can give them less weight in estimating the model, or truncate the outlying values to some acceptable threshold, or remove the influential records completely. 254 Chapter 10 Effects Figure 10-16 Effects view, diagram style This view displays the size of each effect in the model. Styles. There are different display styles, which are accessible from the Style dropdown list. Diagram. This is a chart in which effects are sorted from top to bottom by decreasing predictor importance. Connecting lines in the diagram are weighted based on effect significance, with greater line width corresponding to more significant effects (smaller p-values). Hovering over a connecting line reveals a tooltip that shows the p-value and importance of the effect. This is the default. Table. This is an ANOVA table for the overall model and the individual model effects. The individual effects are sorted from top to bottom by decreasing predictor importance. Note that by default, the table is collapsed to only show the results for the overall model. To see the results for the individual model effects, click the Corrected Model cell in the table. Predictor importance. There is a Predictor Importance slider that controls which predictors are shown in the view. This does not change the model, but simply allows you to focus on the most important predictors. By default, the top 10 effects are displayed. Significance. There is a Significance slider that further controls which effects are shown in the view, beyond those shown based on predictor importance. Effects with significance values greater than the slider value are hidden. This does not change the model, but simply allows you to focus 255 Statistical Models on the most important effects. By default the value is 1.00, so that no effects are filtered based on significance. Coefficients Figure 10-17 Coefficients view, diagram style This view displays the value of each coefficient in the model. Note that factors (categorical predictors) are indicator-coded within the model, so that effects containing factors will generally have multiple associated coefficients; one for each category except the category corresponding to the redundant (reference) parameter. 256 Chapter 10 Styles. There are different display styles, which are accessible from the Style dropdown list. Diagram. This is a chart which displays the intercept first, and then sorts effects from top to bottom by decreasing predictor importance. Within effects containing factors, coefficients are sorted by ascending order of data values. Connecting lines in the diagram are colored based on the sign of the coefficient (see the diagram key) and weighted based on coefficient significance, with greater line width corresponding to more significant coefficients (smaller p-values). Hovering over a connecting line reveals a tooltip that shows the value of the coefficient, its p-value, and the importance of the effect the parameter is associated with. This is the default style. Table. This shows the values, significance tests, and confidence intervals for the individual model coefficients. After the intercept, the effects are sorted from top to bottom by decreasing predictor importance. Within effects containing factors, coefficients are sorted by ascending order of data values. Note that by default the table is collapsed to only show the coefficient, significance, and importance of each model parameter. To see the standard error, t statistic, and confidence interval, click the Coefficient cell in the table. Hovering over the name of a model parameter in the table reveals a tooltip that shows the name of the parameter, the effect the parameter is associated with, and (for categorical predictors), the value labels associated with the model parameter. This can be particularly useful to see the new categories created when automatic data preparation merges similar categories of a categorical predictor. Predictor importance. There is a Predictor Importance slider that controls which predictors are shown in the view. This does not change the model, but simply allows you to focus on the most important predictors. By default, the top 10 effects are displayed. Significance. There is a Significance slider that further controls which coefficients are shown in the view, beyond those shown based on predictor importance. Coefficients with significance values greater than the slider value are hidden. This does not change the model, but simply allows you to focus on the most important coefficients. By default the value is 1.00, so that no coefficients are filtered based on significance. 257 Statistical Models Estimated Means Figure 10-18 Estimated Means view These are charts displayed for significant predictors. The chart displays the model-estimated value of the target on the vertical axis for each value of the predictor on the horizontal axis, holding all other predictors constant. It provides a useful visualization of the effects of each predictor’s coefficients on the target. Note: if no predictors are significant, no estimated means are produced. 258 Chapter 10 Model Building Summary Figure 10-19 Model Building Summary view, forward stepwise algorithm When a model selection algorithm other than None is chosen on the Model Selection settings, this provides some details of the model building process. Forward stepwise. When forward stepwise is the selection algorithm, the table displays the last 10 steps in the stepwise algorithm. For each step, the value of the selection criterion and the effects in the model at that step are shown. This gives you a sense of how much each step contributes to the model. Each column allows you to sort the rows so that you can more easily see which effects are in the model at a given step. Best subsets. When best subsets is the selection algorithm, the table displays the top 10 models. For each model, the value of the selection criterion and the effects in the model are shown. This gives you a sense of the stability of the top models; if they tend to have many similar effects with a few differences, then you can be fairly confident in the “top” model; if they tend to have very different effects, then some of the effects may be too similar and should be combined (or one removed). Each column allows you to sort the rows so that you can more easily see which effects are in the model at a given step. 259 Statistical Models Settings Figure 10-20 Settings tab Note that the predicted value is always computed when the model is scored. The name of the new field is the name of the target field, prefixed with $L-. For example, for a target field named sales, the new field would be named $L-sales. Generate SQL for this model. When using data from a database, SQL code can be pushed back to the database for execution, providing superior performance for many operations. Score by converting to native SQL. If selected, generates SQL to score the model natively within the application. Logistic Node Logistic regression, also known as nominal regression, is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric one. Both binomial models (for targets with two discrete categories) and multinomial models (for targets with more than two categories) are supported. Logistic regression works by building a set of equations that relate the input field values to the probabilities associated with each of the output field categories. Once the model is generated, it can be used to estimate probabilities for new data. For each record, a probability of membership is computed for each possible output category. The target category with the highest probability is assigned as the predicted output value for that record. Binomial example. A telecommunications provider is concerned about the number of customers it is losing to competitors. Using service usage data, you can create a binomial model to predict which customers are liable to transfer to another provider and customize offers so as to retain as many customers as possible. A binomial model is used because the target has two distinct categories (likely to transfer or not). Note: For binomial models only, string fields must be limited to eight characters. If necessary, longer strings can be recoded using a Reclassify node. 260 Chapter 10 Multinomial example. A telecommunications provider has segmented its customer base by service usage patterns, categorizing the customers into four groups. Using demographic data to predict group membership, you can create a multinomial model to classify prospective customers into groups and then customize offers for individual customers. Requirements. One or more input fields and exactly one categorical target field with two or more categories. For a binomial model the target must have a measurement level of Flag. For a multinomial model the target can have a measurement level of Flag, or of Nominal with two or more categories. Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated. Strengths. Logistic regression models are often quite accurate. They can handle symbolic and numeric input fields. They can give predicted probabilities for all target categories so that a second-best guess can easily be identified. Logistic models are most effective when group membership is a truly categorical field; if group membership is based on values of a continuous range field (for example, high IQ versus low IQ), you should consider using linear regression to take advantage of the richer information offered by the full range of values. Logistic models can also perform automatic field selection, although other approaches such as tree models or Feature Selection may do this more quickly on large datasets. Finally, since logistic models are well understood by many analysts and data miners, they may be used by some as a baseline against which other modeling techniques can be compared. When processing large datasets, you can improve performance noticeably by disabling the likelihood-ratio test, an advanced output option. For more information, see the topic Logistic Regression Advanced Output on p. 268. Logistic Node Model Options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Procedure. Specifies whether a binomial or multinomial model is created. The options available in the dialog box vary depending on which type of modeling procedure is selected. Binomial. Used when the target field is a flag or nominal field with two discrete values (dichotomous), such as yes/no, on/off, male/female. Multinomial. Used when the target field is a nominal field with more than two values. You can specify Main effects, Full factorial, or Custom. Include constant in equation. This option determines whether the resulting equations will include a constant term. In most situations, you should leave this option selected. 261 Statistical Models Binomial Models Figure 10-21 Logistic node, binomial model options For binomial models, the following methods and options are available: Method. Specify the method to be used in building the logistic regression model. Enter. This is the default method, which enters all of the terms into the equation directly. No field selection is performed in building the model. Forwards. The Forwards method of field selection builds the model by moving forward step by step. With this method, the initial model is the simplest model, and only the constant and terms can be added to the model. At each step, terms not yet in the model are tested based on how much they would improve the model, and the best of those terms is added to the model. When no more terms can be added, or the best candidate term does not produce a large-enough improvement in the model, the final model is generated. Backwards. The Backwards method is essentially the opposite of the Forwards method. With this method, the initial model contains all of the terms as predictors, and terms can only be removed from the model. Model terms that contribute little to the model are removed one by one until no more terms can be removed without significantly worsening the model, yielding the final model. Categorical inputs. Lists the fields that are identified as categorical, that is, those with a measurement level of flag, nominal, or ordinal. You can specify the contrast and base category for each categorical field. 262 Chapter 10 Field Name. This column contains the field names of the categorical inputs and is prepopulated with all flag and nominal values in the data. To add continuous or numerical inputs into this column, click the Add Fields icon to the right of the list and select the required inputs. Contrast. The interpretation of the regression coefficients for a categorical field depends on the contrasts that are used. The contrast determines how hypothesis tests are set up to compare the estimated means. For example, if you know that a categorical field has implicit order, such as a pattern or grouping, you can use the contrast to model that order. The available contrasts are: Indicator. Contrasts indicate the presence or absence of category membership. This is the default method. Simple. Each category of the predictor field, except the reference category, is compared to the reference category. Difference. Each category of the predictor field, except the first category, is compared to the average effect of previous categories. Also known as reverse Helmert contrasts. Helmert. Each category of the predictor field, except the last category, is compared to the average effect of subsequent categories. Repeated. Each category of the predictor field, except the first category, is compared to the category that precedes it. Polynomial. Orthogonal polynomial contrasts. Categories are assumed to be equally spaced. Polynomial contrasts are available for numeric fields only. Deviation. Each category of the predictor field, except the reference category, is compared to the overall effect. Base Category. Specifies how the reference category is determined for the selected contrast type. Select First to use the first category for the input field—sorted alphabetically—or select Last to use the last category. The default value is First. Note: This field is unavailable if the contrast setting is Difference, Helmert, Repeated, or Polynomial. The estimate of each field’s effect on the overall response is computed as an increase or decrease in the likelihood of each of the other categories relative to the reference category. This can help you identify the fields and values that are more likely to give a specific response. The base category is shown in the output as 0.0. This is because comparing it to itself produces an empty result. All other categories are shown as equations relevant to the base category. For more information, see the topic Logistic Nugget Model Details on p. 272. 263 Statistical Models Multinomial Models Figure 10-22 Logistic node, multinomial model options For multinomial models the following methods and options are available: Method. Specify the method to be used in building the logistic regression model. Enter. This is the default method, which enters all of the terms into the equation directly. No field selection is performed in building the model. Stepwise. The Stepwise method of field selection builds the equation in steps, as the name implies. The initial model is the simplest model possible, with no model terms (except the constant) in the equation. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds significantly to the predictive power of the model, it is added. In addition, terms that are currently in the model are reevaluated to determine if any of them can be removed without significantly detracting from the model. If so, they are removed. The process repeats, and other terms are added and/or removed. When no more terms can be added to improve the model, and no more terms can be removed without detracting from the model, the final model is generated. Forwards. The Forwards method of field selection is similar to the Stepwise method in that the model is built in steps. However, with this method, the initial model is the simplest model, and only the constant and terms can be added to the model. At each step, terms not yet in the model are tested based on how much they would improve the model, and the best of those 264 Chapter 10 terms is added to the model. When no more terms can be added, or the best candidate term does not produce a large-enough improvement in the model, the final model is generated. Backwards. The Backwards method is essentially the opposite of the Forwards method. With this method, the initial model contains all of the terms as predictors, and terms can only be removed from the model. Model terms that contribute little to the model are removed one by one until no more terms can be removed without significantly worsening the model, yielding the final model. Backwards Stepwise. The Backwards Stepwise method is essentially the opposite of the Stepwise method. With this method, the initial model contains all of the terms as predictors. At each step, terms in the model are evaluated, and any terms that can be removed without significantly detracting from the model are removed. In addition, previously removed terms are reevaluated to determine if the best of those terms adds significantly to the predictive power of the model. If so, it is added back into the model. When no more terms can be removed without significantly detracting from the model, and no more terms can be added to improve the model, the final model is generated. Note: The automatic methods, including Stepwise, Forwards, and Backwards, are highly adaptable learning methods and have a strong tendency to overfit the training data. When using these methods, it is especially important to verify the validity of the resulting model either with new data or a hold-out test sample created using the Partition node. Base category for target. Specifies how the reference category is determined. This is used as the baseline against which the regression equations for all other categories in the target are estimated. Select First to use the first category for the current target field—sorted alphabetically—or select Last to use the last category. Alternatively, you can select Specify to choose a specific category and select the desired value from the list. Available values can be defined for each field in a Type node. Often you would specify the category in which you are least interested to be the base category, for example, a loss-leader product. The other categories are then related to this base category in a relative fashion to identify what makes them more likely to be in their own category. This can help you identify the fields and values that are more likely to give a specific response. The base category is shown in the output as 0.0. This is because comparing it to itself produces an empty result. All other categories are shown as equations relevant to the base category. For more information, see the topic Logistic Nugget Model Details on p. 272. Model type. There are three options for defining the terms in the model. Main Effects models include only the input fields individually and do not test interactions (multiplicative effects) between input fields. Full Factorial models include all interactions as well as the input field main effects. Full factorial models are better able to capture complex relationships but are also much more difficult to interpret and are more likely to suffer from overfitting. Because of the potentially large number of possible combinations, automatic field selection methods (methods other than Enter) are disabled for full factorial models. Custom models include only the terms (main effects and interactions) that you specify. When selecting this option, use the Model Terms list to add or remove terms in the model. Model Terms. When building a Custom model, you will need to explicitly specify the terms in the model. The list shows the current set of terms for the model. The buttons on the right side of the Model Terms list allow you to add and remove model terms. 265 Statistical Models E To add terms to the model, click the Add new model terms button. E To delete terms, select the desired terms and click the Delete selected model terms button. Adding Terms to a Logistic Regression Model When requesting a custom logistic regression model, you can add terms to the model by clicking the Add new model terms button on the Logistic Regression Model tab. A new dialog box opens in which you can specify terms. Figure 10-23 Logistic Regression New Terms dialog box Type of term to add. There are several ways to add terms to the model, based on the selection of input fields in the Available fields list. Single interaction. Inserts the term representing the interaction of all selected fields. Main effects. Inserts one main effect term (the field itself) for each selected input field. All 2-way interactions. Inserts a 2-way interaction term (the product of the input fields) for each possible pair of selected input fields. For example, if you have selected input fields A, B, and C in the Available fields list, this method will insert the terms A * B, A * C, and B * C. All 3-way interactions. Inserts a 3-way interaction term (the product of the input fields) for each possible combination of selected input fields, taken three at a time. For example, if you have selected input fields A, B, C, and D in the Available fields list, this method will insert the terms A * B * C, A * B * D, A * C * D, and B * C * D. All 4-way interactions. Inserts a 4-way interaction term (the product of the input fields) for each possible combination of selected input fields, taken four at a time. For example, if you have selected input fields A, B, C, D, and E in the Available fields list, this method will insert the terms A * B * C * D, A * B * C * E, A * B * D * E, A * C * D * E, and B * C * D * E. 266 Chapter 10 Available fields. Lists the available input fields to be used in constructing model terms. Preview. Shows the terms that will be added to the model if you click Insert, based on the selected fields and term type. Insert. Inserts terms in the model (based on the current selection of fields and term type) and closes the dialog box. Logistic Node Expert Options If you have detailed knowledge of logistic regression, expert options allow you to fine-tune the training process. To access expert options, set Mode to Expert on the Expert tab. Figure 10-24 Logistic Regression Expert tab Scale (Multinomial models only). You can specify a dispersion scaling value that will be used to correct the estimate of the parameter covariance matrix. Pearson estimates the scaling value by using the Pearson chi-square statistic. Deviance estimates the scaling value by using the deviance function (likelihood-ratio chi-square) statistic. You can also specify your own user-defined scaling value. It must be a positive numeric value. Append all probabilities. If this option is selected, probabilities for each category of the output field will be added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added. For example, a table containing the results of a multinomial model with three categories will include five new columns. One column will list the probability of the outcome being correctly predicted, the next column will show the probability that this prediction is a hit or miss, and a further three columns will show the probability that each category’s prediction is a miss or hit. For more information, see the topic Logistic Model Nugget on p. 271. 267 Statistical Models Note: This option is always selected for binomial models. Singularity tolerance. Specify the tolerance used in checking for singularities. Convergence. These options allow you to control the parameters for model convergence. When you execute the model, the convergence settings control how many times the different parameters are repeatedly run through to see how well they fit. The more often the parameters are tried, the closer the results will be (that is, the results will converge). For more information, see the topic Logistic Regression Convergence Options on p. 267. Output. These options allow you to request additional statistics that will be displayed in the advanced output of the model nugget built by the node. For more information, see the topic Logistic Regression Advanced Output on p. 268. Stepping. These options allow you to control the criteria for adding and removing fields with the Stepwise, Forwards, Backwards, or Backwards Stepwise estimation methods. (The button is disabled if the Enter method is selected.) For more information, see the topic Logistic Regression Stepping Options on p. 270. Logistic Regression Convergence Options You can set the convergence parameters for logistic regression model estimation. Figure 10-25 Logistic Regression Convergence options Maximum iterations. Specify the maximum number of iterations for estimating the model. Maximum step-halving. Step-halving is a technique used by logistic regression to deal with complexities in the estimation process. Under normal circumstances, you should use the default setting. Log-likelihood convergence. Iterations stop if the relative change in the log-likelihood is less than this value. The criterion is not used if the value is 0. Parameter convergence. Iterations stop if the absolute change or relative change in the parameter estimates is less than this value. The criterion is not used if the value is 0. Delta (Multinomial models only). You can specify a value between 0 and 1 to be added to each empty cell (combination of input field and output field values). This can help the estimation algorithm deal with data where there are many possible combinations of field values relative to the number of records in the data. The default is 0. 268 Chapter 10 Logistic Regression Advanced Output Select the optional output you want to display in the advanced output of the Regression model nugget. To view the advanced output, browse the model nugget and click the Advanced tab. For more information, see the topic Logistic Model Nugget Advanced Output on p. 276. Binomial Options Figure 10-26 Logistic Regression, Binomial output options Select the types of output to be generated for the model. For more information, see the topic Logistic Model Nugget Advanced Output on p. 276. Display. Select whether to display the results at each step, or to wait until all steps have been worked through. CI for exp(B). Select the confidence intervals for each coefficient (shown as Beta) in the expression. Specify the level of the confidence interval (the default is 95%). Residual Diagnosis. Requests a Casewise Diagnostics table of residuals. Outliers outside (std. dev.). List only residual cases for which the absolute standardized value of the listed variable is at least as large as the value you specify. The default value is 2. All cases. Include all cases in the Casewise Diagnostic table of residuals. Note: Because this option lists each of the input records, it may result in an exceptionally large table in the report, with one line for every record. Classification cutoff. This allows you to determine the cutpoint for classifying cases. Cases with predicted values that exceed the classification cutoff are classified as positive, while those with predicted values smaller than the cutoff are classified as negative. To change the default, enter a value between 0.01 and 0.99. 269 Statistical Models Multinomial Options Figure 10-27 Logistic Regression, Multinomial output options Select the types of output to be generated for the model. For more information, see the topic Logistic Model Nugget Advanced Output on p. 276. Note: Selecting the Likelihood ratio tests option greatly increases the processing time required to build a logistic regression model. If your model is taking too long to build, consider disabling this option or utilize the Wald and Score statistics instead. For more information, see the topic Logistic Regression Stepping Options on p. 270. Iteration history for every. Select the step interval for printing iteration status in the advanced output. Confidence Interval. The confidence intervals for coefficients in the equations. Specify the level of the confidence interval (the default is 95%). 270 Chapter 10 Logistic Regression Stepping Options Figure 10-28 Logistic Regression Stepping Criteria Number of terms in model (Multinomial models only). You can specify the minimum number of terms in the model for Backwards and Backwards Stepwise models and the maximum number of terms for Forwards and Stepwise models. If you specify a minimum value greater than 0, the model will include that many terms, even if some of the terms would have been removed based on statistical criteria. The minimum setting is ignored for Forwards, Stepwise, and Enter models. If you specify a maximum, some terms may be omitted from the model, even though they would have been selected based on statistical criteria. The Specify Maximum setting is ignored for Backwards, Backwards Stepwise, and Enter models. Entry criterion (Multinomial models only). Select Score to maximize speed of processing. The Likelihood Ratio option may provide somewhat more robust estimates but take longer to compute. The default setting is to use the Score statistic. Removal criterion. Select Likelihood Ratio for a more robust model. To shorten the time required to build the model, you can try selecting Wald. However, if you have complete or quasi-complete separation in the data (which you can determine by using the Advanced tab on the model nugget), the Wald statistic becomes particularly unreliable and should not be used. The default setting is to use the likelihood-ratio statistic. For binomial models, there is the additional option Conditional. This provides removal testing based on the probability of the likelihood-ratio statistic based on conditional parameter estimates. Significance thresholds for criteria. This option allows you to specify selection criteria based on the statistical probability (the p value) associated with each field. Fields will be added to the model only if the associated p value is smaller than the Entry value and will be removed only if the p value is larger than the Removal value. The Entry value must be smaller than the Removal value. 271 Statistical Models Requirements for entry or removal (Multinomial models only). For some applications, it doesn’t make mathematical sense to add interaction terms to the model unless the model also contains the lower-order terms for the fields involved in the interaction term. For example, it may not make sense to include A * B in the model unless A and B are also included in the model. These options let you determine how such dependencies are handled during stepwise term selection. Hierarchy for discrete effects. Higher-order effects (interactions involving more fields) will enter the model only if all lower-order effects (main effects or interactions involving fewer fields) for the relevant fields are already in the model, and lower-order effects will not be removed if higher-order effects involving the same fields are in the model. This option applies only to categorical fields. Hierarchy for all effects. This option works in the same way as the previous option, except that it applies to all input fields. Containment for all effects. Effects can be included in the model only if all of the effects contained in the effect are also included in the model. This option is similar to the Hierarchy for all effects option except that continuous fields are treated somewhat differently. For an effect to contain another effect, the contained (lower-order) effect must include all of the continuous fields involved in the containing (higher-order) effect, and the contained effect’s categorical fields must be a subset of those in the containing effect. For example, if A and B are categorical fields and X is a continuous field, the term A * B * X contains the terms A * X and B * X. None. No relationships are enforced; terms are added to and removed from the model independently. Logistic Model Nugget A Logistic model nugget represents the equation estimated by a Logistic node. It contains all of the information captured by the logistic regression model, as well as information about the model structure and performance. This type of equation may also be generated by other models such as Oracle SVM. When you run a stream containing a Logistic model nugget, the node adds two new fields containing the model’s prediction and the associated probability. The names of the new fields are derived from the name of the output field being predicted, prefixed with $L- for the predicted category and $LP- for the associated probability. For example, for an output field named colorpref, the new fields would be named $L-colorpref and $LP-colorpref. In addition, if you have selected the Append all probabilities option in the Logistic node, an additional field will be added for each category of the output field, containing the probability belonging to the corresponding category for each record. These additional fields are named based on the values of the output field, prefixed by $LP-. For example, if the legal values of colorpref are Red, Green, and Blue, three new fields will be added: $LP-Red, $LP-Green, and $LP-Blue. Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input fields based on the results of the model. Fields that are dropped from the model due to multicollinearity will be filtered by the generated node, as well as fields not used in the model. 272 Chapter 10 Logistic Nugget Model Details For multinomial models, the Model tab in a Logistic model nugget has a split display with model equations in the left pane, and predictor importance on the right. For binomial models, the tab displays predictor importance only. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Model Equations For multinomial models, the left pane displays the actual equations estimated for the logistic regression model. There is one equation for each category in the target field, except the baseline category. The equations are displayed in a tree format. This type of equation may also be generated by certain other models such as Oracle SVM. Figure 10-29 Logistic nugget model details with predictor importance displayed 273 Statistical Models Equation For. Shows the regression equations used to derive the target category probabilities, given a set of predictor values. The last category of the target field is considered the baseline category; the equations shown give the log-odds for the other target categories relative to the baseline category for a particular set of predictor values. The predicted probability for each category of the given predictor pattern is derived from these log-odds values. How Probabilities Are Calculated Each equation calculates the log-odds for a particular target category, relative to the baseline category. The log-odds, also called the logit, is the ratio of the probability for the specified target category to that of the baseline category, with the natural logarithm function applied to the result. For the baseline category, the odds of the category relative to itself is 1.0, and thus the log-odds is 0. You can think of this as an implicit equation for the baseline category where all coefficients are 0. To derive the probability from the log-odds for a particular target category, you take the logit value calculated by the equation for that category and apply the following formula: P(groupi) = exp(gi) / Σk exp(gk) where g is the calculated log-odds, i is the category index, and k goes from 1 to the number of target categories. Predictor Importance Optionally, a chart that indicates the relative importance of each predictor in estimating the model may also be displayed on the Model tab. Typically you will want to focus your modeling efforts on the predictors that matter most and consider dropping or ignoring those that matter least. Note this chart is only available if Calculate predictor importance is selected on the Analyze tab before generating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Note: Predictor importance may take longer to calculate for logistic regression than for other types of models, and is not selected on the Analyze tab by default. Selecting this option may slow performance, particularly with large datasets. Logistic Model Nugget Summary The summary for a logistic regression model displays the fields and settings used to generate the model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also be displayed in this section. For general information on using the model browser, see Browsing Model Nuggets on p. 49. 274 Chapter 10 Figure 10-30 Logistic Regression model nugget Summary tab Logistic Model Nugget Settings The Settings tab in a Logistic model nugget specifies options for confidences, probabilities, propensity scores, and SQL generation during model scoring. This tab is only available after the model nugget has been added to a stream and displays different options depending on the type of model and target. Multinomial Models For multinomial models, the following options are available: Calculate confidences. Specifies whether confidences are calculated during scoring. Calculate raw propensity scores (flag targets only). For models with flag targets only, you can request raw propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to standard prediction and confidence values. Adjusted propensity scores are not available. For more information, see the topic Modeling Node Analyze Options in Chapter 3 on p. 39. Append all probabilities. Specifies whether probabilities for each category of the output field are added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added. For a nominal target with three categories, for example, 275 Statistical Models the scoring output will include a column for each of the three categories, plus a fourth column indicating the probability for whichever category is predicted. For example if the probabilities for categories Red, Green, and Blue are 0.6, 0.3, and 0.1 respectively, the predicted category would be Red, with a probability of 0.6. Score by converting to native SQL. If selected, generates SQL to score the model natively within the application. Note: For multinomial models, SQL generation unavailable if Append all probabilities has been selected, or—for models with nominal targets—if Calculate confidences has been selected. SQL generation with confidence calculations is supported for multinomial models with flag targets only. SQL generation is not available for binomial models. Binomial Models For binomial models, confidences and probabilities are always enabled, and the settings that would allow you to disable these options are not available. SQL generation is not available for binomial models. The only setting that can be changed for binomial models is the ability to calculate raw propensity scores. As noted earlier for multinomial models, this applies to models with flag targets only. For more information, see the topic Modeling Node Analyze Options in Chapter 3 on p. 39. 276 Chapter 10 Logistic Model Nugget Advanced Output Figure 10-31 Sample Logistic Regression Equation node Advanced tab The advanced output for logistic regression (also known as nominal regression) gives detailed information about the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of logistic regression analysis is required to properly interpret this output. Warnings. Indicates any warnings or potential problems with the results. Case processing summary. Lists the number of records processed, broken down by each symbolic field in the model. Step summary (optional). Lists the effects added or removed at each step of model creation, when using automatic field selection. Note: Only shown for the Stepwise, Forwards, Backwards, or Backwards Stepwise methods. 277 Statistical Models Iteration history (optional). Shows the iteration history of parameter estimates for every n iterations beginning with the initial estimates, where n is the value of the print interval. The default is to print every iteration (n=1). Model fitting information (Multinomial models). Shows the likelihood-ratio test of your model (Final) against one in which all of the parameter coefficients are 0 (Intercept Only). Classification (optional). Shows the matrix of predicted and actual output field values with percentages. Goodness-of-fit chi-square statistics (optional). Shows Pearson’s and likelihood-ratio chi-square statistics. These statistics test the overall fit of the model to the training data. Hosmer and Lemeshow goodness-of-fit (optional). Shows the results of grouping cases into deciles of risk and comparing the observed probability with the expected probability within each decile. This goodness-of-fit statistic is more robust than the traditional goodness-of-fit statistic used in multinomial models, particularly for models with continuous covariates and studies with small sample sizes. Pseudo R-square (optional). Shows the Cox and Snell, Nagelkerke, and McFadden R-square measures of model fit. These statistics are in some ways analogous to the R-square statistic in linear regression. Monotonicity measures (optional). Shows the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the percentage of the total number of pairs that each represents. The Somers’ D, Goodman and Kruskal’s Gamma, Kendall’s tau-a, and Concordance Index C are also displayed in this table. Information criteria (optional). Shows Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion (BIC). Likelihood ratio tests (optional). Shows statistics testing of whether the coefficients of the model effects are statistically different from 0. Significant input fields are those with very small significance levels in the output (labeled Sig.). Parameter estimates (optional). Shows estimates of the equation coefficients, tests of those coefficients, odds ratios derived from the coefficients labeled Exp(B), and confidence intervals for the odds ratios. Asymptotic covariance/correlation matrix (optional). Shows the asymptotic covariances and/or correlations of the coefficient estimates. Observed and predicted frequencies (optional). For each covariate pattern, shows the observed and predicted frequencies for each output field value. This table can be quite large, especially for models with numeric input fields. If the resulting table would be too large to be practical, it is omitted, and a warning is displayed. PCA/Factor Node The PCA/Factor node provides powerful data-reduction techniques to reduce the complexity of your data. Two similar but distinct approaches are provided. 278 Chapter 10 Principal components analysis (PCA) finds linear combinations of the input fields that do the best job of capturing the variance in the entire set of fields, where the components are orthogonal (perpendicular) to each other. PCA focuses on all variance, including both shared and unique variance. Factor analysis attempts to identify underlying concepts, or factors, that explain the pattern of correlations within a set of observed fields. Factor analysis focuses on shared variance only. Variance that is unique to specific fields is not considered in estimating the model. Several methods of factor analysis are provided by the Factor/PCA node. For both approaches, the goal is to find a small number of derived fields that effectively summarize the information in the original set of fields. Requirements. Only numeric fields can be used in a PCA-Factor model. To estimate a factor analysis or PCA, you need one or more fields with the role set to Input fields. Fields with the role set to Target, Both, or None are ignored, as are non-numeric fields. Strengths. Factor analysis and PCA can effectively reduce the complexity of your data without sacrificing much of the information content. These techniques can help you build more robust models that execute more quickly than would be possible with the raw input fields. PCA/Factor Node Model Options Figure 10-32 PCA/Factor Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Extraction Method. Specify the method to be used for data reduction. 279 Statistical Models Principal Components. This is the default method, which uses PCA to find components that summarize the input fields. Unweighted Least Squares. This factor analysis method works by finding the set of factors that is best able to reproduce the pattern of relationships (correlations) among the input fields. Generalized Least Squares. This factor analysis method is similar to unweighted least squares, except that it uses weighting to de-emphasize fields with a lot of unique (unshared) variance. Maximum Likelihood. This factor analysis method produces factor equations that are most likely to have produced the observed pattern of relationships (correlations) in the input fields, based on assumptions about the form of those relationships. Specifically, the method assumes that the training data follow a multivariate normal distribution. Principal Axis Factoring. This factor analysis method is very similar to the principal components method, except that it focuses on shared variance only. Alpha Factoring. This factor analysis method considers the fields in the analysis to be a sample from the universe of potential input fields. It maximizes the statistical reliability of the factors. Image Factoring. This factor analysis method uses data estimation to isolate the common variance and find factors that describe it. PCA/Factor Node Expert Options If you have detailed knowledge of factor analysis and PCA, expert options allow you to fine-tune the training process. To access expert options, set Mode to Expert on the Expert tab. Figure 10-33 PCA/Factor Expert tab 280 Chapter 10 Missing values. By default, IBM® SPSS® Modeler only uses records that have valid values for all fields used in the model. (This is sometimes called listwise deletion of missing values.) If you have a lot of missing data, you may find that this approach eliminates too many records, leaving you without enough data to generate a good model. In such cases, you can deselect the Only use complete records option. SPSS Modeler then attempts to use as much information as possible to estimate the model, including records where some of the fields have missing values. (This is sometimes called pairwise deletion of missing values.) However, in some situations, using incomplete records in this manner can lead to computational problems in estimating the model. Fields. Specify whether to use the correlation matrix (the default) or the covariance matrix of the input fields in estimating the model. Maximum iterations for convergence. Specify the maximum number of iterations for estimating the model. Extract factors. There are two ways to select the number of factors to extract from the input fields. Eigenvalues over. This option will retain all factors or components with eigenvalues larger than the specified criterion. Eigenvalues measure the ability of each factor or component to summarize variance in the set of input fields. The model will retain all factors or components with eigenvalues greater than the specified value when using the correlation matrix. When using the covariance matrix, the criterion is the specified value times the mean eigenvalue. That scaling gives this option a similar meaning for both types of matrix. Maximum number. This option will retain the specified number of factors or components in descending order of eigenvalues. In other words, the factors or components corresponding to the n highest eigenvalues are retained, where n is the specified criterion. The default extraction criterion is five factors/components. Component/factor matrix format. These options control the format of the factor matrix (or component matrix for PCA models). Sort values. If this option is selected, factor loadings in the model output will be sorted numerically. Hide values below. If this option is selected, scores below the specified threshold will be hidden in the matrix to make it easier to see the pattern in the matrix. Rotation. These options allow you to control the rotation method for the model. For more information, see the topic PCA/Factor Node Rotation Options on p. 280. PCA/Factor Node Rotation Options Figure 10-34 PCA/Factor Rotation options 281 Statistical Models In many cases, mathematically rotating the set of retained factors can increase their usefulness and especially their interpretability. Select a rotation method: No rotation. The default option. No rotation is used. Varimax. An orthogonal rotation method that minimizes the number of fields with high loadings on each factor. It simplifies the interpretation of the factors. Direct oblimin. A method for oblique (non-orthogonal) rotation. When Delta equals 0 (the default), solutions are oblique. As delta becomes more negative, the factors become less oblique. To override the default delta of 0, enter a number less than or equal to 0.8. Quartimax. An orthogonal method that minimizes the number of factors needed to explain each field. It simplifies the interpretation of the observed fields. Equamax. A rotation method that is a combination of the Varimax method, which simplifies the factors, and the Quartimax method, which simplifies the fields. The number of fields that load highly on a factor and the number of factors needed to explain a field are minimized. Promax. An oblique rotation, which allows factors to be correlated. It can be calculated more quickly than a direct oblimin rotation, so it can be useful for large datasets. Kappa controls the obliqueness of the solution (the extent to which factors can be correlated). PCA/Factor Model Nugget A PCA/Factor model nugget represents the factor analysis and principal component analysis (PCA) model created by a PCA/Factor node. They contain all of the information captured by the trained model, as well as information about the model’s performance and characteristics. When you run a stream containing a factor equation model, the node adds a new field for each factor or component in the model. The new field names are derived from the model name, prefixed by $F- and suffixed by -n, where n is the number of the factor or component. For example, if your model is named Factor and contains three factors, the new fields would be named $F-Factor-1, $F-Factor-2, and $F-Factor-3. To get a better sense of what the factor model has encoded, you can do some more downstream analysis. A useful way to view the result of the factor model is to view the correlations between factors and input fields using a Statistics node. This shows you which input fields load heavily on which factors and can help you discover if your factors have any underlying meaning or interpretation. You can also assess the factor model by using the information available in the advanced output. To view the advanced output, click the Advanced tab of the model nugget browser. The advanced output contains a lot of detailed information and is meant for users with extensive knowledge of factor analysis or PCA. For more information, see the topic PCA/Factor Model Nugget Advanced Output on p. 284. PCA/Factor Model Nugget Equations The Model tab for a Factor model nugget displays the factor score equation for each factor. Factor or component scores are calculated by multiplying each input field value by its coefficient and summing the results. 282 Chapter 10 Figure 10-35 PCA/Factor nugget Model tab PCA/Factor Model Nugget Summary The Summary tab for a factor model displays the number of factors retained in the factor/PCA model, along with additional information on the fields and settings used to generate the model. For more information, see the topic Browsing Model Nuggets in Chapter 3 on p. 49. 283 Statistical Models Figure 10-36 Sample Factor Equation node Summary tab 284 Chapter 10 PCA/Factor Model Nugget Advanced Output Figure 10-37 Sample Factor Equation node Advanced tab The advanced output for factor analysis gives detailed information on the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of factor analysis is required to properly interpret this output. Warnings. Indicates any warnings or potential problems with the results. Communalities. Shows the proportion of each field’s variance that is accounted for by the factors or components. Initial gives the initial communalities with the full set of factors (the model starts with as many factors as input fields), and Extraction gives the communalities based on the retained set of factors. Total variance explained. Shows the total variance explained by the factors in the model. Initial Eigenvalues shows the variance explained by the full set of initial factors. Extraction Sums of Squared Loadings shows the variance explained by factors retained in the model. Rotation Sums of Squared Loadings shows the variance explained by the rotated factors. Note that for oblique rotations, Rotation Sums of Squared Loadings shows only the sums of squared loadings and does not show variance percentages. Factor (or component) matrix. Shows correlations between input fields and unrotated factors. 285 Statistical Models Rotated factor (or component) matrix. Shows correlations between input fields and rotated factors for orthogonal rotations. Pattern matrix. Shows the partial correlations between input fields and rotated factors for oblique rotations. Structure matrix. Shows the simple correlations between input fields and rotated factors for oblique rotations. Factor correlation matrix. Shows correlations among factors for oblique rotations. Discriminant Node Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership. Example. A telecommunications company can use discriminant analysis to classify customers into groups based on usage data. This allows them to score potential customers and target those who are most likely to be in the most valuable groups. Requirements. You need one or more input fields and exactly one target field. The target must be a categorical field (with a measurement level of Flag or Nominal) with string or integer storage. (Storage can be converted using a Filler or Derive node if necessary. ) Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated. Strengths. Discriminant analysis and Logistic Regression are both suitable classification models. However, Discriminant analysis makes more assumptions about the input fields—for example, they are normally distributed and should be continuous, and they give better results if those requirements are met, especially if the sample size is small. 286 Chapter 10 Discriminant Node Model Options Figure 10-38 Discriminant node dialog box, Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Method. The following options are available for entering predictors into the model: Enter. This is the default method, which enters all of the terms into the equation directly. Terms that do not add significantly to the predictive power of the model are not added. Stepwise. The initial model is the simplest model possible, with no model terms (except the constant) in the equation. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds significantly to the predictive power of the model, it is added. Note: The Stepwise method has a strong tendency to overfit the training data. When using these methods, it is especially important to verify the validity of the resulting model with a hold-out test sample or new data. Discriminant Node Expert Options If you have detailed knowledge of discriminant analysis, expert options allow you to fine-tune the training process. To access expert options, set Mode to Expert on the Expert tab. 287 Statistical Models Figure 10-39 Discriminant node dialog box, Expert tab Prior Probabilities. This option determines whether the classification coefficients are adjusted for a priori knowledge of group membership. All groups equal. Equal prior probabilities are assumed for all groups; this has no effect on the coefficients. Compute from group sizes. The observed group sizes in your sample determine the prior probabilities of group membership. For example, if 50% of the observations included in the analysis fall into the first group, 25% in the second, and 25% in the third, the classification coefficients are adjusted to increase the likelihood of membership in the first group relative to the other two. Use Covariance Matrix. You can choose to classify cases using a within-groups covariance matrix or a separate-groups covariance matrix. Within-groups. The pooled within-groups covariance matrix is used to classify cases. Separate-groups. Separate-groups covariance matrices are used for classification. Because classification is based on the discriminant functions (not based on the original variables), this option is not always equivalent to quadratic discrimination. Output. These options allow you to request additional statistics that will be displayed in the advanced output of the model nugget built by the node. For more information, see the topic Discriminant Node Output Options on p. 288. Stepping. These options allow you to control the criteria for adding and removing fields with the Stepwise estimation method. (The button is disabled if the Enter method is selected.) For more information, see the topic Discriminant Node Stepping Options on p. 290. 288 Chapter 10 Discriminant Node Output Options Figure 10-40 Discriminant node Advanced Output options Select the optional output you want to display in the advanced output of the logistic regression model nugget. To view the advanced output, browse the model nugget and click the Advanced tab. For more information, see the topic Discriminant Model Nugget Advanced Output on p. 292. Descriptives. Available options are means (including standard deviations), univariate ANOVAs, and Box’s M test. Means. Displays total and group means, as well as standard deviations for the independent variables. Univariate ANOVAs. Performs a one-way analysis-of-variance test for equality of group means for each independent variable. Box's M. A test for the equality of the group covariance matrices. For sufficiently large samples, a nonsignificant p value means there is insufficient evidence that the matrices differ. The test is sensitive to departures from multivariate normality. Function Coefficients. Available options are Fisher’s classification coefficients and unstandardized coefficients. Fisher's. Displays Fisher's classification function coefficients that can be used directly for classification. A separate set of classification function coefficients is obtained for each group, and a case is assigned to the group for which it has the largest discriminant score (classification function value). Unstandardized. Displays the unstandardized discriminant function coefficients. 289 Statistical Models Matrices. Available matrices of coefficients for independent variables are within-groups correlation matrix, within-groups covariance matrix, separate-groups covariance matrix, and total covariance matrix. Within-groups correlation. Displays a pooled within-groups correlation matrix that is obtained by averaging the separate covariance matrices for all groups before computing the correlations. Within-groups covariance. Displays a pooled within-groups covariance matrix, which may differ from the total covariance matrix. The matrix is obtained by averaging the separate covariance matrices for all groups. Separate-groups covariance. Displays separate covariance matrices for each group. Total covariance. Displays a covariance matrix from all cases as if they were from a single sample. Classification. The following output pertains to the classification results. Casewise results. Codes for actual group, predicted group, posterior probabilities, and discriminant scores are displayed for each case. Summary table. The number of cases correctly and incorrectly assigned to each of the groups based on the discriminant analysis. Sometimes called the "Confusion Matrix." Leave-one-out classification. Each case in the analysis is classified by the functions derived from all cases other than that case. It is also known as the "U-method." Territorial map. A plot of the boundaries used to classify cases into groups based on function values. The numbers correspond to groups into which cases are classified. The mean for each group is indicated by an asterisk within its boundaries. The map is not displayed if there is only one discriminant function. Combined-groups. Creates an all-groups scatterplot of the first two discriminant function values. If there is only one function, a histogram is displayed instead. Separate-groups. Creates separate-group scatterplots of the first two discriminant function values. If there is only one function, histograms are displayed instead. Stepwise. Summary of Steps displays statistics for all variables after each step; F for pairwise distances displays a matrix of pairwise F ratios for each pair of groups. The F ratios can be used for significance tests of the Mahalanobis distances between groups. 290 Chapter 10 Discriminant Node Stepping Options Figure 10-41 Discriminant node Stepwise Method options Method. Select the statistic to be used for entering or removing new variables. Available alternatives are Wilks’ lambda, unexplained variance, Mahalanobis distance, smallest F ratio, and Rao’s V. With Rao’s V, you can specify the minimum increase in V for a variable to enter. Wilks' lambda. A variable selection method for stepwise discriminant analysis that chooses variables for entry into the equation on the basis of how much they lower Wilks' lambda. At each step, the variable that minimizes the overall Wilks' lambda is entered. Unexplained variance. At each step, the variable that minimizes the sum of the unexplained variation between groups is entered. Mahalanobis distance. A measure of how much a case's values on the independent variables differ from the average of all cases. A large Mahalanobis distance identifies a case as having extreme values on one or more of the independent variables. Smallest F ratio. A method of variable selection in stepwise analysis based on maximizing an F ratio computed from the Mahalanobis distance between groups. Rao's V. A measure of the differences between group means. Also called the Lawley-Hotelling trace. At each step, the variable that maximizes the increase in Rao's V is entered. After selecting this option, enter the minimum value a variable must have to enter the analysis. Criteria. Available alternatives are Use F value and Use probability of F. Enter values for entering and removing variables. Use F value. A variable is entered into the model if its F value is greater than the Entry value and is removed if the F value is less than the Removal value. Entry must be greater than Removal, and both values must be positive. To enter more variables into the model, lower the Entry value. To remove more variables from the model, increase the Removal value. Use probability of F. A variable is entered into the model if the significance level of its F value is less than the Entry value and is removed if the significance level is greater than the Removal value. Entry must be less than Removal, and both values must be positive. To enter more variables into the model, increase the Entry value. To remove more variables from the model, lower the Removal value. 291 Statistical Models Discriminant Model Nugget Discriminant model nuggets represent the equations estimated by Discriminant nodes. They contain all of the information captured by the discriminant model, as well as information about the model structure and performance. When you run a stream containing a Discriminant model nugget, the node adds two new fields containing the model’s prediction and the associated probability. The names of the new fields are derived from the name of the output field being predicted, prefixed with $D- for the predicted category and $DP- for the associated probability. For example, for an output field named colorpref, the new fields would be named $D-colorpref and $DP-colorpref. Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input fields based on the results of the model. Predictor Importance Optionally, a chart that indicates the relative importance of each predictor in estimating the model may also be displayed on the Model tab. Typically you will want to focus your modeling efforts on the predictors that matter most and consider dropping or ignoring those that matter least. Note this chart is only available if Calculate predictor importance is selected on the Analyze tab before generating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. 292 Chapter 10 Discriminant Model Nugget Advanced Output Figure 10-42 Discriminant model nugget, Advanced tab The advanced output for discriminant analysis gives detailed information about the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of discriminant analysis is required to properly interpret this output. For more information, see the topic Discriminant Node Output Options on p. 288. Discriminant Model Nugget Settings The Settings tab in a Discriminant model nugget allows you to obtain propensity scores when scoring the model. This tab is available for models with flag targets only, and only after the model nugget has been added to a stream. 293 Statistical Models Figure 10-43 Discriminant model nugget, Settings tab for a flag target Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. Discriminant Model Nugget Summary The Summary tab for a Discriminant model nugget displays the fields and settings used to generate the model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also be displayed in this section. For general information on using the model browser, see Browsing Model Nuggets on p. 49. 294 Chapter 10 Figure 10-44 Discriminant model nugget, Summary tab GenLin Node The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers widely used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, loglinear models for count data, complementary log-log models for interval-censored survival data, plus many other statistical models through its very general model formulation. Examples. A shipping company can use generalized linear models to fit a Poisson regression to damage counts for several types of ships constructed in different time periods, and the resulting model can help determine which ship types are most prone to damage. A car insurance company can use generalized linear models to fit a gamma regression to damage claims for cars, and the resulting model can help determine the factors that contribute the most to claim size. Medical researchers can use generalized linear models to fit a complementary log-log regression to interval-censored survival data to predict the time to recurrence for a medical condition. Generalized linear models work by building an equation that relates the input field values to the output field values. Once the model is generated, it can be used to estimate values for new data. For each record, a probability of membership is computed for each possible output category. The target category with the highest probability is assigned as the predicted output value for that record. 295 Statistical Models Requirements. You need one or more input fields and exactly one target field (which can have a measurement level of Continuous or Flag) with two or more categories. Fields used in the model must have their types fully instantiated. Strengths. The generalized linear model is extremely flexible, but the process of choosing the model structure is not automated and thus demands a level of familiarity with your data that is not required by “black box” algorithms. GenLin Node Field Options Figure 10-45 GenLin node dialog box, Fields tab In addition to the target, input, and partition custom options typically offered on modeling node Fields tabs (see Modeling Node Fields Options on p. 35), the GenLin node offers the following extra functionality. 296 Chapter 10 Use weight field. The scale parameter is an estimated model parameter related to the variance of the response. The scale weights are “known” values that can vary from observation to observation. If the scale weight variable is specified, the scale parameter, which is related to the variance of the response, is divided by it for each observation. Records with scale weight values that are less than or equal to 0 or are missing are not used in the analysis. Target field represents number of events occurring in a set of trials. When the response is a number of events occurring in a set of trials, the target field contains the number of events and you can select an additional variable containing the number of trials. Alternatively, if the number of trials is the same across all subjects, then trials may be specified using a fixed value. The number of trials should be greater than or equal to the number of events for each record. Events should be non-negative integers, and trials should be positive integers. GenLin Node Model Options Figure 10-46 GenLin node dialog box, Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. 297 Statistical Models Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Model type. There are two options for the type of model to build. Main effects only causes the model to include only the input fields individually, and not to test interactions (multiplicative effects) between input fields. Main effects and all two-way interactions includes all two-way interactions as well as the input field main effects. Offset. The offset term is a “structural” predictor. Its coefficient is not estimated by the model but is assumed to have the value 1; thus, the values of the offset are simply added to the linear predictor of the target. This is especially useful in Poisson regression models, where each case may have different levels of exposure to the event of interest. For example, when modeling accident rates for individual drivers, there is an important difference between a driver who has been at fault in one accident in three years of experience and a driver who has been at fault in one accident in 25 years! The number of accidents can be modeled as a Poisson or negative binomial response with a log link if the natural log of the experience of the driver is included as an offset term. Other combinations of distribution and link types would require other transformations of the offset variable. Note: If a variable offset field is used, the specified field should not also be used as an input. Set the role for the offset field to None in an upstream source or Type node if necessary. Base category for flag target. For binary response, you can choose the reference category for the dependent variable. This can affect certain output, such as parameter estimates and saved values, but it should not change the model fit. For example, if your binary response takes values 0 and 1: By default, the procedure makes the last (highest-valued) category, or 1, the reference category. In this situation, model-saved probabilities estimate the chance that a given case takes the value 0, and parameter estimates should be interpreted as relating to the likelihood of category 0. If you specify the first (lowest-valued) category, or 0, as the reference category, then model-saved probabilities estimate the chance that a given case takes the value 1. If you specify the custom category and your variable has defined labels, you can set the reference category by choosing a value from the list. This can be convenient when, in the middle of specifying a model, you don’t remember exactly how a particular variable was coded. Include intercept in model. The intercept is usually included in the model. If you can assume the data pass through the origin, you can exclude the intercept. GenLin Node Expert Options If you have detailed knowledge of generalized linear models, expert options allow you to fine-tune the training process. To access expert options, set Mode to Expert on the Expert tab. 298 Chapter 10 Figure 10-47 GenLin node dialog box, Expert tab Target Field Distribution and Link Function Distribution. This selection specifies the distribution of the dependent variable. The ability to specify a non-normal distribution and non-identity link function is the essential improvement of the generalized linear model over the general linear model. There are many possible distribution-link function combinations, and several may be appropriate for any given dataset, so your choice can be guided by a priori theoretical considerations or which combination seems to fit best. Binomial. This distribution is appropriate only for variables that represent a binary response or number of events. Gamma. This distribution is appropriate for variables with positive scale values that are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis. 299 Statistical Models Inverse Gaussian. This distribution is appropriate for variables with positive scale values that are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis. Negative binomial. This distribution can be thought of as the number of trials required to observe k successes and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis. The fixed value of the negative binomial distribution’s ancillary parameter can be any number greater than or equal to 0. When the ancillary parameter is set to 0, using this distribution is equivalent to using the Poisson distribution. Normal. This is appropriate for scale variables whose values take a symmetric, bell-shaped distribution about a central (mean) value. The dependent variable must be numeric. Poisson. This distribution can be thought of as the number of occurrences of an event of interest in a fixed period of time and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis. Tweedie. This distribution is appropriate for variables that can be represented by Poisson mixtures of gamma distributions; the distribution is “mixed” in the sense that it combines properties of continuous (takes non-negative real values) and discrete distributions (positive probability mass at a single value, 0). The dependent variable must be numeric, with data values greater than or equal to zero. If a data value is less than zero or missing, then the corresponding case is not used in the analysis. The fixed value of the Tweedie distribution’s parameter can be any number greater than one and less than two. Multinomial. This distribution is appropriate for variables that represent an ordinal response. The dependent variable can be numeric or string, and it must have at least two distinct valid data values. Link Functions. The link function is a transformation of the dependent variable that allows estimation of the model. The following functions are available: Identity. f(x)=x. The dependent variable is not transformed. This link can be used with any distribution. Complementary log-log. f(x)=log(−log(1−x)). This is appropriate only with the binomial distribution. Cumulative Cauchit. f(x) = tan(π (x – 0.5)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution. Cumulative complementary log-log. f(x)=ln(−ln(1−x)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution. Cumulative logit. f(x)=ln(x / (1−x)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution. Cumulative negative log-log. f(x)=−ln(−ln(x)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution. Cumulative probit. f(x)=Φ−1(x), applied to the cumulative probability of each category of the response, where Φ−1 is the inverse standard normal cumulative distribution function. This is appropriate only with the multinomial distribution. 300 Chapter 10 Log. f(x)=log(x). This link can be used with any distribution. Log complement. f(x)=log(1−x). This is appropriate only with the binomial distribution. Logit. f(x)=log(x / (1−x)). This is appropriate only with the binomial distribution. Negative binomial. f(x)=log(x / (x+k−1)), where k is the ancillary parameter of the negative binomial distribution. This is appropriate only with the negative binomial distribution. Negative log-log. f(x)=−log(−log(x)). This is appropriate only with the binomial distribution. Odds power. f(x)=[(x/(1−x))α−1]/α, if α ≠ 0. f(x)=log(x), if α=0. α is the required number specification and must be a real number. This is appropriate only with the binomial distribution. Probit. f(x)=Φ−1(x), where Φ−1 is the inverse standard normal cumulative distribution function. This is appropriate only with the binomial distribution. Power. f(x)=xα, if α ≠ 0. f(x)=log(x), if α=0. α is the required number specification and must be a real number. This link can be used with any distribution. Parameters. The controls in this group allow you to specify parameter values when certain distribution options are chosen. Parameter for negative binomial. For negative binomial distribution, choose either to specify a value or to allow the system to provide an estimated value. Parameter for Tweedie. For Tweedie distribution, specify a number between 1.0 and 2.0 for the fixed value. Parameter Estimation. The controls in this group allow you to specify estimation methods and to provide initial values for the parameter estimates. Method. You can select a parameter estimation method. Choose between Newton-Raphson, Fisher scoring, or a hybrid method in which Fisher scoring iterations are performed before switching to the Newton-Raphson method. If convergence is achieved during the Fisher scoring phase of the hybrid method before the maximum number of Fisher iterations is reached, the algorithm continues with the Newton-Raphson method. Scale parameter method. You can select the scale parameter estimation method. Maximum-likelihood jointly estimates the scale parameter with the model effects; note that this option is not valid if the response has a negative binomial, Poisson, or binomial distribution. The deviance and Pearson chi-square options estimate the scale parameter from the value of those statistics. Alternatively, you can specify a fixed value for the scale parameter. Covariance matrix. The model-based estimator is the negative of the generalized inverse of the Hessian matrix. The robust (also called the Huber/White/sandwich) estimator is a “corrected” model-based estimator that provides a consistent estimate of the covariance, even when the specification of the variance and link functions is incorrect. Iterations. These options allow you to control the parameters for model convergence. For more information, see the topic Generalized Linear Models Iterations on p. 301. Output. These options allow you to request additional statistics that will be displayed in the advanced output of the model nugget built by the node. For more information, see the topic Generalized Linear Models Advanced Output on p. 302. 301 Statistical Models Singularity tolerance. Singular (or non-invertible) matrices have linearly dependent columns, which can cause serious problems for the estimation algorithm. Even near-singular matrices can lead to poor results, so the procedure will treat a matrix whose determinant is less than the tolerance as singular. Specify a positive value. Generalized Linear Models Iterations You can set the convergence parameters for estimating the generalized linear model. Figure 10-48 Generalized Linear Models Iterations options Iterations. Maximum iterations. The maximum number of iterations the algorithm will execute. Specify a non-negative integer. Maximum step-halving. At each iteration, the step size is reduced by a factor of 0.5 until the log-likelihood increases or maximum step-halving is reached. Specify a positive integer. Check for separation of data points. When selected, the algorithm performs tests to ensure that the parameter estimates have unique values. Separation occurs when the procedure can produce a model that correctly classifies every case. This option is available for binomial responses with binary format. Convergence Criteria. Parameter convergence. When selected, the algorithm stops after an iteration in which the absolute or relative change in the parameter estimates is less than the value specified, which must be positive. 302 Chapter 10 Log-likelihood convergence. When selected, the algorithm stops after an iteration in which the absolute or relative change in the log-likelihood function is less than the value specified, which must be positive. Hessian convergence. For the Absolute specification, convergence is assumed if a statistic based on the Hessian convergence is less than the positive value specified. For the Relative specification, convergence is assumed if the statistic is less than the product of the positive value specified and the absolute value of the log-likelihood. Generalized Linear Models Advanced Output Figure 10-49 Generalized Linear Models Advanced Output options Select the optional output you want to display in the advanced output of the generalized linear model nugget. To view the advanced output, browse the model nugget and click the Advanced tab. For more information, see the topic GenLin Model Nugget Advanced Output on p. 305. The following output is available: Case processing summary. Displays the number and percentage of cases included and excluded from the analysis and the Correlated Data Summary table. Descriptive statistics. Displays descriptive statistics and summary information about the dependent variable, covariates, and factors. Model information. Displays the dataset name, dependent variable or events and trials variables, offset variable, scale weight variable, probability distribution, and link function. 303 Statistical Models Goodness of fit statistics. Displays deviance and scaled deviance, Pearson chi-square and scaled Pearson chi-square, log-likelihood, Akaike’s information criterion (AIC), finite sample corrected AIC (AICC), Bayesian information criterion (BIC), and consistent AIC (CAIC). Model summary statistics. Displays model fit tests, including likelihood-ratio statistics for the model fit omnibus test and statistics for the Type I or III contrasts for each effect. Parameter estimates. Displays parameter estimates and corresponding test statistics and confidence intervals. You can optionally display exponentiated parameter estimates in addition to the raw parameter estimates. Covariance matrix for parameter estimates. Displays the estimated parameter covariance matrix. Correlation matrix for parameter estimates. Displays the estimated parameter correlation matrix. Contrast coefficient (L) matrices. Displays contrast coefficients for the default effects and for the estimated marginal means, if requested on the EM Means tab. General estimable functions. Displays the matrices for generating the contrast coefficient (L) matrices. Iteration history. Displays the iteration history for the parameter estimates and log-likelihood and prints the last evaluation of the gradient vector and the Hessian matrix. The iteration history table displays parameter estimates for every nth iterations beginning with the 0th iteration (the initial estimates), where n is the value of the print interval. If the iteration history is requested, then the last iteration is always displayed regardless of n. Lagrange multiplier test. Displays Lagrange multiplier test statistics for assessing the validity of a scale parameter that is computed using the deviance or Pearson chi-square, or set at a fixed number, for the normal, gamma, and inverse Gaussian distributions. For the negative binomial distribution, this tests the fixed ancillary parameter. Model Effects. Analysis type. Specify the type of analysis to produce. Type I analysis is generally appropriate when you have a priori reasons for ordering predictors in the model, while Type III is more generally applicable. Wald or likelihood-ratio statistics are computed based upon the selection in the Chi-Square Statistics group. Confidence intervals. Specify a confidence level greater than 50 and less than 100. Wald intervals are based on the assumption that parameters have an asymptotic normal distribution; profile likelihood intervals are more accurate but can be computationally expensive. The tolerance level for profile likelihood intervals is the criteria used to stop the iterative algorithm used to compute the intervals. Log-likelihood function. This controls the display format of the log-likelihood function. The full function includes an additional term that is constant with respect to the parameter estimates; it has no effect on parameter estimation and is left out of the display in some software products. GenLin Model Nugget A GenLin model nugget represents the equations estimated by a GenLin node. They contain all of the information captured by the model, as well as information about the model structure and performance. 304 Chapter 10 When you run a stream containing a GenLin model nugget, the node adds new fields whose contents depend on the nature of the target field: Flag target. Adds fields containing the predicted category and associated probability and the probabilities for each category. The names of the first two new fields are derived from the name of the output field being predicted, prefixed with $G- for the predicted category and $GP- for the associated probability. For example, for an output field named default, the new fields would be named $G-default and $GP-default. The latter two additional fields are named based on the values of the output field, prefixed by $GP-. For example, if the legal values of default are Yes and No, the new fields would be named $GP-Yes and $GP-No. Continuous target. Adds fields containing the predicted mean and standard error. Continuous target, representing number of events in a series of trials. Adds fields containing the predicted mean and standard error. Ordinal target. Adds fields containing the predicted category and associated probability for each value of the ordered set. The names of the fields are derived from the value of the ordered set being predicted, prefixed with $G- for the predicted category and $GP- for the associated probability. Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input fields based on the results of the model. Predictor Importance Optionally, a chart that indicates the relative importance of each predictor in estimating the model may also be displayed on the Model tab. Typically you will want to focus your modeling efforts on the predictors that matter most and consider dropping or ignoring those that matter least. Note this chart is only available if Calculate predictor importance is selected on the Analyze tab before generating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. 305 Statistical Models GenLin Model Nugget Advanced Output Figure 10-50 GenLin model nugget, Advanced tab The advanced output for generalized linear model gives detailed information about the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of this type of analysis is required to properly interpret this output. For more information, see the topic Generalized Linear Models Advanced Output on p. 302. GenLin Model Nugget Settings The Settings tab for a GenLin model nugget allows you to obtain propensity scores when scoring the model. This tab is available for models with flag targets only, and only after the model nugget has been added to a stream. 306 Chapter 10 Figure 10-51 GenLin model nugget, Settings tab for flag targets Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. GenLin Model Nugget Summary The Summary tab for a GenLin model nugget displays the fields and settings used to generate the model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also be displayed in this section. For general information on using the model browser, see Browsing Model Nuggets on p. 49. 307 Statistical Models Figure 10-52 GenLin model nugget, Summary tab GLMM Node Use this node to create a generalized linear mixed model (GLMM). Generalized linear mixed models Generalized linear mixed models extend the linear model so that: The target is linearly related to the factors and covariates via a specified link function. The target can have a non-normal distribution. The observations can be correlated. Generalized linear mixed models cover a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data. Examples. The district school board can use a generalized linear mixed model to determine whether an experimental teaching method is effective at improving math scores. Students from the same classroom should be correlated since they are taught by the same teacher, and classrooms within the same school may also be correlated, so we can include random effects at school and class levels to account for different sources of variability. 308 Chapter 10 Medical researchers can use a generalized linear mixed model to determine whether a new anticonvulsant drug can reduce a patient’s rate of epileptic seizures. Repeated measurements from the same patient are typically positively correlated so a mixed model with some random effects should be appropriate. The target field, the number of seizures, takes positive integer values, so a generalized linear mixed model with a Poisson distribution and log link may be appropriate. Executives at a cable provider of television, phone, and internet services can use a generalized linear mixed model to know more about potential customers. Since possible answers have nominal measurement levels, the company analyst uses a generalized logit mixed model with a random intercept to capture correlation between answers to the service usage questions across service types (tv, phone, internet) within a given survey responder’s answers. Figure 10-53 Data Structure tab The Data Structure tab allows you to specify the structural relationships between records in your dataset when observations are correlated. If the records in the dataset represent independent observations, you do not need to specify anything on this tab. 309 Statistical Models Subjects. The combination of values of the specified categorical fields should uniquely define subjects within the dataset. For example, a single Patient ID field should be sufficient to define subjects in a single hospital, but the combination of Hospital ID and Patient ID may be necessary if patient identification numbers are not unique across hospitals. In a repeated measures setting, multiple observations are recorded for each subject, so each subject may occupy multiple records in the dataset. A subject is an observational unit that can be considered independent of other subjects. For example, the blood pressure readings from a patient in a medical study can be considered independent of the readings from other patients. Defining subjects becomes particularly important when there are repeated measurements per subject and you want to model the correlation between these observations. For example, you might expect that blood pressure readings from a single patient during consecutive visits to the doctor are correlated. All of the fields specified as Subjects on the Data Structure tab are used to define subjects for the residual covariance structure, and provide the list of possible fields for defining subjects for random-effects covariance structures on the Random Effect Block. Repeated measures. The fields specified here are used to identify repeated observations. For example, a single variable Week might identify the 10 weeks of observations in a medical study, or Month and Day might be used together to identify daily observations over the course of a year. Define covariance groups by. The fields specified here define independent sets of repeated effects covariance parameters; one for each category defined by the cross-classification of the grouping fields. All subjects have the same covariance type; subjects within the same covariance grouping will have the same values for the parameters. Repeated covariance type. This specifies the covariance structure for the residuals. The available structures are: First-order autoregressive (AR1) Autoregressive moving average (1,1) (ARMA11) Compound symmetry Diagonal Scaled identity Toeplitz Unstructured Variance components 310 Chapter 10 Target Figure 10-55 Target settings These settings define the target, its distribution, and its relationship to the predictors through the link function. Target. The target is required. It can have any measurement level, and the measurement level of the target restricts which distributions and link functions are appropriate. Use number of trials as denominator. When the target response is a number of events occurring in a set of trials, the target field contains the number of events and you can select an additional field containing the number of trials. For example, when testing a new pesticide you might expose samples of ants to different concentrations of the pesticide and then record the number of ants killed and the number of ants in each sample. In this case, the field recording the number of ants killed should be specified as the target (events) field, and the field recording the number of ants in each sample should be specified as the trials field. If the number of ants is the same for each sample, then the number of trials may be specified using a fixed value. 311 Statistical Models The number of trials should be greater than or equal to the number of events for each record. Events should be non-negative integers, and trials should be positive integers. Customize reference category. For a categorical target, you can choose the reference category. This can affect certain output, such as parameter estimates, but it should not change the model fit. For example, if your target takes values 0, 1, and 2, by default, the procedure makes the last (highest-valued) category, or 2, the reference category. In this situation, parameter estimates should be interpreted as relating to the likelihood of category 0 or 1 relative to the likelihood of category 2. If you specify a custom category and your target has defined labels, you can set the reference category by choosing a value from the list. This can be convenient when, in the middle of specifying a model, you don’t remember exactly how a particular field was coded. Target Distribution and Relationship (Link) with the Linear Model. Given the values of the predictors, the model expects the distribution of values of the target to follow the specified shape, and for the target values to be linearly related to the predictors through the specified link function. Short cuts for several common models are provided, or choose a Custom setting if there is a particular distribution and link function combination you wish to fit that is not on the short list. Linear model. Specifies a normal distribution with an identity link, which is useful when the target can be predicted using a linear regression or ANOVA model. Gamma regression. Specifies a Gamma distribution with a log link, which should be used when the target contains all positive values and is skewed towards larger values. Loglinear. Specifies a Poisson distribution with a log link, which should be used when the target represents a count of occurrences in a fixed period of time. Negative binomial regression. Specifies a negative binomial distribution with a log link, which should be used when the target and denominator represent the number of trials required to observe k successes. Multinomial logistic regression. Specifies a multinomial distribution, which should be used when the target is a multi-category response. It uses either a cumulative logit link (ordinal outcomes) or a generalized logit link (multi-category nominal responses). Binary logistic regression. Specifies a binomial distribution with a logit link, which should be used when the target is a binary response predicted by a logistic regression model. Binary probit. Specifies a binomial distribution with a probit link, which should be used when the target is a binary response with an underlying normal distribution. Interval censored survival. Specifies a binomial distribution with a complementary log-log link, which is useful in survival analysis when some observations have no termination event. Distribution This selection specifies the distribution of the target. The ability to specify a non-normal distribution and non-identity link function is the essential improvement of the generalized linear mixed model over the linear mixed model. There are many possible distribution-link function combinations, and several may be appropriate for any given dataset, so your choice can be guided by a priori theoretical considerations or which combination seems to fit best. Binomial. This distribution is appropriate only for a target that represents a binary response or number of events. 312 Chapter 10 Gamma. This distribution is appropriate for a target with positive scale values that are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis. Inverse Gaussian. This distribution is appropriate for a target with positive scale values that are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis. Multinomial. This distribution is appropriate for a target that represents a multi-category response. The form of the model will depend on the measurement level of the target. A nominal target will result in a nominal multinomial model in which a separate set of model parameters are estimated for each category of the target (except the reference category). The parameter estimates for a given predictor show the relationship between that predictor and the likelihood of each category of the target, relative to the reference category. An ordinal target will result in an ordinal multinomial model in which the traditional intercept term is replaced with a set of threshold parameters that relate to the cumulative probability of the target categories. Negative binomial. Negative binomial regression uses a negative binomial distribution with a log link, which should be used when the target represents a count of occurrences with high variance. Normal. This is appropriate for a continuous target whose values take a symmetric, bell-shaped distribution about a central (mean) value. Poisson. This distribution can be thought of as the number of occurrences of an event of interest in a fixed period of time and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis. Link Functions The link function is a transformation of the target that allows estimation of the model. The following functions are available: Identity. f(x)=x. The target is not transformed. This link can be used with any distribution, except the multinomial. Complementary log-log. f(x)=log(−log(1−x)). This is appropriate only with the binomial or multinomial distribution. Cauchit. f(x) = tan(π (x− 0.5)). This is appropriate only with the binomial or multinomial distribution. Log. f(x)=log(x). This link can be used with any distribution, except the multinomial. Log complement. f(x)=log(1−x). This is appropriate only with the binomial distribution. Logit. f(x)=log(x / (1−x)). This is appropriate only with the binomial or multinomial distribution. Negative log-log. f(x)=−log(−log(x)). This is appropriate only with the binomial or multinomial distribution. 313 Statistical Models Probit. f(x)=Φ−1(x), where Φ−1 is the inverse standard normal cumulative distribution function. This is appropriate only with the binomial or multinomial distribution. Power. f(x)=xα, if α ≠ 0. f(x)=log(x), if α=0. α is the required number specification and must be a real number. This link can be used with any distribution, except the multinomial. Fixed Effects Figure 10-56 Fixed Effects settings Fixed effects factors are generally thought of as fields whose values of interest are all represented in the dataset, and can be used for scoring. By default, fields with the predefined input role that are not specified elsewhere in the dialog are entered in the fixed effects portion of the model. Categorical (flag, nominal, and ordinal) fields are used as factors in the model and continuous fields are used as covariates. Enter effects into the model by selecting one or more fields in the source list and dragging to the effects list. The type of effect created depends upon which hotspot you drop the selection. 314 Chapter 10 Main. Dropped fields appear as separate main effects at the bottom of the effects list. 2-way. All possible pairs of the dropped fields appear as 2-way interactions at the bottom of the effects list. 3-way. All possible triplets of the dropped fields appear as 3-way interactions at the bottom of the effects list. *. The combination of all dropped fields appear as a single interaction at the bottom of the effects list. The buttons to the right of the Effect Builder allow you to: Delete terms from the fixed effects model by selecting the terms you want to delete and clicking the delete button, Reorder the terms within the fixed effects model by selecting the terms you want to reorder and clicking the up or down arrow, and Add nested terms to the model using the Add a Custom Term dialog, by clicking on the Add a Custom Term button. Include Intercept. The intercept is usually included in the model. If you can assume the data pass through the origin, you can exclude the intercept. Add a Custom Term Figure 10-57 Add a Custom Term dialog You can build nested terms for your model in this procedure. Nested terms are useful for modeling the effect of a factor or covariate whose values do not interact with the levels of another factor. For example, a grocery store chain may follow the spending habits of its customers at several store 315 Statistical Models locations. Since each customer frequents only one of these locations, the Customer effect can be said to be nested within the Store location effect. Additionally, you can include interaction effects, such as polynomial terms involving the same covariate, or add multiple levels of nesting to the nested term. Limitations. Nested terms have the following restrictions: All factors within an interaction must be unique. Thus, if A is a factor, then specifying A*A is invalid. All factors within a nested effect must be unique. Thus, if A is a factor, then specifying A(A) is invalid. No effect can be nested within a covariate. Thus, if A is a factor and X is a covariate, then specifying A(X) is invalid. Constructing a nested term E Select a factor or covariate that is nested within another factor, and then click the arrow button. E Click (Within). E Select the factor within which the previous factor or covariate is nested, and then click the arrow button. E Click Add Term. Optionally, you can include interaction effects or add multiple levels of nesting to the nested term. 316 Chapter 10 Random Effects Figure 10-58 Random Effects settings Random effects factors are fields whose values in the data file can be considered a random sample from a larger population of values. They are useful for explaining excess variability in the target. By default, if you have selected more than one subject in the Data Structure tab, a Random Effect block will be created for each subject beyond the innermost subject. For example, if you selected School, Class, and Student as subjects on the Data Structure tab, the following random effect blocks are automatically created: Random Effect 1: subject is school (with no effects, intercept only) Random Effect 2: subject is school * class (no effects, intercept only) You can work with random effects blocks in the following ways: E To add a new block, click Add Block...This opens the Random Effect Block dialog. 317 Statistical Models E To edit an existing block, select the block you want to edit and click Edit Block... This opens the Random Effect Block dialog. E To delete one or more blocks, select the blocks you want to delete and click the delete button. Random Effect Block Figure 10-59 Random Effect Block dialog Enter effects into the model by selecting one or more fields in the source list and dragging to the effects list. The type of effect created depends upon which hotspot you drop the selection. Categorical (flag, nominal, and ordinal) fields are used as factors in the model and continuous fields are used as covariates. Main. Dropped fields appear as separate main effects at the bottom of the effects list. 2-way. All possible pairs of the dropped fields appear as 2-way interactions at the bottom of the effects list. 3-way. All possible triplets of the dropped fields appear as 3-way interactions at the bottom of the effects list. *. The combination of all dropped fields appear as a single interaction at the bottom of the effects list. 318 Chapter 10 The buttons to the right of the Effect Builder allow you to: Delete terms from the fixed effects model by selecting the terms you want to delete and clicking the delete button, Reorder the terms within the fixed effects model by selecting the terms you want to reorder and clicking the up or down arrow, and Add nested terms to the model using the Add a Custom Term dialog, by clicking on the Add a Custom Term button. Include Intercept. The intercept is not included in the random effects model by default. If you can assume the data pass through the origin, you can exclude the intercept. Define covariance groups by. The fields specified here define independent sets of random effects covariance parameters; one for each category defined by the cross-classification of the grouping fields. A different set of grouping fields can be specified for each random effect block. All subjects have the same covariance type; subjects within the same covariance grouping will have the same values for the parameters. Subject combination. This allows you to specify random effect subjects from preset combinations of subjects from the Data Structure tab. For example, if School, Class, and Student are defined as subjects on the Data Structure tab, and in that order, then the Subject combination dropdown list will have None, School, School * Class, and School * Class * Student as options. Random effect covariance type. This specifies the covariance structure for the residuals. The available structures are: First-order autoregressive (AR1) Autoregressive moving average (1,1) (ARMA11) Compound symmetry Diagonal Scaled identity Toeplitz Unstructured Variance components 319 Statistical Models Weight and Offset Figure 10-60 Weight and Offset settings Analysis weight. The scale parameter is an estimated model parameter related to the variance of the response. The analysis weights are “known” values that can vary from observation to observation. If the analysis weight field is specified, the scale parameter, which is related to the variance of the response, is divided by the analysis weight values for each observation. Records with analysis weight values that are less than or equal to 0 or are missing are not used in the analysis. Offset. The offset term is a “structural” predictor. Its coefficient is not estimated by the model but is assumed to have the value 1; thus, the values of the offset are simply added to the linear predictor of the target. This is especially useful in Poisson regression models, where each case may have different levels of exposure to the event of interest. For example, when modeling accident rates for individual drivers, there is an important difference between a driver who has been at fault in one accident in three years of experience and a driver who has been at fault in one accident in 25 years! The number of accidents can be modeled as a Poisson or negative binomial response with a log link if the natural log of the experience of the driver is included as an offset term. Other combinations of distribution and link types would require other transformations of the offset variable. 320 Chapter 10 Build Options Figure 10-61 Build Options settings These selections specify some more advanced criteria used to build the model. Sorting Order. These controls determine the order of the categories for the target and factors (categorical inputs) for purposes of determining the “last” category. The target sort order setting is ignored if the target is not categorical or if a custom reference category is specified on the Target settings. Stopping Rules. You can specify the maximum number of iterations the algorithm will execute. Specify a non-negative integer. The default is 100. Post-Estimation Settings. These settings determine how some of the model output is computed for viewing. Confidence level. This is the level of confidence used to compute interval estimates of the model coefficients. Specify a value greater than 0 and less than 100. The default is 95. Degrees of freedom. This specifies how degrees of freedom are computed for significance tests. Choose Fixed for all tests (Residual method) if your sample size is sufficiently large, or the data are balanced, or the model uses a simpler covariance type; for example, scaled identity or 321 Statistical Models diagonal. This is the default. Choose Varied across tests (Satterthwaite approximation) if your sample size is small, or the data are unbalanced, or the model uses a complicated covariance type; for example, unstructured. Tests of fixed effects and coefficients. This is the method for computing the parameter estimates covariance matrix. Choose the robust estimate if you are concerned that the model assumptions are violated. General Figure 10-62 General settings Model Name. You can generate the model name automatically based on the target fields or specify a custom name. The automatically generated name is the target field name. If there are multiple targets, then the model name is the field names in order, connected by ampersands. For example, if field1 field2 field3 are targets, then the model name is: field1 & field2 & field3. Make Available for Scoring. When the model is scored, the selected items in this group should be produced. The predicted value (for all targets) and confidence (for categorical targets) are always computed when the model is scored. The computed confidence can be based on the probability of the predicted value (the highest predicted probability) or the difference between the highest predicted probability and the second highest predicted probability. 322 Chapter 10 Predicted probability for categorical targets. This produces the predicted probabilities for categorical targets. A field is created for each category. Propensity scores for flag targets. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. The model produces raw propensity scores; if partitions are in effect, the model also produces adjusted propensity scores based on the testing partition. For more information, see the topic Propensity Scores in Chapter 3 on p. 41. Estimated Means Figure 10-63 Estimated Means settings This tab allows you to display the estimated marginal means for levels of factors and factor interactions. Estimated marginal means are not available for multinomial models. Terms. The model terms in the Fixed Effects that are entirely comprised of categorical fields are listed here. Check each term for which you want the model to produce estimated marginal means. Contrast Type. This specifies the type of contrast to use for the levels of the contrast field. If None is selected, no contrasts are produced. Pairwise produces pairwise comparisons for all level combinations of the specified factors. This is the only available contrast for factor interactions. Deviation contrasts compare each level of the factor to the grand mean. Simple contrasts compare each level of the factor, except the last, to the last level. The “last” level is 323 Statistical Models determined by the sort order for factors specified on the Build Options. Note that all of these contrast types are not orthogonal. Contrast Field. This specifies a factor, the levels of which are compared using the selected contrast type. If None is selected as the contrast type, no contrast field can (or need) be selected. Continuous Fields. The listed continuous fields are extracted from the terms in the Fixed Effects that use continuous fields. When computing estimated marginal means, covariates are fixed at the specified values. Select the mean or specify a custom value. Display estimated means in terms of. This specifies whether to compute estimated marginal means based on the original scale of the target or based on the link function transformation. Original target scale computes estimated marginal means for the target. Note that when the target is specified using the events/trials option, this gives the estimated marginal means for the events/trials proportion rather than for the number of events. Link function transformation computes estimated marginal means for the linear predictor. Adjust for multiple comparisons using. When performing hypothesis tests with multiple contrasts, the overall significance level can be adjusted from the significance levels for the included contrasts. This allows you to choose the adjustment method. Least significant difference. This method does not control the overall probability of rejecting the hypotheses that some linear contrasts are different from the null hypothesis values. Sequential Bonferroni. This is a sequentially step-down rejective Bonferroni procedure that is much less conservative in terms of rejecting individual hypotheses but maintains the same overall significance level. Sequential Sidak. This is a sequentially step-down rejective Sidak procedure that is much less conservative in terms of rejecting individual hypotheses but maintains the same overall significance level. The least significant difference method is less conservative than the sequential Sidak method, which in turn is less conservative than the sequential Bonferroni; that is, least significant difference will reject at least as many individual hypotheses as sequential Sidak, which in turn will reject at least as many individual hypotheses as sequential Bonferroni. Model view By default, the Model Summary view is shown. To see another model view, select it from the view thumbnails. 324 Chapter 10 Model Summary Figure 10-64 Model Summary view This view is a snapshot, at-a-glance summary of the model and its fit. Table. The table identifies the target, probability distribution, and link function specified on the Target settings. If the target is defined by events and trials, the cell is split to show the events field and the trials field or fixed number of trials. Additionally the finite sample corrected Akaike information criterion (AICC) and Bayesian information criterion (BIC) are displayed. Akaike Corrected. A measure for selecting and comparing mixed models based on the -2 (Restricted) log likelihood. Smaller values indicate better models. The AICC "corrects" the AIC for small sample sizes. As the sample size increases, the AICC converges to the AIC. Bayesian. A measure for selecting and comparing models based on the -2 log likelihood. Smaller values indicate better models. The BIC also penalizes overparametrized models, but more strictly than the AIC. Chart. If the target is categorical, a chart displays the accuracy of the final model, which is the percentage of correct classifications. 325 Statistical Models Data Structure Figure 10-65 Data Structure view This view provides a summary of the data structure you specified, and helps you to check that the subjects and repeated measures have been specified correctly. The observed information for the first subject is displayed for each subject field and repeated measures field, and the target. Additionally, the number of levels for each subject field and repeated measures field is displayed. 326 Chapter 10 Predicted by Observed Figure 10-66 Predicted by Observed view For continuous targets, including targets specified as events/trials, this displays a binned scatterplot of the predicted values on the vertical axis by the observed values on the horizontal axis. Ideally, the points should lie on a 45-degree line; this view can tell you whether any records are predicted particularly badly by the model. 327 Statistical Models Classification Figure 10-67 Classification view For categorical targets, this displays the cross-classification of observed versus predicted values in a heat map, plus the overall percent correct. Table styles. There are several different display styles, which are accessible from the Style dropdown list. Row percents. This displays the row percentages (the cell counts expressed as a percent of the row totals) in the cells. This is the default. Cell counts. This displays the cell counts in the cells. The shading for the heat map is still based on the row percentages. Heat map. This displays no values in the cells, just the shading. Compressed. This displays no row or column headings, or values in the cells. It can be useful when the target has a lot of categories. Missing. If any records have missing values on the target, they are displayed in a (Missing) row under all valid rows. Records with missing values do not contribute to the overall percent correct. Multiple targets. If there are multiple categorical targets, then each target is displayed in a separate table and there is a Target dropdown list that controls which target to display. Large tables. If the displayed target has more than 100 categories, no table is displayed. 328 Chapter 10 Fixed Effects Figure 10-68 Fixed Effects view, diagram style 329 Statistical Models Figure 10-69 Fixed Effects view, table style This view displays the size of each fixed effect in the model. Styles. There are different display styles, which are accessible from the Style dropdown list. Diagram. This is a chart in which effects are sorted from top to bottom in the order in which they were specified on the Fixed Effects settings. Connecting lines in the diagram are weighted based on effect significance, with greater line width corresponding to more significant effects (smaller p-values). This is the default. Table. This is an ANOVA table for the overall model and the individual model effects. The individual effects are sorted from top to bottom in the order in which they were specified on the Fixed Effects settings. Significance. There is a Significance slider that controls which effects are shown in the view. Effects with significance values greater than the slider value are hidden. This does not change the model, but simply allows you to focus on the most important effects. By default the value is 1.00, so that no effects are filtered based on significance. 330 Chapter 10 Fixed Coefficients Figure 10-70 Fixed Coefficients view, diagram style 331 Statistical Models Figure 10-71 Fixed Coefficients view, table style 332 Chapter 10 This view displays the value of each fixed coefficient in the model. Note that factors (categorical predictors) are indicator-coded within the model, so that effects containing factors will generally have multiple associated coefficients; one for each category except the category corresponding to the redundant coefficient. Styles. There are different display styles, which are accessible from the Style dropdown list. Diagram. This is a chart which displays the intercept first, and then sorts effects from top to bottom in the order in which they were specified on the Fixed Effects settings. Within effects containing factors, coefficients are sorted by ascending order of data values. Connecting lines in the diagram are colored and weighted based on coefficient significance, with greater line width corresponding to more significant coefficients (smaller p-values). This is the default style. Table. This shows the values, significance tests, and confidence intervals for the individual model coefficients. After the intercept, the effects are sorted from top to bottom in the order in which they were specified on the Fixed Effects settings. Within effects containing factors, coefficients are sorted by ascending order of data values. Multinomial. If the multinomial distribution is in effect, then the Multinomial drop-down list controls which target category to display. The sort order of the values in the list is determined by the specification on the Build Options settings. Exponential. This displays exponential coefficient estimates and confidence intervals for certain model types, including Binary logistic regression (binomial distribution and logit link), Nominal logistic regression (multinomial distribution and logit link), Negative binomial regression (negative binomial distribution and log link), and Log-linear model (Poisson distribution and log link). Significance. There is a Significance slider that controls which coefficients are shown in the view. Coefficients with significance values greater than the slider value are hidden. This does not change the model, but simply allows you to focus on the most important coefficients. By default the value is 1.00, so that no coefficients are filtered based on significance. Random Effect Covariances This view displays the random effects covariance matrix (G). Styles. There are different display styles, which are accessible from the Style dropdown list. Covariance values. This is a heat map of the covariance matrix in which effects are sorted from top to bottom in the order in which they were specified on the Fixed Effects settings. Colors in the corrgram correspond to the cell values as shown in the key. This is the default. Corrgram. This is a heat map of the covariance matrix. Compressed. This is a heat map of the covariance matrix without the row and column headings. Blocks. If there are multiple random effect blocks, then there is a Block dropdown list for selecting the block to display. 333 Statistical Models Groups. If a random effect block has a group specification, then there is a Group dropdown list for selecting the group level to display. Multinomial. If the multinomial distribution is in effect, then the Multinomial drop-down list controls which target category to display. The sort order of the values in the list is determined by the specification on the Build Options settings. Covariance Parameters Figure 10-72 Covariance Parameters view This view displays the covariance parameter estimates and related statistics for residual and random effects. These are advanced, but fundamental, results that provide information on whether the covariance structure is suitable. Summary table. This is a quick reference for the number of parameters in the residual (R) and random effect (G) covariance matrices, the rank (number of columns) in the fixed effect (X) and random effect (Z) design matrices, and the number of subjects defined by the subject fields that define the data structure. 334 Chapter 10 Covariance parameter table. For the selected effect, the estimate, standard error, and confidence interval are displayed for each covariance parameter. The number of parameters shown depends upon the covariance structure for the effect and, for random effect blocks, the number of effects in the block. If you see that the off-diagonal parameters are not significant, you may be able to use a simpler covariance structure. Effects. If there are random effect blocks, then there is an Effect dropdown list for selecting the residual or random effect block to display. The residual effect is always available. Groups. If a residual or random effect block has a group specification, then there is a Group dropdown list for selecting the group level to display. Multinomial. If the multinomial distribution is in effect, then the Multinomial drop-down list controls which target category to display. The sort order of the values in the list is determined by the specification on the Build Options settings. Estimated Means: Significant Effects These are charts displayed for the 10 “most significant” fixed all-factor effects, starting with the three-way interactions, then the two-way interactions, and finally main effects. The chart displays the model-estimated value of the target on the vertical axis for each value of the main effect (or first listed effect in an interaction) on the horizontal axis; a separate line is produced for each value of the second listed effect in an interaction; a separate chart is produced for each value of the third listed effect in a three-way interaction; all other predictors are held constant. It provides a useful visualization of the effects of each predictor’s coefficients on the target. Note that if no predictors are significant, no estimated means are produced. Confidence. This displays upper and lower confidence limits for the marginal means, using the confidence level specified as part of the Build Options. Estimated Means: Custom Effects These are tables and charts for user-requested fixed all-factor effects. Styles. There are different display styles, which are accessible from the Style dropdown list. Diagram. This style displays a line chart of the model-estimated value of the target on the vertical axis for each value of the main effect (or first listed effect in an interaction) on the horizontal axis; a separate line is produced for each value of the second listed effect in an interaction; a separate chart is produced for each value of the third listed effect in a three-way interaction; all other predictors are held constant. If contrasts were requested, another chart is displayed to compare levels of the contrast field; for interactions, a chart is displayed for each level combination of the effects other than the contrast field. For pairwise contrasts, it is a distance network chart; that is, a graphical representation of the comparisons table in which the distances between nodes in the network correspond to differences between samples. Yellow lines correspond to statistically significant differences; black lines correspond to non-significant differences. Hovering over a line in 335 Statistical Models the network displays a tooltip with the adjusted significance of the difference between the nodes connected by the line. For deviation contrasts, a bar chart is displayed with the model-estimated value of the target on the vertical axis and the values of the contrast field on the horizontal axis; for interactions, a chart is displayed for each level combination of the effects other than the contrast field. The bars show the difference between each level of the contrast field and the overall mean, which is represented by a black horizontal line. For simple contrasts, a bar chart is displayed with the model-estimated value of the target on the vertical axis and the values of the contrast field on the horizontal axis; for interactions, a chart is displayed for each level combination of the effects other than the contrast field. The bars show the difference between each level of the contrast field (except the last) and the last level, which is represented by a black horizontal line. Table. This style displays a table of the model-estimated value of the target, its standard error, and confidence interval for each level combination of the fields in the effect; all other predictors are held constant. If contrasts were requested, another table is displayed with the estimate, standard error, significance test, and confidence interval for each contrast; for interactions, there a separate set of rows for each level combination of the effects other than the contrast field. Additionally, a table with the overall test results is displayed; for interactions, there is a separate overall test for each level combination of the effects other than the contrast field. Confidence. This toggles the display of upper and lower confidence limits for the marginal means, using the confidence level specified as part of the Build Options. Layout. This toggles the layout of the pairwise contrasts diagram. The circle layout is less revealing of contrasts than the network layout but avoids overlapping lines. Settings Figure 10-73 Model settings 336 Chapter 10 When the model is scored, the selected items in this tab should be produced. The predicted value (for all targets) and confidence (for categorical targets) are always computed when the model is scored. The computed confidence can be based on the probability of the predicted value (the highest predicted probability) or the difference between the highest predicted probability and the second highest predicted probability. Predicted probability for categorical targets. This produces the predicted probabilities for categorical targets. A field is created for each category. Propensity scores for flag targets. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. The model produces raw propensity scores; if partitions are in effect, the model also produces adjusted propensity scores based on the testing partition. For more information, see the topic Propensity Scores in Chapter 3 on p. 41. Cox Node Cox Regression builds a predictive model for time-to-event data. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time t for given values of the predictor variables. The shape of the survival function and the regression coefficients for the predictors are estimated from observed subjects; the model can then be applied to new cases that have measurements for the predictor variables. Note that information from censored subjects, that is, those that do not experience the event of interest during the time of observation, contributes usefully to the estimation of the model. Example. As part of its efforts to reduce customer churn, a telecommunications company is interested in modeling the “time to churn” in order to determine the factors that are associated with customers who are quick to switch to another service. To this end, a random sample of customers is selected, and their time spent as customers (whether or not they are still active customers) and various demographic fields are pulled from the database. Requirements. You need one or more input fields, exactly one target field, and you must specify a survival time field within the Cox node. The target field should be coded so that the “false” value indicates survival and the “true” value indicates that the event of interest has occurred; it must have a measurement level of Flag, with string or integer storage. (Storage can be converted using a Filler or Derive node if necessary. ) Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated. The survival time can be any numeric field. Dates & Times. Date & Time fields cannot be used to directly define the survival time; if you have Date & Time fields, you should use them to create a field containing survival times, based upon the difference between the date of entry into the study and the observation date. Kaplan-Meier Analysis. Cox regression can be performed with no input fields. This is equivalent to a Kaplan-Meier analysis. 337 Statistical Models Cox Node Fields Options Figure 10-74 Cox node dialog box, Fields tab Survival time. Choose a numeric field (one with a measurement level of Continuous) in order to make the node executable. Survival time indicates the lifespan of the record being predicted. For example, when modeling customer time to churn, this would be the field that records how long the customer has been with the organization. The date on which the customer joined or churned would not affect the model; only the duration of the customer’s tenure would be relevant. Survival time is taken to be a duration with no units. You must make sure that the input fields match the survival time. For example, in a study to measure churn by months, you would use sales per month as an input instead of sales per year. If your data has start and end dates instead of a duration, you must recode those dates to a duration upstream from the Cox node. The remaining fields in this dialog box are the standard ones used throughout IBM® SPSS® Modeler. For more information, see the topic Modeling Node Fields Options in Chapter 3 on p. 35. 338 Chapter 10 Cox Node Model Options Figure 10-75 Cox node dialog box, Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Method. The following options are available for entering predictors into the model: Enter. This is the default method, which enters all of the terms into the model directly. No field selection is performed in building the model. Stepwise. The Stepwise method of field selection builds the model in steps, as the name implies. The initial model is the simplest model possible, with no model terms (except the constant) in the model. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds significantly to the predictive power of the model, it is added. In addition, terms that are currently in the model are reevaluated to determine if any of them can be removed without significantly detracting from the model. If so, they are removed. The process repeats, and other terms are added and/or removed. 339 Statistical Models When no more terms can be added to improve the model, and no more terms can be removed without detracting from the model, the final model is generated. Backwards Stepwise. The Backwards Stepwise method is essentially the opposite of the Stepwise method. With this method, the initial model contains all of the terms as predictors. At each step, terms in the model are evaluated, and any terms that can be removed without significantly detracting from the model are removed. In addition, previously removed terms are reevaluated to determine if the best of those terms adds significantly to the predictive power of the model. If so, it is added back into the model. When no more terms can be removed without significantly detracting from the model, and no more terms can be added to improve the model, the final model is generated. Note: The automatic methods, including Stepwise and Backwards Stepwise, are highly adaptable learning methods and have a strong tendency to overfit the training data. When using these methods, it is especially important to verify the validity of the resulting model either with new data or a hold-out test sample created using the Partition node. Groups. Specifying a groups field causes the node to compute separate models for each category of the field. It can be any categorical field (Flag or Nominal) with string or integer storage. Model type. There are two options for defining the terms in the model. Main effects models include only the input fields individually and do not test interactions (multiplicative effects) between input fields. Custom models include only the terms (main effects and interactions) that you specify. When selecting this option, use the Model Terms list to add or remove terms in the model. Model Terms. When building a Custom model, you will need to explicitly specify the terms in the model. The list shows the current set of terms for the model. The buttons on the right side of the Model Terms list allow you to add and remove model terms. E To add terms to the model, click the Add new model terms button. E To delete terms, select the desired terms and click the Delete selected model terms button. Adding Terms to a Cox Regression Model When requesting a custom model, you can add terms to the model by clicking the Add new model terms button on the Model tab. A new dialog box opens in which you can specify terms. 340 Chapter 10 Figure 10-76 New Terms dialog box Type of term to add. There are several ways to add terms to the model, based on the selection of input fields in the Available fields list. Single interaction. Inserts the term representing the interaction of all selected fields. Main effects. Inserts one main effect term (the field itself) for each selected input field. All 2-way interactions. Inserts a 2-way interaction term (the product of the input fields) for each possible pair of selected input fields. For example, if you have selected input fields A, B, and C in the Available fields list, this method will insert the terms A * B, A * C, and B * C. All 3-way interactions. Inserts a 3-way interaction term (the product of the input fields) for each possible combination of selected input fields, taken three at a time. For example, if you have selected input fields A, B, C, and D in the Available fields list, this method will insert the terms A * B * C, A * B * D, A * C * D, and B * C * D. All 4-way interactions. Inserts a 4-way interaction term (the product of the input fields) for each possible combination of selected input fields, taken four at a time. For example, if you have selected input fields A, B, C, D, and E in the Available fields list, this method will insert the terms A * B * C * D, A * B * C * E, A * B * D * E, A * C * D * E, and B * C * D * E. Available fields. Lists the available input fields to be used in constructing model terms. Note that the list may include fields that are not legal input fields, so take care to ensure that all model terms include only input fields. Preview. Shows the terms that will be added to the model if you click Insert, based on the selected fields and the term type selected above. Insert. Inserts terms in the model (based on the current selection of fields and term type) and closes the dialog box. 341 Statistical Models Cox Node Expert Options Figure 10-77 Cox node dialog box, Expert tab Convergence. These options allow you to control the parameters for model convergence. When you execute the model, the convergence settings control how many times the different parameters are repeatedly run through to see how well they fit. The more often the parameters are tried, the closer the results will be (that is, the results will converge). For more information, see the topic Cox Node Convergence Criteria on p. 341. Output. These options allow you to request additional statistics and plots, including the survival curve, that will be displayed in the advanced output of the generated model built by the node. For more information, see the topic Cox Node Advanced Output Options on p. 342. Stepping. These options allow you to control the criteria for adding and removing fields with the Stepwise estimation method. (The button is disabled if the Enter method is selected.) For more information, see the topic Cox Node Stepping Criteria on p. 343. Cox Node Convergence Criteria Figure 10-78 Cox Regression Convergence Criteria dialog box 342 Chapter 10 Maximum iterations. Allows you to specify the maximum iterations for the model, which controls how long the procedure will search for a solution. Log-likelihood convergence. Iterations stop if the relative change in the log-likelihood is less than this value. The criterion is not used if the value is 0. Parameter convergence. Iterations stop if the absolute change or relative change in the parameter estimates is less than this value. The criterion is not used if the value is 0. Cox Node Advanced Output Options Figure 10-79 Cox Regression Advanced Output dialog box Statistics. You can obtain statistics for your model parameters, including confidence intervals for exp(B) and correlation of estimates. You can request these statistics either at each step or at the last step only. Display baseline function. Allows you to display the baseline hazard function and cumulative survival at the mean of the covariates. Plots Plots can help you to evaluate your estimated model and interpret the results. You can plot the survival, hazard, log-minus-log, and one-minus-survival functions. Survival. Displays the cumulative survival function on a linear scale. Hazard. Displays the cumulative hazard function on a linear scale. 343 Statistical Models Log minus log. Displays the cumulative survival estimate after the ln(-ln) transformation is applied to the estimate. One minus survival. Plots one-minus the survival function on a linear scale. Plot a separate line for each value. This option is available only for categorical fields. Value to use for plots. Because these functions depend on values of the predictors, you must use constant values for the predictors to plot the functions versus time. The default is to use the mean of each predictor as a constant value, but you can enter your own values for the plot using the grid. For categorical inputs, indicator coding is used, so there is a regression coefficient for each category (except the last). Thus, a categorical input has a mean value for each indicator contrast, equal to the proportion of cases in the category corresponding to the indicator contrast. Cox Node Stepping Criteria Figure 10-80 Cox Regression Stepping Criteria dialog box Removal criterion. Select Likelihood Ratio for a more robust model. To shorten the time required to build the model, you can try selecting Wald. There is the additional option Conditional, which provides removal testing based on the probability of the likelihood-ratio statistic based on conditional parameter estimates. Significance thresholds for criteria. This option allows you to specify selection criteria based on the statistical probability (the p value) associated with each field. Fields will be added to the model only if the associated p value is smaller than the Entry value and will be removed only if the p value is larger than the Removal value. The Entry value must be smaller than the Removal value. 344 Chapter 10 Cox Node Settings Options Figure 10-81 Cox node dialog box, Settings tab Predict survival at future times. Specify one or more future times. Survival, that is, whether each case is likely to survive for at least that length of time (from now) without the terminal event occurring, is predicted for each record at each time value, one prediction per time value. Note that survival is the “false” value of the target field. Regular intervals. Survival time values are generated from the specified Time interval and Number of time periods to score. For example, if 3 time periods are requested with an interval of 2 between each time, survival will be predicted for future times 2, 4, 6. Every record is evaluated at the same time values. Time fields. Survival times are provided for each record in the time field chosen (one prediction field is generated), thus each record can be evaluated at different times. Past survival time. Specify the survival time of the record so far—for example, the tenure of an existing customer as a field. Scoring the likelihood of survival at a future time will be conditional on past survival time. Note: The values of future and past survival times must be within range of survival times in the data used to train the model. Records whose times fall outside this range are scored as null. Append all probabilities. Specifies whether probabilities for each category of the output field are added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added. Probabilities are computed for each future time. Calculate cumulative hazard function. Specifies whether the value of the cumulative hazard is added to each record. The cumulative hazard is computed for each future time. 345 Statistical Models Cox Model Nugget Cox regression models represent the equations estimated by Cox nodes. They contain all of the information captured by the model, as well as information about the model structure and performance. When you run a stream containing a generated Cox regression model, the node adds two new fields containing the model’s prediction and the associated probability. The names of the new fields are derived from the name of the output field being predicted, prefixed with $C- for the predicted category and $CP- for the associated probability, suffixed with the number of the future time interval or the name of the time field that defines the time interval. For example, for an output field named churn and two future time intervals defined at regular intervals, the new fields would be named $C-churn-1, $CP-churn-1, $C-churn-2, and $CP-churn-2. If future times are defined with a time field tenure, the new fields would be $C-churn_tenure and $CP-churn_tenure. If you have selected the Append all probabilities settings option in the Cox node, two additional fields will be added for each future time, containing the probabilities of survival and failure for each record. These additional fields are named based on the name of the output field, prefixed by $CP-<false value>- for the probability of survival and $CP-<true value>- for the probability the event has occurred, suffixed with the number of the future time interval. For example, for an output field where the “false” value is 0 and the “true” value is 1, and two future time intervals defined at regular intervals, the new fields would be named $CP-0-1, $CP-1-1, $CP-0-2, and $CP-1-2. If future times are defined with a single time field tenure, the new fields would be $CP-0-1 and $CP-1-1, since there is a single future interval If you have selected the Calculate cumulative hazard function settings option in the Cox Node, an additional field will be added for each future time, containing the cumulative hazard function for each record. These additional fields are named based on the name of the output field, prefixed by $CH-, suffixed with the number of the future time interval or the name of the time field that defines the time interval. For example, for an output field named churn and two future time intervals defined at regular intervals, the new fields would be named $CH-churn-1 and $CH-churn-2. If future times are defined with a time field tenure, the new field would be $CH-churn-1. Cox Regression Output Settings The Settings tab of the nugget contains the same controls as the Settings tab of the model node. The default values of the nugget controls are determined by the values set in the model node. For more information, see the topic Cox Node Settings Options on p. 344. Cox Regression Advanced Output The advanced output for Cox regression gives detailed information about the estimated model and its performance, including the survival curve. Most of the information contained in the advanced output is quite technical, and extensive knowledge of Cox regression is required to properly interpret this output. 346 Chapter 10 Figure 10-82 Cox model nugget, Advanced tab Chapter 11 Clustering Models Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without the benefit of prior knowledge about the groups and their characteristics. In fact, you may not even know exactly how many groups to look for. This is what distinguishes clustering models from the other machine-learning techniques—there is no predefined output or target field for the model to predict. These models are often referred to as unsupervised learning models, since there is no external standard by which to judge the model’s classification performance. There are no right or wrong answers for these models. Their value is determined by their ability to capture interesting groupings in the data and provide useful descriptions of those groupings. Clustering methods are based on measuring distances between records and between clusters. Records are assigned to clusters in a way that tends to minimize the distance between records belonging to the same cluster. Figure 11-1 Simple clustering model © Copyright IBM Corporation 1994, 2012. 347 348 Chapter 11 Three clustering methods are provided: The K-Means node clusters the data set into distinct groups (or clusters). The method defines a fixed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further refinement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input fields. For more information, see the topic K-Means Node on p. 354. The TwoStep node uses a two-step clustering method. The first step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed field types and large data sets efficiently. For more information, see the topic TwoStep Cluster Node on p. 358. The Kohonen node generates a type of neural network that can be used to cluster the data set into distinct groups. When the network is fully trained, records that are similar should be close together on the output map, while records that are different will be far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see the topic Kohonen Node on p. 348. Clustering models are often used to create clusters or segments that are then used as inputs in subsequent analyses. A common example of this is the market segments used by marketers to partition their overall market into homogeneous subgroups. Each segment has special characteristics that affect the success of marketing efforts targeted toward it. If you are using data mining to optimize your marketing strategy, you can usually improve your model significantly by identifying the appropriate segments and using that segment information in your predictive models. Kohonen Node Kohonen networks are a type of neural network that perform clustering, also known as a knet or a self-organizing map. This type of network can be used to cluster the dataset into distinct groups when you don’t know what those groups are at the beginning. Records are grouped so that records within a group or cluster tend to be similar to each other, and records in different groups are dissimilar. The basic units are neurons, and they are organized into two layers: the input layer and the output layer (also called the output map). All of the input neurons are connected to all of the output neurons, and these connections have strengths, or weights, associated with them. During training, each unit competes with all of the others to “win” each record. The output map is a two-dimensional grid of neurons, with no connections between the units. A 3 × 4 map is shown below, although maps are typically larger than this. 349 Clustering Models Figure 11-2 Structure of a Kohonen network Input data is presented to the input layer, and the values are propagated to the output layer. The output neuron with the strongest response is said to be the winner and is the answer for that input. Initially, all weights are random. When a unit wins a record, its weights (along with those of other nearby units, collectively referred to as a neighborhood) are adjusted to better match the pattern of predictor values for that record. All of the input records are shown, and weights are updated accordingly. This process is repeated many times until the changes become very small. As training proceeds, the weights on the grid units are adjusted so that they form a two-dimensional “map” of the clusters (hence the term self-organizing map). When the network is fully trained, records that are similar should be close together on the output map, whereas records that are vastly different will be far apart. Unlike most learning methods in IBM® SPSS® Modeler, Kohonen networks do not use a target field. This type of learning, with no target field, is called unsupervised learning. Instead of trying to predict an outcome, Kohonen nets try to uncover patterns in the set of input fields. Usually, a Kohonen net will end up with a few units that summarize many observations (strong units), and several units that don’t really correspond to any of the observations (weak units). The strong units (and sometimes other units adjacent to them in the grid) represent probable cluster centers. Another use of Kohonen networks is in dimension reduction. The spatial characteristic of the two-dimensional grid provides a mapping from the k original predictors to two derived features that preserve the similarity relationships in the original predictors. In some cases, this can give you the same kind of benefit as factor analysis or PCA. Note that the method for calculating default size of the output grid has changed from previous versions of SPSS Modeler. The new method will generally produce smaller output layers that are faster to train and generalize better. If you find that you get poor results with the default size, try increasing the size of the output grid on the Expert tab. For more information, see the topic Kohonen Node Expert Options on p. 352. Requirements. To train a Kohonen net, you need one or more fields with the role set to Input. Fields with the role set to Target, Both, or None are ignored. 350 Chapter 11 Strengths. You do not need to have data on group membership to build a Kohonen network model. You don’t even need to know the number of groups to look for. Kohonen networks start with a large number of units, and as training progresses, the units gravitate toward the natural clusters in the data. You can look at the number of observations captured by each unit in the model nugget to identify the strong units, which can give you a sense of the appropriate number of clusters. Kohonen Node Model Options Figure 11-3 Kohonen node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Continue training existing model. By default, each time you execute a Kohonen node, a completely new network is created. If you select this option, training continues with the last net successfully produced by the node. Show feedback graph. If this option is selected, a visual representation of the two-dimensional array is displayed during training. The strength of each node is represented by color. Red denotes a unit that is winning many records (a strong unit), and white denotes a unit that is winning few or no records (a weak unit). Feedback may not display if the time taken to build the model is relatively short. Note that this feature can slow training time. To speed up training time, deselect this option. 351 Clustering Models Figure 11-4 Kohonen feedback graph Stop on. The default stopping criterion stops training, based on internal parameters. You can also specify time as the stopping criterion. Enter the time (in minutes) for the network to train. Set random seed. If no random seed is set, the sequence of random values used to initialize the network weights will be different every time the node is executed. This can cause the node to create different models on different runs, even if the node settings and data values are exactly the same. By selecting this option, you can set the random seed to a specific value so the resulting model is exactly reproducible. A specific random seed always generates the same sequence of random values, in which case executing the node always yields the same generated model. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. Note: If you want to include nominal (set) fields in your model but are having memory problems in building the model, or the model is taking too long to build, consider recoding large set fields to reduce the number of values, or consider using a different field with fewer values as a proxy for the large set. For example, if you are having a problem with a product_id field containing values for individual products, you might consider removing it from the model and adding a less detailed product_category field instead. Optimize. Select options designed to increase performance during model building based on your specific needs. Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrifice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specified in options.cfg. 352 Chapter 11 Append cluster label. Selected by default for new models, but deselected for models loaded from earlier versions of IBM® SPSS® Modeler, this creates a single categorical score field of the same type that is created by both the K-Means and TwoStep nodes. This string field is used in the Auto Cluster node when calculating ranking measures for the different model types. For more information, see the topic Auto Cluster Node in Chapter 5 on p. 104. Kohonen Node Expert Options For those with detailed knowledge of Kohonen networks, expert options allow you to fine-tune the training process. To access expert options, set the Mode to Expert on the Expert tab. Figure 11-5 Kohonen expert options Width and Length. Specify the size (width and length) of the two-dimensional output map as number of output units along each dimension. Learning rate decay. Select either linear or exponential learning rate decay. The learning rate is a weighting factor that decreases over time, such that the network starts off encoding large-scale features of the data and gradually focuses on more fine-level detail. Phase 1 and Phase 2. Kohonen net training is split into two phases. Phase 1 is a rough estimation phase, used to capture the gross patterns in the data. Phase 2 is a tuning phase, used to adjust the map to model the finer features of the data. For each phase, there are three parameters: Neighborhood. Sets the starting size (radius) of the neighborhood. This determines the number of “nearby” units that get updated along with the winning unit during training. During phase 1, the neighborhood size starts at Phase 1 Neighborhood and decreases to (Phase 2 Neighborhood + 1). During phase 2, neighborhood size starts at Phase 2 Neighborhood and decreases to 1.0. Phase 1 Neighborhood should be larger than Phase 2 Neighborhood. 353 Clustering Models Initial Eta. Sets the starting value for learning rate eta. During phase 1, eta starts at Phase 1 Initial Eta and decreases to Phase 2 Initial Eta. During phase 2, eta starts at Phase 2 Initial Eta and decreases to 0. Phase 1 Initial Eta should be larger than Phase 2 Initial Eta. Cycles. Sets the number of cycles for each phase of training. Each phase continues for the specified number of passes through the data. Kohonen Model Nuggets Kohonen model nuggets contain all of the information captured by the trained Kohenen network, as well as information about the network’s architecture. When you run a stream containing a Kohonen model nugget, the node adds two new fields containing the X and Y coordinates of the unit in the Kohonen output grid that responded most strongly to that record. The new field names are derived from the model name, prefixed by $KX- and $KY-. For example, if your model is named Kohonen, the new fields would be named $KX-Kohonen and $KY-Kohonen. To get a better sense of what the Kohonen net has encoded, click the Model tab on the model nugget browser. This displays the Cluster Viewer, providing a graphical representation of clusters, fields, and importance levels. For more information, see the topic Cluster Viewer - Model Tab on p. 362. If you prefer to visualize the clusters as a grid, you can view the result of the Kohonen net by plotting the $KX- and $KY- fields using a Plot node. (You should select X-Agitation and Y-Agitation in the Plot node to prevent each unit’s records from all being plotted on top of each other.) In the plot, you can also overlay a symbolic field to investigate how the Kohonen net has clustered the data. Another powerful technique for gaining insight into the Kohonen network is to use rule induction to discover the characteristics that distinguish the clusters found by the network. For more information, see the topic C5.0 Node in Chapter 6 on p. 160. For general information on using the model browser, see Browsing Model Nuggets Kohonen Model Summary The Summary tab for a Kohonen model nugget displays information about the architecture or topology of the network. The length and width of the two-dimensional Kohonen feature map (the output layer) are shown as $KX-model_name and $KY-model_name. For the input and output layers, the number of units in that layer is listed. 354 Chapter 11 Figure 11-6 Kohonen model nugget Summary tab K-Means Node The K-Means node provides a method of cluster analysis. It can be used to cluster the dataset into distinct groups when you don’t know what those groups are at the beginning. Unlike most learning methods in IBM® SPSS® Modeler, K-Means models do not use a target field. This type of learning, with no target field, is called unsupervised learning. Instead of trying to predict an outcome, K-Means tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar. K-Means works by defining a set of starting cluster centers derived from data. It then assigns each record to the cluster to which it is most similar, based on the record’s input field values. After all cases have been assigned, the cluster centers are updated to reflect the new set of records assigned to each cluster. The records are then checked again to see whether they should be reassigned to a different cluster, and the record assignment/cluster iteration process continues until either the maximum number of iterations is reached, or the change between one iteration and the next fails to exceed a specified threshold. Note: The resulting model depends to a certain extent on the order of the training data. Reordering the data and rebuilding the model may lead to a different final cluster model. 355 Clustering Models Requirements. To train a K-Means model, you need one or more fields with the role set to Input. Fields with the role set to Output, Both, or None are ignored. Strengths. You do not need to have data on group membership to build a K-Means model. The K-Means model is often the fastest method of clustering for large datasets. K-Means Node Model Options Figure 11-7 K-Means node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Specified number of clusters. Specify the number of clusters to generate. The default is 5. Generate distance field. If this option is selected, the model nugget will include a field containing the distance of each record from the center of its assigned cluster. Cluster label. Specify the format for the values in the generated cluster membership field. Cluster membership can be indicated as a String with the specified Label prefix (for example "Cluster 1", "Cluster 2", and so on), or as a Number. Note: If you want to include nominal (set) fields in your model but are having memory problems in building the model or the model is taking too long to build, consider recoding large set fields to reduce the number of values, or consider using a different field with fewer values as a proxy 356 Chapter 11 for the large set. For example, if you are having a problem with a product_id field containing values for individual products, you might consider removing it from the model and adding a less detailed product_category field instead. Optimize. Select options designed to increase performance during model building based on your specific needs. Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrifice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specified in options.cfg. K-Means Node Expert Options For those with detailed knowledge of k-means clustering, expert options allow you to fine-tune the training process. To access expert options, set the Mode to Expert on the Expert tab. Figure 11-8 K-Means expert options Stop on. Specify the stopping criterion to be used in training the model. The Default stopping criterion is 20 iterations or change < 0.000001, whichever occurs first. Select Custom to specify your own stopping criteria. 357 Clustering Models Maximum Iterations. This option allows you to stop model training after the number of iterations specified. Change tolerance. This option allows you to stop model training when the largest change in cluster centers for an iteration is less than the level specified. Encoding value for sets. Specify a value between 0 and 1.0 to use for recoding set fields as groups of numeric fields. The default value is the square root of 0.5 (approximately 0.707107), which provides the proper weighting for recoded flag fields. Values closer to 1.0 will weight set fields more heavily than numeric fields. K-Means Model Nuggets K-Means model nuggets contain all of the information captured by the clustering model, as well as information about the training data and the estimation process. When you run a stream containing a K-Means modeling node, the node adds two new fields containing the cluster membership and distance from the assigned cluster center for that record. The new field names are derived from the model name, prefixed by $KM- for the cluster membership and $KMD- for the distance from the cluster center. For example, if your model is named Kmeans, the new fields would be named $KM-Kmeans and $KMD-Kmeans. A powerful technique for gaining insight into the K-Means model is to use rule induction to discover the characteristics that distinguish the clusters found by the model. For more information, see the topic C5.0 Node in Chapter 6 on p. 160. You can also click the Model tab on the model nugget browser to display the Cluster Viewer, providing a graphical representation of clusters, fields, and importance levels. For more information, see the topic Cluster Viewer - Model Tab on p. 362. For general information on using the model browser, see Browsing Model Nuggets K-Means Model Summary The Summary tab for a K-Means model nugget contains information about the training data, the estimation process, and the clusters defined by the model. The number of clusters is shown, as well as the iteration history. If you have executed an Analysis node attached to this modeling node, information from that analysis will also be displayed in this section. 358 Chapter 11 Figure 11-9 K-Means model nugget Summary tab TwoStep Cluster Node The TwoStep Cluster node provides a form of cluster analysis. It can be used to cluster the dataset into distinct groups when you don’t know what those groups are at the beginning. As with Kohonen nodes and K-Means nodes, TwoStep Cluster models do not use a target field. Instead of trying to predict an outcome, TwoStep Cluster tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar. TwoStep Cluster is a two-step clustering method. The first step makes a single pass through the data, during which it compresses the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters, without requiring another pass through the data. Hierarchical clustering has the advantage of not requiring the number of clusters to be selected ahead of time. Many hierarchical clustering methods start with individual records as starting clusters and merge them recursively to produce ever larger clusters. Though such approaches often break down with large amounts of data, TwoStep’s initial preclustering makes hierarchical clustering fast even for large datasets. Note: The resulting model depends to a certain extent on the order of the training data. Reordering the data and rebuilding the model may lead to a different final cluster model. 359 Clustering Models Requirements. To train a TwoStep Cluster model, you need one or more fields with the role set to Input. Fields with the role set to Target, Both, or None are ignored. The TwoStep Cluster algorithm does not handle missing values. Records with blanks for any of the input fields will be ignored when building the model. Strengths. TwoStep Cluster can handle mixed field types and is able to handle large datasets efficiently. It also has the ability to test several cluster solutions and choose the best, so you don’t need to know how many clusters to ask for at the outset. TwoStep Cluster can be set to automatically exclude outliers, or extremely unusual cases that can contaminate your results. TwoStep Cluster Node Model Options Figure 11-10 TwoStep Cluster node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Standardize numeric fields. By default, TwoStep will standardize all numeric input fields to the same scale, with a mean of 0 and a variance of 1. To retain the original scaling for numeric fields, deselect this option. Symbolic fields are not affected. Exclude outliers. If you select this option, records that don’t seem to fit into a substantive cluster will be automatically excluded from the analysis. This prevents such cases from distorting the results. 360 Chapter 11 Outlier detection occurs during the preclustering step. When this option is selected, subclusters with few records relative to other subclusters are considered potential outliers, and the tree of subclusters is rebuilt excluding those records. The size below which subclusters are considered to contain potential outliers is controlled by the Percentage option. Some of those potential outlier records can be added to the rebuilt subclusters if they are similar enough to any of the new subcluster profiles. The rest of the potential outliers that cannot be merged are considered outliers and are added to a “noise” cluster and excluded from the hierarchical clustering step. When scoring data with a TwoStep model that uses outlier handling, new cases that are more than a certain threshold distance (based on the log-likelihood) from the nearest substantive cluster are considered outliers and are assigned to the “noise” cluster with the name -1. Cluster label. Specify the format for the generated cluster membership field. Cluster membership can be indicated as a String with the specified Label prefix (for example, "Cluster 1", "Cluster 2", and so on) or as a Number. Automatically calculate number of clusters. TwoStep cluster can very rapidly analyze a large number of cluster solutions to choose the optimal number of clusters for the training data. Specify a range of solutions to try by setting the Maximum and the Minimum number of clusters. TwoStep uses a two-stage process to determine the optimal number of clusters. In the first stage, an upper bound on the number of clusters in the model is selected based on the change in the Bayes Information Criterion (BIC) as more clusters are added. In the second stage, the change in the minimum distance between clusters is found for all models with fewer clusters than the minimum-BIC solution. The largest change in distance is used to identify the final cluster model. Specify number of clusters. If you know how many clusters to include in your model, select this option and enter the number of clusters. Distance measure. This selection determines how the similarity between two clusters is computed. Log-likelihood. The likelihood measure places a probability distribution on the variables. Continuous variables are assumed to be normally distributed, while categorical variables are assumed to be multinomial. All variables are assumed to be independent. Euclidean. The Euclidean measure is the “straight line” distance between two clusters. It can be used only when all of the variables are continuous. Clustering Criterion. This selection determines how the automatic clustering algorithm determines the number of clusters. Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified. TwoStep Cluster Model Nuggets TwoStep cluster model nuggets contain all of the information captured by the clustering model, as well as information about the training data and the estimation process. When you run a stream containing a TwoStep cluster model nugget, the node adds a new field containing the cluster membership for that record. The new field name is derived from the model name, prefixed by $T-. For example, if your model is named TwoStep, the new field would be named $T-TwoStep. 361 Clustering Models A powerful technique for gaining insight into the TwoStep model is to use rule induction to discover the characteristics that distinguish the clusters found by the model. For more information, see the topic C5.0 Node in Chapter 6 on p. 160. You can also click the Model tab on the model nugget browser to display the Cluster Viewer, providing a graphical representation of clusters, fields, and importance levels. For more information, see the topic Cluster Viewer - Model Tab on p. 362. For general information on using the model browser, see Browsing Model Nuggets TwoStep Model Summary The Summary tab for a TwoStep cluster model nugget displays the number of clusters found, along with information about the training data, the estimation process, and build settings used. Figure 11-11 Sample TwoStep cluster model nugget Summary tab For more information, see the topic Browsing Model Nuggets in Chapter 3 on p. 49. The Cluster Viewer Cluster models are typically used to find groups (or clusters) of similar records based on the variables examined, where the similarity between members of the same group is high and the similarity between members of different groups is low. The results can be used to identify associations that would otherwise not be apparent. For example, through cluster analysis of 362 Chapter 11 customer preferences, income level, and buying habits, it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign. There are two approaches to interpreting the results in a cluster display: Examine clusters to determine characteristics unique to that cluster. Does one cluster contain all the high-income borrowers? Does this cluster contain more records than the others? Examine fields across clusters to determine how values are distributed among clusters. Does one’s level of education determine membership in a cluster? Does a high credit score distinguish between membership in one cluster or another? Using the main views and the various linked views in the Cluster Viewer, you can gain insight to help you answer these questions. The following cluster model nuggets can be generated in IBM® SPSS® Modeler: Kohonen net model nugget K-Means model nugget TwoStep cluster model nugget To see information about the cluster model nuggets, right-click the model node and choose Browse from the context menu (or Edit for nodes in a stream). Alternatively, if you are using the Auto Cluster modeling node, double-click on the required cluster nugget within the Auto Cluster model nugget. For more information, see the topic Auto Cluster Node in Chapter 5 on p. 104. Cluster Viewer - Model Tab The Model tab for cluster models shows a graphical display of summary statistics and distributions for fields between clusters; this is known as the Cluster Viewer. Note: The Model tab is not available for models built in versions of IBM® SPSS® Modeler prior to 13. 363 Clustering Models Figure 11-12 Cluster Viewer with default display The Cluster Viewer is made up of two panels, the main view on the left and the linked, or auxiliary, view on the right. There are two main views: Model Summary (the default). For more information, see the topic Model Summary View on p. 364. Clusters. For more information, see the topic Clusters View on p. 365. There are four linked/auxiliary views: Predictor Importance. For more information, see the topic Cluster Predictor Importance View on p. 368. Cluster Sizes (the default). For more information, see the topic Cluster Sizes View on p. 369. Cell Distribution. For more information, see the topic Cell Distribution View on p. 370. Cluster Comparison. For more information, see the topic Cluster Comparison View on p. 371. 364 Chapter 11 Model Summary View Figure 11-13 Model Summary view in the main panel The Model Summary view shows a snapshot, or summary, of the cluster model, including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor, fair, or good results. This snapshot enables you to quickly check if the quality is poor, in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result. The results of poor, fair, and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures. In the Model Summary view, a good result equates to data that reflects Kaufman and Rousseeuw’s rating as either reasonable or strong evidence of cluster structure, fair reflects their rating of weak evidence, and poor reflects their rating of no significant evidence. The silhouette measure averages, over all records, (B−A) / max(A,B), where A is the record’s distance to its cluster center and B is the record’s distance to the nearest cluster center that it doesn’t belong to. A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers. A value of −1 would mean all cases are located on the cluster centers of some other cluster. A value of 0 means, on average, cases are equidistant between their own cluster center and the nearest other cluster. The summary includes a table that contains the following information: Algorithm. The clustering algorithm used, for example, “TwoStep”. Input Features. The number of fields, also known as inputs or predictors. Clusters. The number of clusters in the solution. 365 Clustering Models Clusters View Figure 11-14 Cluster Centers view in the main panel The Clusters view contains a cluster-by-features grid that includes cluster names, sizes, and profiles for each cluster. The columns in the grid contain the following information: Cluster. The cluster numbers created by the algorithm. Label. Any labels applied to each cluster (this is blank by default). Double-click in the cell to enter a label that describes the cluster contents; for example, “Luxury car buyers”. Description. Any description of the cluster contents (this is blank by default). Double-click in the cell to enter a description of the cluster; for example, “55+ years of age, professionals, earning over $100,000”. Size. The size of each cluster as a percentage of the overall cluster sample. Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster, a size percentage in numeric format, and the cluster case counts. Features. The individual inputs or predictors, sorted by overall importance by default. If any columns have equal sizes they are shown in ascending sort order of the cluster numbers. Overall feature importance is indicated by the color of the cell background shading; the most important feature is darkest; the least important feature is unshaded. A guide above the table indicates the importance attached to each feature cell color. 366 Chapter 11 When you hover your mouse over a cell, the full name/label of the feature and the importance value for the cell is displayed. Further information may be displayed, depending on the view and feature type. In the Cluster Centers view, this includes the cell statistic and the cell value; for example: “Mean: 4.32”. For categorical features the cell shows the name of the most frequent (modal) category and its percentage. Within the Clusters view, you can select various ways to display the cluster information: Transpose clusters and features. For more information, see the topic Transpose Clusters and Features on p. 366. Sort features. For more information, see the topic Sort Features on p. 366. Sort clusters. For more information, see the topic Sort Clusters on p. 367. Select cell contents. For more information, see the topic Cell Contents on p. 367. Transpose Clusters and Features By default, clusters are displayed as columns and features are displayed as rows. To reverse this display, click the Transpose Clusters and Features button to the left of the Sort Features By buttons. For example you may want to do this when you have many clusters displayed, to reduce the amount of horizontal scrolling required to see the data. Figure 11-15 Transposed clusters in the main panel Sort Features The Sort Features By buttons enable you to select how feature cells are displayed: Overall Importance. This is the default sort order. Features are sorted in descending order of overall importance, and sort order is the same across clusters. If any features have tied importance values, the tied features are listed in ascending sort order of the feature names. Within-Cluster Importance. Features are sorted with respect to their importance for each cluster. If any features have tied importance values, the tied features are listed in ascending sort order of the feature names. When this option is chosen the sort order usually varies across clusters. Name. Features are sorted by name in alphabetical order. Data order. Features are sorted by their order in the dataset. 367 Clustering Models Sort Clusters By default clusters are sorted in descending order of size. The Sort Clusters By buttons enable you to sort them by name in alphabetical order, or, if you have created unique labels, in alphanumeric label order instead. Features that have the same label are sorted by cluster name. If clusters are sorted by label and you edit the label of a cluster, the sort order is automatically updated. Cell Contents The Cells buttons enable you to change the display of the cell contents for features and evaluation fields. Cluster Centers. By default, cells display feature names/labels and the central tendency for each cluster/feature combination. The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields. Absolute Distributions. Shows feature names/labels and absolute distributions of the features within each cluster. For categorical features, the display shows bar charts overlaid with categories ordered in ascending order of the data values. For continuous features, the display shows a smooth density plot which use the same endpoints and intervals for each cluster. The solid red colored display shows the cluster distribution, whilst the paler display represents the overall data. Relative Distributions. Shows feature names/labels and relative distributions in the cells. In general the displays are similar to those shown for absolute distributions, except that relative distributions are displayed instead. The solid red colored display shows the cluster distribution, while the paler display represents the overall data. Basic View. Where there are a lot of clusters, it can be difficult to see all the detail without scrolling. To reduce the amount of scrolling, select this view to change the display to a more compact version of the table. 368 Chapter 11 Cluster Predictor Importance View Figure 11-16 Cluster Predictor Importance view in the link panel The Predictor Importance view shows the relative importance of each field in estimating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. 369 Clustering Models Cluster Sizes View Figure 11-17 Cluster Sizes view in the link panel The Cluster Sizes view shows a pie chart that contains each cluster. The percentage size of each cluster is shown on each slice; hover the mouse over each slice to display the count in that slice. Below the chart, a table lists the following size information: The size of the smallest cluster (both a count and percentage of the whole). The size of the largest cluster (both a count and percentage of the whole). The ratio of size of the largest cluster to the smallest cluster. 370 Chapter 11 Cell Distribution View Figure 11-18 Cell Distribution view in the link panel The Cell Distribution view shows an expanded, more detailed, plot of the distribution of the data for any feature cell you select in the table in the Clusters main panel. 371 Clustering Models Cluster Comparison View Figure 11-19 Cluster Comparison view in the link panel The Cluster Comparison view consists of a grid-style layout, with features in the rows and selected clusters in the columns. This view helps you to better understand the factors that make up the clusters; it also enables you to see differences between clusters not only as compared with the overall data, but with each other. To select clusters for display, click on the top of the cluster column in the Clusters main panel. Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison. Note: You can select up to five clusters for display. Clusters are shown in the order in which they were selected, while the order of fields is determined by the Sort Features By option. When you select Within-Cluster Importance, fields are always sorted by overall importance . The background plots show the overall distributions of each features: Categorical features are shown as dot plots, where the size of the dot indicates the most frequent/modal category for each cluster (by feature). Continuous features are displayed as boxplots, which show overall medians and the interquartile ranges. 372 Chapter 11 Overlaid on these background views are boxplots for selected clusters: For continuous features, square point markers and horizontal lines indicate the median and interquartile range for each cluster. Each cluster is represented by a different color, shown at the top of the view. Navigating the Cluster Viewer The Cluster Viewer is an interactive display. You can: Select a field or cluster to view more details. Compare clusters to select items of interest. Alter the display. Transpose axes. Generate Derive, Filter, and Select nodes using the Generate menu. Using the Toolbars You control the information shown in both the left and right panels by using the toolbar options. You can change the orientation of the display (top-down, left-to-right, or right-to-left) using the toolbar controls. In addition, you can also reset the viewer to the default settings, and open a dialog box to specify the contents of the Clusters view in the main panel. Figure 11-20 Toolbars for controlling the data shown on the Cluster Viewer The Sort Features By, Sort Clusters By, Cells, and Display options are only available when you select the Clusters view in the main panel. For more information, see the topic Clusters View on p. 365. See Transpose Clusters and Features on p. 366 See Sort Features By on p. 366 See Sort Clusters By on p. 367 See Cells on p. 367 Generating Nodes from Cluster Models The Generate menu enables you to create new nodes based on the cluster model. This option is available from the Model tab of the generated model and enables you to generate nodes based on either the current display or selection (that is, all visible clusters or all selected ones). For example, you can select a single feature and then generate a Filter node to discard all other (nonvisible) features. The generated nodes are placed unconnected on the canvas. In addition, you can generate a copy of the model nugget to the models palette. Remember to connect the nodes and make any desired edits before execution. 373 Clustering Models Generate Modeling Node. Creates a modeling node on the stream canvas. This would be useful, for example, if you have a stream in which you want to use these model settings but you no longer have the modeling node used to generate them. Model to Palette. Creates a the nugget on the Models palette. This is useful in situations where a colleague may have sent you a stream containing the model and not the model itself. Filter Node. Creates a new Filter node to filter fields that are not used by the cluster model, and/or not visible in the current Cluster Viewer display. If there is a Type node upstream from this Cluster node, any fields with the role Target are discarded by the generated Filter node. Filter Node (from selection). Creates a new Filter node to filter fields based on selections in the Cluster Viewer. Select multiple fields using the Ctrl-click method. Fields selected in the Cluster Viewer are discarded downstream, but you can change this behavior by editing the Filter node before execution. Select Node. Creates a new Select node to select records based on their membership in any of the clusters visible in the current Cluster Viewer display. A select condition is automatically generated. Select Node (from selection). Creates a new Select node to select records based on membership in clusters selected in the Cluster Viewer. Select multiple clusters using the Ctrl-click method. Derive Node. Creates a new Derive node, which derives a flag field that assigns records a value of True or False based on membership in all clusters visible in the Cluster Viewer. A derive condition is automatically generated. Derive Node (from selection). Creates a new Derive node, which derives a flag field based on membership in clusters selected in the Cluster Viewer. Select multiple clusters using the Ctrl-click method. In addition to generating nodes, you can also create graphs from the Generate menu. For more information, see the topic Generating Graphs from Cluster Models on p. 374. Control Cluster View Display To control what is shown in the Clusters view on the main panel, click the Display button; the Display dialog opens. Figure 11-21 Cluster Viewer - Display options 374 Chapter 11 Features. Selected by default. To hide all input features, deselect the check box. Evaluation Fields. Choose the evaluation fields (fields not used to create the cluster model, but sent to the model viewer to evaluate the clusters) to display; none are shown by default. Note: This check box is unavailable if no evaluation fields are available. Cluster Descriptions. Selected by default. To hide all cluster description cells, deselect the check box. Cluster Sizes. Selected by default. To hide all cluster size cells, deselect the check box. Maximum Number of Categories. Specify the maximum number of categories to display in charts of categorical features; the default is 20. Generating Graphs from Cluster Models Cluster models provide a lot of information; however, it may not always be in an easily accessible format for business users. To provide the data in a way that can be easily incorporated into business reports, presentations, and so on, you can produce graphs of selected data. For example, from the Cluster Viewer you can generate a graph for a selected cluster, thereby only creating a graph for the cases in that cluster. Note: You can only generate a graph from the Cluster Viewer when the model nugget is attached to other nodes in a stream. Generate a graph E Open the model nugget containing the Cluster Viewer. E On the Model tab select Clusters from the View drop-down list. E In the main view, select the cluster, or clusters, for which you want to produce a graph. E From the Generate menu, select Graph (from selection); the Graphboard Basic tab is displayed. 375 Clustering Models Figure 11-22 Graphboard node dialog box, Basic tab Note: Only the Basic and Detailed tabs are available when you display the Graphboard in this way. E Using either the Basic or Detailed tab settings, specify the details to be displayed on the graph. E Click OK to generate the graph. 376 Chapter 11 Figure 11-23 Histogram generated from Graphboard Basic tab The graph heading identifies the model type and cluster, or clusters, that were chosen for inclusion. Chapter 12 Association Rules Association rules associate a particular conclusion (the purchase of a particular product, for example) with a set of conditions (the purchase of several other products, for example). For example, the rule beer <= cannedveg & frozenmeal (173, 17.0%, 0.84) states that beer often occurs when cannedveg and frozenmeal occur together. The rule is 84% reliable and applies to 17% of the data, or 173 records. Association rule algorithms automatically find the associations that you could find manually using visualization techniques, such as the Web node. Figure 12-1 Web node showing associations between market basket items The advantage of association rule algorithms over the more standard decision tree algorithms (C5.0 and C&R Trees) is that associations can exist between any of the attributes. A decision tree algorithm will build rules with only a single conclusion, whereas association algorithms attempt to find many rules, each of which may have a different conclusion. The disadvantage of association algorithms is that they are trying to find patterns within a potentially very large search space and, hence, can require much more time to run than a decision tree algorithm. The algorithms use a generate and test method for finding rules—simple rules are generated initially, and these are validated against the dataset. The good rules are stored and all © Copyright IBM Corporation 1994, 2012. 377 378 Chapter 12 rules, subject to various constraints, are then specialized. Specialization is the process of adding conditions to a rule. These new rules are then validated against the data, and the process iteratively stores the best or most interesting rules found. The user usually supplies some limit to the possible number of antecedents to allow in a rule, and various techniques based on information theory or efficient indexing schemes are used to reduce the potentially large search space. At the end of the processing, a table of the best rules is presented. Unlike a decision tree, this set of association rules cannot be used directly to make predictions in the way that a standard model (such as a decision tree or a neural network) can. This is due to the many different possible conclusions for the rules. Another level of transformation is required to transform the association rules into a classification rule set. Hence, the association rules produced by association algorithms are known as unrefined models. Although the user can browse these unrefined models, they cannot be used explicitly as classification models unless the user tells the system to generate a classification model from the unrefined model. This is done from the browser through a Generate menu option. Two association rule algorithms are supported: The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to process large data sets efficiently. For large problems, Apriori is generally faster to train; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output fields all be categorical but delivers better performance because it is optimized for this type of data. For more information, see the topic Apriori Node on p. 379. The Sequence node discovers association rules in sequential or time-oriented data. A sequence is a list of item sets that tends to occur in a predictable order. For example, a customer who purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The Sequence node is based on the CARMA association rules algorithm, which uses an efficient two-pass method for finding sequences. For more information, see the topic Sequence Node on p. 404. Tabular versus Transactional Data Data used by association rule models may be in transactional or tabular format, as described below. These are general descriptions; specific requirements may vary as discussed in the documentation for each model type. Note that when scoring models, the data to be scored must mirror the format of the data used to build the model. Models built using tabular data can be used to score only tabular data; models built using transactional data can score only transactional data. Transactional Format Transactional data have a separate record for each transaction or item. If a customer makes multiple purchases, for example, each would be a separate record, with associated items linked by a customer ID. This is also sometimes known as till-roll format. Customer 1 2 Purchase jam milk 379 Association Rules Purchase jam bread jam bread milk Customer 3 3 4 4 4 The Apriori, CARMA, and Sequence nodes can all use transactional data. Tabular Data Tabular data (also known as basket or truth-table data) have items represented by separate flags, where each flag field represents the presence or absence of a specific item. Each record represents a complete set of associated items. Flag fields can be categorical or numeric, although certain models may have more specific requirements. Customer 1 2 3 4 Jam T F T T Bread F F T T Milk F T F T The Apriori, CARMA, and Sequence nodes can all use tabular data. Apriori Node The Apriori node also discovers association rules in the data. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to efficiently process large datasets. Requirements. To create an Apriori rule set, you need one or more Input fields and one or more Target fields. Input and output fields (those with the role Input, Target, or Both) must be symbolic. Fields with the role None are ignored. Fields types must be fully instantiated before executing the node. Data can be in tabular or transactional format. For more information, see the topic Tabular versus Transactional Data on p. 378. Strengths. For large problems, Apriori is generally faster to train. It also has no arbitrary limit on the number of rules that can be retained and can handle rules with up to 32 preconditions. Apriori offers five different training methods, allowing more flexibility in matching the data mining method to the problem at hand. 380 Chapter 12 Apriori Node Model Options Figure 12-2 Apriori node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Minimum antecedent support. You can specify a support criterion for keeping rules in the rule set. Support refers to the percentage of records in the training data for which the antecedents (the “if” part of the rule) are true. (Note that this definition of support differs from that used in the CARMA and Sequence nodes. For more information, see the topic Sequence Node Model Options on p. 406.) If you are getting rules that apply to very small subsets of the data, try increasing this setting. Note: The definition of support for Apriori is based on the number of records with the antecedents. This is in contrast to the CARMA and Sequence algorithms for which the definition of support is based on the number of records with all the items in a rule (that is, both the antecedents and consequent). The results for association models show both the (antecedent) support and rule support measures. Minimum rule confidence. You can also specify a confidence criterion. Confidence is based on the records for which the rule’s antecedents are true and is the percentage of those records for which the consequent(s) are also true. In other words, it’s the percentage of predictions based on the rule that are correct. Rules with lower confidence than the specified criterion are discarded. If you are getting too many rules, try increasing this setting. If you are getting too few rules (or no rules at all), try decreasing this setting. Maximum number of antecedents. You can specify the maximum number of preconditions for any rule. This is a way to limit the complexity of the rules. If the rules are too complex or too specific, try decreasing this setting. This setting also has a large influence on training time. If your rule set is taking too long to train, try reducing this setting. 381 Association Rules Only true values for flags. If this option is selected for data in tabular (truth table) format, then only true values will be included in the resulting rules. This can help make rules easier to understand. The option does not apply to data in transactional format. For more information, see the topic Tabular versus Transactional Data on p. 378. Optimize. Select options designed to increase performance during model building based on your specific needs. Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrifice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specified in options.cfg. See the IBM® SPSS® Modeler Server Administrator’s Guide for more information. Apriori Node Expert Options For those with detailed knowledge of Apriori’s operation, the following expert options allow you to fine-tune the induction process. To access expert options, set the Mode to Expert on the Expert tab. Figure 12-3 Apriori expert options Evaluation measure. Apriori supports five methods of evaluating potential rules. Rule Confidence. The default method uses the confidence (or accuracy) of the rule to evaluate rules. For this measure, the Evaluation measure lower bound is disabled, since it is redundant with the Minimum rule confidence option on the Model tab. For more information, see the topic Apriori Node Model Options on p. 380. 382 Chapter 12 Confidence Difference. (Also called absolute confidence difference to prior.) This evaluation measure is the absolute difference between the rule’s confidence and its prior confidence. This option prevents bias where the outcomes are not evenly distributed. This helps prevent “obvious” rules from being kept. For example, it may be the case that 80% of customers buy your most popular product. A rule that predicts buying that popular product with 85% accuracy doesn’t add much to your knowledge, even though 85% accuracy may seem quite good on an absolute scale. Set the evaluation measure lower bound to the minimum difference in confidence for which you want rules to be kept. Confidence Ratio. (Also called difference of confidence quotient to 1.) This evaluation measure is the ratio of rule confidence to prior confidence (or, if the ratio is greater than one, its reciprocal) subtracted from 1. Like Confidence Difference, this method takes uneven distributions into account. It is especially good at finding rules that predict rare events. For example, suppose that there is a rare medical condition that occurs in only 1% of patients. A rule that is able to predict this condition 10% of the time is a great improvement over random guessing, even though on an absolute scale, 10% accuracy might not seem very impressive. Set the evaluation measure lower bound to the difference for which you want rules to be kept. Information Difference. (Also called information difference to prior.) This measure is based on the information gain measure. If the probability of a particular consequent is considered as a logical value (a bit), then the information gain is the proportion of that bit that can be determined, based on the antecedents. The information difference is the difference between the information gain, given the antecedents, and the information gain, given only the prior confidence of the consequent. An important feature of this method is that it takes support into account so that rules that cover more records are preferred for a given level of confidence. Set the evaluation measure lower bound to the information difference for which you want rules to be kept. Note: Because the scale for this measure is somewhat less intuitive than the other scales, you may need to experiment with different lower bounds to get a satisfactory rule set. Normalized Chi-square. (Also called normalized chi-squared measure.) This measure is a statistical index of association between antecedents and consequents. The measure is normalized to take values between 0 and 1. This measure is even more strongly dependent on support than the information difference measure. Set the evaluation measure lower bound to the information difference for which you want rules to be kept. Note: As with the information difference measure, the scale for this measure is somewhat less intuitive than the other scales, so you may need to experiment with different lower bounds to get a satisfactory rule set. Allow rules without antecedents. Select to allow rules that include only the consequent (item or item set). This is useful when you are interested in determining common items or item sets. For example, cannedveg is a single-item rule without an antecedent that indicates purchasing cannedveg is a common occurrence in the data. In some cases, you may want to include such rules if you are simply interested in the most confident predictions. This option is off by default. By convention, antecedent support for rules without antecedents is expressed as 100%, and rule support will be the same as confidence. 383 Association Rules CARMA Node The CARMA node uses an association rules discovery algorithm to discover association rules in the data. Association rules are statements in the form if antecedent(s) th en consequent(s) For example, if a Web customer purchases a wireless card and a high-end wireless router, the customer is also likely to purchase a wireless music server if offered. The CARMA model extracts a set of rules from the data without requiring you to specify input or target fields. This means that the rules generated can be used for a wider variety of applications. For example, you can use rules generated by this node to find a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. Using IBM® SPSS® Modeler, you can determine which clients have purchased the antecedent products and construct a marketing campaign designed to promote the consequent product. Requirements. In contrast to Apriori, the CARMA node does not require Input or Target fields. This is integral to the way the algorithm works and is equivalent to building an Apriori model with all fields set to Both. You can constrain which items are listed only as antecedents or consequents by filtering the model after it is built. For example, you can use the model browser to find a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. To create a CARMA rule set, you need to specify an ID field and one or more content fields. The ID field can have any role or measurement level. Fields with the role None are ignored. Field types must be fully instantiated before executing the node. Like Apriori, data may be in tabular or transactional format. For more information, see the topic Tabular versus Transactional Data on p. 378. Strengths. The CARMA node is based on the CARMA association rules algorithm. In contrast to Apriori, the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than antecedent support. CARMA also allows rules with multiple consequents. Like Apriori, models generated by a CARMA node can be inserted into a data stream to create predictions. For more information, see the topic Model Nuggets in Chapter 3 on p. 43. CARMA Node Fields Options Before executing a CARMA node, you must specify input fields on the Fields tab of the CARMA node. While most modeling nodes share identical Fields tab options, the CARMA node contains several unique options. All options are discussed below. 384 Chapter 12 Figure 12-4 CARMA node fields options Use Type node settings. This option tells the node to use field information from an upstream type node. This is the default. Use custom settings. This option tells the node to use field information specified here instead of that given in any upstream Type node(s). After selecting this option, specify fields below according to whether you are reading data in transactional or tabular format. Use transactional format. This option changes the field controls in the rest of this dialog box depending on whether your data are in transactional or tabular format. If you use multiple fields with transactional data, the items specified in these fields for a particular record are assumed to represent items found in a single transaction with a single timestamp. For more information, see the topic Tabular versus Transactional Data on p. 378. Tabular data If Use transactional format is not selected, the following fields are displayed. Inputs. Select the input field or fields. This is similar to setting the field role to Input in a Type node. Partition. This field allows you to specify a field used to partition the data into separate samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a different sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data. If multiple partition fields have been defined by using Type or Partition nodes, a single partition 385 Association Rules field must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.) Also note that to apply the selected partition in your analysis, partitioning must also be enabled in the Model Options tab for the node. (Deselecting this option makes it possible to disable partitioning without changing field settings.) Transactional data If you select Use transactional format, the following fields are displayed. ID. For transactional data, select an ID field from the list. Numeric or symbolic fields can be used as the ID field. Each unique value of this field should indicate a specific unit of analysis. For example, in a market basket application, each ID might represent a single customer. For a Web log analysis application, each ID might represent a computer (by IP address) or a user (by login data). IDs are contiguous. (Apriori and CARMA nodes only) If your data are presorted so that all records with the same ID are grouped together in the data stream, select this option to speed up processing. If your data are not presorted (or you are not sure), leave this option unselected and the node will sort the data automatically. Note: If your data are not sorted and you select this option, you may get invalid results in your model. Content. Specify the content field(s) for the model. These fields contain the items of interest in association modeling. You can specify multiple flag fields (if data are in tabular format) or a single nominal field (if data are in transactional format). 386 Chapter 12 CARMA Node Model Options Figure 12-5 CARMA node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Minimum rule support (%). You can specify a support criterion. Rule support refers to the proportion of IDs in the training data that contain the entire rule. (Note that this definition of support differs from antecedent support used in the Apriori nodes.) If you want to focus on more common rules, increase this setting. Minimum rule confidence (%). You can specify a confidence criterion for keeping rules in the rule set. Confidence refers to the percentage of IDs where a correct prediction is made (out of all IDs for which the rule makes a prediction). It is calculated as the number of IDs for which the entire rule is found divided by the number of IDs for which the antecedents are found, based on the training data. Rules with lower confidence than the specified criterion are discarded. If you are getting uninteresting or too many rules, try increasing this setting. If you are getting too few rules, try decreasing this setting. Maximum rule size. You can set the maximum number of distinct item sets (as opposed to items) in a rule. If the rules of interest are relatively short, you can decrease this setting to speed up building the rule set. 387 Association Rules CARMA Node Expert Options For those with detailed knowledge of the CARMA node’s operation, the following expert options allow you to fine-tune the model-building process. To access expert options, set the mode to Expert on the Expert tab. Figure 12-6 CARMA node expert options Exclude rules with multiple consequents. Select to exclude “two-headed” consequents—that is, consequents that contain two items. For example, the rule bread & cheese & fish -> wine&fruit contains a two-headed consequent, wine&fruit. By default, such rules are included. Set pruning value. To conserve memory, the CARMA algorithm used periodically removes (prunes) infrequent item sets from its list of potential item sets during processing. Select this option to adjust the frequency of pruning, and the number you specify determines the frequency of pruning. Enter a smaller value to decrease the memory requirements of the algorithm (but potentially increase the training time required), or enter a larger value to speed up training (but potentially increase memory requirements). The default value is 500. Vary support. Select to increase efficiency by excluding infrequent item sets that seem to be frequent when they are included unevenly. This is achieved by starting with a higher support level and tapering it down to the level specified on the Model tab. Enter a value for Estimated number of transactions to specify how quickly the support level should be tapered. Allow rules without antecedents. Select to allow rules that include only the consequent (item or item set). This is useful when you are interested in determining common items or item sets. For example, cannedveg is a single-item rule without an antecedent that indicates purchasing 388 Chapter 12 cannedveg is a common occurrence in the data. In some cases, you may want to include such rules if you are simply interested in the most confident predictions. This option is unselected by default. Association Rule Model Nuggets Association rule model nuggets represent the rules discovered by one of the following association rule modeling nodes: Apriori CARMA The model nuggets contain information about the rules extracted from the data during model building. Viewing Results You can browse the rules generated by association models (Apriori and CARMA) and Sequence models using the Model tab on the dialog box. Browsing a model nugget shows you the information about the rules and provides options for filtering and sorting results before generating new nodes or scoring the model. Scoring the Model Refined model nuggets (Apriori, CARMA, and Sequence) may be added to a stream and used for scoring. For more information, see the topic Using Model Nuggets in Streams in Chapter 3 on p. 63. Model nuggets used for scoring include an extra Settings tab on their respective dialog boxes. For more information, see the topic Association Rule Model Nugget Settings on p. 395. An unrefined model nugget cannot be used for scoring in its raw format. Instead, you can generate a rule set and use the rule set for scoring. For more information, see the topic Generating a Rule Set from an Association Model Nugget on p. 398. Association Rule Model Nugget Details On the Model tab of an Association Rule model nugget, you can see a table containing the rules extracted by the algorithm. Each row in the table represents a rule. The first column represents the consequents (the “then” part of the rule), while the next column represents the antecedents (the “if” part of the rule). Subsequent columns contain rule information, such as confidence, support, and lift. 389 Association Rules Figure 12-7 Association Rule nugget Model tab Association rules are often shown in the following format: Consequent Drug = drugY Antecedent Sex = F BP = HIGH The example rule is interpreted as if Sex = “F” and BP = “HIGH,” then Drug is likely to be drugY; or to phrase it another way, for records where Sex = “F” and BP = “HIGH,” Drug is likely to be drugY. Using the dialog box toolbar, you can choose to display additional information, such as confidence, support, and instances. Sort menu. The Sort menu button on the toolbar controls the sorting of rules. Direction of sorting (ascending or descending) can be changed using the sort direction button (up or down arrow). Figure 12-8 Toolbar options for sorting You can sort rules by: Support 390 Chapter 12 Confidence Rule Support Consequent Lift Deployability Show/Hide menu. The Show/Hide menu (criteria toolbar button) controls options for the display of rules. Figure 12-9 Show/Hide button The following display options are available: Rule ID displays the rule ID assigned during model building. A rule ID enables you to identify which rules are being applied for a given prediction. Rule IDs also allow you to merge additional rule information, such as deployability, product information, or antecedents, at a later time. Instances displays information about the number of unique IDs to which the rule applies—that is, for which the antecedents are true. For example, given the rule bread -> cheese, the number of records in the training data that include the antecedent bread are referred to as instances. Support displays antecedent support—that is, the proportion of IDs for which the antecedents are true, based on the training data. For example, if 50% of the training data includes the purchase of bread, then the rule bread -> cheese will have an antecedent support of 50%. Note: Support as defined here is the same as the instances but is represented as a percentage. Confidence displays the ratio of rule support to antecedent support. This indicates the proportion of IDs with the specified antecedent(s) for which the consequent(s) is/are also true. For example, if 50% of the training data contains bread (indicating antecedent support) but only 20% contains both bread and cheese (indicating rule support), then confidence for the rule bread -> cheese would be Rule Support / Antecedent Support or, in this case, 40%. Rule Support displays the proportion of IDs for which the entire rule, antecedents, and consequent(s), are true. For example, if 20% of the training data contains both bread and cheese, then rule support for the rule bread -> cheese is 20%. Lift displays the ratio of confidence for the rule to the prior probability of having the consequent. For example, if 10% of the entire population buys bread, then a rule that predicts whether people will buy bread with 20% confidence will have a lift of 20/10 = 2. If another rule tells you that people will buy bread with 11% confidence, then the rule has a lift of close to 1, meaning that having the antecedent(s) does not make a lot of difference in the probability of having the consequent. In general, rules with lift different from 1 will be more interesting than rules with lift close to 1. Deployability is a measure of what percentage of the training data satisfies the conditions of the antecedent but does not satisfy the consequent. In product purchase terms, it basically means what percentage of the total customer base owns (or has purchased) the antecedent(s) but has not yet purchased the consequent. The deployability statistic is defined as ((Antecedent Support in # of Records - Rule Support in # of Records) / Number of Records) * 100, where 391 Association Rules Antecedent Support means the number of records for which the antecedents are true and Rule Support means the number of records for which both antecedents and the consequent are true. Filter button. The Filter button (funnel icon) on the menu expands the bottom of the dialog box to show a panel where active rule filters are displayed. Filters are used to narrow the number of rules displayed on the Models tab. Figure 12-10 Filter button To create a filter, click the Filter icon to the right of the expanded panel. This opens a separate dialog box in which you can specify constraints for displaying rules. Note that the Filter button is often used in conjunction with the Generate menu to first filter rules and then generate a model containing that subset of rules. For more information, see Specifying Filters for Rules below. Find Rule button. The Find Rule button (binoculars icon) enables you to search the rules shown for a specified rule ID. The adjacent display box indicates the number of rules currently displayed out of the number available. Rule IDs are assigned by the model in the order of discovery at the time and are added to the data during scoring. Figure 12-11 Find Rule button To reorder rule IDs: E You can rearrange rule IDs in IBM® SPSS® Modeler by first sorting the rule display table according to the desired measurement, such as confidence or lift. E Then using options from the Generate menu, create a filtered model. E In the Filtered Model dialog box, select Renumber rules consecutively starting with, and specify a start number. For more information, see the topic Generating a Filtered Model on p. 399. Specifying Filters for Rules By default, rule algorithms, such as Apriori, CARMA, and Sequence, may generate a large and cumbersome number of rules. To enhance clarity when browsing or to streamline rule scoring, you should consider filtering rules so that consequents and antecedents of interest are more prominently displayed. Using the filtering options on the Model tab of a rule browser, you can open a dialog box for specifying filter qualifications. 392 Chapter 12 Figure 12-12 Rules browser filter dialog box Consequents. Select Enable Filter to activate options for filtering rules based on the inclusion or exclusion of specified consequents. Select Includes any of to create a filter where rules contain at least one of the specified consequents. Alternatively, select Excludes to create a filter excluding specified consequents. You can select consequents using the picker icon to the right of the list box. This opens a dialog box listing all consequents present in the generated rules. Note: Consequents may contain more than one item. Filters will check only that a consequent contains one of the items specified. Antecedents. Select Enable Filter to activate options for filtering rules based on the inclusion or exclusion of specified antecedents. You can select items using the picker icon to the right of the list box. This opens a dialog box listing all antecedents present in the generated rules. Select Includes all of to set the filter as an inclusionary one where all antecedents specified must be included in a rule. Select Includes any of to create a filter where rules contain at least one of the specified antecedents. Select Excludes to create a filter excluding rules that contain a specified antecedent. Confidence. Select Enable Filter to activate options for filtering rules based on the level of confidence for a rule. You can use the Min and Max controls to specify a confidence range. When you are browsing generated models, confidence is listed as a percentage. When you are scoring output, confidence is expressed as a number between 0 and 1. 393 Association Rules Antecedent Support. Select Enable Filter to activate options for filtering rules based on the level of antecedent support for a rule. Antecedent support indicates the proportion of training data that contains the same antecedents as the current rule, making it analogous to a popularity index. You can use the Min and Max controls to specify a range used to filter rules based on support level. Lift. Select Enable Filter to activate options for filtering rules based on the lift measurement for a rule. Note: Lift filtering is available only for association models built after release 8.5 or for earlier models that contain a lift measurement. Sequence models do not contain this option. Click OK to apply all filters that have been enabled in this dialog box. Generating Graphs for Rules The Association nodes provide a lot of information; however, it may not always be in an easily accessible format for business users. To provide the data in a way that can be easily incorporated into business reports, presentations, and so on, you can produce graphs of selected data. From the Model tab, you can generate a graph for a selected rule, thereby only creating a graph for the cases in that rule. E On the Model tab, select the rule in which you are interested. E From the Generate menu, select Graph (from selection). The Graphboard Basic tab is displayed. 394 Chapter 12 Figure 12-13 Graphboard node dialog box, Basic tab Note: Only the Basic and Detailed tabs are available when you display the Graphboard in this way. E Using either the Basic or Detailed tab settings, specify the details to be displayed on the graph. E Click OK to generate the graph. 395 Association Rules Figure 12-14 Graphboard node dialog box, Basic tab The graph heading identifies the rule and antecedent details that were chosen for inclusion. Association Rule Model Nugget Settings This Settings tab is used to specify scoring options for association models (Apriori and CARMA). This tab is available only after the model nugget has been added to a stream for purposes of scoring. Note: The dialog box for browsing an unrefined model does not include the Settings tab, since it cannot be scored. To score the “unrefined” model, you must first generate a rule set. For more information, see the topic Generating a Rule Set from an Association Model Nugget on p. 398. 396 Chapter 12 Figure 12-15 Association Rule model nugget Settings tab Maximum number of predictions. Specify the maximum number of predictions included for each set of basket items. This option is used in conjunction with Rule Criterion below to produce the “top” predictions, where top indicates the highest level of confidence, support, lift, and so on, as specified below. Rule Criterion. Select the measure used to determine the strength of rules. Rules are sorted by the strength of criteria selected here in order to return the top predictions for an item set. Available criteria are: Confidence Support Rule support (Support * Confidence) Lift Deployability Allow repeat predictions. Select to include multiple rules with the same consequent when scoring. For example, selecting this option allows the following rules to be scored: bread & cheese -> wine cheese & fruit -> wine Turn off this option to exclude repeat predictions when scoring. 397 Association Rules Note: Rules with multiple consequents (bread & cheese & fruit -> wine & pate) are considered repeat predictions only if all consequents (wine & pate) have been predicted before. Ignore unmatched basket items. Select to ignore the presence of additional items in the item set. For example, when this option is selected for a basket that contains [tent & sleeping bag & kettle], the rule tent & sleeping bag -> gas_stove will apply despite the extra item (kettle) present in the basket. There may be some circumstances where extra items should be excluded. For example, it is likely that someone who purchases a tent, sleeping bag, and kettle may already have a gas stove, indicated by the presence of the kettle. In other words, a gas stove may not be the best prediction. In such cases, you should deselect Ignore unmatched basket items to ensure that rule antecedents exactly match the contents of a basket. By default, unmatched items are ignored. Check that predictions are not in basket. Select to ensure that consequents are not also present in the basket. For example, if the purpose of scoring is to make a home furniture product recommendation, then it is unlikely that a basket that already contains a dining room table will be likely to purchase another one. In such a case, you should select this option. On the other hand, if products are perishable or disposable (such as cheese, baby formula, or tissue), then rules where the consequent is already present in the basket may be of value. In the latter case, the most useful option might be Do not check basket for predictions below. Check that predictions are in basket. Select this option to ensure that consequents are also present in the basket. This approach is useful when you are attempting to gain insight into existing customers or transactions. For example, you may want to identify rules with the highest lift and then explore which customers fit these rules. Do not check basket for predictions. Select to include all rules when scoring, regardless of the presence or absence of consequents in the basket. Association Rule Model Nugget Summary The Summary tab of an association rule model nugget displays the number of rules discovered and the minimum and maximum for support, lift, confidence and deployability of rules in the rule set. 398 Chapter 12 Figure 12-16 Association Rule model nugget Summary tab Generating a Rule Set from an Association Model Nugget Figure 12-17 Generate Rule Set dialog box 399 Association Rules Association model nuggets, such as Apriori and CARMA, can be used to score data directly, or you can first generate a subset of rules, known as a rule set. Rule sets are particularly useful when you are working with an unrefined model, which cannot be used directly for scoring. For more information, see the topic Unrefined Models in Chapter 3 on p. 69. To generate a rule set, choose Rule set from the Generate menu in the model nugget browser. You can specify the following options for translating the rules into a rule set: Rule set name. Allows you to specify the name of the new generated Rule Set node. Create node on. Controls the location of the new generated Rule Set node. Select Canvas, GM Palette, or Both. Target field. Determines which output field will be used for the generated Rule Set node. Select a single output field from the list. Minimum support. Specify the minimum support for rules to be preserved in the generated rule set. Rules with support less than the specified value will not be included in the new rule set. Minimum confidence. Specify the minimum confidence for rules to be preserved in the generated rule set. Rules with confidence less than the specified value will not be included in the new rule set. Default value. Allows you to specify a default value for the target field that is assigned to scored records for which no rule fires. Generating a Filtered Model Figure 12-18 Generate New Model dialog box To generate a filtered model from an association model nugget, such as an Apriori, CARMA, or Sequence Rule Set node, choose Filtered Model from the Generate menu in the model nugget browser. This creates a subset model that includes only those rules currently displayed in the browser. Note: You cannot generate filtered models for unrefined models. You can specify the following options for filtering rules: Name for New Model. Allows you to specify the name of the new Filtered Model node. Create node on. Controls the location of the new Filtered Model node. Select Canvas, GM Palette, or Both. Rule numbering. Specify how rule IDs will be numbered in the subset of rules included in the filtered model. 400 Chapter 12 Retain original rule ID numbers. Select to maintain the original numbering of rules. By default, rules are given an ID that corresponds with their order of discovery by the algorithm. That order may vary depending on the algorithm employed. Renumber rules consecutively starting with. Select to assign new rule IDs for the filtered rules. New IDs are assigned based on the sort order displayed in the rule browser table on the Model tab, beginning with the number you specify here. You can specify the start number for IDs using the arrows to the right. Scoring Association Rules Scores produced by running new data through an association rule model nugget are returned in separate fields. Three new fields are added for each prediction, with P representing the prediction, C representing confidence, and I representing the rule ID. The organization of these output fields depends on whether the input data are in transactional or tabular format. See Tabular versus Transactional Data on p. 378 for an overview of these formats. For example, suppose you are scoring basket data using a model that generates predictions based on the following three rules: Rule_15 bread&wine -> meat (confidence 54%) Rule_22 cheese -> fruit (confidence 43%) Rule_5 bread&cheese -> frozveg (confidence 24%) Tabular data. For tabular data, the three predictions (3 is the default) are returned in a single record. Table 12-1 Scores in tabular format ID Bread Fred 1 Wine 1 Cheese 1 P1 meat C1 0.54 I1 15 P2 fruit C2 0.43 I2 22 P3 frozveg C3 .24 I3 5 Transactional data. For transactional data, a separate record is generated for each prediction. Predictions are still added in separate columns, but scores are returned as they are calculated. This results in records with incomplete predictions, as shown in the sample output below. The second and third predictions (P2 and P3) are blank in the first record, along with the associated confidences and rule IDs. As scores are returned, however, the final record contains all three predictions. Table 12-2 Scores in transactional format ID Fred Fred Fred Item bread cheese wine P1 meat meat meat C1 0.54 0.54 0.54 I1 14 14 14 P2 $null$ fruit fruit C2 $null$ 0.43 0.43 I2 $null$ 22 22 P3 $null$ $null$ frozveg C3 $null$ $null$ 0.24 I3 $null$ $null$ 5 To include only complete predictions for reporting or deployment purposes, use a Select node to select complete records. Note: The field names used in these examples are abbreviated for clarity. During actual use, results fields for association models are named as follows: New field Prediction Example field name $A-TRANSACTION_NUMBER-1 401 Association Rules New field Confidence (or other criterion) Rule ID Example field name $AC-TRANSACTION_NUMBER-1 $A-Rule_ID-1 Rules with Multiple Consequents The CARMA algorithm allows rules with multiple consequents—for example: bread -> wine&cheese When you are scoring such “two-headed” rules, predictions are returned in the format displayed in the following table: Table 12-3 Scoring results including a prediction with multiple consequents ID Bread Wine Cheese P1 1 1 meat&veg Fred 1 C1 0.54 I1 16 P2 fruit C2 I2 0.43 22 P3 C3 I3 frozveg .24 5 In some cases, you may need to split such scores before deployment. To split a prediction with multiple consequents, you will need to parse the field using the CLEM string functions. Deploying Association Models When scoring association models, predictions and confidences are output in separate columns (where P represents the prediction, C represents confidence, and I represents the rule ID). This is the case whether the input data are tabular or transactional. For more information, see the topic Scoring Association Rules on p. 400. Figure 12-19 Tabular scores with predictions in columns When preparing scores for deployment, you might find that your application requires you to transpose your output data to a format with predictions in rows rather than columns (one prediction per row, sometimes known as “till-roll” format). 402 Chapter 12 Figure 12-20 Transposed scores with predictions in rows Transposing Tabular Scores You can transpose tabular scores from columns to rows using a combination of steps in IBM® SPSS® Modeler, as described in the steps that follow. Figure 12-21 Example stream used to transpose tabular data into till-roll format E Use the @INDEX function in a Derive node to ascertain the current order of predictions and save this indicator in a new field, such as Original_order. E Add a Type node to ensure that all fields are instantiated. E Use a Filter node to rename the default prediction, confidence, and ID fields (P1, C1, I1) to common fields, such as Pred, Crit, and Rule_ID, which will be used to append records later on. You will need one Filter node for each prediction generated. 403 Association Rules Figure 12-22 Filtering fields for predictions 1 and 3 while renaming fields for prediction 2. E Use an Append node to append values for the shared Pred, Crit, and Rule_ID. E Attach a Sort node to sort records in ascending order for the field Original_order and in descending order for Crit, which is the field used to sort predictions by criteria such as confidence, lift, and support. E Use another Filter node to filter the field Original_order from the output. At this point, the data are ready for deployment. Transposing Transactional Scores The process is similar for transposing transactional scores. For example, the stream shown below transposes scores to a format with a single prediction in each row as needed for deployment. Figure 12-23 Example stream used to transpose transactional data into till-roll format 404 Chapter 12 With the addition of two Select nodes, the process is identical to that explained earlier for tabular data. The first Select node is used to compare rule IDs across adjacent records and include only unique or undefined records. This Select node uses the CLEM expression to select records: ID /= @OFFSET(ID,-1) or @OFFSET(ID,-1) = undef. The second Select node is used to discard extraneous rules, or rules where Rule_ID has a null value. This Select node uses the following CLEM expression to discard records: not(@NULL(Rule_ID)). For more information on transposing scores for deployment, contact Technical Support. Sequence Node The Sequence node discovers patterns in sequential or time-oriented data, in the format bread -> cheese. The elements of a sequence are item sets that constitute a single transaction. For example, if a person goes to the store and purchases bread and milk and then a few days later returns to the store and purchases some cheese, that person’s buying activity can be represented as two item sets. The first item set contains bread and milk, and the second one contains cheese. A sequence is a list of item sets that tend to occur in a predictable order. The Sequence node detects frequent sequences and creates a generated model node that can be used to make predictions. Requirements. To create a Sequence rule set, you need to specify an ID field, an optional time field, and one or more content fields. Note that these settings must be made on the Fields tab of the modeling node; they cannot be read from an upstream Type node. The ID field can have any role or measurement level. If you specify a time field, it can have any role but its storage must be numeric, date, time, or timestamp. If you do not specify a time field, the Sequence node will use an implied timestamp, in effect using row numbers as time values. Content fields can have any measurement level and role, but all content fields must be of the same type. If they are numeric, they must be integer ranges (not real ranges). Strengths. The Sequence node is based on the CARMA association rules algorithm, which uses an efficient two-pass method for finding sequences. In addition, the generated model node created by a Sequence node can be inserted into a data stream to create predictions. The generated model node can also generate SuperNodes for detecting and counting specific sequences and for making predictions based on specific sequences. 405 Association Rules Sequence Node Fields Options Figure 12-24 Sequence node fields options Before executing a Sequence node, you must specify ID and content fields on the Fields tab of the Sequence node. If you want to use a time field, you also need to specify that here. ID field. Select an ID field from the list. Numeric or symbolic fields can be used as the ID field. Each unique value of this field should indicate a specific unit of analysis. For example, in a market basket application, each ID might represent a single customer. For a Web log analysis application, each ID might represent a computer (by IP address) or a user (by login data). IDs are contiguous. If your data are presorted so that all records with the same ID are grouped together in the data stream, select this option to speed up processing. If your data are not presorted (or you are not sure), leave this option unselected, and the Sequence node will sort the data automatically. Note: If your data are not sorted and you select this option, you may get invalid results in your Sequence model. Time field. If you want to use a field in the data to indicate event times, select Use time field and specify the field to be used. The time field must be numeric, date, time, or timestamp. If no time field is specified, records are assumed to arrive from the data source in sequential order, and record numbers are used as time values (the first record occurs at time "1"; the second, at time "2"; and so on). Content fields. Specify the content field(s) for the model. These fields contain the events of interest in sequence modeling. 406 Chapter 12 The Sequence node can handle data in either tabular or transactional format. If you use multiple fields with transactional data, the items specified in these fields for a particular record are assumed to represent items found in a single transaction with a single timestamp. For more information, see the topic Tabular versus Transactional Data on p. 378. Partition. This field allows you to specify a field used to partition the data into separate samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a different sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data. If multiple partition fields have been defined by using Type or Partition nodes, a single partition field must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.) Also note that to apply the selected partition in your analysis, partitioning must also be enabled in the Model Options tab for the node. (Deselecting this option makes it possible to disable partitioning without changing field settings.) Sequence Node Model Options Figure 12-25 Sequence node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. 407 Association Rules Minimum rule support (%). You can specify a support criterion. Rule support refers to the proportion of IDs in the training data that contain the entire sequence. If you want to focus on more common sequences, increase this setting. Minimum rule confidence (%). You can specify a confidence criterion for keeping sequences in the sequence set. Confidence refers to the percentage of the IDs where a correct prediction is made, out of all the IDs for which the rule makes a prediction. It is calculated as the number of IDs for which the entire sequence is found divided by the number of IDs for which the antecedents are found, based on the training data. Sequences with lower confidence than the specified criterion are discarded. If you are getting too many sequences or uninteresting sequences, try increasing this setting. If you are getting too few sequences, try decreasing this setting. Maximum sequence size. You can set the maximum number of distinct item sets (as opposed to items) in a sequence. If the sequences of interest are relatively short, you can decrease this setting to speed up building the sequence set. Predictions to add to stream. Specify the number of predictions to be added to the stream by the resulting generated Model node. For more information, see the topic Sequence Model Nuggets on p. 409. Sequence Node Expert Options For those with detailed knowledge of the Sequence node’s operation, the following expert options allow you to fine-tune the model-building process. To access expert options, set the Mode to Expert on the Expert tab. 408 Chapter 12 Figure 12-26 Sequence node expert options Set maximum duration. If this option is selected, sequences will be limited to those with a duration (the time between the first and last item set) less than or equal to the value specified. If you haven’t specified a time field, the duration is expressed in terms of rows (records) in the raw data. If the time field used is a time, date, or timestamp field, the duration is expressed in seconds. For numeric fields, the duration is expressed in the same units as the field itself. Set pruning value. The CARMA algorithm used in the Sequence node periodically removes (prunes) infrequent item sets from its list of potential item sets during processing to conserve memory. Select this option to adjust the frequency of pruning. The number specified determines the frequency of pruning. Enter a smaller value to decrease the memory requirements of the algorithm (but potentially increase the training time required), or enter a larger value to speed up training (but potentially increase memory requirements). Set maximum sequences in memory. If this option is selected, the CARMA algorithm will limit its memory store of candidate sequences during model building to the number of sequences specified. Select this option if IBM® SPSS® Modeler is using too much memory during the building of Sequence models. Note that the maximum sequences value you specify here is the number of candidate sequences tracked internally as the model is built. This number should be much larger than the number of sequences you expect in the final model. Constrain gaps between item sets. This option allows you to specify constraints on the time gaps that separate item sets. If selected, item sets with time gaps smaller than the Minimum gap or larger than the Maximum gap that you specify will not be considered to form part of a sequence. Use 409 Association Rules this option to avoid counting sequences that include long time intervals or those that take place in a very short time span. Note: If the time field used is a time, date, or timestamp field, the time gap is expressed in seconds. For numeric fields, the time gap is expressed in the same units as the time field. For example, consider this list of transactions: ID 1001 1001 1001 1001 Time 1 2 5 6 Content apples bread cheese dressing If you build a model on these data with the minimum gap set to 2, you would get the following sequences: apples -> cheese apples -> dressing bread -> cheese bread -> dressing You would not see sequences such as apples -> bread because the gap between apples and bread is smaller than the minimum gap. Similarly, if the data were instead: ID 1001 1001 1001 1001 Time 1 2 5 20 Content apples bread cheese dressing and the maximum gap were set to 10, you would not see any sequences with dressing, because the gap between cheese and dressing is too large for them to be considered part of the same sequence. Sequence Model Nuggets Sequence model nuggets represent the sequences found for a particular output field discovered by the Sequence node and can be added to streams to generate predictions. When you run a stream containing a Sequence node, the node adds a pair of fields containing predictions and associated confidence values for each prediction from the sequence model to the data. By default, three pairs of fields containing the top three predictions (and their associated confidence values) are added. You can change the number of predictions generated when you build the model by setting the Sequence node model options at build time, as well as on the Settings tab after adding the model nugget to a stream. For more information, see the topic Sequence Model Nugget Settings on p. 413. 410 Chapter 12 The new field names are derived from the model name. The field names are $S-sequence-n for the prediction field (where n indicates the nth prediction) and $SC-sequence-n for the confidence field. In a stream with multiple Sequence Rules nodes in a series, the new field names will include numbers in the prefix to distinguish them from each other. The first Sequence Set node in the stream will use the usual names, the second node will use names starting with $S1- and $SC1-, the third node will use names starting with $S2- and $SC2-, and so on. Predictions are displayed in order by confidence, so that $S-sequence-1 contains the prediction with the highest confidence, $S-sequence-2 contains the prediction with the next highest confidence, and so on. For records where the number of available predictions is smaller than the number of predictions requested, remaining predictions contain the value $null$. For example, if only two predictions can be made for a particular record, the values of $S-sequence-3 and $SC-sequence-3 will be $null$. For each record, the rules in the model are compared to the set of transactions processed for the current ID so far, including the current record and any previous records with the same ID and earlier timestamp. The k rules with the highest confidence values that apply to this set of transactions are used to generate the k predictions for the record, where k is the number of predictions specified on the Settings tab after adding the model to the stream. (If multiple rules predict the same outcome for the transaction set, only the rule with the highest confidence is used.) For more information, see the topic Sequence Model Nugget Settings on p. 413. As with other types of association rule models, the data format must match the format used in building the sequence model. For example, models built using tabular data can be used to score only tabular data. For more information, see the topic Scoring Association Rules on p. 400. Note: When scoring data using a generated Sequence Set node in a stream, any tolerance or gap settings that you selected in building the model are ignored for scoring purposes. Predictions from Sequence Rules The node handles the records in a time-dependent manner (or order-dependent, if no timestamp field was used to build the model). Records should be sorted by the ID field and timestamp field (if present). However, predictions are not tied to the timestamp of the record to which they are added. They simply refer to the most likely items to occur at some point in the future, given the history of transactions for the current ID up to the current record. Note that the predictions for each record do not necessarily depend on that record’s transactions. If the current record’s transactions do not trigger a specific rule, rules will be selected based on the previous transactions for the current ID. In other words, if the current record doesn’t add any useful predictive information to the sequence, the prediction from the last useful transaction for this ID is carried forward to the current record. For example, suppose you have a Sequence model with the single rule Jam -> Bread (0.66) and you pass it the following records: ID 001 001 Purchase jam milk Prediction bread bread 411 Association Rules Notice that the first record generates a prediction of bread, as you would expect. The second record also contains a prediction of bread, because there’s no rule for jam followed by milk; therefore, the milk transaction doesn’t add any useful information, and the rule Jam -> Bread still applies. Generating New Nodes The Generate menu allows you to create new SuperNodes based on the sequence model. Rule SuperNode. Creates a SuperNode that can detect and count occurrences of sequences in scored data. This option is disabled if no rule is selected. For more information, see the topic Generating a Rule SuperNode from a Sequence Model Nugget on p. 415. Model to Palette. Returns the model to the Models palette. This is useful in situations where a colleague may have sent you a stream containing the model and not the model itself. Sequence Model Nugget Details The Model tab for a Sequence model nugget displays the rules extracted by the algorithm. Each row in the table represents a rule, with the antecedent (the “if” part of the rule) in the first column followed by the consequent (the “then” part of the rule) in the second column. Figure 12-27 Sequence nugget Model tab 412 Chapter 12 Each rule is shown in the following format: Antecedent beer and cannedveg fish fish Consequent beer fish The first example rule is interpreted as for IDs that had “beer” and “cannedveg” in the same transaction, there is likely a subsequent occurrence of “beer.” The second example rule can be interpreted as for IDs that had “fish” in one transaction and then “fish” in another, there is a likely subsequent occurrence of “fish.” Note that in the first rule, beer and cannedveg are purchased at the same time; in the second rule, fish is purchased in two separate transactions. Sort menu. The Sort menu button on the toolbar controls the sorting of rules. Direction of sorting (ascending or descending) can be changed using the sort direction button (up or down arrow). Figure 12-28 Toolbar options for sorting You can sort rules by: Support % Confidence % Rule Support % Consequent First Antecedent Last Antecedent Number of Items (antecedents) For example, the following table is sorted in descending order by number of items. Rules with multiple items in the antecedent set precede those with fewer items. Antecedent beer and cannedveg and frozenmeal beer and cannedveg fish fish softdrink Consequent frozenmeal beer fish softdrink Show/hide criteria menu. The Show/hide criteria menu button (grid icon) controls options for the display of rules. The following display options are available: Instances displays information about the number of unique IDs for which the full sequence—both antecedents and consequent—occurs. (Note this differs from Association models, for which the number of instances refers to the number of IDs for which only the antecedents apply.) For example, given the rule bread -> cheese, the number of IDs in the training data that include both bread and cheese are referred to as instances. 413 Association Rules Support displays the proportion of IDs in the training data for which the antecedents are true. For example, if 50% of the training data includes the antecedent bread then the support for the bread -> cheese rule would be 50%. (Unlike Association models, support is not based on the number of instances, as noted earlier.) Confidence displays the percentage of the IDs where a correct prediction is made, out of all the IDs for which the rule makes a prediction. It is calculated as the number of IDs for which the entire sequence is found divided by the number of IDs for which the antecedents are found, based on the training data. For example, if 50% of the training data contains cannedveg (indicating antecedent support) but only 20% contains both cannedveg and frozenmeal, then confidence for the rule cannedveg -> frozenmeal would be Rule Support / Antecedent Support or, in this case, 40%. Rule Support for Sequence models is based on instances and displays the proportion of training records for which the entire rule, antecedents, and consequent(s), are true. For example, if 20% of the training data contains both bread and cheese, then rule support for the rule bread -> cheese is 20%. Note that the proportions are based on valid transactions (transactions with at least one observed item or true value) rather than total transactions. Invalid transactions—those with no items or true values—are discarded for these calculations. Filter button. The Filter button (funnel icon) on the menu expands the bottom of the dialog box to show a panel where active rule filters are displayed. Filters are used to narrow the number of rules displayed on the Models tab. Figure 12-29 Filter button To create a filter, click the Filter icon to the right of the expanded panel. This opens a separate dialog box in which you can specify constraints for displaying rules. Note that the Filter button is often used in conjunction with the Generate menu to first filter rules and then generate a model containing that subset of rules. For more information, see Specifying Filters for Rules below. Sequence Model Nugget Settings The Settings tab for a Sequence model nugget displays scoring options for the model. This tab is available only after the model has been added to the stream canvas for scoring. 414 Chapter 12 Figure 12-30 Sequence nugget Settings tab Maximum number of predictions. Specify the maximum number of predictions included for each set of basket items. The rules with the highest confidence values that apply to this set of transactions are used to generate predictions for the record up to the specified limit. Sequence Model Nugget Summary The Summary tab for a sequence rule model nugget displays the number of rules discovered and the minimum and maximum for support and confidence in the rules. If you have executed an Analysis node attached to this modeling node, information from that analysis will also be displayed in this section. 415 Association Rules Figure 12-31 Sequence nugget Summary tab For more information, see the topic Browsing Model Nuggets in Chapter 3 on p. 49. Generating a Rule SuperNode from a Sequence Model Nugget Figure 12-32 Generate Rule SuperNode dialog box 416 Chapter 12 To generate a rule SuperNode based on a sequence rule: E On the Model tab for the sequence rule model nugget, click on a row in the table to select the desired rule. E From the rule browser menus choose: Generate > Rule SuperNode Important: To use the generated SuperNode, you must sort the data by ID field (and Time field, if any) before passing them into the SuperNode. The SuperNode will not detect sequences properly in unsorted data. You can specify the following options for generating a rule SuperNode: Detect. Specifies how matches are defined for data passed into the SuperNode. Antecedents only. The SuperNode will identify a match any time it finds the antecedents for the selected rule in the correct order within a set of records having the same ID, regardless of whether the consequent is also found. Note that this does not take into account timestamp tolerance or item gap constraint settings from the original Sequence modeling node. When the last antecedent item set is detected in the stream (and all other antecedents have been found in the proper order), all subsequent records with the current ID will contain the summary selected below. Entire sequence. The SuperNode will identify a match any time it finds the antecedents and the consequent for the selected rule in the correct order within a set of records having the same ID. This does not take into account timestamp tolerance or item gap constraint settings from the original Sequence modeling node. When the consequent is detected in the stream (and all antecedents have also been found in the correct order), the current record and all subsequent records with the current ID will contain the summary selected below. Display. Controls how match summaries are added to the data in the Rule SuperNode output. Consequent value for first occurrence. The value added to the data is the consequent value predicted based on the first occurrence of the match. Values are added as a new field named rule_n_consequent, where n is the rule number (based on the order of creation of Rule SuperNodes in the stream). True value for first occurrence. The value added to the data is true if there is at least one match for the ID and false if there is no match. Values are added as a new field named rule_n_flag. Count of occurrences. The value added to the data is the number of matches for the ID. Values are added as a new field named rule_n_count. Rule number. The value added is the rule number for the selected rule. Rule numbers are assigned based on the order in which the SuperNode was added to the stream. For example, the first Rule SuperNode is considered rule 1, the second Rule SuperNode is considered rule 2, etc. This option is most useful when you will be including multiple Rule SuperNodes in your stream. Values are added as a new field named rule_n_number. Include confidence figures. If selected, this option will add the rule confidence to the data stream as well as the selected summary. Values are added as a new field named rule_n_confidence. Chapter Time Series Models 13 Why Forecast? To forecast means to predict the values of one or more series over time. For example, you may want to predict the expected demand for a line of products or services in order to allocate resources for manufacturing or distribution. Because planning decisions take time to implement, forecasts are an essential tool in many planning processes. Methods of modeling time series assume that history repeats itself—if not exactly, then closely enough that by studying the past, you can make better decisions in the future. To predict sales for next year, for example, you would probably start by looking at this year’s sales and work backward to figure out what trends or patterns, if any, have developed in recent years. But patterns can be difficult to gauge. If your sales increase several weeks in a row, for example, is this part of a seasonal cycle or the beginning of a long-term trend? Using statistical modeling techniques, you can analyze the patterns in your past data and project those patterns to determine a range within which future values of the series are likely to fall. The result is more accurate forecasts on which to base your decisions. Time Series Data A time series is an ordered collection of measurements taken at regular intervals—for example, daily stock prices or weekly sales data. The measurements may be of anything that interests you, and each series can generally be classified as one of the following: Dependent. A series that you want to forecast. Predictor. A series that may help to explain the target—for example, using an advertising budget to predict sales. Predictors can only be used with ARIMA models. Event. A special predictor series used to account for predictable recurring incidents—for example, sales promotions. Intervention. A special predictor series used to account for one-time past incidents—for example, a power outage or employee strike. The intervals can represent any unit of time, but the interval must be the same for all measurements. Moreover, any interval for which there is no measurement must be set to the missing value. Thus, the number of intervals with measurements (including those with missing values) defines the length of time of the historical span of the data. Characteristics of Time Series Studying the past behavior of a series will help you identify patterns and make better forecasts. When plotted, many time series exhibit one or more of the following features: Trends © Copyright IBM Corporation 1994, 2012. 417 418 Chapter 13 Seasonal and nonseasonal cycles Pulses and steps Outliers Trends A trend is a gradual upward or downward shift in the level of the series or the tendency of the series values to increase or decrease over time. Figure 13-1 Trend Trends are either local or global, but a single series can exhibit both types. Historically, series plots of the stock market index show an upward global trend. Local downward trends have appeared in times of recession, and local upward trends have appeared in times of prosperity. Trends can also be either linear or nonlinear. Linear trends are positive or negative additive increments to the level of the series, comparable to the effect of simple interest on principal. Nonlinear trends are often multiplicative, with increments that are proportional to the previous series value(s). Global linear trends are fit and forecast well by both exponential smoothing and ARIMA models. In building ARIMA models, series showing trends are generally differenced to remove the effect of the trend. Seasonal Cycles A seasonal cycle is a repetitive, predictable pattern in the series values. 419 Time Series Models Figure 13-2 Seasonal cycle Seasonal cycles are tied to the interval of your series. For instance, monthly data typically cycles over quarters and years. A monthly series might show a significant quarterly cycle with a low in the first quarter or a yearly cycle with a peak every December. Series that show a seasonal cycle are said to exhibit seasonality. Seasonal patterns are useful in obtaining good fits and forecasts, and there are exponential smoothing and ARIMA models that capture seasonality. Nonseasonal Cycles A nonseasonal cycle is a repetitive, possibly unpredictable, pattern in the series values. Figure 13-3 Nonseasonal cycle Some series, such as unemployment rate, clearly display cyclical behavior; however, the periodicity of the cycle varies over time, making it difficult to predict when a high or low will occur. Other series may have predictable cycles but do not neatly fit into the Gregorian calendar or have cycles longer than a year. For example, the tides follow the lunar calendar, international travel and trade related to the Olympics swell every four years, and there are many religious holidays whose Gregorian dates change from year to year. Nonseasonal cyclical patterns are difficult to model and generally increase uncertainty in forecasting. The stock market, for example, provides numerous instances of series that have defied the efforts of forecasters. All the same, nonseasonal patterns must be accounted for when they 420 Chapter 13 exist. In many cases, you can still identify a model that fits the historical data reasonably well, which gives you the best chance to minimize uncertainty in forecasting. Pulses and Steps Many series experience abrupt changes in level. They generally come in two types: A sudden, temporary shift, or pulse, in the series level A sudden, permanent shift, or step, in the series level Figure 13-4 Series with a pulse When steps or pulses are observed, it is important to find a plausible explanation. Time series models are designed to account for gradual, not sudden, change. As a result, they tend to underestimate pulses and be ruined by steps, which lead to poor model fits and uncertain forecasts. (Some instances of seasonality may appear to exhibit sudden changes in level, but the level is constant from one seasonal period to the next.) If a disturbance can be explained, it can be modeled using an intervention or event. For example, during August 1973, an oil embargo imposed by the Organization of Petroleum Exporting Countries (OPEC) caused a drastic change in the inflation rate, which then returned to normal levels in the ensuing months. By specifying a point intervention for the month of the embargo, you can improve the fit of your model, thus indirectly improving your forecasts. For example, a retail store might find that sales were much higher than usual on the day all items were marked 50% off. By specifying the 50%-off promotion as a recurring event, you can improve the fit of your model and estimate the effect of repeating the promotion on future dates. Outliers Shifts in the level of a time series that cannot be explained are referred to as outliers. These observations are inconsistent with the remainder of the series and can dramatically influence the analysis and, consequently, affect the forecasting ability of the time series model. The following figure displays several types of outliers commonly occurring in time series. The blue lines represent a series without outliers. The red lines suggest a pattern that might be present if the series contained outliers. These outliers are all classified as deterministic because they affect only the mean level of the series. 421 Time Series Models Figure 13-5 Outlier types Additive Outlier 10 9 8 7 6 5 4 3 2 1 0 0 20 40 60 Time 80 Innovational Outlier 100 10 9 8 7 6 5 4 3 2 1 0 0 120 Level Shift Outlier 10 9 8 7 6 5 4 3 2 1 0 0 20 40 60 Time 80 20 20 40 60 Time 80 60 Time 80 100 120 100 120 100 120 Transient Change Outlier 100 10 9 8 7 6 5 4 3 2 1 0 0 120 20 Seasonal Additive Outlier 10 9 8 7 6 5 4 3 2 1 0 0 40 40 60 Time 80 Local Trend Outlier 100 10 9 8 7 6 5 4 3 2 1 0 0 120 20 40 60 Time 80 Outlier No Outlier Additive Outlier. An additive outlier appears as a surprisingly large or small value occurring for a single observation. Subsequent observations are unaffected by an additive outlier. Consecutive additive outliers are typically referred to additive outlier patches. Innovational Outlier. An innovational outlier is characterized by an initial impact with effects lingering over subsequent observations. The influence of the outliers may increase as time proceeds. Level Shift Outlier. For a level shift, all observations appearing after the outlier move to a new level. In contrast to additive outliers, a level shift outlier affects many observations and has a permanent effect. Transient Change Outlier. Transient change outliers are similar to level shift outliers, but the effect of the outlier diminishes exponentially over the subsequent observations. Eventually, the series returns to its normal level. Seasonal Additive Outlier. A seasonal additive outlier appears as a surprisingly large or small value occurring repeatedly at regular intervals. 422 Chapter 13 Local Trend Outlier. A local trend outlier yields a general drift in the series caused by a pattern in the outliers after the onset of the initial outlier. Outlier detection in time series involves determining the location, type, and magnitude of any outliers present. Tsay (1988) proposed an iterative procedure for detecting mean level change to identify deterministic outliers. This process involves comparing a time series model that assumes no outliers are present to another model that incorporates outliers. Differences between the models yield estimates of the effect of treating any given point as an outlier. Autocorrelation and Partial Autocorrelation Functions Autocorrelation and partial autocorrelation are measures of association between current and past series values and indicate which past series values are most useful in predicting future values. With this knowledge, you can determine the order of processes in an ARIMA model. More specifically, Autocorrelation function (ACF). At lag k, this is the correlation between series values that are k intervals apart. Partial autocorrelation function (PACF). At lag k, this is the correlation between series values that are k intervals apart, accounting for the values of the intervals between. Figure 13-6 ACF plot for a series The x axis of the ACF plot indicates the lag at which the autocorrelation is computed; the y axis indicates the value of the correlation (between −1 and 1). For example, a spike at lag 1 in an ACF plot indicates a strong correlation between each series value and the preceding value, a spike at lag 2 indicates a strong correlation between each value and the value occurring two points previously, and so on. A positive correlation indicates that large current values correspond with large values at the specified lag; a negative correlation indicates that large current values correspond with small values at the specified lag. The absolute value of a correlation is a measure of the strength of the association, with larger absolute values indicating stronger relationships. 423 Time Series Models Series Transformations Transformations are often useful for stabilizing a series before estimating models. This is particularly important for ARIMA models, which require series to be stationary before models are estimated. A series is stationary if the global level (mean) and average deviation from the level (variance) are constant throughout the series. While most interesting series are not stationary, ARIMA is effective as long as the series can be made stationary by applying transformations, such as the natural log, differencing, or seasonal differencing. Variance stabilizing transformations. Series in which the variance changes over time can often be stabilized using a natural log or square root transformation. These are also called functional transformations. Natural log. The natural logarithm is applied to the series values. Square root. The square root function is applied to the series values. Natural log and square root transformations cannot be used for series with negative values. Level stabilizing transformations. A slow decline of the values in the ACF indicates that each series value is strongly correlated with the previous value. By analyzing the change in the series values, you obtain a stable level. Simple differencing. The differences between each value and the previous value in the series are computed, except the oldest value in the series. This means that the differenced series will have one less value than the original series. Seasonal differencing. Identical to simple differencing, except that the differences between each value and the previous seasonal value are computed. When either simple or seasonal differencing is simultaneously in use with either the log or square root transformation, the variance stabilizing transformation is always applied first. When simple and seasonal differencing are both in use, the resulting series values are the same whether simple differencing or seasonal differencing is applied first. Predictor Series Predictor series include related data that may help explain the behavior of the series to be forecast. For example, a Web- or catalog-based retailer might forecast sales based on the number of catalogs mailed, the number of phone lines open, or the number of hits to the company Web page. Any series can be used as a predictor provided that the series extends as far into the future as you want to forecast and has complete data with no missing values. Use care when adding predictors to a model. Adding large numbers of predictors will increase the time required to estimate models. While adding predictors may improve a model’s ability to fit the historical data, it doesn’t necessarily mean that the model does a better job of forecasting, so the added complexity may not be worth the trouble. Ideally, the goal should be to identify the simplest model that does a good job of forecasting. As a general rule, it is recommended that the number of predictors should be less than the sample size divided by 15 (at most, one predictor per 15 cases). 424 Chapter 13 Predictors with missing data. Predictors with incomplete or missing data cannot be used in forecasting. This applies to both historical data and future values. In some cases, you can avoid this limitation by setting the model’s estimation span to exclude the oldest data when estimating models. Time Series Modeling Node The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series and produces forecasts based on the time series data. Exponential smoothing is a method of forecasting that uses weighted values of previous series observations to predict future values. As such, exponential smoothing is not based on a theoretical understanding of the data. It forecasts one point at a time, adjusting its forecasts as new data come in. The technique is useful for forecasting series that exhibit trend, seasonality, or both. You can choose from a variety of exponential smoothing models that differ in their treatment of trend and seasonality. ARIMA models provide more sophisticated methods for modeling trend and seasonal components than do exponential smoothing models, and, in particular, they allow the added benefit of including independent (predictor) variables in the model. This involves explicitly specifying autoregressive and moving average orders as well as the degree of differencing. You can include predictor variables and define transfer functions for any or all of them, as well as specify automatic detection of outliers or an explicit set of outliers. Note: In practical terms, ARIMA models are most useful if you want to include predictors that may help to explain the behavior of the series being forecast, such as the number of catalogs mailed or the number of hits to a company Web page. Exponential smoothing models describe the behavior of the time series without attempting to understand why it behaves as it does. For example, a series that historically has peaked every 12 months will probably continue to do so even if you don’t know why. Also available is an Expert Modeler, which automatically identifies and estimates the best-fitting ARIMA or exponential smoothing model for one or more target variables, thus eliminating the need to identify an appropriate model through trial and error. In all cases, the Expert Modeler picks the best model for each of the target variables specified. If in doubt, use the Expert Modeler. If predictor variables are specified, the Expert Modeler selects for inclusion in ARIMA models those variables that have a statistically significant relationship with the dependent series. Model variables are transformed where appropriate using differencing and/or a square root or natural log transformation. By default, the Expert Modeler considers all exponential smoothing models and all ARIMA models and picks the best model among them for each target field. You can, however, limit the Expert Modeler only to pick the best of the exponential smoothing models or only to pick the best of the ARIMA models. You can also specify automatic detection of outliers. Example. An analyst for a national broadband provider is required to produce forecasts of user subscriptions in order to predict utilization of bandwidth. Forecasts are needed for each of the local markets that make up the national subscriber base. You can use time series modeling to produce forecasts for the next three months for a number of local markets. 425 Time Series Models Requirements The Time Series node is different from other IBM® SPSS® Modeler nodes in that you cannot simply insert it into a stream and run the stream. The Time Series node must always be preceded by a Time Intervals node that specifies such information as the time interval to use (years, quarters, months etc.), the data to use for estimation, and how far into the future to extend a forecast, if used. Figure 13-7 Always precede a Time Series node with a Time Intervals node The time series data must be evenly spaced. Methods for modeling time series data require a uniform interval between each measurement, with any missing values indicated by empty rows. If your data do not already meet this requirement, the Time Intervals node can transform values as needed. Other points to note in connection with Time Series nodes are: Fields must be numeric Date fields cannot be used as inputs Partitions are ignored 426 Chapter 13 Field Options Figure 13-8 Time Series node dialog box, Fields tab The Fields tab is where you specify the fields to be used in building the model. Before you can build a model, you need to specify which fields you want to use as targets and as inputs. Typically the Time Series node uses field information from an upstream Type node. If you are using a Type node to select input and target fields, you don’t need to change anything on this tab. Use type node settings. This option tells the node to use field information from an upstream Type node. This is the default. Use custom settings. This option tells the node to use field information specified here instead of that given in any upstream Type node(s). After selecting this option, specify the fields below. Note that fields stored as dates are not accepted as either target or input fields. Targets. Select one or more target fields. This is similar to setting a field role to Target in a Type node. Target fields for a time series model must have a measurement level of Continuous. A separate model is created for each target field. A target field considers all specified Input fields except itself as possible inputs. Thus, the same field can be included in both lists; such a field will be used as a possible input to all models except the one where it is a target. Inputs. Select the input field(s). This is similar to setting a field role to Input in a Type node. Input fields for a time series model must be numeric. 427 Time Series Models Time Series Model Options Figure 13-9 Time Series node dialog box, Model tab Model name. Specifies the name assigned to the model that is generated when the node is executed. Auto. Generates the model name automatically based on the target or ID field names or the name of the model type in cases where no target is specified (such as clustering models). Custom. Allows you to specify a custom name for the model nugget. Continue estimation using existing model(s). If you have already generated a time series model, select this option to reuse the criteria settings specified for that model and generate a new model node in the Models palette, rather than building a new model from the beginning. In this way, you can save time by reestimating and producing a new forecast based on the same model settings as before but using more recent data. Thus, for example, if the original model for a particular time series was Holt’s linear trend, the same type of model is used for reestimating and forecasting for that data; the system does not reattempt to find the best model type for the new data. Selecting this option disables the Method and Criteria controls. For more information, see the topic Reestimating and Forecasting on p. 437. Method. You can choose Expert Modeler, Exponential Smoothing, or ARIMA. For more information, see the topic Time Series Modeling Node on p. 424. Select Criteria to specify options for the selected method. Expert Modeler. Choose this option to use the Expert Modeler, which automatically finds the best-fitting model for each dependent series. 428 Chapter 13 Exponential Smoothing. Use this option to specify a custom exponential smoothing model. ARIMA. Use this option to specify a custom ARIMA model. Time Interval Information This section of the dialog box contains information about specifications for estimates and forecasts made on the Time Intervals node. Note that this section is not displayed if you choose the Continue estimation using existing model(s) option. The first line of the information indicates whether any records are excluded from the model or used as holdouts. The second line provides information about any forecast periods specified on the Time Intervals node. If the first line reads No time interval defined, this indicates that no Time Intervals node is connected. This situation will cause an error on attempting to run the stream; you must include a Time Intervals node upstream from the Time Series node. Miscellaneous Information Confidence limit width (%). Confidence intervals are computed for the model predictions and residual autocorrelations. You can specify any positive value less than 100. By default, a 95% confidence interval is used. Maximum number of lags in ACF and PACF output. You can set the maximum number of lags shown in tables and plots of autocorrelations and partial autocorrelations. Build scoring model only. Check this box to reduce the amount of data that is stored in the model. Doing so can improve performance when building models with very large numbers of time series (tens of thousands). If you select this option, the Model, Parameters and Residuals tabs are not displayed in the Time Series model nugget, but you can still score the data in the usual way. 429 Time Series Models Time Series Expert Modeler Criteria Figure 13-10 Expert Modeler Criteria dialog, box, Model tab Model Type. The following options are available: All models. The Expert Modeler considers both ARIMA and exponential smoothing models. Exponential smoothing models only. The Expert Modeler only considers exponential smoothing models. ARIMA models only. The Expert Modeler only considers ARIMA models. Expert Modeler considers seasonal models. This option is only enabled if a periodicity has been defined for the active dataset. When this option is selected, the Expert Modeler considers both seasonal and nonseasonal models. If this option is not selected, the Expert Modeler only considers nonseasonal models. Events and Interventions. Enables you to designate certain input fields as event or intervention fields. Doing so identifies a field as containing time series data affected by events (predictable recurring situations, for example, sales promotions) or interventions (one-time incidents, for example, power outage or employee strike). The Expert Modeler will consider only simple regression and not arbitrary transfer functions for inputs identified as event or intervention fields. Input fields must have a measurement level of Flag, Nominal, or Ordinal and must be numeric (for example, 1/0, not True/False, for a flag field), before they will be included in this list. For more information, see the topic Pulses and Steps on p. 420. 430 Chapter 13 Outliers Figure 13-11 Expert Modeler Criteria dialog box, Outliers tab Detect outliers automatically. By default, automatic detection of outliers is not performed. Select this option to perform automatic detection of outliers, then select the desired outlier types. For more information, see the topic Outliers on p. 420. 431 Time Series Models Time Series Exponential Smoothing Criteria Figure 13-12 Exponential Smoothing Criteria dialog box Model Type. Exponential smoothing models are classified as either seasonal or nonseasonal . Seasonal models are only available if the periodicity defined using the Time Intervals node is seasonal. The seasonal periodicities are: cyclic periods, years, quarters, months, days per week, hours per day, minutes per day, and seconds per day. Simple. This model is appropriate for a series in which there is no trend or seasonality. Its only relevant smoothing parameter is level. Simple exponential smoothing is most similar to an ARIMA with zero orders of autoregression, one order of differencing, one order of moving average, and no constant. Holt’s linear trend. This model is appropriate for a series in which there is a linear trend and no seasonality. Its relevant smoothing parameters are level and trend, and, in this model, they are not constrained by each other’s values. Holt’s model is more general than Brown’s model but may take longer to compute estimates for large series. Holt’s exponential smoothing is most similar to an ARIMA with zero orders of autoregression, two orders of differencing, and two orders of moving average. Brown’s linear trend. This model is appropriate for a series in which there is a linear trend and no seasonality. Its relevant smoothing parameters are level and trend, but, in this model, they are assumed to be equal. Brown’s model is therefore a special case of Holt’s model. Brown’s exponential smoothing is most similar to an ARIMA with zero orders of autoregression, two orders of differencing, and two orders of moving average, with the coefficient for the second order of moving average equal to one half of the coefficient for the first order squared. Damped trend. This model is appropriate for a series in which there is a linear trend that is dying out and no seasonality. Its relevant smoothing parameters are level, trend, and damping trend. Damped exponential smoothing is most similar to an ARIMA with one order of autoregression, one order of differencing, and two orders of moving average. 432 Chapter 13 Simple seasonal. This model is appropriate for a series in which there is no trend and a seasonal effect that is constant over time. Its relevant smoothing parameters are level and season. Seasonal exponential smoothing is most similar to an ARIMA with zero orders of autoregression; one order of differencing; one order of seasonal differencing; and orders 1, p, and p+1 of moving average, where p is the number of periods in a seasonal interval. For monthly data, p = 12. Winters’ additive. This model is appropriate for a series in which there is a linear trend and a seasonal effect that is constant over time. Its relevant smoothing parameters are level, trend, and season. Winters’ additive exponential smoothing is most similar to an ARIMA with zero orders of autoregression; one order of differencing; one order of seasonal differencing; and p+1 orders of moving average, where p is the number of periods in a seasonal interval. For monthly data, p=12. Winters’ multiplicative. This model is appropriate for a series in which there is a linear trend and a seasonal effect that changes with the magnitude of the series. Its relevant smoothing parameters are level, trend, and season. Winters’ multiplicative exponential smoothing is not similar to any ARIMA model. Target Transformation. You can specify a transformation to be performed on each dependent variable before it is modeled. For more information, see the topic Series Transformations on p. 423. None. No transformation is performed. Square root. Square root transformation is performed. Natural log. Natural log transformation is performed. Time Series ARIMA Criteria The Time Series node allows you to build custom nonseasonal or seasonal ARIMA models—also known as Box-Jenkins models—with or without a fixed set of input (predictor) variables . You can define transfer functions for any or all of the input variables and specify automatic detection of outliers or an explicit set of outliers. All input variables specified are explicitly included in the model. This is in contrast to using the Expert Modeler, where input variables are included only if they have a statistically significant relationship with the target variable. Model The Model tab allows you to specify the structure of a custom ARIMA model. 433 Time Series Models Figure 13-13 ARIMA Criteria dialog box, Model tab ARIMA Orders. Enter values for the various ARIMA components of your model into the corresponding cells of the Structure grid. All values must be non-negative integers. For autoregressive and moving average components, the value represents the maximum order. All positive lower orders will be included in the model. For example, if you specify 2, the model includes orders 2 and 1. Cells in the Seasonal column are only enabled if a periodicity has been defined for the active dataset. Autoregressive (p). The number of autoregressive orders in the model. Autoregressive orders specify which previous values from the series are used to predict current values. For example, an autoregressive order of 2 specifies that the value of the series two time periods in the past be used to predict the current value. Difference (d). Specifies the order of differencing applied to the series before estimating models. Differencing is necessary when trends are present (series with trends are typically nonstationary and ARIMA modeling assumes stationarity) and is used to remove their effect. The order of differencing corresponds to the degree of series trend—first-order differencing accounts for linear trends, second-order differencing accounts for quadratic trends, and so on. Moving Average (q). The number of moving average orders in the model. Moving average orders specify how deviations from the series mean for previous values are used to predict current values. For example, moving-average orders of 1 and 2 specify that deviations from the mean value of the series from each of the last two time periods be considered when predicting current values of the series. Seasonal Orders. Seasonal autoregressive, moving average, and differencing components play the same roles as their nonseasonal counterparts. For seasonal orders, however, current series values are affected by previous series values separated by one or more seasonal periods. For example, for 434 Chapter 13 monthly data (seasonal period of 12), a seasonal order of 1 means that the current series value is affected by the series value 12 periods prior to the current one. A seasonal order of 1, for monthly data, is then the same as specifying a nonseasonal order of 12. Target Transformation. You can specify a transformation to be performed on each target variable before it is modeled. For more information, see the topic Series Transformations on p. 423. None. No transformation is performed. Square root. Square root transformation is performed. Natural log. Natural log transformation is performed. Include constant in model. Inclusion of a constant is standard unless you are sure that the overall mean series value is 0. Excluding the constant is recommended when differencing is applied. Transfer Functions Figure 13-14 ARIMA Criteria dialog box, Transfer Functions tab The Transfer Functions tab allows you to define transfer functions for any or all of the input fields. Transfer functions allow you to specify the manner in which past values of these fields are used to forecast future values of the target series. The tab is displayed only if input fields (with the role set to Input) are specified, either on the Type node or on the Fields tab of the Time Series node (select Use custom settings—Inputs). The top list shows all input fields. The remaining information in this dialog box is specific to the selected input field in the list. 435 Time Series Models Transfer Function Orders. Enter values for the various components of the transfer function into the corresponding cells of the Structure grid. All values must be non-negative integers. For numerator and denominator components, the value represents the maximum order. All positive lower orders will be included in the model. In addition, order 0 is always included for numerator components. For example, if you specify 2 for numerator, the model includes orders 2, 1, and 0. If you specify 3 for denominator, the model includes orders 3, 2, and 1. Cells in the Seasonal column are only enabled if a periodicity has been defined for the active dataset. Numerator. The numerator order of the transfer function specifies which previous values from the selected independent (predictor) series are used to predict current values of the dependent series. For example, a numerator order of 1 specifies that the value of an independent series one time period in the past—as well as the current value of the independent series—is used to predict the current value of each dependent series. Denominator. The denominator order of the transfer function specifies how deviations from the series mean, for previous values of the selected independent (predictor) series, are used to predict current values of the dependent series. For example, a denominator order of 1 specifies that deviations from the mean value of an independent series one time period in the past be considered when predicting the current value of each dependent series. Difference. Specifies the order of differencing applied to the selected independent (predictor) series before estimating models. Differencing is necessary when trends are present and is used to remove their effect. Seasonal Orders. Seasonal numerator, denominator, and differencing components play the same roles as their nonseasonal counterparts. For seasonal orders, however, current series values are affected by previous series values separated by one or more seasonal periods. For example, for monthly data (seasonal period of 12), a seasonal order of 1 means that the current series value is affected by the series value 12 periods prior to the current one. A seasonal order of 1, for monthly data, is then the same as specifying a nonseasonal order of 12. Delay. Setting a delay causes the input field’s influence to be delayed by the number of intervals specified. For example, if the delay is set to 5, the value of the input field at time t doesn’t affect forecasts until five periods have elapsed (t + 5). Transformation. Specification of a transfer function for a set of independent variables also includes an optional transformation to be performed on those variables. None. No transformation is performed. Square root. Square root transformation is performed. Natural log. Natural log transformation is performed. 436 Chapter 13 Handling Outliers Figure 13-15 ARIMA Criteria dialog box, Outliers tab The Outliers tab provides a number of choices for the handling of outliers in the data . Do not detect outliers or model them. By default, outliers are neither detected nor modeled. Select this option to disable any detection or modeling of outliers. Detect outliers automatically. Select this option to perform automatic detection of outliers, and select one or more of the outlier types shown. Type of Outliers to Detect. Select the outlier type(s) you want to detect. The supported types are: Additive (default) Level shift (default) Innovational Transient Seasonal additive Local trend Additive patch For more information, see the topic Outliers on p. 420. 437 Time Series Models Generating Time Series Models This section gives some general information about certain aspects of generating time series models: Generating multiple models Using time series models in forecasting Reestimating and forecasting The generated model nugget is described in a separate topic. For more information, see the topic Time Series Model Nugget on p. 438. Generating Multiple Models Time series modeling in IBM® SPSS® Modeler generates a single model (either ARIMA or exponential smoothing) for each target field. Thus, if you have multiple target fields, SPSS Modeler generates multiple models in a single operation, saving time and enabling you to compare the settings for each model. If you want to compare an ARIMA model and an exponential smoothing model for the same target field, you can perform separate executions of the Time Series node, specifying a different model each time. Using Time Series Models in Forecasting A time series build operation uses a specific series of ordered cases, known as the estimation span, to build a model that can be used to forecast future values of the series. This model contains information about the time span used, including the interval. In order to forecast using this model, the same time span and interval information must be used with the same series for both the target variable and predictor variables. For example, suppose that at the beginning of January you want to forecast monthly sales of Product 1 for the first three months of the year. You build a model using the actual monthly sales data for Product 1 from January through December of the previous year (which we’ll call Year 1), setting the Time Interval to “Months.” You can then use the model to forecast sales of Product 1 for the first three months of Year 2. In fact you could forecast any number of months ahead, but of course, the further into the future you try to predict, the less effective the model will become. It would not, however, be possible to forecast the first three weeks of Year 2, because the interval used to build the model was “Months.” It would also make no sense to use this model to predict the sales of Product 2—a time series model is relevant only for the data that was used to define it. Reestimating and Forecasting The estimation period is hard coded into the model that is generated. This means that any values outside the estimation period are ignored if you apply the current model to new data. Thus, a time series model must be reestimated each time new data is available, in contrast to other IBM® SPSS® Modeler models, which can be reapplied unchanged for the purposes of scoring. 438 Chapter 13 To continue the previous example, suppose that by the beginning of April in Year 2, you have the actual monthly sales data for January through March. However, if you reapply the model you generated at the beginning of January, it will again forecast January through March and ignore the known sales data for that period. The solution is to generate a new model based on the updated actual data. Assuming that you do not change the forecasting parameters, the new model can be used to forecast the next three months, April through June. If you still have access to the stream that was used to generate the original model, you can simply replace the reference to the source file in that stream with a reference to the file containing the updated data and rerun the stream to generate the new model. However, if all you have is the original model saved in a file, you can still use it to generate a Time Series node that you can then add to a new stream containing a reference to the updated source file. Provided this new stream precedes the Time Series node with a Time Intervals node where the interval is set to “Months,” running this new stream will then generate the required new model. Time Series Model Nugget The time series modeling operation creates a number of new fields with the prefix $TS- as follows: $TS-colname $TSLCI-colname $TSUCI-colname $TSNR-colname $TS-Total $TSLCI-Total $TSUCI-Total $TSNR-Total The value forecasted by the model for each target series. The lower confidence intervals for each forecasted series.* The upper confidence intervals for each forecasted series.* The noise residual value for each column of the generated model data.* The total of the $TS-colname values for this row. The total of the $TSLCI-colname values for this row.* The total of the $TSUCI-colname values for this row.* The total of the $TSNR-colname values for this row.* * Visibility of these fields (for example, in the output from an attached Table node) depends on options on the Settings tab of the Time Series model nugget. For more information, see the topic Time Series Model Settings on p. 445. 439 Time Series Models Figure 13-16 Time Series model nugget, Model tab The Time Series model nugget displays details of the various models selected for each of the series input into the Time Series build node. Multiple series (such as data relating to product lines, regions, or stores) can be input, and a separate model is generated for each target series. For example, if revenue in the eastern region is found to fit an ARIMA model, but the western region fits only a simple moving average, each region is scored with the appropriate model. The default output shows, for each model built, the model type, the number of predictors specified, and the goodness-of-fit measure (stationary R-squared is the default). If you have specified outlier methods, there is a column showing the number of outliers detected. The default output also includes columns for Ljung-Box Q, degrees of freedom, and significance values. You can also choose advanced output, which displays the following additional columns: R-squared RMSE (Root Mean Square Error) MAPE (Mean Absolute Percentage Error) MAE (Mean Absolute Error) MaxAPE (Maximum Absolute Percentage Error) 440 Chapter 13 MaxAE (Maximum Absolute Error) Norm. BIC (Normalized Bayesian Information Criterion) Generate. Enables you to generate a Time Series modeling node back to the stream or a model nugget to the palette. Generate Modeling Node. Places a Time Series modeling node into a stream with the settings used to create this set of models. Doing so would be useful, for example, if you have a stream in which you want to use these model settings but you no longer have the modeling node used to generate them. Model to Palette. Places a model nugget containing all the targets in the Models manager. Model Figure 13-17 Check All and Uncheck All buttons Check boxes. Choose which models you want to use in scoring. All the boxes are checked by default. The Check all and Uncheck all buttons act on all the boxes in a single operation. Sort by. Enables you to sort the output rows in ascending or descending order for a specified column of the display. The “Selected” option sorts the output based on one or more rows selected by check boxes. This would be useful, for example, to cause target fields named “Market_1” to “Market_9” to be displayed before “Market_10,” as the default sort order displays “Market_10” immediately after “Market_1.” View. The default view (Simple) displays the basic set of output columns. The Advanced option displays additional columns for goodness-of-fit measures. Number of records used in estimation. The number of rows in the original source data file. Target. The field or fields identified as the target fields (those with a role of Target) in the Type node. Model. The type of model used for this target field. Predictors. The number of predictors (those with a role of Input) used for this target field. Outliers. This column is displayed only if you have requested (in the Expert Modeler or ARIMA criteria) the automatic detection of outliers. The value shown is the number of outliers detected. Stationary R-squared. A measure that compares the stationary part of the model to a simple mean model. This measure is preferable to ordinary R-squared when there is a trend or seasonal pattern. Stationary R-squared can be negative with a range of negative infinity to 1. Negative values mean that the model under consideration is worse than the baseline model. Positive values mean that the model under consideration is better than the baseline model. R-Squared. Goodness-of-fit measure of a linear model, sometimes called the coefficient of determination. It is the proportion of variation in the dependent variable explained by the regression model. It ranges in value from 0 to 1. Small values indicate that the model does not fit the data well. 441 Time Series Models RMSE. Root Mean Square Error. The square root of mean square error. A measure of how much a dependent series varies from its model-predicted level, expressed in the same units as the dependent series. MAPE. Mean Absolute Percentage Error. A measure of how much a dependent series varies from its model-predicted level. It is independent of the units used and can therefore be used to compare series with different units. MAE. Mean absolute error. Measures how much the series varies from its model-predicted level. MAE is reported in the original series units. MaxAPE. Maximum Absolute Percentage Error. The largest forecasted error, expressed as a percentage. This measure is useful for imagining a worst-case scenario for your forecasts. MaxAE. Maximum Absolute Error. The largest forecasted error, expressed in the same units as the dependent series. Like MaxAPE, it is useful for imagining the worst-case scenario for your forecasts. Maximum absolute error and maximum absolute percentage error may occur at different series points–for example, when the absolute error for a large series value is slightly larger than the absolute error for a small series value. In that case, the maximum absolute error will occur at the larger series value and the maximum absolute percentage error will occur at the smaller series value. Normalized BIC. Normalized Bayesian Information Criterion. A general measure of the overall fit of a model that attempts to account for model complexity. It is a score based upon the mean square error and includes a penalty for the number of parameters in the model and the length of the series. The penalty removes the advantage of models with more parameters, making the statistic easy to compare across different models for the same series. Q. The Ljung-Box Q statistic. A test of the randomness of the residual errors in this model. df. Degrees of freedom. The number of model parameters that are free to vary when estimating a particular target. Sig. Significance value of the Ljung-Box statistic. A significance value less than 0.05 indicates that the residual errors are not random. Summary Statistics. This section contains various summary statistics for the different columns, including mean, minimum, maximum, and percentile values. 442 Chapter 13 Time Series Model Parameters Figure 13-18 Time Series model, Parameters tab The Parameters tab lists details of various parameters that were used to build a selected model. Display parameters for model. Select the model for which you want to display the parameter details. Target. The name of the target field (with the role Target) forecast by this model. Model. The type of model used for this target field. Field (ARIMA models only). Contains one entry for each of the variables used in the model, with the target first, followed by the predictors, if any. Transformation. Indicates what type of transformation was specified, if any, for this field before the model was built. Parameter. The model parameter for which the following details are displayed: Lag (ARIMA models only). Indicates the lags, if any, considered for this parameter in the model. Estimate. The parameter estimate. This value is used in calculating the forecast value and confidence intervals for the target field. SE. The standard error of the parameter estimate. t. The value of the parameter estimate divided by the standard error. Sig. The significance level for the parameter estimate. Values above 0.05 are regarded as not statistically significant. 443 Time Series Models Time Series Model Residuals Figure 13-19 Time Series model, Residuals tab, ACF and PACF display The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the residuals (the differences between expected and actual values) for each model built. For more information, see the topic Autocorrelation and Partial Autocorrelation Functions on p. 422. Display plot for model. Select the model for which you want to display the residual ACF and residual PACF. 444 Chapter 13 Time Series Model Summary Figure 13-20 Time Series model, Summary tab The Summary tab of a model nugget displays information about the model itself (Analysis), fields used in the model (Fields), settings used when building the model (Build Settings), and model training (Training Summary). When you first browse the node, the Summary tab results are collapsed. To see the results of interest, use the expander control to the left of an item to unfold it or click the Expand All button to show all results. To hide the results when you have finished viewing them, use the expander control to collapse the specific results that you want to hide or click the Collapse All button to collapse all results. Analysis. Displays information about the specific model. Fields. Lists the fields used as the target and the inputs in building the model. 445 Time Series Models Build Settings. Contains information about the settings used in building the model. Training Summary. Shows the type of model, the stream used to create it, the user who created it, when it was built, and the elapsed time for building the model. Time Series Model Settings Figure 13-21 Time Series model, Settings tab The Settings tab enables you to specify what extra fields are created by the modeling operation. Create new fields for each model to be scored. Enables you to specify the new fields to create for each model to be scored. Calculate upper and lower confidence limits. If checked, creates new fields (with the default prefixes $TSLCI- and $TSUCI-) for the lower and upper confidence intervals, respectively, for each target field, together with totals of these values. Calculate noise residuals. If checked, creates a new field (with the default prefix $TSNR-) for the model residuals for each target field, together with a total of these values. Chapter 14 Self-Learning Response Node Models SLRM Node The Self-Learning Response Model (SLRM) node enables you to build a model that you can continually update, or reestimate, as a dataset grows without having to rebuild the model every time using the complete dataset. For example, this is useful when you have several products and you want to identify which product a customer is most likely to buy if you offer it to them. This model allows you to predict which offers are most appropriate for customers and the probability of the offers being accepted. The model can initially be built using a small dataset with randomly made offers and the responses to those offers. As the dataset grows, the model can be updated and therefore becomes more able to predict the most suitable offers for customers and the probability of their acceptance based upon other input fields such as age, gender, job, and income. The offers available can be changed by adding or removing them from within the node dialog box, instead of having to change the target field of the dataset. When coupled with IBM® SPSS® Collaboration and Deployment Services, you can set up automatic regular updates to the model. This process, without the need for human oversight or action, provides a flexible and low-cost solution for organizations and applications where custom intervention by a data miner is not possible or necessary. Example. A financial institution wants to achieve more profitable results by matching the offer that is most likely to be accepted to each customer. You can use a self-learning model to identify the characteristics of customers most likely to respond favorably based on previous promotions and to update the model in real time based on the latest customer responses. © Copyright IBM Corporation 1994, 2012. 446 447 Self-Learning Response Node Models SLRM Node Fields Options Figure 14-1 SLRM node dialog box, Fields tab Before executing an SLRM node, you must specify both the target and target response fields on the Fields tab of the node. Target field. Select the target field from the list; for example, a nominal (set) field containing the different products you want to offer to customers. Note: The target field must have string storage, not numeric. Target response field. Select the target response field from the list. For example, Accepted or Rejected. Note: This field must be a Flag. The true value of the flag indicates offer acceptance and the false value indicates offer refusal. The remaining fields in this dialog box are the standard ones used throughout IBM® SPSS® Modeler. For more information, see the topic Modeling Node Fields Options in Chapter 3 on p. 35. Note: If the source data includes ranges that are to be used as continuous (numeric range) input fields, you must ensure that the metadata includes both the minimum and maximum details for each range. 448 Chapter 14 SLRM Node Model Options Figure 14-2 SLRM node dialog box, Model tab Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Continue training existing model. By default, a completely new model is created each time a modeling node is executed. If this option is selected, training continues with the last model successfully produced by the node. This makes it possible to update or refresh an existing model without having to access the original data and may result in significantly faster performance since only the new or updated records are fed into the stream. Details on the previous model are stored with the modeling node, making it possible to use this option even if the previous model nugget is no longer available in the stream or Models palette. Target field values By default this is set to Use all, which means that a model will be built that contains every offer associated with the selected target field value. If you want to generate a model that contains only some of the target field’s offers, click Specify and use the Add, Edit, and Delete buttons to enter or amend the names of the offers for which you want to build a model. For example, if you chose a target that lists all of the products you supply, you can use this field to limit the offered products to just a few that you enter here. 449 Self-Learning Response Node Models Model Assessment. The fields in this panel are independent from the model in that they don’t affect the scoring. Instead they enable you to create a visual representation of how well the model will predict results. Note: To display the model assessment results in the model nugget you must also select the Display model evaluation box. Include model assessment. Select this box to create graphs that show the model’s predicted accuracy for each selected offer. Set random seed. When estimating the accuracy of a model based on a random percentage, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed. Simulated sample size. Specify the number of records to be used in the sample when assessing the model. The default is 100. Number of iterations. This enables you to stop building the model assessment after the number of iterations specified. Specify the maximum number of iterations; the default is 20. Note: Bear in mind that large sample sizes and high numbers of iterations will increase the amount of time it takes to build the model. Display model evaluation. Select this option to display a graphical representation of the results in the model nugget. 450 Chapter 14 SLRM Node Settings Options Figure 14-3 SLRM node dialog box, Settings tab The node settings options allow you to fine-tune the model-building process. Maximum number of predictions per record. This option allows you to limit the number of predictions made for each record in the dataset. The default is 3. For example, you may have six offers (such as savings, mortgage, car loan, pension, credit card, and insurance), but you only want to know the best two to recommend; in this case you would set this field to 2. When you build the model and attach it to a table, you would see two prediction columns (and the associated confidence in the probability of the offer being accepted) per record. The predictions could be made up of any of the six possible offers. Level of randomization. To prevent any bias—for example, in a small or incomplete dataset—and treat all potential offers equally, you can add a level of randomization to the selection of offers and the probability of their being included as recommended offers. Randomization is expressed as a percentage, shown as decimal values between 0.0 (no randomization) and 1.0 (completely random). The default is 0.0. Set random seed. When adding a level of randomization to selection of an offer, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is 451 Self-Learning Response Node Models executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. Sort order. Select the order in which offers are to be displayed in the built model: Descending. The model displays offers with the highest scores first. These are the offers that have the greatest probability of being accepted. Ascending. The model displays offers with the lowest scores first. These are the offers that have the greatest probability of being rejected. For example, this may be useful when deciding which customers to remove from a marketing campaign for a specific offer. Preferences for target fields. When building a model, there may be certain aspects of the data that you want to actively promote or remove. For example, if building a model that selects the best financial offer to promote to a customer, you may want to ensure that one particular offer is always included regardless of how well it scores against each customer. To include an offer in this panel and edit its preferences, click Add, type the offer’s name (for example, Savings or Mortgage), and click OK. Value. This shows the name of the offer that you added. Preference. Specify the level of preference to be applied to the offer. Preference is expressed as a percentage, shown as decimal values between 0.0 (not preferred) and 1.0 (most preferred). The default is 0.0. Always include. To ensure that a specific offer is always included in the predictions, select this box. Note: If the Preference is set to 0.0, the Always include setting is ignored. Take account of model reliability. A well-structured, data-rich model that has been fine-tuned through several regenerations should always produce more accurate results compared to a brand new model with little data. To take advantage of the more mature model’s increased reliability, select this box. SLRM Model Nuggets Note: Results are only shown on this tab if you select both Include model assessment and Display model evaluation on the Model options tab. 452 Chapter 14 Figure 14-4 SLRM model nugget display When you run a stream containing an SLRM model, the node estimates the accuracy of the predictions for each target field value (offer) and the importance of each predictor used. Note: If you selected Continue training existing model on the modeling node Model tab, the information shown on the model nugget is updated each time you regenerate the model. For models built using IBM® SPSS® Modeler 12.0 or later, the model nugget Model tab is divided into two columns: Left column. View. When you have more than one offer, select the one for which you want to display results. Model Performance. This shows the estimated model accuracy of each offer. The test set is generated through simulation. Right column. View. Select whether you want to display Association with Response or Variable Importance details. 453 Self-Learning Response Node Models Association with Response. Displays the association (correlation) of each predictor with the target variable. Predictor Importance. Indicates the relative importance of each predictor in estimating the model. Typically you will want to focus your modeling efforts on the predictors that matter most and consider dropping or ignoring those that matter least. This chart can be interpreted in the same manner as for other models that display predictor importance, though in the case of SLRM the graph is generated through simulation by the SLRM algorithm. This is done by removing each predictor in turn from the model and seeing how this affects the model’s accuracy. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. SLRM Model Settings The Settings tab for a SLRM model nugget specifies options for modifying the built model. For example, you may use the SLRM node to build several different models using the same data and settings, then use this tab in each model to slightly modify the settings to see how that affects the results. Note: This tab is only available after the model nugget has been added to a stream. Figure 14-5 SLRM model nugget dialog box, Settings tab Maximum number of predictions per record. This option allows you to limit the number of predictions made for each record in the dataset. The default is 3. 454 Chapter 14 For example, you may have six offers (such as savings, mortgage, car loan, pension, credit card, and insurance), but you only want to know the best two to recommend; in this case you would set this field to 2. When you build the model and attach it to a table, you would see two prediction columns (and the associated confidence in the probability of the offer being accepted) per record. The predictions could be made up of any of the six possible offers. Level of randomization. To prevent any bias—for example, in a small or incomplete dataset—and treat all potential offers equally, you can add a level of randomization to the selection of offers and the probability of their being included as recommended offers. Randomization is expressed as a percentage, shown as decimal values between 0.0 (no randomization) and 1.0 (completely random). The default is 0.0. Set random seed. When adding a level of randomization to selection of an offer, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. Sort order. Select the order in which offers are to be displayed in the built model: Descending. The model displays offers with the highest scores first. These are the offers that have the greatest probability of being accepted. Ascending. The model displays offers with the lowest scores first. These are the offers that have the greatest probability of being rejected. For example, this may be useful when deciding which customers to remove from a marketing campaign for a specific offer. Preferences for target fields. When building a model, there may be certain aspects of the data that you want to actively promote or remove. For example, if building a model that selects the best financial offer to promote to a customer, you may want to ensure that one particular offer is always included regardless of how well it scores against each customer. To include an offer in this panel and edit its preferences, click Add, type the offer’s name (for example, Savings or Mortgage), and click OK. Value. This shows the name of the offer that you added. Preference. Specify the level of preference to be applied to the offer. Preference is expressed as a percentage, shown as decimal values between 0.0 (not preferred) and 1.0 (most preferred). The default is 0.0. Always include. To ensure that a specific offer is always included in the predictions, select this box. Note: If the Preference is set to 0.0, the Always include setting is ignored. Take account of model reliability. A well-structured, data-rich model that has been fine-tuned through several regenerations should always produce more accurate results compared to a brand new model with little data. To take advantage of the more mature model’s increased reliability, select this box. Chapter Support Vector Machine Models 15 About SVM Support Vector Machine (SVM) is a robust classification and regression technique that maximizes the predictive accuracy of a model without overfitting the training data. SVM is particularly suited to analyzing data with very large numbers (for example, thousands) of predictor fields. SVM has applications in many disciplines, including customer relationship management (CRM), facial and other image recognition, bioinformatics, text mining concept extraction, intrusion detection, protein structure prediction, and voice and speech recognition. How SVM Works SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data are transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong. For example, consider the following figure, in which the data points fall into two different categories: Figure 15-1 Original dataset The two categories can be separated with a curve: Figure 15-2 Data with separator added © Copyright IBM Corporation 1994, 2012. 455 456 Chapter 15 After the transformation, the boundary between the two categories can be defined by a hyperplane: Figure 15-3 Transformed data The mathematical function used for the transformation is known as the kernel function. SVM in IBM® SPSS® Modeler supports the following kernel types: Linear Polynomial Radial basis function (RBF) Sigmoid A linear kernel function is recommended when linear separation of the data is straightforward. In other cases, one of the other functions should be used. You will need to experiment with the different functions to obtain the best model in each case, as they each use different algorithms and parameters. Tuning an SVM Model Besides the separating line between the categories, a classification SVM model also finds marginal lines that define the space between the two categories: Figure 15-4 Data with a preliminary model The data points that lie on the margins are known as the support vectors. 457 Support Vector Machine Models The wider the margin between the two categories, the better the model will be at predicting the category for new records. In the previous example, the margin is not very wide, and the model is said to be overfitted. A small amount of misclassification can be accepted in order to widen the margin, for example: Figure 15-5 Data with an improved model In some cases, linear separation is more difficult, for example: Figure 15-6 A problem for linear separation In a case like this, the goal is to find the optimum balance between a wide margin and a small number of misclassified data points. The kernel function has a regularization parameter (known as C) which controls the trade-off between these two values. You will probably need to experiment with different values of this and other kernel parameters in order to find the best model. SVM Node The SVM node enables you to use a support vector machine to classify data. SVM is particularly suited for use with wide datasets, that is, those with a large number of predictor fields. You can use the default settings on the node to produce a basic model relatively quickly, or you can use the Expert settings to experiment with different types of SVM model. When the model has been built, you can: Browse the model nugget to display the relative importance of the input fields in building the model. Append a Table node to the model nugget to view the model output. 458 Chapter 15 Example. A medical researcher has obtained a dataset containing characteristics of a number of human cell samples extracted from patients who were believed to be at risk of developing cancer. Analysis of the original data showed that many of the characteristics differed significantly between benign and malignant samples. The researcher wants to develop an SVM model that can use the values of similar cell characteristics in samples from other patients to give an early indication of whether their samples might be benign or malignant. SVM Node Model Options Figure 15-7 SVM node model options Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. SVM Node Expert Options If you have detailed knowledge of support vector machines, expert options allow you to fine-tune the training process. To access the expert options, set Mode to Expert on the Expert tab. 459 Support Vector Machine Models Figure 15-8 SVM node expert options Append all probabilities (valid only for categorical targets). If selected (checked), specifies that probabilities for each possible value of a nominal or flag target field are displayed for each record processed by the node. If this option is not selected, the probability of only the predicted value is displayed for nominal or flag target fields. The setting of this check box determines the default state of the corresponding check box on the model nugget display. Stopping criteria. Determines when to stop the optimization algorithm. Values range from 1.0E–1 to 1.0E–6; default is 1.0E–3. Reducing the value results in a more accurate model, but the model will take longer to train. Regularization parameter (C). Controls the trade-off between maximizing the margin and minimizing the training error term. Value should normally be between 1 and 10 inclusive; default is 10. Increasing the value improves the classification accuracy (or reduces the regression error) for the training data, but this can also lead to overfitting. Regression precision (epsilon). Used only if the measurement level of the target field is Continuous. Causes errors to be accepted provided that they are less than the value specified here. Increasing the value may result in faster modeling, but at the expense of accuracy. Kernel type. Determines the type of kernel function used for the transformation. Different kernel types cause the separator to be calculated in different ways, so it is advisable to experiment with the various options. Default is RBF (Radial Basis Function). RBF gamma. Enabled only if the kernel type is set to RBF. Value should normally be between 3/k and 6/k, where k is the number of input fields. For example, if there are 12 input fields, values between 0.25 and 0.5 would be worth trying. Increasing the value improves the classification accuracy (or reduces the regression error) for the training data, but this can also lead to overfitting. 460 Chapter 15 Gamma. Enabled only if the kernel type is set to Polynomial or Sigmoid. Increasing the value improves the classification accuracy (or reduces the regression error) for the training data, but this can also lead to overfitting. Bias. Enabled only if the kernel type is set to Polynomial or Sigmoid. Sets the coef0 value in the kernel function. The default value 0 is suitable in most cases. Degree. Enabled only if Kernel type is set to Polynomial. Controls the complexity (dimension) of the mapping space. Normally you would not use a value greater than 10. SVM Model Nugget The SVM model creates a number of new fields. The most important of these is the $S-fieldname field, which shows the target field value predicted by the model. The number and names of the new fields created by the model depend on the measurement level of the target field (this field is indicated in the following tables by fieldname). To see these fields and their values, add a Table node to the SVM model nugget and execute the Table node. Table 15-1 Target field measurement level is ‘Nominal’ or ‘Flag’ New field name $S-fieldname $SP-fieldname $SP-value $SRP-value Description Predicted value of target field. Probability of predicted value. Probability of each possible value of nominal or flag (displayed only if Append all probabilities is checked on the Settings tab of the model nugget). (Flag targets only) Raw (SRP) and adjusted (SAP) propensity scores, indicating the likelihood of a “true” outcome for the target field. These scores are displayed only if the corresponding check boxes are selected on the Analyze tab of the SVM modeling node before the model is generated. For more information, see the topic Modeling Node Analyze Options in Chapter 3 on p. 39. $SAP-value Table 15-2 Target field measurement level is ‘Continuous’ New field name Description Predicted value of target field. $S-fieldname Predictor Importance Optionally, a chart that indicates the relative importance of each predictor in estimating the model may also be displayed on the Model tab. Typically you will want to focus your modeling efforts on the predictors that matter most and consider dropping or ignoring those that matter least. Note this chart is only available if Calculate predictor importance is selected on the Analyze tab before generating the model. For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Note: Predictor importance may take longer to calculate for SVM than for other types of models, and is not selected on the Analyze tab by default. Selecting this option may slow performance, particularly with large datasets. 461 Support Vector Machine Models SVM Model Settings Figure 15-9 SVM model, Settings tab The Settings tab enables you to specify extra fields to be displayed when viewing the results (for example by executing a Table node attached to the nugget). You can see the effect of each of these options by selecting them and clicking the Preview button—scroll to the right of the Preview output to see the extra fields. Append all probabilities (valid only for categorical targets). If this option is checked, probabilities for each possible value of a nominal or flag target field are displayed for each record processed by the node. If this option is unchecked, only the predicted value and its probability are displayed for nominal or flag target fields. The default setting of this check box is determined by the corresponding check box on the modeling node. Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. Chapter Nearest Neighbor Models 16 KNN Node Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity. Cases that are near each other are said to be “neighbors.” When a new case (holdout) is presented, its distance from each of the cases in the model is computed. The classifications of the most similar cases – the nearest neighbors – are tallied and the new case is placed into the category that contains the greatest number of nearest neighbors. You can specify the number of nearest neighbors to examine; this value is called k. The pictures show how a new case would be classified using two different values of k. When k = 5, the new case is placed in category 1 because a majority of the nearest neighbors belong to category 1. However, when k = 9, the new case is placed in category 0 because a majority of the nearest neighbors belong to category 0. Figure 16-1 The effects of changing k on classification Nearest neighbor analysis can also be used to compute values for a continuous target. In this situation, the average or median target value of the nearest neighbors is used to obtain the predicted value for the new case. © Copyright IBM Corporation 1994, 2012. 462 463 Nearest Neighbor Models KNN Node Objectives Options Figure 16-2 KNN node objectives options The Objectives tab is where you can choose either to build a model that predicts the value of a target field in your input data based on the values of its nearest neighbors, or to simply find which are the nearest neighbors for a particular case of interest. What type of analysis do you want to perform? Predict a target field. Choose this option if you want to predict the value of a target field based on the values of its nearest neighbors. Only identify the nearest neighbors. Choose this option if you only want to see which are the nearest neighbors for a particular input field. If you choose to identify only the nearest neighbors, the remaining options on this tab relating to accuracy and speed are disabled as they are relevant only for predicting targets. What is your objective? When predicting a target fields, this group of options lets you decide whether speed, accuracy, or a blend of both, are the most important factors when predicting a target field. Alternatively you can choose to customize settings yourself. 464 Chapter 16 If you choose the Balance, Speed, or Accuracy option, the algorithm preselects the most appropriate combination of settings for that option. Advanced users may wish to override these selections; this can be done on the various panels of the Settings tab. Balance speed and accuracy. Selects the best number of neighbors within a small range. Speed. Finds a fixed number of neighbors. Accuracy. Selects the best number of neighbors within a larger range, and uses predictor importance when calculating distances. Custom analysis. Choose this option to fine-tune the algorithm on the Settings tab. Note: The size of the resulting KNN model, unlike most other models, increases linearly with the quantity of training data. If, when trying to build a KNN model, you see an error reporting an “out of memory” error, try increasing the maximum system memory used by IBM® SPSS® Modeler. To do so, choose Tools > Options > System Options and enter the new size in the Maximum memory field. Changes made in the System Options dialog do not take effect until you restart SPSS Modeler. KNN Node Settings The Settings tab is where you specify the options that are specific to Nearest Neighbor Analysis. The sidebar on the left of the screen lists the panels that you use to specify the options. 465 Nearest Neighbor Models Model Figure 16-3 KNN node model options The Model panel provides options that control how the model is to be built, for example, whether to use partitioning or split models, whether to transform numeric input fields so that they all fall within the same range, and how to manage cases of interest. You can also choose a custom name for the model. Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name. Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model. Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. To select fields manually... By default, the node uses the partition and split field settings (if any) from the Type node, but you can override those settings here. To activate the Partition and Splits fields, select the Fields tab and choose Use Custom Settings, then return here. Partition. This field allows you to specify a field used to partition the data into separate samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a different sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data. If multiple partition fields have been defined by using Type or Partition nodes, a single partition 466 Chapter 16 field must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.) Also note that to apply the selected partition in your analysis, partitioning must also be enabled in the Model Options tab for the node. (Deselecting this option makes it possible to disable partitioning without changing field settings.) Splits. For split models, select the split field or fields. This is similar to setting the field role to Split in a Type node. You can designate only fields of type Flag, Nominal or Ordinal as split fields. Fields chosen as split fields cannot be used as target, input, partition, frequency or weight fields. For more information, see the topic Building Split Models in Chapter 3 on p. 30. Normalize range inputs. Check this box to normalize the values for continuous input fields. Normalized features have the same range of values, which can improve the performance of the estimation algorithm. Adjusted normalization, [2*(x−min)/(max−min)]−1, is used. Adjusted normalized values fall between −1 and 1. Use case labels. Check this box to enable the drop-down list, from where you can choose a field whose values will be used as labels to identify the cases of interest in the predictor space chart, peers chart, and quadrant map in the model viewer. You can choose any field with a measurement level of Nominal, Ordinal, or Flag to use as the labeling field. If you do not choose a field here, records are displayed in the model viewer charts with nearest neighbors being identified by row number in the source data. If you will be manipulating the data at all after building the model, use case labels to avoid having to refer back to the source data each time to identify the cases in the display. Identify focal record. Check this box to enable the drop-down list, which allows you to mark an input field of particular interest (for flag fields only). If you specify a field here, the points representing that field are initially selected in the model viewer when the model is built. Selecting a focal record here is optional; any point can temporarily become a focal record when selected manually in the model viewer. 467 Nearest Neighbor Models Neighbors Figure 16-4 KNN node neighbors options The Neighbors panel has a set of options that control how the number of nearest neighbors is calculated. Number of Nearest Neighbors (k). Specify the number of nearest neighbors for a particular case. Note that using a greater number of neighbors will not necessarily result in a more accurate model. If the objective is to predict a target, you have two choices: Specify fixed k. Use this option if you want to specify a fixed number of nearest neighbors to find. Automatically select k. You can alternatively use the Minimum and Maximum fields to specify a range of values and allow the procedure to choose the “best” number of neighbors within that range. The method for determining the number of nearest neighbors depends upon whether feature selection is requested on the Feature Selection panel: If feature selection is in effect, then feature selection is performed for each value of k in the requested range, and the k, and accompanying feature set, with the lowest error rate (or the lowest sum-of-squares error if the target is continuous) is selected. If feature selection is not in effect, then V-fold cross-validation is used to select the “best” number of neighbors. See the Cross-validation panel for control over assignment of folds. 468 Chapter 16 Distance Computation. This is the metric used to specify the distance metric used to measure the similarity of cases. Euclidean metric. The distance between two cases, x and y, is the square root of the sum, over all dimensions, of the squared differences between the values for the cases. City Block metric. The distance between two cases is the sum, over all dimensions, of the absolute differences between the values for the cases. Also called Manhattan distance. Optionally, if the objective is to predict a target, you can choose to weight features by their normalized importance when computing distances. Feature importance for a predictor is calculated by the ratio of the error rate or sum-of-squares error of the model with the predictor removed from the model, to the error rate or sum-of-squares error for the full model. Normalized importance is calculated by reweighting the feature importance values so that they sum to 1. Weight features by importance when computing distances. (Displayed only if the objective is to predict a target.) Check this box to cause predictor importance to be used when calculating the distances between neighbors. Predictor importance will then be displayed in the model nugget, and used in predictions (and so will affect scoring). For more information, see the topic Predictor Importance in Chapter 3 on p. 51. Predictions for Range Target. (Displayed only if the objective is to predict a target.) If a continuous (numeric range) target is specified, this defines whether the predicted value is computed based upon the mean or the median value of the nearest neighbors. 469 Nearest Neighbor Models Feature Selection Figure 16-5 KNN node feature selection options This panel is activated only if the objective is to predict a target. It allows you to request and specify options for feature selection. By default, all features are considered for feature selection, but you can optionally select a subset of features to force into the model. Perform feature selection. Check this box to enable the feature selection options. Forced entry. Click the field chooser button next to this box and choose one or more features to force into the model. Stopping Criterion. At each step, the feature whose addition to the model results in the smallest error (computed as the error rate for a categorical target and sum of squares error for a continuous target) is considered for inclusion in the model set. Forward selection continues until the specified condition is met. Stop when the specified number of features have been selected. The algorithm adds a fixed number of features in addition to those forced into the model. Specify a positive integer. Decreasing values of the number to select creates a more parsimonious model, at the risk of missing important features. Increasing values of the number to select will capture all the 470 Chapter 16 important features, at the risk of eventually adding features that actually increase the model error. Stop when the change in the absolute error ratio is less than or equal to the minimum. The algorithm stops when the change in the absolute error ratio indicates that the model cannot be further improved by adding more features. Specify a positive number. Decreasing values of the minimum change will tend to include more features, at the risk of including features that do not add much value to the model. Increasing the value of the minimum change will tend to exclude more features, at the risk of losing features that are important to the model. The “optimal” value of the minimum change will depend upon your data and application. See the Feature Selection Error Log in the output to help you assess which features are most important. For more information, see the topic Predictor Selection Error Log on p. 480. Cross-Validation Figure 16-6 KNN node cross-validation options This panel is activated only if the objective is to predict a target. The options on this panel control whether to use cross-validation when calculating the nearest neighbors. Cross-validation divides the sample into a number of subsamples, or folds. Nearest neighbor models are then generated, excluding the data from each subsample in turn. The first model is based on all of the cases except those in the first sample fold, the second model is based on all of the cases except those in the second sample fold, and so on. For each model, the error is estimated 471 Nearest Neighbor Models by applying the model to the subsample excluded in generating it. The “best” number of nearest neighbors is the one which produces the lowest error across folds. Cross-Validation Folds. V-fold cross-validation is used to determine the “best” number of neighbors. It is not available in conjunction with feature selection for performance reasons. Randomly assign cases to folds. Specify the number of folds that should be used for cross-validation. The procedure randomly assigns cases to folds, numbered from 1 to V, the number of folds. Set random seed. When estimating the accuracy of a model based on a random percentage, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed. Use field to assign cases. Specify a numeric field that assigns each case in the active dataset to a fold. The field must be numeric and take values from 1 to V. If any values in this range are missing, and on any split fields if split models are in effect, this will cause an error. Analyze Figure 16-7 KNN node analyze options 472 Chapter 16 The Analyze panel is activated only if the objective is to predict a target. You can use it to specify whether the model is to include additional variables to contain: probabilities for each possible target field value distances between a case and its nearest neighbors raw and adjusted propensity scores (for flag targets only) Append all probabilities. If this option is checked, probabilities for each possible value of a nominal or flag target field are displayed for each record processed by the node. If this option is unchecked, only the predicted value and its probability are displayed for nominal or flag target fields. Save distances between cases and k nearest neighbors. For each focal record, a separate variable is created for each of the focal record’s k nearest neighbors (from the training sample) and the corresponding k nearest distances. Propensity Scores Propensity scores can be enabled in the modeling node, and on the Settings tab in the model nugget. This functionality is available only when the selected target is a flag field. For more information, see the topic Propensity Scores in Chapter 3 on p. 41. Calculate raw propensity scores. Raw propensity scores are derived from the model based on the training data only. If the model predicts the true value (will respond), then the propensity is the same as P, where P is the probability of the prediction. If the model predicts the false value, then the propensity is calculated as (1 – P). If you choose this option when building the model, propensity scores will be enabled in the model nugget by default. However, you can always choose to enable raw propensity scores in the model nugget whether or not you select them in the modeling node. When scoring the model, raw propensity scores will be added in a field with the letters RP appended to the standard prefix. For example, if the predictions are in a field named $R-churn, the name of the propensity score field will be $RRP-churn. Calculate adjusted propensity scores. Raw propensities are based purely on estimates given by the model, which may be overfitted, leading to over-optimistic estimates of propensity. Adjusted propensities attempt to compensate by looking at how the model performs on the test or validation partitions and adjusting the propensities to give a better estimate accordingly. This setting requires that a valid partition field is present in the stream. Unlike raw confidence scores, adjusted propensity scores must be calculated when building the model; otherwise, they will not be available when scoring the model nugget. When scoring the model, adjusted propensity scores will be added in a field with the letters AP appended to the standard prefix. For example, if the predictions are in a field named $R-churn, the name of the propensity score field will be $RAP-churn. Adjusted propensity scores are not available for logistic regression models. 473 Nearest Neighbor Models When calculating the adjusted propensity scores, the test or validation partition used for the calculation must not have been balanced. To avoid this, be sure the Only balance training data option is selected in any upstream Balance nodes. In addition, if a complex sample has been taken upstream this will invalidate the adjusted propensity scores. Adjusted propensity scores are not available for “boosted” tree and rule set models. For more information, see the topic Boosted C5.0 Models in Chapter 6 on p. 171. Based on. For adjusted propensity scores to be computed, a partition field must be present in the stream. You can specify whether to use the testing or validation partition for this computation. For best results, the testing or validation partition should include at least as many records as the partition used to train the original model. KNN Model Nugget The KNN model creates a number of new fields, as shown in the following table. To see these fields and their values, add a Table node to the KNN model nugget and execute the Table node, or click the Preview button on the nugget. Table 16-1 KNN model fields New field name $KNN-fieldname $KNNP-fieldname $KNNP-value $KNN-neighbor-n $KNN-distance-n Description Predicted value of target field. Probability of predicted value. Probability of each possible value of a nominal or flag field. Included only if Append all probabilities is checked on the Settings tab of the model nugget. The name of the nth nearest neighbor to the focal record. Included only if Display Nearest on the Settings tab of the model nugget is set to a non-zero value. The relative distance from the focal record of the nth nearest neighbor to the focal record. Included only if Display Nearest on the Settings tab of the model nugget is set to a non-zero value. 474 Chapter 16 Model View Figure 16-8 Nearest Neighbor Analysis Model View The model view has a 2-panel window: The first panel displays an overview of the model called the main view. The second panel displays one of two types of views: An auxiliary model view shows more information about the model, but is not focused on the model itself. A linked view is a view that shows details about one feature of the model when the user drills down on part of the main view. By default, the first panel shows the predictor space and the second panel shows the predictor importance chart. If the predictor importance chart is not available; that is, when Weight features by importance was not selected on the Neighbors panel of the Settings tab, the first available view in the View dropdown is shown. Figure 16-9 Nearest Neighbor Analysis Model View dropdown When a view has no available information, it is omitted from the View dropdown. 475 Nearest Neighbor Models Predictor Space Figure 16-10 Predictor Space The predictor space chart is an interactive graph of the predictor space (or a subspace, if there are more than 3 predictors). Each axis represents a predictor in the model, and the location of points in the chart show the values of these predictors for cases in the training and holdout partitions. Keys. In addition to the predictor values, points in the plot convey other information. Shape indicates the partition to which a point belongs, either Training or Holdout. The color/shading of a point indicates the value of the target for that case; with distinct color values equal to the categories of a categorical target, and shades indicating the range of values of a continuous target. The indicated value for the training partition is the observed value; for the holdout partition, it is the predicted value. If no target is specified, this key is not shown. Heavier outlines indicate a case is focal. Focal records are shown linked to their k nearest neighbors. Controls and Interactivity. A number of controls in the chart allow you explore the predictor space. You can choose which subset of predictors to show in the chart and change which predictors are represented on the dimensions. “Focal records” are simply points selected in the Predictor Space chart. If you specified a focal record variable, the points representing the focal records will initially be selected. However, any point can temporarily become a focal record if you select it. The “usual” controls for point selection apply; clicking on a point selects that point and deselects all others; Control-clicking 476 Chapter 16 on a point adds it to the set of selected points. Linked views, such as the Peers Chart, will automatically update based upon the cases selected in the Predictor Space. You can change the number of nearest neighbors (k) to display for focal records. Hovering over a point in the chart displays a tooltip with the value of the case label, or case number if case labels are not defined, and the observed and predicted target values. A “Reset” button allows you to return the Predictor Space to its original state. Changing the Axes on the Predictor Space Chart You can control which features are displayed on the axes of the Predictor Space chart. To change the axis settings: E Click the Edit Mode button (paintbrush icon) in the left-hand panel to select Edit mode for the Predictor Space. E Change the view (to anything) on the right-hand panel. The Show zones panel appears between the two main panels. E Click the Show zones check box. E Click any data point in the Predictor Space. E To replace an axis with a predictor of the same data type: Drag the new predictor over the zone label (the one with the small X button) of the one you want to replace. E To replace an axis with a predictor of a different data type: On the zone label of the predictor you want to replace, click the small X button. The predictor space changes to a two-dimensional view. Drag the new predictor over the Add dimension zone label. E Click the Explore Mode button (arrowhead icon) in the left-hand panel to exit from Edit mode. 477 Nearest Neighbor Models Predictor Importance Figure 16-11 Predictor Importance Typically, you will want to focus your modeling efforts on the predictor fields that matter most and consider dropping or ignoring those that matter least. The predictor importance chart helps you do this by indicating the relative importance of each predictor in estimating the model. Since the values are relative, the sum of the values for all predictors on the display is 1.0. Predictor importance does not relate to model accuracy. It just relates to the importance of each predictor in making a prediction, not whether or not the prediction is accurate. Nearest Neighbor Distances Figure 16-12 Nearest Neighbor Distances 478 Chapter 16 This table displays the k nearest neighbors and distances for focal records only. It is available if a focal record identifier is specified on the modeling node, and only displays focal records identified by this variable. Each row of: The Focal Record column contains the value of the case labeling variable for the focal record; if case labels are not defined, this column contains the case number of the focal record. The ith column under the Nearest Neighbors group contains the value of the case labeling variable for the ith nearest neighbor of the focal record; if case labels are not defined, this column contains the case number of the ith nearest neighbor of the focal record. The ith column under the Nearest Distances group contains the distance of the ith nearest neighbor to the focal record Peers Figure 16-13 Peers Chart This chart displays the focal cases and their k nearest neighbors on each predictor and on the target. It is available if a focal case is selected in the Predictor Space. The Peers chart is linked to the Predictor Space in two ways. 479 Nearest Neighbor Models Cases selected (focal) in the Predictor Space are displayed in the Peers chart, along with their k nearest neighbors. The value of k selected in the Predictor Space is used in the Peers chart. Select Predictors. Enables you to select the predictors to display in the Peers chart. Quadrant Map Figure 16-14 Quadrant Map This chart displays the focal cases and their k nearest neighbors on a scatterplot (or dotplot, depending upon the measurement level of the target) with the target on the y-axis and a scale predictor on the x-axis, paneled by predictors. It is available if there is a target and if a focal case is selected in the Predictor Space. Reference lines are drawn for continuous variables, at the variable means in the training partition. Select Predictors. Enables you to select the predictors to display in the Quadrant Map. 480 Chapter 16 Predictor Selection Error Log Figure 16-15 Predictor Selection Points on the chart display the error (either the error rate or sum-of-squares error, depending upon the measurement level of the target) on the y-axis for the model with the predictor listed on the x-axis (plus all features to the left on the x-axis). This chart is available if there is a target and feature selection is in effect. Classification Table Figure 16-18 Classification Table This table displays the cross-classification of observed versus predicted values of the target, by partition. It is available if there is a target and it is categorical (flag, nominal, or ordinal). The (Missing) row in the Holdout partition contains holdout cases with missing values on the target. These cases contribute to the Holdout Sample: Overall Percent values but not to the Percent Correct values. 481 Nearest Neighbor Models Error Summary Figure 16-19 Error Summary This table is available if there is a target variable. It displays the error associated with the model; sum-of-squares for a continuous target and the error rate (100% − overall percent correct) for a categorical target. KNN Model Settings Figure 16-20 KNN model nugget settings The Settings tab enables you to specify extra fields to be displayed when viewing the results (for example by executing a Table node attached to the nugget). You can see the effect of each of these options by selecting them and clicking the Preview button—scroll to the right of the Preview output to see the extra fields. 482 Chapter 16 Append all probabilities (valid only for categorical targets). If this option is checked, probabilities for each possible value of a nominal or flag target field are displayed for each record processed by the node. If this option is unchecked, only the predicted value and its probability are displayed for nominal or flag target fields. The default setting of this check box is determined by the corresponding check box on the modeling node. Calculate raw propensity scores. For models with a flag target (which return a yes or no prediction), you can request propensity scores that indicate the likelihood of the true outcome specified for the target field. These are in addition to other prediction and confidence values that may be generated during scoring. Calculate adjusted propensity scores. Raw propensity scores are based only on the training data and may be overly optimistic due to the tendency of many models to overfit this data. Adjusted propensities attempt to compensate by evaluating model performance against a test or validation partition. This option requires that a partition field be defined in the stream and adjusted propensity scores be enabled in the modeling node before generating the model. Display nearest. If you set this value to n, where n is a non-zero positive integer, the n nearest neighbors to the focal record are included in the model, together with their relative distances from the focal record. Appendix A Notices This information was developed for products and services offered worldwide. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785, U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing, Legal and Intellectual Property Law, IBM Japan Ltd., 1623-14, Shimotsuruma, Yamato-shi, Kanagawa 242-8502 Japan. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Software Group, Attention: Licensing, 233 S. Wacker Dr., Chicago, IL 60606, USA. © Copyright IBM Corporation 1994, 2012. 483 484 Appendix A Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. If you are viewing this information softcopy, the photographs and color illustrations may not appear. Trademarks IBM, the IBM logo, ibm.com, and SPSS are trademarks of IBM Corporation, registered in many jurisdictions worldwide. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies. 485 Notices Index absolute confidence difference to prior apriori evaluation measure, 382 add model rules, 225 additional information panel decision tree models, 168 additive outliers, 421 patches, 421 Time Series Modeler, 436 adjusted propensity scores balancing data, 41 decision list models, 213 discriminant models, 292 generalized linear models, 305 adjusted R-square in linear models, 244 advanced output Cox regression models, 342 Factor/PCA node, 281 advanced parameters, 222 Akaike information criterion in linear models, 244 algorithms, 43, 117 alternative models, 227 Alternative Rules pane, 225 Alternatives tab, 216 analysis of variance in generalized linear mixed models, 307 anomaly detection models, 82 adjustment coefficient, 80 anomaly fields, 79, 84 anomaly index, 79 cutoff value, 79, 83 missing values, 80 modeling node, 77 noise level, 80 peer groups, 80, 83 scoring, 81, 84 ANOVA in linear models, 254 antecedent rules without, 387 application examples, 4 apriori models evaluation measures, 381 expert options, 381 modeling node, 379 modeling node options, 380 tabular versus transactional data, 36 ARIMA models, 424 autoregressive orders, 433 constant, 433 criteria in time series models, 432 differencing orders, 433 moving average orders, 433 outliers, 436 seasonal orders, 433 © Copyright IBM Corporation 1994, 2012. transfer functions, 434 assess a model, 230 assessment in Excel, 232 association rule models, 170, 175–176, 377, 409, 411, 413–414 apriori, 379 CARMA, 383 deploying, 401 for sequences, 404 generating a filtered model, 399 generating a rule set, 398 graph generation, 393 IBM InfoSphere Warehouse, 36 model nugget, 388 model nugget details, 388 model nugget summary, 397 scoring rules, 400 settings, 395 specifying filters, 391 transposing scores, 401 asymptotic correlations logistic regression models, 268, 276 asymptotic covariance logistic regression models, 268 auto classifier models, 86 algorithm settings, 87 discarding models, 96 evaluation charts, 113 evaluation graphs, 114 generating modeling nodes and nuggets, 112 introduction, 89 model nugget, 110 model types, 93 modeling node, 89, 91 partitions, 93 ranking models, 91 results browser window, 110 settings, 97 stopping rules, 88 auto cluster models, 86 algorithm settings, 87 discarding models, 109 evaluation charts, 113 generating modeling nodes and nuggets, 112 model nugget, 110 model types, 107 modeling node, 105 partitions, 107 ranking models, 105 results browser window, 110 stopping rules, 88 Auto Cluster models modeling node, 104 auto numeric models, 86 algorithm settings, 87 evaluation charts, 113 486 487 Index evaluation graphs, 114 generating modeling nodes and nuggets, 112 model nugget, 110 model types, 101 modeling node, 98–99 modeling options, 99 results browser window, 110 settings, 104 stopping rules, 88, 101 autocorrelation function series, 422 automated modeling nodes auto classifier models, 86 auto cluster models, 86 auto numeric models, 86 automatic data preparation in linear models, 249 autoregression ARIMA models, 433 available fields, 223 bagging, 148 in linear models, 241 in neural networks, 191 banding continuous variables, 118 base category Logistic node, 264 basket data, 378, 400–401 Bayesian network models expert options, 183 model nugget, 185 model nugget settings, 186 model nugget summary, 187 model options, 181 modeling node, 179 best subsets in linear models, 244 binomial logistic regression models, 259–260 Bonferroni adjustment CHAID node, 158 boosting, 148, 162, 171 in linear models, 241 in neural networks, 191 Box’s M test Discriminant node, 288 Build Rule node , 164 build selections defining, 220 C&R Tree models case weights, 38 ensembling, 151 field options, 145 frequency weights, 38 graph generation from model nugget, 172 impurity measures, 155 misclassification costs, 152 model nugget, 164 modeling node, 119, 141, 143, 168, 170 objectives, 146 prior probabilities, 153 pruning, 149 stopping options, 150 surrogates, 150 tree depth, 149 C5.0 models, 117 boosting, 162, 171 graph generation from model nugget, 172 misclassification costs, 162 model nugget, 164, 175–176 modeling node, 160, 162, 168, 170–171 options, 162 parallel processing, 161, 164 performance, 161, 164 pruning, 162 CARMA models content field(s), 383 data formats, 383 expert options, 387 field options, 383 ID field, 383 modeling node, 383 modeling node options, 386 multiple consequents, 401 tabular versus transactional data, 387 time field, 383 category merging, 118 CHAID models, 117 ensembling, 151 exhaustive CHAID, 149 field options, 145 graph generation from model nugget, 172 misclassification costs, 154 model nugget, 164 modeling node, 119, 141, 144, 168, 170 objectives, 146 stopping options, 150 tree depth, 149 change target value, 229 chart options, 236 chi-square CHAID node, 158 feature selection, 73 classification gains decision trees, 127, 130 classification table in Nearest Neighbor Analysis, 480 logistic regression models, 268 classification trees, 143–144, 160 cluster analysis anomaly detection, 80 number of clusters, 359 cluster viewer about cluster models, 361 488 Index basic view, 367 cell content display, 367 cell distribution view, 370 cluster centers view, 365 cluster comparison view, 371 cluster display sort, 367 cluster predictor importance view, 368 cluster sizes view, 369 clusters view, 365 comparison of clusters, 371 distribution of cells, 370 feature display sort, 366 flip clusters and features, 366 graph generation, 374 model summary, 364 overview, 362 predictor importance, 368 size of clusters, 369 sort cell contents, 367 sort clusters, 367 sort features, 366 summary view, 364 transpose clusters and features, 366 using, 372 clustering, 347–348, 354, 357–358, 361 overall display, 362 viewing clusters, 362 coefficient of variance screening fields, 72 combining rules in linear models, 245 in neural networks, 195 confidence Apriori node, 380 association rules, 390–391, 413 CARMA node, 386 decision tree models, 167 for sequences, 411 Sequence node, 406 confidence difference apriori evaluation measure, 382 confidence intervals logistic regression models, 268 confidence ratio apriori evaluation measure, 382 confidence scores, 41 confidences decision tree models, 170 logistic regression models, 274 rule sets, 170 consequent multiple consequents, 387 content field(s) CARMA node, 383 Sequence node, 405 continuous variables segmenting, 118 contrast coefficients matrix generalized linear models, 302 convergence options CHAID node, 158 Cox regression models, 341 generalized linear models, 301 logistic regression models, 267 copying model links, 45 correlation matrix generalized linear models, 302 costs decision trees, 152, 154 covariance matrix generalized linear models, 302 Cox regression models, 345 advanced output, 342, 345 convergence criteria, 341 expert options, 341 field options, 337 model nugget, 345 model options, 338 modeling node, 336 settings options, 344 stepping criteria, 343 Cramér’s V feature selection, 73 custom splits decision trees, 121–123 customize a model, 228 data reduction, 118 PCA/factor models, 277 decision list models alternatives tab, 216 binning method, 211 excluding segments, 205 expert options, 211 mailing lists, 204 model options, 209 modeling node, 204 PMML, 212 requirements, 209 scoring, 205, 212 search direction, 209 search width, 211 segments, 212 settings, 213 snapshots tab, 218 SQL generation, 213 target value, 209 viewer workspace, 213 working model pane, 214 working with viewer, 219 decision tree models, 116, 119–120, 124, 141, 143–145, 160, 164, 168, 172 additional information panel, 168 custom splits, 121 489 Index exporting results, 139 gains, 126–127, 130, 133 generating, 135–136 graph generation, 172 misclassification costs, 152, 154 model nugget, 165 modeling node, 139 predictor importance, 165 predictors, 122 profits, 129 ROI, 129 rule frequencies, 168 stopping options, 150 surrogates, 123, 168 tree rules, 165 viewer, 168 deleting model links, 44 deployability measure, 390 descriptive statistics generalized linear models, 302 difference of confidence quotient to 1 apriori evaluation measure, 382 differencing transformation, 423 ARIMA models, 433 dimension reduction, 348 direct oblimin rotation PCA/factor models, 280 directives decision trees, 138 discriminant models advanced output, 288, 292 convergence criteria, 286 expert options, 286 model form, 286 model nugget, 291–293 modeling node, 285 propensity scores, 292 scoring, 291 stepping criteria (field selection), 290 documentation, 4 DTD, 66 edit advanced parameters, 222 eigenvalues PCA/factor models, 279 ensemble viewer, 53 automatic data preparation, 61 component model accuracy, 58 component model details, 60 model summary, 55 predictor frequency, 57 predictor importance, 56 ensembles in linear models, 245 in neural networks, 195 epsilon for convergence CHAID node, 158 equamax rotation PCA/factor models, 280 error summary in Nearest Neighbor Analysis, 481 evaluation charts from auto classifier models, 113 from auto cluster models, 113 from auto numeric models, 113 evaluation graphs from auto classifier models, 114 from auto numeric models, 114 evaluation measures Apriori node, 381 events identifying, 420 examples Applications Guide, 4 overview, 5 exhaustive CHAID, 117, 119, 149 expert modeler criteria in time series models, 429 outliers, 430 expert options Apriori node, 381 Bayesian network node, 183 CARMA node, 387 Cox regression models, 341 k-means models, 356 Kohonen models, 352 Sequence node, 407 expert output Cox regression models, 342 exponential smoothing, 424 criteria in time series models, 431 exporting model nuggets, 47 PMML, 65, 67 SQL, 50 F statistic feature selection, 73 in linear models, 244 factor models advanced output, 284 eigenvalues, 279 equations, 281 expert options, 279 factor scores, 279 iterations, 279 missing-value handling, 279 model nugget, 281–282, 284 model options, 278 modeling node, 277 number of factors, 279 rotation, 280 490 Index feature selection models, 74, 76 generating Filter nodes, 76 importance, 70–71, 74 ranking predictors, 70–71, 74 screening predictors, 70–71, 74 field importance filtering fields, 52 model results, 39, 51–52 ranking fields, 70, 72–74, 76 field options Cox node, 337 modeling nodes, 35 SLRM node, 447 Filter node generating from decision trees, 139 filtering rules, 391, 413 association rules, 391 first hit rule set, 175 focal records, 466 folds, cross-validation, 470 forecasting overview, 417 predictor series, 423 forward stepwise in linear models, 244 fraud detection anomaly detection, 77 frequencies decision tree models, 168 frequency fields, 38 functional transformation, 423 gains chart, 235 decision trees, 126–127, 130 exporting, 139 gains-based selection, 133 general estimable function generalized linear models, 302 general linear model generalized linear mixed models, 307 generalized linear mixed models, 307 analysis weight, 319 classification table, 327 covariance parameters, 333 custom terms, 314 data structure, 325 estimated marginal means, 322 estimated means, 334 fixed coefficients, 330 fixed effects, 313, 328 link function, 310 model summary, 324 model view, 323 offset, 319 predicted by observed, 326 random effect block, 317 random effect covariances, 332 random effects, 316 scoring options, 321 settings, 335 target distribution, 310 generalized linear model in generalized linear mixed models, 307 generalized linear models advanced output, 302, 305 convergence options, 301 expert options, 297 fields, 295 model form, 296 model nugget, 303, 306 modeling node, 294 propensity scores, 305 generate new model, 229 generated sequence rule set, 399 getting started, 213 Gini impurity measure, 155 goodness-of-fit statistics generalized linear models, 302 logistic regression models, 276 graph generation association rules, 393 hierarchical models generalized linear mixed models, 307 history decision tree models, 168 hits decision tree gains, 126 Hosmer and Lemeshow goodness-of-fit logistic regression models, 276 IBM InfoSphere Warehouse (ISW) PMML export, 67 IBM SPSS Modeler, 1 documentation, 4 IBM SPSS Statistics models, 29 ID field CARMA node, 383 Sequence node, 405 importance filtering fields, 52 predictors in models, 39, 51–52 ranking predictors, 70, 72–74, 76 importing PMML, 47, 66–67 impurity measures C&R Tree node, 155 decision trees, 155 index decision tree gains, 126 information criteria in linear models, 244 491 Index information difference apriori evaluation measure, 382 innovational outliers, 421 Time Series Modeler, 436 input fields screening, 72 selecting for analysis, 72 instances, 390, 412 decision tree models, 167 integration ARIMA models, 433 interaction identification, 118 interactions logistic regression models, 265 interactive trees, 116, 119–120, 122, 124 custom splits, 121 exporting results, 139 gains, 126–127, 130, 133 generating models, 135–136 graph generation, 172 profits, 129 ROI, 129 surrogates, 123 interventions identifying, 420 iteration history generalized linear models, 302 logistic regression models, 268 k-means models, 347, 354–356 clustering, 354, 357 distance field, 355 encoding value for sets, 356 expert options, 356 model nugget, 357 stopping criteria, 356 K-Means models graph generation from model nugget, 374 kernel functions support vector machine models, 455 KNN. See nearest neighbor models, 462 Kohonen models, 347–348, 350, 352 binary set encoding option (removed), 350 expert options, 352 feedback graph, 350 graph generation from model nugget, 374 learning rate, 352 model nugget, 353 modeling node, 348 neighborhood, 348, 352 neural networks, 348, 353 stopping criteria, 350 L matrix generalized linear models, 302 labels value, 66 variable, 66 lag ACF and PACF, 422 Lagrange multiplier test generalized linear models, 303 lambda feature selection, 73 legal notices, 483 level shift outliers, 421 Time Series Modeler, 436 level stabilizing transformation, 423 lift, 390 association rules, 391 decision tree gains, 126 lift charts decision tree gains, 131 likelihood ratio test logistic regression models, 268, 276 likelihood-ratio chi-square CHAID node, 158 feature selection, 73 linear kernel support vector machine models, 455 linear models, 239 ANOVA table, 254 automatic data preparation, 242, 249 coefficients, 255 combining rules, 245 confidence level, 242 ensembles, 245 estimated means, 257 information criterion, 248 model building summary, 258 model options, 247 model selection, 244 model summary, 248 nugget settings, 259 objectives, 241 outliers, 253 predicted by observed, 251 predictor importance, 250 R-square statistic, 248 replicating results, 246 residuals, 252 linear regression models, 238 modeling node, 239 weighted least squares, 38 linear trends identifying, 418 linearnode node, 239 link function generalized linear mixed models, 310 links model, 43 loading model nuggets, 47 492 Index local trend outliers, 422 Time Series Modeler, 436 log transformation, 423 Time Series Modeler, 434 log-odds logistic regression models, 272 logistic regression generalized linear mixed models, 307 logistic regression models, 238 adding terms, 265 advanced output, 268, 276 binomial options, 260 convergence options, 267 expert options, 266 interactions, 265 main effects, 265 model equations, 272 model nugget, 271–274 modeling node, 259 multinomial options, 260 predictor importance, 272 stepping options, 270 loglinear analysis in generalized linear mixed models, 307 longitudinal models generalized linear mixed models, 307 mailing lists decision list models, 204 main effects logistic regression models, 265 managers Models tab, 47 mining task starting, 220 mining tasks, 219 creating, 220 decision list models, 204 editing, 221 misclassification costs C5.0 node, 162 decision trees, 95, 152, 154 missing data predictor series, 423 missing values CHAID trees, 122 excluding from SQL, 170 screening fields, 72 mixed models generalized linear mixed models, 307 MLP (multilayer perceptron) in neural networks, 193 model fit logistic regression models, 276 model information generalized linear models, 302 model links, 43 and SuperNodes, 45 copying and pasting, 45 defining and removing, 44 model measures defining, 230 refresh, 231 model nuggets, 43, 69, 164, 170–171, 175–176, 306 ensemble models, 53 exporting, 47, 49 generating processing nodes, 63 menus, 49 printing, 49 saving, 49 saving and loading, 47 scoring data with, 63 split models, 61 Summary tab, 50 using in streams, 63 model options Bayesian network node, 181 Cox regression models, 338 SLRM node, 448 model refresh self-learning response models, 448 model view in generalized linear mixed models, 323 in Nearest Neighbor Analysis, 474 modeling nodes, 24, 77, 160, 179, 348, 354, 358, 379, 404, 446 models ARIMA, 433 importing, 47 replacing, 46 split, 30, 33–34 Summary tab, 50 models palette, 43, 47 moving average ARIMA models, 433 MS Excel setup integration format, 233 multilayer perceptron (MLP) in neural networks, 193 multilevel models generalized linear mixed models, 307 multinomial logistic regression generalized linear mixed models, 307 multinomial logistic regression models, 259–260 natural log transformation, 423 Time Series Modeler, 434 Nearest Neighbor Analysis model view, 474 nearest neighbor distances in Nearest Neighbor Analysis, 477 nearest neighbor models about, 462 analyze options, 471 493 Index cross-validation options, 470 feature selection options, 469 model options, 465 modeling node, 462 neighbors options, 467 objectives options, 463 settings options, 464 neural network models field options, 35 neural networks, 189 classification, 201 combining rules, 195 ensembles, 195 hidden layers, 193 missing values, 196 model options, 197 model summary, 198 multilayer perceptron (MLP), 193 network, 202 nugget settings, 203 objectives, 191 overfit prevention, 196 predicted by observed, 200 predictor importance, 199 radial basis function (RBF), 193 replicating results, 196 stopping rules, 194 neuralnetwork node, 189 nodeName node, 307 nominal regression, 259 nonlinear trends identifying, 418 nonseasonal cycles, 419 normalized chi-square apriori evaluation measure, 382 optimizing performance, 351, 356, 381 ordered twoing impurity measure, 155 organize data selections, 224 outliers, 420 additive patches, 421 ARIMA models, 436 deterministic, 420 expert modeler, 430 identifying, 77 in series, 420 in time series models, 436 innovational, 421 level shift, 421 local trend, 422 seasonal additive, 421 transient change, 421 overfit prevention in neural networks, 196 overfit prevention criterion in linear models, 244 overfitting SVM model, 456 p value, 73 parallel processing C5.0 models, 161, 164 parameter estimates generalized linear models, 302 logistic regression models, 276 parameters in time series models, 442 partial autocorrelation function series, 422 partitions, 37, 384, 406, 465 model building, 91, 100, 106, 162, 181, 209, 260, 278, 286, 296, 338, 350, 355, 359, 406, 448, 458, 465 selecting, 37, 384, 406, 465 PCA models advanced output, 284 eigenvalues, 279 equations, 281 expert options, 279 factor scores, 279 iterations, 279 missing-value handling, 279 model nugget, 281–282, 284 model options, 278 modeling node, 277 number of factors, 279 rotation, 280 Pearson chi-square CHAID node, 158 feature selection, 73 peer groups anomaly detection, 80 peers in Nearest Neighbor Analysis, 478 performance C5.0 models, 161, 164 performance enhancements, 270, 351, 356, 381 periodicity Time Series Modeler, 434 PMML exporting models, 47, 65, 67 importing models, 47, 66–67 point interventions identifying, 420 Poisson regression generalized linear mixed models, 307 predictor importance decision tree models, 165 discriminant models, 291 filtering fields, 52 generalized linear models, 303 in Nearest Neighbor Analysis, 477 linear models, 250 logistic regression models, 272 model results, 39, 51–52 neural networks, 199 494 Index predictor selection in Nearest Neighbor Analysis, 480 predictor series, 423 missing data, 423 predictor space chart in Nearest Neighbor Analysis, 475 predictors decision trees, 122 ranking importance, 70, 72–74, 76 screening, 70, 74, 76 selecting for analysis, 70, 72, 74, 76 surrogates, 123 preview model contents, 50 principal components analysis. See PCA models, 277, 281 prior probabilities, 153 decision trees, 153 probabilities logistic regression models, 272 probit analysis generalized linear mixed models, 307 profits decision tree gains, 129 promax rotation PCA/factor models, 280 propensity scores balancing data, 41 decision list models, 213 discriminant models, 292 generalized linear models, 305 pruning decision trees, 143, 149 pseudo R-square logistic regression models, 276 pulses in series, 420 quadrant map in Nearest Neighbor Analysis, 479 quartimax rotation PCA/factor models, 280 QUEST models, 117 ensembling, 151 field options, 145 graph generation from model nugget, 172 misclassification costs, 152 model nugget, 164 modeling node, 119, 141, 144, 168, 170 objectives, 146 prior probabilities, 153 pruning, 149 stopping options, 150 surrogates, 150 tree depth, 149 R-square in linear models, 248 radial basis function (RBF) in neural networks, 193 ranking predictors, 70, 72–74, 76 raw propensity scores, 41 RBF (radial basis function) in neural networks, 193 reference category Logistic node, 264 refreshing measures, 231 refreshing models self-learning response models, 448 regression gains decision trees, 130, 133 regression models modeling node, 239 regression trees, 143–144 removing model links, 44 replacing models, 46 residuals in time series models, 443 response charts decision tree gains, 126, 132 risk estimate decision tree gains, 134 risks exporting, 139 ROI decision tree gains, 129 rotation PCA/factor models, 280 rule conditions decision list models, 204 rule ID, 391 rule induction, 116, 143–144, 160, 379 rule set, 139, 170, 175–176, 395, 398–399 generating from decision trees, 139 Rule SuperNode generating from sequence rules, 415 rules association rules, 379, 383 rule support, 390, 413 run a mining task, 220 score statistic, 268, 270 scoring data, 63 screening input fields, 72 screening predictors, 70, 74, 76 seasonal additive outliers, 421 Time Series Modeler, 436 seasonal differencing transformation, 423 ARIMA models, 433 seasonal orders ARIMA models, 433 seasonality, 419 identifying, 418 segment rule generation, 220 segmentation, 118 495 Index segments copy, 227 decision list models, 204 deleting, 228 deleting rule conditions, 226 editing, 226 excluding, 229 inserting, 225 prioritizing, 228 Select node generating from decision trees, 139 self-learning response models field options, 447 model nugget, 451 model refresh, 448 modeling node, 446 preferences for target fields, 451, 454 randomization of results, 450, 454 settings, 450, 453 variable importance, 451 self-organizing maps, 348 sequence browser, 414 sequence detection, 377, 404 sequence models content field(s), 405 data formats, 405 expert options, 407 field options, 405 generating a rule SuperNode, 415 ID field, 405 model nugget, 409, 411, 413–414 model nugget details, 411 model nugget settings, 413 model nugget summary, 414 modeling node, 404 options, 406 predictions, 409 sequence browser, 414 sorting, 414 tabular versus transactional data, 407 time field, 405 series transforming, 423 settings options Cox regression models, 344 SLRM node, 450 significance levels for merging, 158 for splitting, 157 SLRM. See self-learning response models, 446 snapshot creating, 218 Snapshots tab, 218 split models, 465 building, 30 features affected by, 34 modeling nodes, 33 versus partitioning, 33 split-model nuggets, 61 Summary tab, 50 viewer, 61 splits decision trees, 121–123 SPSS Modeler Server, 2 SQL export, 50 logistic regression models, 275 rule sets, 170 square root transformation, 423 Time Series Modeler, 434 statistical models, 238 Statistics models, 29 step interventions identifying, 420 stepping options Cox regression models, 343 logistic regression models, 270 stepwise field selection Discriminant node, 290 stopping options decision trees, 150 stratification, 118 SuperNodes and model links, 45 support antecedent support, 390, 413 Apriori node, 380 association rules, 391 CARMA node, 386–387 for sequences, 411 rule support, 390, 413 Sequence node, 406 support vector machine models about, 455 expert options, 458 kernel functions, 455 model nugget, 460, 473 model options, 458 modeling node, 457 overfitting, 456 settings, 461 tuning, 456 surrogates decision tree models, 168 decision trees, 123, 150 SVM. See support vector machine models, 455 t statistic feature selection, 73 tabular data, 378, 400 Apriori node, 36 CARMA node, 383 Sequence node, 405 transposing, 401 496 Index territorial map Discriminant node, 288 till-roll data, 378, 400–401 time field CARMA node, 383 Sequence node, 405 time series models ARIMA criteria, 432 ARIMA models, 424 expert modeler criteria, 429 exponential smoothing, 424 exponential smoothing criteria, 431 model nugget, 438 model parameters, 442 modeling node, 424 outliers, 430, 436 periodicity, 434 requirements, 425 residuals, 443 series transformation, 434 transfer functions, 434 trademarks, 484 transactional data, 378, 400–401 Apriori node, 36 CARMA node, 383 MS Association Rules node, 36 Sequence node, 405 transfer functions, 434 delay, 434 denominator orders, 434 difference orders, 434 numerator orders, 434 seasonal orders, 434 transforming series, 423 transient change outliers, 421 transient outliers Time Series Modeler, 436 transposing tabular output, 401 tree builder, 119–120, 124 custom splits, 121 exporting results, 139 gains, 126–127, 130, 133 generating models, 135–136 graph generation, 172 predictors, 122 profits, 129 ROI, 129 surrogates, 123 tree depth, 149 tree directives, 148 C&R Tree node, 136 CHAID node, 136, 138 decision trees, 138 QUEST node, 136 tree map decision tree models, 168 graph generation, 172 tree-based analysis general uses, 118 trends identifying, 418 truth-table data, 378, 400–401 two-headed rules, 387 twoing impurity measure, 155 TwoStep cluster models, 347, 359–361 clustering, 361 graph generation from model nugget, 374 model nugget, 360–361 modeling node, 358 number of clusters, 359 options, 359 outlier handling, 359 standardization of fields, 359 unrefined models, 69, 74, 76, 377 unrefined rule models, 388, 397–398 unsupervised learning, 347–348 variable importance self-learning response models, 451 variables screening, 118 variance stabilizing transformation, 423 varimax rotation PCA/factor models, 280 viewer tab decision tree models, 168 graph generation, 172 visualization clustering models, 362 decision trees, 168 graph generation, 172, 374, 393 visualize a model, 235 voting rule set, 175 Wald statistic, 268, 270 weight fields, 37–38 weighted least squares, 38 working model pane, 214

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement