Front cover
Enhance Your Business Applications Simple Integration of Advanced Data Mining Functions
Understand IBM DB2 Intelligent Miner Modeling, Scoring, and Visualization Deploy data mining functions in today’s business applications Learn how to configure the advanced data mining functions
Corinne Baragoin Ronnie Chan Helena Gottschalk Gregor Meyer Paulo Pereira Jaap Verhees
ibm.com/redbooks
International Technical Support Organization Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions December 2002
SG24-6879-00
Note: Before using this information and the product it supports, read the information in “Notices” on page xvii.
First Edition (December 2002) This edition applies to IBM DB2 Intelligent Miner Modeling Version 8.1, IBM DB2 Intelligent Miner Scoring Version 8.1, and IBM DB2 Intelligent Miner Visualization Version 8.1.
© Copyright International Business Machines Corporation 2002. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix The team that wrote this redbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii Comments welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii Part 1. Advanced data mining functions overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. Data mining functions in the database . . . . . . . . . . . . . . . . . . . . 3 1.1 The evolution of data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Data mining does not stand alone anymore . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Faster time to market and closing the loop . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Real-time analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.3 Leveraging existing IT skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.4 Building repeatable processes and tasks . . . . . . . . . . . . . . . . . . . . . 12 1.2.5 Efficiency and effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.6 Cost reduction of mining analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2. Overview of the new data mining functions. . . . . . . . . . . . . . . 15 2.1 Why relational database management system (RDBMS) functions . . . . . 16 2.1.1 Easy use of automation and integration . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Operational efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.4 Administrative efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Scoring: Deploying data mining models . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Scoring as an SQL extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Batch mode and real time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Support for the new PMML 2.0 standard . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Leveraging existing IT skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Modeling: Building a mining model using SQL . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Interoperability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
© Copyright IBM Corp. 2002. All rights reserved.
iii
2.3.2 Models in DB2 UDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Support of the new PMML 2.0 standard . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 Required skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Visualization: Understanding the data mining model . . . . . . . . . . . . . . . . 24 2.4.1 Interoperability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Choice in use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Multiplatform capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.4 Support of the new PMML 2.0 standard . . . . . . . . . . . . . . . . . . . . . . 24 2.4.5 Required skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 IM Modeling, Scoring, and Visualization interactions . . . . . . . . . . . . . . . . 25 2.5.1 The whole picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.2 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.3 Positioning with Intelligent Miner for Data . . . . . . . . . . . . . . . . . . . . . 27 2.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3. Business scenario deployment examples . . . . . . . . . . . . . . . . 33 3.1 Customer profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 Business benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Fraud detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Business benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Campaign management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 Business benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Up-to-date promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Business benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Integrating the generic components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.1 Generic environment and components . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.2 The method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Part 2. Deploying data mining functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 4. Customer profiling example . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Mapping the business issue to data mining functions . . . . . . . . . . . . . . . . 52 4.3 The business application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Environment, components, and implementation flow . . . . . . . . . . . . . . . . 54 4.5 Step-by-step implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.2 Workbench data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.4 Application integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6.1 End-to-end implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6.2 DB2 mining functions next to the workbench . . . . . . . . . . . . . . . . . . 72 4.6.3 Real-time analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
iv
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
4.6.4 Automated and on demand for multi-channels . . . . . . . . . . . . . . . . . 73 Chapter 5. Fraud detection example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Mapping the business issue to data mining functions . . . . . . . . . . . . . . . . 78 5.3 The business application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 Environment, components, and implementation flow . . . . . . . . . . . . . . . . 80 5.5 Data to be used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5.1 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5.2 Data manipulation and enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Implementation in DB2 UDB V8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.6.1 Enabling database for modeling and scoring . . . . . . . . . . . . . . . . . . 85 5.6.2 Installing additional UDFs and stored procedures . . . . . . . . . . . . . . 85 5.6.3 Model building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.7 Implementation in DB2 UDB V7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7.1 Prerequisite: Initial model building . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.7.2 Data settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.7.3 Model parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.7.4 Building the mining task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7.5 Running the model by calling a stored procedure . . . . . . . . . . . . . . . 89 5.7.6 Scoring script generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7.7 Applying the scoring model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7.8 Ranking and listing the five smallest clusters . . . . . . . . . . . . . . . . . . 91 5.7.9 Actionable result for investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.7.10 Scheduling the job to run at regular intervals . . . . . . . . . . . . . . . . . 93 5.8 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.8.1 A system that adapts to changes in undesirable behavior . . . . . . . . 94 5.8.2 Fast deployment of fraud detection system . . . . . . . . . . . . . . . . . . . 95 5.8.3 Better use of data mining resource . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.8.4 A repeatable data mining process in a production environment . . . . 95 5.8.5 Enhanced communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.8.6 Leveraged IT skills for advanced analytical application . . . . . . . . . . 95 5.8.7 Actionable result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 6. Campaign management solution examples . . . . . . . . . . . . . . . 97 6.1 Campaign management overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Trigger-based marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.2 Mapping the business issue to data mining functions . . . . . . . . . . . 100 6.2.3 The business application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.4 Environment, components, and implementation flow . . . . . . . . . . . 102 6.2.5 Step-by-step implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.6 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Contents
v
6.3 Retention campaign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.2 Mapping the business issue to data mining functions . . . . . . . . . . . 115 6.3.3 The business application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.4 Environment, components, and implementation flow . . . . . . . . . . . 116 6.3.5 Step-by-step implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.6 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Cross-selling campaign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4.2 Mapping the business issue to data mining functions . . . . . . . . . . . 129 6.4.3 The business application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4.4 Environment, components, and implementation flow . . . . . . . . . . . 130 6.4.5 Step-by-step implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4.6 Other considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Chapter 7. Up-to-date promotion example . . . . . . . . . . . . . . . . . . . . . . . . 149 7.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.2 Mapping the business issue to data mining functions . . . . . . . . . . . . . . . 150 7.3 The business application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.4 Environment, components, and implementation flow . . . . . . . . . . . . . . . 151 7.5 Step-by-step implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.5.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.5.2 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.5.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.5.4 Application integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.6 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.6.1 Automating models: Easy to use. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.6.2 Calibration: New data = new model . . . . . . . . . . . . . . . . . . . . . . . . 159 Chapter 8. Other possibilities of integration . . . . . . . . . . . . . . . . . . . . . . 161 8.1 Real-time scoring on the Web (using Web analytics) . . . . . . . . . . . . . . . 162 8.1.1 The business issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.1.2 Mapping the business issue to data mining functions . . . . . . . . . . . 162 8.1.3 The business application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.1.4 Integration with the application example . . . . . . . . . . . . . . . . . . . . . 164 8.2 Business Intelligence integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 8.2.1 Integration with DB2 OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 8.2.2 Integration with QMF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.3 Integration with e-commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 8.4 Integration with WebSphere Personalization . . . . . . . . . . . . . . . . . . . . . 175 8.5 Integration using Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.5.1 Online scoring with IM Scoring Java Beans . . . . . . . . . . . . . . . . . . 185 8.5.2 Typical business issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
vi
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8.5.3 Mapping to mining functions using IM Scoring Java Beans . . . . . . 188 8.5.4 The business application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.5.5 Integration with the application example . . . . . . . . . . . . . . . . . . . . . 189 8.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Part 3. Configuring the DB2 functions for data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Chapter 9. IM Scoring functions for existing mining models . . . . . . . . . 193 9.1 Scoring functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 9.1.1 Scoring mining models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 9.1.2 Scoring results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.2 IM Scoring configuration steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 9.3 Step-by-step configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 9.3.1 Configuring the DB2 UDB instance . . . . . . . . . . . . . . . . . . . . . . . . . 197 9.3.2 Configuring the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 9.3.3 Exporting models from the modeling environment . . . . . . . . . . . . . 200 9.3.4 Importing the data mining model in the relational database management system (RDBMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 9.3.5 Scoring the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.3.6 Exploiting the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 9.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Chapter 10. Building the mining models using IM Modeling functions . 211 10.1 IM Modeling functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 10.2 Data mining process with IM Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 212 10.3 Configuring a database for mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 10.3.1 Enabling the DB2 UDB instance for modeling . . . . . . . . . . . . . . . 213 10.3.2 Configuring the individual database for modeling . . . . . . . . . . . . . 214 10.3.3 IM Modeling in DB2 UDB V8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 10.4 Specifying mining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 10.4.1 Defining mining settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 10.4.2 Defining mining tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 10.4.3 Building and storing mining models . . . . . . . . . . . . . . . . . . . . . . . 223 10.4.4 Testing the classification models . . . . . . . . . . . . . . . . . . . . . . . . . 224 10.4.5 Working with mining models and test results . . . . . . . . . . . . . . . . 225 10.5 Hybrid modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 10.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Chapter 11. Using IM Visualization functions. . . . . . . . 11.1 IM Visualization functions . . . . . . . . . . . . . . . . . . . . . 11.1.1 Common and different tasks . . . . . . . . . . . . . . . 11.1.2 Applets or Java API. . . . . . . . . . . . . . . . . . . . . . 11.2 Configuration settings . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Loading a model from a local file system . . . . .
....... ....... ....... ....... ....... .......
...... ...... ...... ...... ...... ......
Contents
. . . . . .
229 230 230 231 231 231
vii
11.2.2 Loading a model from a database. . . . . 11.3 Using IM Visualizers . . . . . . . . . . . . . . . . . . . 11.3.1 Using IM Visualizers as applets . . . . . . 11.3.2 Complete example script. . . . . . . . . . . . 11.4 Examples of IM Visualization . . . . . . . . . . . .
...... ...... ...... ...... ......
....... ....... ....... ....... .......
...... ...... ...... ...... ......
. . . . .
232 235 235 238 241
Part 4. Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Appendix A. SQL script to configure database for data mining function247 Appendix B. SQL scripts for the customer profiling scenario . . . . . . . . 249 Script to create and load the customer segment table . . . . . . . . . . . . . . . . . . 250 Script to score new customers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Appendix C. SQL scripts for the fraud detection scenario . . . . . . . . . . . 255 Script to prepare the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Script to build the data mining model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Script to score the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Script to get the scoring results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Appendix D. SQL scripts for the retention campaign scenario . . . . . . . 269 Script to create a table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Script to import the data mining model with PMML file . . . . . . . . . . . . . . . . . 271 Script to create a view of the resulting score . . . . . . . . . . . . . . . . . . . . . . . . . 271 Script to create a table with the resulting score . . . . . . . . . . . . . . . . . . . . . . . 272 Appendix E. SQL scripts for the up-to-date promotion scenario . . . . . . 275 Script for function to build the associations rule model . . . . . . . . . . . . . . . . . 276 Script for a function that transforms the resulting rule model . . . . . . . . . . . . . 276 Script to build the rules model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Script to extract rules to a table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Appendix F. UDF to create data mining models . . . . . . . . . . . . . . . . . . . . 281 Appendix G. UDF to extract rules from a model to a table . . . . . . . . . . . 285 Appendix H. Embedding an IM Visualization applet . . . . . . . . . . . . . . . . 289 Syntax to embed the IM Visualization applet . . . . . . . . . . . . . . . . . . . . . . . . . 290 Parameters to use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Appendix I. IM Scoring Java Bean code example . . . . . . . . . . . . . . . . . . 293 Source code of IM Scoring Java Bean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Setting up the environment variables: The paths.bat file . . . . . . . . . . . . . . . . 297 Appendix J. Demographic clustering: Technical differences . . . . . . . . . 299
viii
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Appendix K. Additional material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Locating the Web material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Using the Web material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 System requirements for downloading the Web material . . . . . . . . . . . . . 302 How to use the Web material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Related publications . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other resources . . . . . . . . . . . . . . . . . . . . . . . . Referenced Web sites . . . . . . . . . . . . . . . . . . . . . . How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . IBM Redbooks collections . . . . . . . . . . . . . . . . .
...... ...... ...... ...... ...... ......
....... ....... ....... ....... ....... .......
...... ...... ...... ...... ...... ......
. . . . . .
307 307 307 308 309 309
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Contents
ix
x
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figures 1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 2-5 2-6 2-7 3-1 3-2 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 5-1 5-2 5-3 5-4 5-5 5-6 6-1 6-2 6-3 6-4 6-5 6-6 6-7 6-8 6-9 6-10
A historical view of data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Shift in the use of data mining technology and audience. . . . . . . . . . . . . 5 The generic data mining method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Focus on modeling and deployment of scores . . . . . . . . . . . . . . . . . . . 10 Integration between different modeling and scoring techniques . . . . . . 13 IM Scoring and skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 IM Modeling data mining functions and skills. . . . . . . . . . . . . . . . . . . . . 23 IM Visualization and skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Positioning DB2 mining functions with DB2 Intelligent Miner for Data . . 28 IM Scoring: Example CRM application for campaign management. . . . 30 Interoperability of DB2 Intelligent Miner for Data with IM Scoring . . . . . 31 Getting to 1:1 marketing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Business scenarios components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Customer profiling using IM Scoring, deployment to Business Objects. 55 Implementation flow of customer profile scores . . . . . . . . . . . . . . . . . . . 57 Demographic clustering in DB2 Intelligent Miner for Data . . . . . . . . . . . 61 Exporting the mining model in the PMML format . . . . . . . . . . . . . . . . . . 62 Business Objects report first page: Scores graphical display . . . . . . . . 67 Business Objects report first page: Scores textual display . . . . . . . . . . 68 Business Objects report second page: Graphics cluster . . . . . . . . . . . . 70 Business Objects report third page: Detailed data of selected customer71 Fraud detection: Deployment environment and components . . . . . . . . 81 Data enrichment process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Step-by-step implementation flow with DB2 UDB V8.1 . . . . . . . . . . . . . 84 Step-by-step implementation flow using DB2 UDB V7.2 . . . . . . . . . . . . 86 Sample window showing job scheduled to run everyday at 0130am. . . 94 Report on risky transactions in Business Objects . . . . . . . . . . . . . . . . . 96 Deployment environment, components in customer churn system . . . 102 A simple workflow implementation using DB2 UDB triggers . . . . . . . . 103 Trigger-based marketing implementation flow . . . . . . . . . . . . . . . . . . . 104 Deployment environment, components in customer churn system . . . 117 Retention case study implementation flow. . . . . . . . . . . . . . . . . . . . . . 118 Table with the churn score (confidence) . . . . . . . . . . . . . . . . . . . . . . . 122 Unica Affinium campaign: Example of DB2 scoring script integration . 125 Unica Affinium campaign: Detail of the schedule process . . . . . . . . . . 126 Unica Affinium campaign: Trigger invoking DB2 scoring script . . . . . . 127 Campaign management data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
© Copyright IBM Corp. 2002. All rights reserved.
xi
6-11 6-12 6-13 6-14 6-15 6-16 6-17 6-18 7-1 7-2 7-3 7-4 8-1 8-2 8-3 8-4 8-5 8-6 8-7 8-8 8-9 8-10 8-11 8-12 8-13 8-14 8-15 8-16 9-1 9-2 9-3 9-4 9-5 9-6 10-1 11-1 11-2 11-3 11-4 11-5 11-6 11-7 11-8
xii
Customer loyalty architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 DB2 OLAP server outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Cross-selling outbound campaign implementation flow . . . . . . . . . . . . 135 Outbound treatment process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 MQSeries Server transport adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Business Service example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Siebel integration object XML representation . . . . . . . . . . . . . . . . . . . 145 Campaign summary Business Objects report . . . . . . . . . . . . . . . . . . . 147 Up-to-date promotion components . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Implementation flow of the up-to-date promotion. . . . . . . . . . . . . . . . . 153 Rules in IM Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Business Objects report with the product combinations . . . . . . . . . . . 158 Using modeling and real-time scoring services for Web analytics . . . . 164 Tracing a customer behavior profile based on session traffic . . . . . . . 165 OLAP outline with initial rough hierarchy to present customer groups 167 Customer segments placed as dimensions in an OLAP outline . . . . . 168 Results of IM Scoring run in QMF V7.2 . . . . . . . . . . . . . . . . . . . . . . . . 171 B2C, B2B, and business focal point in an e-store . . . . . . . . . . . . . . . . 173 E-commerce analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 WebSphere Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Personalizing at Amazon.com when you visit their Web site . . . . . . . . 178 The Amazon three-step recommendation wizard . . . . . . . . . . . . . . . . 179 Amazon has no recommendations yet, but tries to get on target. . . . . 180 The Amazon.com recommendations wizard suggesting a book . . . . . 181 REI home page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Recommendations at the time of shopping cart entries . . . . . . . . . . . . 183 IM Scoring Java Beans: JavaDoc on class RecordScorer. . . . . . . . . . 186 Architecture sample to realize a call-center scenario . . . . . . . . . . . . . 187 IM Scoring to apply models to new data . . . . . . . . . . . . . . . . . . . . . . . 195 Application architecture with modeling and scoring . . . . . . . . . . . . . . . 197 Export model menu from DB2 Intelligent Miner for Data . . . . . . . . . . . 201 Conceptual overview of SQL scoring elements . . . . . . . . . . . . . . . . . . 206 Pseudo code for a SQL script for scoring . . . . . . . . . . . . . . . . . . . . . . 207 Sample code for scoring with a churn model . . . . . . . . . . . . . . . . . . . . 208 Hybrid modeling run for classification of banking card holders . . . . . . 227 IM Visualizer: Opening a model from the local file system . . . . . . . . . 232 Opening a model from the database system . . . . . . . . . . . . . . . . . . . . 234 Loading the model in a secure manner from the database . . . . . . . . . 234 Invoking an IM Visualizer applet to be embedded in an HTML browser239 Invoking IM Visualizer applet via a push button in HTML browser . . . 240 Cluster visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Tree visualization of predicted classes . . . . . . . . . . . . . . . . . . . . . . . . 243 Visualization of associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Tables 3-1 3-2 4-1 5-1 5-2 6-1 6-2 6-3 6-4 6-5 6-6 7-1 7-2 7-3 9-1 9-2 9-3 9-4 9-5 9-6 9-7 9-8 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 11-1 H-1 J-1
Step-by-step modeling process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Step-by-step scoring process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Table fields to use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 CDR fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Attributes used for clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Steps to configure the automatic e-mail component in DB2 UDB . . . . 105 Customer data in telecommunications churn case . . . . . . . . . . . . . . . 109 Customer information table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Campaign information table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Campaign activity customer history table . . . . . . . . . . . . . . . . . . . . . . 139 Hub customer table example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Customer data in retail case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Product_Name table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Rules_Definition table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Step-by-step action categories and steps . . . . . . . . . . . . . . . . . . . . . . 196 Steps for enabling the DB2 UDB instance . . . . . . . . . . . . . . . . . . . . . . 198 Steps for enabling the DB2 UDB database . . . . . . . . . . . . . . . . . . . . . 198 Middleware prerequisites for federated database access . . . . . . . . . . 199 Configuring a remote DB2 UDB table as a target table . . . . . . . . . . . . 199 Matching models to UDFs, UDTs, and DB2 tables . . . . . . . . . . . . . . . 203 Model types and UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Matching scoring functions, result types, and UDFs . . . . . . . . . . . . . . 209 Steps to set up a modeling run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Database instance parameters required for scoring . . . . . . . . . . . . . . 214 Database parameters required for scoring . . . . . . . . . . . . . . . . . . . . . 214 Stored procedures for starting mining runs . . . . . . . . . . . . . . . . . . . . . 215 Methods for defining mining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 UDFs and frequently used methods for defining mining settings. . . . . 217 Tables that store data mining tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Algorithms and the associated stored procedures input and output . . 223 Procedures for testing a classification model. . . . . . . . . . . . . . . . . . . . 224 Parameters to load a model from a database environment . . . . . . . . . 233 Parameters to embed the IM Visualization applet . . . . . . . . . . . . . . . . 290 Demographic clustering: Application mode settings options . . . . . . . . 300
© Copyright IBM Corp. 2002. All rights reserved.
xiii
xiv
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Examples 4-1 4-2 4-3 4-4 4-5 4-6 4-7 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 6-1 6-2 6-3 6-4 6-5 6-6 6-7 6-8 6-9 6-10 6-11 6-12 6-13 6-14 7-1 7-2 8-1 9-1 9-2 10-1 10-2 10-3
Batch job for scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Script to enable the USBANK database for scoring. . . . . . . . . . . . . . . . 58 Importing the clustering model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Defining the view CustomerSegmentsView . . . . . . . . . . . . . . . . . . . . . . 63 Write scoring results into table CustomerSegmentsTable . . . . . . . . . . . 64 Defining the CustomerSegmentsTable . . . . . . . . . . . . . . . . . . . . . . . . . 64 HTML code to show IM Visualization applet in BO report . . . . . . . . . . . 65 Stored procedure to invoke clustering run in DB2 UDB V8.1 . . . . . . . . 85 SQL script to define data setting in DB2 UDB V7.2 . . . . . . . . . . . . . . . . 87 SQL script to define model setting for clustering . . . . . . . . . . . . . . . . . . 88 SQL script to build a data mining task . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Calling a stored procedure to run the segmentation task . . . . . . . . . . . 89 Creating the SQL script for scoring with a data mining model . . . . . . . . 90 Scoring with a generated script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Script to list the connections in the five smallest clusters . . . . . . . . . . . 91 Script to list all clusters with less than 1% of all transactions . . . . . . . . 92 Sample Java code creating a MailUDF that sends e-mail within DB2 . 106 DB2 UDF source code for mail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 SQL code used to import the PMML model into the CHURN database111 The scoring code used for this scenario . . . . . . . . . . . . . . . . . . . . . . . 111 Scoring with generated script. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Trigger to run scoring code when a customer attribute changes . . . . . 112 DB2 UDB code to create a trigger to send e-mail . . . . . . . . . . . . . . . . 113 Importing the data mining model (PMML file) . . . . . . . . . . . . . . . . . . . 120 Applying the tree classification model . . . . . . . . . . . . . . . . . . . . . . . . . 120 Creating the churn score table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 New score and rescore DB2 UDB triggers . . . . . . . . . . . . . . . . . . . . . 123 Script to enable the CMPG_MGM database for scoring . . . . . . . . . . . 136 Starting an MQ Receiver task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Other parameters for MQ Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Building the associations rules model . . . . . . . . . . . . . . . . . . . . . . . . . 155 Extracting the associations rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 QMF V72. query for IM Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Passing the record and applying a classification model . . . . . . . . . . . 208 Extracting the score from a UDT DM_RegResult . . . . . . . . . . . . . . . . 209 Using DM_testClasModelCmdT stored procedure . . . . . . . . . . . . . . . 224 Exporting a model for distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Extracting error rate from test result . . . . . . . . . . . . . . . . . . . . . . . . . . 226
© Copyright IBM Corp. 2002. All rights reserved.
xv
11-1 11-2 11-3 11-4 11-5 11-6 11-7 11-8 A-1 B-1 B-2 C-1 C-2 C-3 C-4 D-1 D-2 D-3 D-4 D-5 D-6 E-1 E-2 E-3 E-4 F-1 G-1 H-1 I-1 I-2
xvi
Calling IM Visualizer via command line parameter settings. . . . . . . . . 235 Clustering Visualizer launched with a start button . . . . . . . . . . . . . . . . 236 Syntax to open visualizer as applet in HTML document . . . . . . . . . . . 236 Classification visualizer embedded in an HTML document . . . . . . . . . 237 Syntax to embed a single view as an applet in an HTML document . . 237 Association rules view embedded in an HTML document . . . . . . . . . . 237 Script to embed and open a Visualizer applet for a cluster model . . . . 238 Script to launch IM Visualizer applet via push button in HTML browser240 SQL script for configuring databases . . . . . . . . . . . . . . . . . . . . . . . . . . 247 SQL script to create and load segment table . . . . . . . . . . . . . . . . . . . . 250 SQL script Score.db2 for customer profile scoring . . . . . . . . . . . . . . . 251 SQL script for data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 SQL script defining data, model parameters, task, and modeling . . . . 263 SQL script generated by IDMMKSQL and modified for the example. . 264 Script to generate a list of customers from the smallest five clusters . 266 Creating a table to be scored . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Inserting the PMML model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Applying the score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Creating a table with the churn score . . . . . . . . . . . . . . . . . . . . . . . . . 272 New Customer score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Rescoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Creating the BuildRulesModel function . . . . . . . . . . . . . . . . . . . . . . . . 276 Function creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Building associations rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Extracting rules to a table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 SQL script to create and load segment table . . . . . . . . . . . . . . . . . . . . 281 Creating the function ListRules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Embedding IM Visualization in an HTML document . . . . . . . . . . . . . . 290 Source code of IM Scoring Java Bean: CustomerScore.java file . . . . 294 The paths.bat file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.
© Copyright IBM Corp. 2002. All rights reserved.
xvii
Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Redbooks(logo)™ AIX® AS/400® Balance® DB2® DB2 Connect™ DB2 OLAP Server™
DRDA® IBM® Illustra™ IMS™ Intelligent Miner™ MQSeries® OS/400®
Perform™ QMF™ Redbooks™ SP™ TME® WebSphere® z/OS™
The following terms are trademarks of other companies: ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. C-bus is a trademark of Corollary, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. SET, SET Secure Electronic Transaction, and the SET Logo are trademarks owned by SET Secure Electronic Transaction LLC. Other company, product, and service names may be trademarks or service marks of others.
xviii
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Preface Today data mining is no longer thought of as a set of stand-alone techniques, far from the business applications, and used only by data mining specialists or statisticians. Integrating data mining with mainstream applications is becoming an important issue for e-business applications. To support this move to applications, data mining is now an extension of the relational databases that database administrators or IT developers use. They use data mining as they would use any other standard relational function that they manipulate. This IBM Redbook positions the new DB2 data mining functions: IBM DB2 Intelligent Miner Modeling (IM Modeling in this redbook) IBM DB2 Intelligent Miner Scoring (IM Scoring in this redbook) IBM DB2 Intelligent Miner Visualization (IM Visualization in this redbook) Part 1 of this redbook helps business analysts and implementers to understand and position these new DB2 data mining functions. Part 2 provides examples for implementers on how to easily and quickly integrate the data mining functions in business applications to enhance them. And part 3 helps database administrators and IT developers to configure these functions once to prepare them for use and integration in any application.
The team that wrote this redbook This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization, San Jose Center. Corinne Baragoin is a Business Intelligence (BI) project leader at the International Technical Support Organization, San Jose Center. She has over 17 years of experience as an IT specialist on DB2 Universal Database (UDB) and related solutions. Before joining the ITSO in 2000, she worked as an IT Specialist for IBM France. There, she supported Business Intelligence technical presales activities and assisted customers on DB2 UDB, data warehouse, and Online Analytical Processing (OLAP) solutions. Ronnie Chan is a senior IT specialist with DB2 Data Management Software for IBM Australia. He is also a member of the Technical Leadership Council with IBM Software Group, Asia Pacific. He has 15 years of experience as a system engineer for large software organizations. He holds a bachelor of science degree with honors from the University of Salford in the United Kingdom (UK). His areas
© Copyright IBM Corp. 2002. All rights reserved.
xix
of expertise include DB2 on distributed platforms, SAS, Business Intelligence, OLAP, and data mining. His most recent publication in the area of data mining is a paper entitled “Protecting rivers and streams by monitoring chemical concentrations and algae communities using neural network” for European Network for fuzzy logic and uncertainty modeling. Helena Gottschalk is a data mining consultant, working as an IBM Business Partner in Brazil. Previously she worked for a research institute, IBM, and Pricewaterhouse. She has 13 years of experience in applied mathematics, neural networks, statistics, and high performance computing, solving business needs. She has worked in many data mining projects in different industries, some in data warehouse environments. She has participated in several conferences and written many scientific and business publications. She holds a master's degree in electrical engineering from the Federal University of Rio de Janeiro (UFRJ). Gregor Meyer is a senior software engineer at the Silicon Valley Lab in San Jose. He has worked for IBM since 1997, when he joined the product development team for DB2 Intelligent Miner in Germany. Now, he is responsible for the technical integration of data mining and other Business Intelligence technologies with DB2. He represents IBM in the Data Mining Group (DMG) defining the Predictive Model Markup Language (PMML) standard for mining models. Gregor studied computer science in Brunswick and Stuttgart, Germany. He received his doctorate from the University of Hagen, Germany. Paulo Pereira is a Business Intelligence specialist at the Business Intelligence Solutions Center (BISC) in Dallas, Texas. He has over seven years of experience with customer projects in the Business Intelligence arena, in consulting, architecture design, data modeling, and implementing large data warehouse systems. He has worked with the majority of the Business Intelligence IBM Data Management and partners portfolio, specializing in parallel UNIX solutions. He holds a master's degree in electrical engineering from the Catholic University of Rio de Janeiro (PUC-RJ), Brazil. Jaap Verhees is an architecture and business/IT alignment consultant with Ordina Finance Consulting, based in the Netherlands. Prior to his five years at Ordina, he worked for IBM for six years. He has 11 years of IT experience, both in the Business Intelligence (BI) and Object Technology (OT) areas. He holds a Ph.D degree in econometrics from Groningen University, The Netherlands. His areas of expertise include data mining, OLAP, and modeling in the BI arena. He also trains clients in techniques for business process modeling, data modeling, and system analysis, in addition to application development methodologies and data mining methodologies. He has written extensively on techniques for analyzing multidimensional data.
xx
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The team (from left to right): Helena, Gregor, Jaap, Paulo, Corinne, and Ronnie
Thanks to the following people for their contributions to this project: By providing their technical input and reviewing this book: Leston Nay UNICA Corp Ute Baumbach Inge Buecker Toni Bollinger Cornelius Dufft Carsten Schulz Gerd Piel IBM Development Lab in Boeblingen Jay Bruce IBM Silicon Valley Lab Prasad Vishnubhotla IBM Austin Don Beville IBM Dallas
Preface
xxi
Micol Trezza IBM Italy By reviewing this redbook: Wout van Zeeland Ordina Fidel Reijerse INTELLIDAT Damiaan Zwietering IBM Netherlands Graham Bent IBM UK Tommy Eunice IBM Worldwide Marketing
Become a published author Join us for a two- to seven-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html
xxii
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Comments welcome Your comments are important to us! We want our Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at: ibm.com/redbooks
Send your comments in an Internet note to:
[email protected] Mail your comments to: IBM Corporation, International Technical Support Organization Dept. QXXE Building 80-E2 650 Harry Road San Jose, California 95120-6099
Preface
xxiii
xxiv
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 1
Part
1
Advanced data mining functions overview This part provides an overview of the following data mining deployment topics. It explains: The next step for data mining industrialization The drivers for deploying data mining models and scores in any business environment Why data mining functions are part of the database Deployment examples according to typical business scenarios in the banking, retail, and telecommunications industry sectors
© Copyright IBM Corp. 2002. All rights reserved.
1
2
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1
Chapter 1.
Data mining functions in the database This chapter discusses: How data mining has evolved The drivers for deployment instead of using data mining alone The effective deployment of scoring into the business data environment The issues that result from industrializing the mining models and the challenges during the actual deployment of the models
© Copyright IBM Corp. 2002. All rights reserved.
3
1.1 The evolution of data mining Data mining has become useful over the past decade in business to gain more information, to have a better understanding of running a business, and to find new ways and ideas to extrapolate a business to other markets. Its analytical power to find new business areas and opportunities does not need to be proven anymore to both the business- and data-mining analysts. Figure 1-1 shows a historical view of data mining.
Data Warehousing Data Mining
Learning Pattern recognition Rule based reasoning
Query languages
Web Mining Real-time Scoring
DBMS
1960
1970
1956 *AI
1980
1990
2000
2002
1985
1996
2002
SQL-Standard
DB2 UDB Intelligent Miner
DB2 UDB Mining Extenders
1970 First def. of DB language
Figure 1-1 A historical view of data mining
Today, data mining is no longer thought of as a set of stand-alone techniques, far from the business applications. Over the last three years, integrating data mining with mainstream applications has become an important part of e-commerce applications. Meta Group has noted that data mining tools and workbenches are difficult to sell as stand-alone tools. They are better as part of a solution. Enterprises require more and more integration of data mining technology with relational databases and their businss-oriented applications. To support this move to applications, data mining products are shifting from stand-alone, technology to being integrated in the relational database. Yesterday’s data mining experts had to advise how to integrate the use of data mining results within the company process. Today technology supports the insertion of data mining findings in the company processes of interaction between different business users. As shown in Figure 1-2, data mining has moved from workbenches used by power users to be embedded and integrated directly in applications. With the
4
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
traditional approach to data mining, an expert uses a workbench to run preprocessing steps, mining algorithms, and visualization of the mining models. The goal is to find new and interesting patterns in the data and somehow use the gained knowledge in business decisions. The power users, as data mining analysts and advanced Online Analytical Processing (OLAP) users, typically require separate specialized datamarts for doing specialized analysis on preprocessed data. The integration of mining results into the operational business is usually done in an ad-hoc manner. With the integration of mining, the focus shifts toward the deployment of mining in business applications (Figure 1-2). The audience includes the end users in the line of business who need an easy-to-use interface to the mining results and a tight integration with the existing environment. A power user is still required to design and build optimized and efficient data mining models. The power user considers all the data from both the operational databases and the data warehouse to be potentially important in the traditional, non-business integrated way. Simple predictive models, anticipation of customer behavior, and automatic optimization of business processes by means of data mining become more important than some general knowledge discovery.
Workbench: GUI-based interactive mining
Application-focused integrated mining Customers
Business Users
Applications Risk assessment Targeted mailing Campaign management Churn management Sales forecasting
Database Developer
Intelligent Miner Extenders
Figure 1-2 Shift in the use of data mining technology and audience
When data mining becomes part of a solution, the actual user of the mining functions is a developer who packages the mining results for the end-user
Chapter 1. Data mining functions in the database
5
community. This person has the database application development skills. They provide the models and scores back to the data warehouse with aggregated data, specialized datamarts, and operational databases that hold data at the transactional level. A person with SQL programming skills applies scores to, for example, the likelihood of events in the near future or the forecasts of customer behavior. This person also assures that the creation of the model is generally not considered the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained needs to be organized and presented in a way that the end user can use it. Depending on the business requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases, the business end user, not the data analyst, together with the person with the SQL skills, carry out the deployment steps. However, even if the data analyst does not carry out the deployment effort, the business end user must understand up front what actions need to be carried out to actually use the created models. The actual deployment of the model and scores is where the real benefits are harvested by the business.
1.2 Data mining does not stand alone anymore As stated in the previous section, data mining is no longer thought of as a set of stand-alone techniques, far from business applications. Today many applications, such as Web analytics, Web personalization, e-commerce, and operative Customer Relationship Management (CRM), require integrated data mining functions. There are several business drivers to deploy models and scores in any business environment instead of applying data mining as it is. For example, there is the need for:
Faster time to market and closing the loop Real-time analytics Leveraging existing IT skills Building repeatable processes and tasks Efficiency and effectiveness Cost reduction of mining analytics
The following sections discuss each of these business drivers.
6
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1.2.1 Faster time to market and closing the loop Time to market is essential to effectively use data mining. It is also essential across such industries as retail, telecommunications, and finance. For example, in retail, recency is short for perishable products in the fast moving goods sector. In the food retail business, for instance, it is essential to know on time when and what products to lay out or what product to competitively price for the next days or weeks to have a profitable overturn. In the telecommunications business, time to market is more and more an essential way of doing business. It provides customers a means to select their preferred manner of communications for prices that may change from time to time to minimize the possibility of losing them to the competition. And in the finance industry, financial institutions provide services where real-time analytics are useful to detect fraudulent behavior. Banks offer a large set of services that the customer may select or configure themselves to match to their preferred customer touch point. Flexibility offered by a bank induces the possibility of fraudulent behavior, and people may use the services for just enough time before any fraudulent behavior is actually detected. Fraudulent behavior, by nature, is very agile because customers sense they will not be easily detected if they connect and disconnect to services in short bursts of time. Therefore it is in the interest of banks to discover such behavior as soon as possible to reduce costs and risks of loss. Another issue where time to market is noteworthy is related to the analytical-process side more than the business-application side. Several data mining vendors have come up with the right tools and introduced a data mining process to the marketplace. Over the past years, the generic data mining method has evolved in the data mining community. This method is meant to be used by organizations that need to perform a data mining exercise, from end-to-end, for the first time in their business environment. It is the full-blown, months-lasting data mining projects where you can also find the data mining workbenches in the early stages of data mining within the organizations. In regard to time to market, many organizations so far have not considered to refine scoring models from their data warehouse environment, datamarts, or operational data stores and pass scores back to the operational databases with transaction-oriented data. The closed loop from retrieval, discovery, and actual deployment is certainly the next stage for most business to gain Return On Investment (ROI) by applying data mining in a business environment. Figure 1-3 displays this generic data mining method. The steps that are involved are outlined here:
Chapter 1. Data mining functions in the database
7
1. Define the business issue in a precise statement. 2. Prepare the data: a. Define the data model and the data requirements. b. Source the data from all your available repositories and prepare the data. c. Evaluate the data quality. 3. Choose the mining function and run it. 4. Interpret the results and detecting new information. 5. Deploy the results and the new knowledge into your business.
Define the business issue
Integration of traditional mining and the deployment
Mine the data
Mining Workbench
Prepare the data
Interpret the results
Deploy the results
SQL and RDBMS functions for Scoring
Figure 1-3 The generic data mining method
Effective and timely deployment of mining results becomes a problem with this data mining method if you do not place the right skills and technology at the right moment in the overall process. Typically, the time to move from steps 1 to step 5
8
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
can take several weeks to several months, depending on the maturity of the data warehouse or operational data stores, if it is already in place at all. Eighty percent of that time is taken by steps 1 through 4. This is regardless of how well the business issue is defined and how well meta-defined and cleansed our data is before we start a mining run with step 1. Fortunately, new application integration requirements open the door to integrate mining functions, such as modeling and scoring, with the relational database management system (RDBMS). This allows for a faster ROI and quicker answer to the business. It also advocates the need for applying the modeling functions to any data preparation source, and scoring of new transactions, without needing in-depth knowledge of the mining techniques. The new mining functions for both modeling and scoring today are more closely integrated with the RDBMS. They shift the focus to executing the mining and deployment steps of the overall generic data mining process. This addresses the time to market and close-loop driver. See Figure 1-4. With the advent of new applications where the need for quicker time to market is more urgent than before, there is less concern with the accuracy or goodness of fit of the original data mining model to the data at that time. Here are some examples: Sacrificing accuracy for speed results in faster time to market of, for example, marketing campaigns or competitive pricing strategies for up-sell or cross-sell. Detecting fraudulent behavior occurs as soon as possible after customer touch points take place in a short span of time. Click-stream analysis satisfies the need where speedy analysis and targeting are more important than an accuracy of close to 100 percent. The focus on data mining runs has evolved from trying to develop models with high accuracy, that lasts for months to years. The focus has shifted to models that support the CRM analytics with enough flexibility and speed at the cost of possibly missing out on precise targeting on each individual customer. Who has not heard of the phrase that timing is essential? It is not just about the contents but also about the moment in time that we convey a message. We need to act upon information that we gather from data and day-to-day business experiences to score an opportunity in business. We do no want to wait for days to act or react.
Chapter 1. Data mining functions in the database
9
Integration of models and deployment of scores in applications
Prepare the data
Mine the data
SQL and RDBMS functions for Modeling
Interpret the results
Deploy the results
SQL and RDBMS functions for Scoring
Figure 1-4 Focus on modeling and deployment of scores
Visualization, scoring, and deployment have become the key words in the data mining methodology. The modeling itself, the choice of the mining technique, is less important in contrast to the need for real-time analytics. We prefer less exact preparation of the model parameters and accept less than 100% correct selection of the data if it means success in delivering a model within matters of days instead of weeks. The goal is to quickly produce an actual deployment in a operational business environment to finally succeed with applying a closed loop and real time CRM application. As Figure 1-4 shows, scoring and deployment become the center of focus in the data mining process. Preparing the data and the data model is done while keeping time to market in mind at an acceptable loss of accuracy. Visualization may be part of the process, but merely to have visual inspection of the model and initial results of the first run.
10
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The real reward is when the model is deployed. This is a deployment in the operational environment by integrating the outcome of the mining model, the scores, with the actual data. In the end, for example, a third-party tool that supports a specific CRM task uses the scores to address the real reward, that is to gain overall customer satisfaction. Or the mining model is triggered by a event that takes place of a customer interaction via one of their preferred customer touch points. This leads to a new score of the customer lifetime value or of the propensity of something else. Deployment takes many forms. Refer to 3.5, “Integrating the generic components” on page 44, which explains deployment issues in more detail.
1.2.2 Real-time analytics Real-time analytics is a driver that is closely related to the need for faster time to market and a closed loop. But here the focus is, for example, not on delivering faster predictive scores for a campaign that may be scheduled to run in the next weeks after it is decided for which segment of the overall customer base this campaign will run. In the case of real-time analytics, this is real-time scoring, in which a new customer initiates a transaction via their preferred channel. The channel can be the phone call to a customer sales center, a personal visit to one of the offices, or an inline form at the Web site where products are promoted by companies. Customer data gathered via any of these channels is fed into the front-end application. Then the company gives instant feedback, information, and offerings that are suited to the customer of which a profile is generated within seconds. This is a common practice among companies today. Scoring data on entry is what sets it aside from traditional stand-alone data mining that focused on mining data from tables. Trigger-based scoring, for example the so-called trigger-based marketing concept for faster targeted marketing, serves real-time analytics and scoring of data on entry. This results in faster time-to-market responses by business-to consumer (B2C) companies. Real-time analytics, initial modeling, and iterative recalibration of the models makes more sense in fast moving markets. It makes sense in highly competitive markets and in markets where customer interaction is vital to gain high return on investment. For example, the Bank of Montreal relies on scoring to execute models. It uses real-time scoring to track changes in customer profiles, such as those who recently purchased certain products (indicating they may be interested in a new product) or those who closed several accounts in one week (indicating they are about to turn). When there is a change in a customer profile, users get the score
Chapter 1. Data mining functions in the database
11
without waiting for a report at the end of the month. Each time something changes on a customer profile, the scoring application immediately reruns scores. Also, real-time scoring enhances applications with the ability to simulate and perform micro-testing. With Intelligent Miner (IM) Scoring, input data for the scoring can be from both a database table and entry data. For example, you can score a known customer with altered values to study the effects of an action.
1.2.3 Leveraging existing IT skills These days staff turnover is high. When staff with data mining skills leaves your company, automation of (part of) the mining process is the way of the future to tackle this issue. If several tasks in the overall data mining process are automated as far as possible, the danger of skills loss is less. The actual deployment of scoring results can be done by someone with SQL programming skills to bring back scores to the data warehouse, specialized datamarts, and operational databases.
1.2.4 Building repeatable processes and tasks Build repeatable business processes and tasks to reduce redeployment and maintenance costs. This is another business driver for deploying a scoring model and scores in a business environment. You need implement a process so that the essence of a successful model is captured, made operational, and guarantees repeatable success. You want to use models that have a tight integration with databases. Without the tight integration, you would still have to manually deploy the scoring results in the data warehouse or operational databases. Without tight integration, you would not benefit from the automation facilities that database management systems offer. The deployment of scoring results is done with ease and with less intervention to make the results available. The scores must be updated at regular times for the obvious reason that the business environment changes all the time too. Repeatable and eventually automated (parts of) processes are the way to go for deployment. This way you can schedule batch runs for automated rescoring or recalibration of models, which is cost effective. Another interesting possibility when you have a repeatable process is regenerating the model based on new data or triggering a new model to be built when the old one no longer fits.
12
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1.2.5 Efficiency and effectiveness When modeling and scoring analytics must be applied in the business environment, there is the driver for effectiveness and efficiency. Technology is also needed to provide ease of use and ease of interaction close to the heart of the where the data resides. To gain scoring results in an efficient manner and to have effective scoring models to achieve these results, business applications require interaction and more integration between different modeling and scoring techniques. One technique is no longer sufficient to model a customer’s behavior. Consider an example where you need to profile the customer and predict the likelihood that they are going to accept the targeted marketing campaign. Figure 1-5 shows an example of both scoring to a profile and predictive scoring. You can combine the separate profile and target models to be a sequential scoring model. And possibly they will be more effective in the profiling of purchase habits and targeting via campaigns for new customers.
1. Profiling 2. Targeting
Who is our new customer? Are they likely to buy the product once we send them a specific offering in our next campaign?
Type attributes
Type pla yer
Type player
Type pl ayer
Type player
Type attr ibutes
Segmentation
NO
? ?
Yes! Yes! NO
Yes!Yes!
NO
Yes! NO
NO
? ?
NO
Figure 1-5 Integration between different modeling and scoring techniques
Chapter 1. Data mining functions in the database
13
1.2.6 Cost reduction of mining analytics Scoring of data on entry brings to mind what-if analysis, which was not possible before, even at the time that data mining was somewhat industrialized. The what-if analysis can reduce the cost of mining analytics. It is also a driver for deploying a scoring model. Why? You may to decide on whether a particular segment of customers in the overall customer base is likely to migrate or merge into another segment of customers that is more profitable to the business. In this case, you can cut costs by simulating instead of applying the model in an actual segment targeted campaign. Simulation is carried out, for example, by a telecommunications company in cross-sell campaign programs. They would simulate whether the flat-fee offering to a region and language specific segment of customers in its database help to migrate the customers to another segment with the characteristic that it uses more flat fee services than the previous segment. In a similar way, food retailers simulate different pricing strategies to evaluate the possible effect of cross selling products to specific groups of its customers without actually having to price the products on the shelves. And even banks may want to use scoring models to simulate new rules, or combinations of rules and hierarchies of their customers, to introduce different interest pricing offerings for products they service to select groups of customers. This way they can gain more profit without instituting more risk.
14
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
2
Chapter 2.
Overview of the new data mining functions The new modeling and scoring services directly integrate data mining technology into the relational database management system for rapid application development and deployment. This chapter describes the new data mining functions (Intelligent Miner (IM) Modeling, IM Scoring, and IM Visualization) and the philosophy behind them. It provides: An overview of the separate DB2 Universal Database (UDB) functions for Scoring, Modeling, and Visualization The principles and capabilities of these mining functions The whole picture where the three mining functions interact A positioning of these DB2 mining functions with DB2 Intelligent Miner for Data
© Copyright IBM Corp. 2002. All rights reserved.
15
2.1 Why relational database management system (RDBMS) functions What makes the mining functions so appealing is their integration with DB2 UDB? Built as DB2 data mining functions, the new Modeling and Scoring services directly integrate data mining technology into DB2 UDB. This leads to faster application performance. Developers want integration and performance, as well as any facility to make their job easier. Note: The Scoring data mining function is available on both DB2 UDB and Oracle. IM Scoring can be integrated into applications in the same manner, even in an environment where different database systems are used. Benefits of storing a model and scores as database objects are:
Administrative efficiency Ease of automation and integration Operational efficiency Performance advantages
2.1.1 Easy use of automation and integration The application programming interface (API) of the mining functions is tightly integrated into Structured Query Language (SQL). The model can be stored as a Binary Large Object (BLOB). The model can be used within any SQL statement. This means the scoring function can be invoked with ease from any application that is SQL aware, either in batch, real time, or as a trigger. The batch mode also facilitates automatic scheduled runs of the applications, which invoke the use of the models. Scheduled scoring is possible. Because DB2 UDB is ODBC-, OLE DB-, JDBC-, and SQLJ-enabled, the model can be easily integrated with applications.
2.1.2 Operational efficiency Storing a model and scores as a database object also leads to operational efficiency. For instance, storing models in a relational table allows more efficient marketing automation. Versioning of models that are calibrated at different points in time, or models for different marketing segments, can all be stored in a single table. See Figure 1-5 on page 13.
16
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Marketers can mix and match models with different marketing campaigns with much ease and little coding. When a model is updated, there is no need for changing the scoring code. The new SQL mining functions eliminate this error prone task. Updating a model is as simple as updating a value in the database. Besides, storing a model and scores as database objects facilitates simulation. For example, the analyst can simulate the effects of actions on selected customers that were segmented together by the model. Or the analyst can study (simulate) the impact of a chance in values of key variables in the clustering (variables are already given in a specific order by the cluster model that can help analysis) to define actions to migrate customers from one segment to another segment. These operations are efficient because the analyst has direct access to both the model and simulated scores in one environment, the environment for database objects. Another point is transparency. You can hide the application of a model on data under a database view. Also the complexity of applying models is stored in the database itself, and any application can access it.
2.1.3 Performance There is a performance advantage to invoking a user-defined function (UDF) or method. They are invoked from the database engine and not run as part of your applications. The mining functions are executed where the data is stored. There is no overhead of copying all the data to the application. An additional performance benefit occurs when working with a Large Object (LOB). You can create a function to extract information from an LOB right at the database server and pass only the extracted value back to the application. This strategy avoids the passing of the entire LOB value back to the application and then extracting the needed information.
2.1.4 Administrative efficiency The database administrator (DBA) can administer access from a single point of control such as the DB2 Control Center. This means more administrative efficiency when you perform a database or table space backup, allowing automatic backup of the models and scores. Models and scores stored inside the database immediately inherit the benefit of a secure database. Access to the models and scores are controlled from a single point, the DBA. This is a lot more secure than if the models are stored in the file system.
Chapter 2. Overview of the new data mining functions
17
Installation and maintenance are also simpler. There is no need to set up additional tools or client/server interfaces. A DBA can manage the configuration using standard database tools, and every piece of information is simply an object in a database table.
2.2 Scoring: Deploying data mining models Scoring is the use of existing data mining models based on historical data and simply applying these models to new data. For example, you may have a classification model that contains rules about estimating the churn risk for customers. Given the profile data about a particular customer, the scoring function computes the churn risk. You can now apply this function in real time to a single record, for example, the customer who currently talks to someone in the call center. Note: Scoring does not come up with new models or new predictive rules. The actual mining models that contain the rules are computed by other modeling functions or are imported from external mining workbenches. The guiding principle behind DB2 IM Scoring is the notion of mining versus business rules. The business environment is in flux, where frequent updates are necessary and rules age quickly. Rules are often manually identified, which is both a laborious and time consuming process. IM Scoring identifies unsuspected targets, opportunities, and problems in an automatic way. For example, new data (change in behavior or characteristics of transactions) may trigger the IM Scoring application to produce a score based on the underlying mining model. Then it matches this to a range of other scores to automatically signal whether this is an expected or unexpected result. IM Scoring is an economical and easy-to-use mining deployment capability, that:
Is a database extension Is implemented in batch mode or real time Supports the new Predictive Model Markup Language (PMML) Leverages existing IT skills
2.2.1 Scoring as an SQL extension IM Scoring is an economical and easy-to-use mining deployment capability, because it implemented as a database extension. The scoring functions are simple standard extensions to SQL. This way, the actual scoring is easy to
18
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
implement by the SQL interface. You can combine the scoring functions with any SQL query, VIEW, or TRIGGER. IM Scoring uses the well-known industry standard PMML for mining models.
2.2.2 Batch mode and real time Scoring can be implemented in batch mode or by transaction in real time. Again, in both cases, the invocation is done by either using the SQL API in SQL statements or by using the Java API in applications built in Java. Batch modes can be used, for example, in a telecommunications industry scenario. Here the model for churn is run on a regular basis, probably monthly and quarterly, for customers who have to renew their contract. Scoring by transaction in real time is what dubbed as real-time scoring. Real-time scoring takes place in two variations. The first variation occurs in a bank scenario, for instance. In this scenario, a loan officer online scores the loan applicant request on the basis of their online input to the mining model for the loan application. Or similarly, in a retail environment, the model runs on demand at the visit of a customer of touch point in store. Since they seem to be similar to previously targeted consumer groups, they receive the offering. The other variation occurs in e-business scenarios where the input data for scoring may include data that is not yet made persistent in the database. Think of personalization in WebSphere where the current data may depend on the most recent mouse click on a Web page. In this case, a small Java API for the scoring functions allows for high-speed predictive mining as well.
2.2.3 Support for the new PMML 2.0 standard IM Scoring uses the PMML standard so that it can access mining models that have been generated by DB2 Intelligent Miner for Data. It also uses this standard to use the mining models of other data mining tool vendors, such as SAS with its data mining workbench SAS Enterprise Miner (SAS/EM).
Chapter 2. Overview of the new data mining functions
19
More on PMML: PMML is an XML-based language that provides a quick and easy way for companies to exchange models between compliant vendors' applications. It provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application and to use other vendors' applications to visualize, analyze, evaluate, or otherwise use the models. Previously, this was virtually impossible, but with PMML, the exchange of models between compliant applications is now seamless. Since PMML is an XML-based standard, the specification comes in the form of an XML Document Type Definition (DTD). IM Scoring provides a model conversion facility, which converts mining models from DB2 Intelligent Miner for Data format to PMML 2.0 format. The most important advantages of IM Scoring by using PMML are:
Computational resources are exploited efficiently. It allows real-time (event-triggered) processing and batch processing. It enables foreign software applications access to the modeling logic. It allows the generalization of any future model type. Data miner experts on-site are not necessary (see the following section). It enhances ease of integration between mining and core operational systems.
2.2.4 Leveraging existing IT skills When you want to use the DB2 data mining function for IM Scoring effectively, the following skills are required (see Figure 2-1): For the IT specialist: To use the SQL API, this specialist need basic SQL skills to apply data management procedures within the business and to actually deploy the scores in the business environment. Obviously, Java skills may be required when using Java API. For the business users (for example marketing, sales): These users can use IM Scoring through the interface or report built by the IT specialist. They must have a knowledge of the business domain knowledge and a thorough understanding of the business issue to check the usability of the scores and to evaluate the deployment from a business perspective.
20
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
IT skills SQL and RDBMS functions for Scoring
Business skills
Data mining skills
Figure 2-1 IM Scoring and skills
If the IT specialist who develops IM Scoring functionalities has Java or any reporting tool (such as QMF, Business Objects, or similar) skills to build a Java interface or a report, any people with some business skills can launch IM Scoring.
2.3 Modeling: Building a mining model using SQL Using DB2 data mining functions for modeling offers an extra push when you analyze the data to find an adequate data mining model easily and in a short time. The ease of use is enhanced by IM Modeling. IM Modeling provides a set of functions as an add-on service to DB2 UDB. This consists of a set of user-defined tables (UDTs), UDFs, methods, and stored procedures. They extend the capabilities of DB2 UDB to include data mining modeling functions. You can easily use these functions from SQL statements. Modeling allows: Interoperability Building and using data mining models stored in DB2 UDB Support for the new PMML V2.0 standard It requires both IT skills and basic data mining understanding.
Chapter 2. Overview of the new data mining functions
21
2.3.1 Interoperability The models may have been developed by other applications and tools that support interoperability via PMML models. Or the models of the DB2 Intelligent Miner for Data may have been exported as PMML models. In the modeling phase of a data mining project, various modeling techniques are selected and applied. Their parameters are calibrated to optimal values. Typically, you can choose one or more of several techniques for the same data mining problem type. For example, you may try deduce a model that you will use to predict credit ratings for new customers of a bank. In this case, you use the neural prediction technique to generate the model. But parallel, or subsequent, to the neural prediction effort, you can use the tree decision technique to state the decision rules of the model that you will use over the following months.
2.3.2 Models in DB2 UDB You perform modeling essentially by making calls to the functions of the DB2 data mining function. For example, you call the mining technique that you want to use in the modeling stage. The SQL API provides the means to make these calls. It enables SQL application programs to call associations discovery, clustering, and classification techniques to develop models based on data which is (also) accessed by DB2 UDB. Models are in DB2 UDB. This makes for ease of management, centralized control, security. Plus, multiple versions of models in one central database make it easier to change to other models and ease oversight. By this you gain in ease of management, which leads to cost reduction of data and model management.
2.3.3 Support of the new PMML 2.0 standard The move from using traditional data mining workbenches (SAS Enterprise Miner, DB2 Intelligent Miner for Data) toward more modularized tooling to perform and deploy data mining is gaining momentum. This is happening now that many leading data mining product vendors are conforming to the PMML standard format for data. The result is that organizations may now buy a mining model from one company. Then they may use the visualization and application features of tooling by another company to deploy the model to business data for the future. For example, a bank may decide to use a model developed with SAS/EM and have its model exported in PMML. Then they may use the IBM mining tooling for
22
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
scoring and visualization to deploy the results to the operational business environment. Vice versa, you can use IBM mining tooling for modeling the data and then decide to incorporate the model in the deployment phase of the mining application. You can incorporate the prebuilt mining model via PMML import into a Customer Relationship Management (CRM) application such Siebel. In this manner, the organization is not dependent on only one vendor. Nor does the organization rely on the buy-in of one mining model. Instead, it has the means to choose a package on the basis of best-of-breed solutions for the business issue at hand.
2.3.4 Required skills You need the following skills to use the DB2 data mining function functionality for IM Modeling (see Figure 2-2): SQL programming: You need this knowledge to use the SQL API to call the mining technique you want to use in the modeling stage. You also need it to apply data management procedures to provide (read only) access to the models, so the business can run scoring applications. Basic understanding of data mining and the business issue: You need this knowledge to call an appropriate mining technique.
IT skills SQL and RDBMS functions for Modeling
Business skills
Data mining skills
Figure 2-2 IM Modeling data mining functions and skills
Chapter 2. Overview of the new data mining functions
23
2.4 Visualization: Understanding the data mining model The DB2 data mining function, IBM DB2 Intelligent Miner Visualization (IM Visualization) provides Java visualizers to present data modeling results for analysis. Visualization provides analysts with visual summaries of data from a database. It can also be used as a method for understanding the information extracted using other data mining methods. Features that are difficult to detect by scanning rows and columns of numbers in databases often become obvious when viewed graphically. Visualization is based on the following principles:
Interoperability Choice in use Multi platform capability Support of the new PMML V2.0 standard
It also requires business analyst skills.
2.4.1 Interoperability The models may have been developed by using IM Modeling or other applications and tools that support interoperability through the use of PMML models. Or the models of the DB2 Intelligent Miner for Data may have been exported as PMML models. The different visualizers interoperate with the models.
2.4.2 Choice in use Applications can call the Intelligent Miner visualizers to present model results. You can also deploy the visualizers as applets in a Web browser for ready dissemination.
2.4.3 Multiplatform capability Because the IM Visualization functionality is written in Java, you can install the visualizers and run them on different hardware and software platforms.
2.4.4 Support of the new PMML 2.0 standard You can use the Intelligent Miner Visualizers to visualize PMML-conforming mining models.
24
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
2.4.5 Required skills As with any end-user-oriented visualization tool that has a dynamic graphical user interface for interacting with the end user, you need the following skills to use the DB2 data mining function for IM Visualization effectively (see Figure 2-3): A basic understanding of the graphical user interface functionality, such a mouse click through, drill down, and resizing A strong business domain knowledge A basic understanding of using visualization for data mining results in the form of decision trees, tree rules, clusters and patterns, and associations rules
IT skills
Functions for Visualization
Business Skills
Data mining skills
Figure 2-3 IM Visualization and skills
2.5 IM Modeling, Scoring, and Visualization interactions This section explains the whole picture of using the three DB2 data mining functions and how to deploy them. It presents the different packaging options. And it positions the DB2 data mining functions with Intelligent Miner for Data.
2.5.1 The whole picture To start with data mining, you use the standard data mining process (see also Figure 1-3 on page 8). You develop an application into which you want to include some interfaces to mine data, so as to actually benefit from the mining results in your business environment. Your data mining project consists of these five data mining process phases: 1. Preparing the data 2. Building the data mining model 3. Testing the model, in the case of a classification model
Chapter 2. Overview of the new data mining functions
25
4. Visualizing the results 5. Applying the model IM Modeling enables you to work less on phase 1 and to complete phase 2 of the data mining process. Also, in the case of the classification data mining function, IM Modeling enables you to test the model. For phase 4, use IM Visualization, which enables you to display your data mining results. IM Visualization also helps you to analyze and interpret the data mining results. For phase 5, the application mode, use IM Scoring. If the mining models are created by means of IM Modeling, then IM Scoring can directly apply these models. That is because IM Modeling writes the models into database tables. Mining models that are applied by the SQL API of IM Scoring must be contained in database tables.
2.5.2 Configurations You can use the three DB2 data mining functions IM Modeling, IM Scoring, and IM Visualization together to cover all the phases in the data mining process. However, conformance to such standards as SQL and PMML offer you the choice to use the three DB2 data mining functions independently. Then use these DB2 data mining functions in different package combinations with tools from other data mining product vendors who conform to the PMML standard format for data. For example, you can define predictive models in IM Modeling and share models between compliant vendors' applications. Or you can buy the model from one vendor, use IM Scoring to produce scoring results, and use other vendors' applications to visualize the results. As Figure 2-4 shows, you can expect the following different packages to be used from modeling until deployment into the business environment: IM Modeling IM Scoring in combination with a PMML model imported from another mining tool (for instance, DB2 Intelligent Miner for Data or SAS/EM) IM Modeling in combination with IM Scoring IM Modeling with PMML export to another third-party scoring tool IM Modeling with PMML export to another third-party scoring tool, plus IM Visualization IM Modeling, Scoring, and Visualization as a full integrated package All of these different packaged solutions can also be combined with a third-party end-user software (for instance, Business Objects, Siebel, or SAP).
26
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Third-party end-user software: BO, Siebel, SAP
Applications
Models in PMML
(Models) Scores (Visualizations)
(Models in PMML)
IM Scoring IM Modeling
IM Modeling
IM Modeling IM Scoring
IM Visualization
Configurations
IM Modeling IM Scoring IM Visualization
Figure 2-4 Configurations
2.5.3 Positioning with Intelligent Miner for Data DB2 Intelligent Miner for Data is a suite of statistical, preprocessing, and mining functions that you can use to analyze large databases. It also provides visualization tools for viewing and interpreting mining results. Data mining is an iterative process, as Figure 1-3 on page 8 shows, that typically involves selecting input data, transforming it, running a mining function, and interpreting the results. The DB2 Intelligent Miner for Data assists you with all the steps in this process. You can apply the functions of the DB2 Intelligent Miner for Data independently, iteratively, or in combination. Mining functions use elaborate mathematical techniques to discover hidden patterns in your data. After you interpret the results of your data-mining process, you can modify your selection of data, data processing and statistical functions, or mining parameters to improve and extend your results.
Chapter 2. Overview of the new data mining functions
27
DB2 Intelligent Miner for Data is set up so that it is considered a full data mining workbench. A workbench environment provides support to a data mining specialist who goes through the complete data mining process. When you start from scratch, an interactive mining workbench, such as DB2 Intelligent Miner for Data, may be the fastest way to create a model. That’s because workbenches are the right tools for exploratory data mining. It involves less programming effort and assists the exploration with mining techniques settled in a GUI environment. But beyond exploration, once the scenario is defined, the API of the DB2 data mining functions helps to actually close the loop with the operational applications and reduces the effort for continuous deployment. DB2 Intelligent Miner for Data does not offer the real time deployment support, right from the start, that a business analyst or data warehouse administrator or operational databases administrator with SQL programming skills would look for. This is a major difference with what the DB2 data mining functions offer. As Figure 2-5 shows, the combination with IM Scoring offers support for the deployment of the models and scores in the operational business environment.
Mining: Modeling, scoring, and visualizing through using a workbench
IM4D
IM Modeling
Mining: Modeling, scoring, and visualizing
Plus Deployment
IM Visualization
through DB2 functions
IM Scoring Operational DB
Figure 2-5 Positioning DB2 mining functions with DB2 Intelligent Miner for Data
28
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
For a complete overview on DB2 Intelligent Miner for Data, see: Intelligent Miner for Data Applications Guide, SG24-5252 Intelligent Miner for Data V8.1 Application Programming Interface and Utility Reference, SH12-6751 In addition to the difference in use within the generic data mining method and the audience of users, DB2 Intelligent Miner for Data offers slight differences in functionality. Refer to Appendix J, “Demographic clustering: Technical differences” on page 299, to learn about the differences that may matter to most of you who will build mining models. IM Scoring is a separate but companion product to DB2 Intelligent Miner for Data. It supports all scoring functions offered by DB2 Intelligent Miner for Data. For its mining deployment capability, it can use the PMML format from DB2 Intelligent Miner for Data. If the mining models are created by means of DB2 Intelligent Miner for Data, they must be exported from the DB2 Intelligent Miner for Data and imported into database tables so that the models can be applied by the SQL API of IM Scoring. IM Scoring provides UDFs to import the models. Access to predictive mining results closes the loop between operational applications and data warehouse analytical environment. Operational applications are all too often isolated islands where no data sharing takes place. Meanwhile data sharing should be there in case of performing an optimal CRM process facing customers. Data warehouse analytical environment needs on the other side of the spectrum have a means to provide a more customer-centralized view. Here there is a lack of performance and time to market. Access to predictive mining results, scores on the basis of a mining model, is the middleware. For example, a call center application can automatically enrich current customer information with a predicted churn risk. This is the risk that this particular customer may leave and no longer use the services of the company. The prediction is computed by a data mining scoring function on the basis of a mining model, plus data similar to what was stored in the data warehouse (but this time fed to a mining database as input to the mining model). Then the result is actually transferred back to the operational data, which maps to the call center customer. This redbook provides a series of examples of how IM Scoring is used in customized applications or in partner tools such as Business Objects, UNICA, or Siebel. You can learn more about these in 3.5, “Integrating the generic components” on page 44.
Chapter 2. Overview of the new data mining functions
29
Interoperability between workbench and data mining functions There are not only differences in usage and technologies between the DB2 Intelligent Miner for Data workbench and the DB2 data mining functions for modeling and scoring. The technologies also interoperate. For example, consider a business case of a CRM application for campaign management, where the technologies could come together as shown in Figure 2-6. In this scenario, the data mining expert develops a campaign management model that suggest to target customers with a high-potential to buy the product: 1. They produce this model in DB2 Intelligent Miner for Data. 2. They export this model in PMML format and transfer it to the operational environment where it is settled in a Web application. 3. The marketing department launches the campaign. 4. The marketers use IM Scoring to produce scores for customers. They subselect customers based on a ranking of scores to actually target: – The high-potential customers (first), hopefully to gain a higher ROI of the campaign – All the newly scored customers
Data Warehouse or Data Mart
3. Marketing launches campaign.
2. Model is transferred in industry-standard PMML format to the web application.
IM Scoring Intelligent Miner for Data 1. Mining expert develops model describing high-potential target customers.
Operational Server
4. Offers made to target customers.
Figure 2-6 IM Scoring: Example CRM application for campaign management
30
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The previous example shows interoperability of the mining tools in a business scenario. Figure 2-7 shows a more technical and generic flow as an example of where the mining tools interoperate. Figure 2-7 also shows the data mining expert producing a model in DB2 Intelligent Miner for Data and exporting the model in the PMML format (an XML-based language). The scoring application uses the SQL interface of IM Scoring to the PMML mining model to data. Then a DB2 UDF is called to store the scored data back to the operational database.
M ining Scenario: Scoring m odels m odels cou ld b e su pplied by a con sultant, solution provid er, or central su ppo rt g ro up within an enterprise m odels can be exch ang ed betw een data m ining too ls from co m pliant vendo rs added value: consultant m ig ht m erg e purchased data, such as d em o graph ic or ind ustry-specific d ata, with d ata m ined
Data A n a l y st
Intelligent M iner for Data scoring application
Transfo rmed Da ta Extracte d I nform ation D ata W areh o use
S elect
A ssimilated Info rmat ion
SQ L
S electe d D ata
Transform
Mine
A ssim ilate
m odel XM L form at
DB2 UDF
sco red data
Historical D ata classification m odel
Figure 2-7 Interoperability of DB2 Intelligent Miner for Data with IM Scoring
In summary, sourcing the data mining model can be done in several ways. If the model to be deployed is created in a data mining workbench, such as DB2 Intelligent Miner for Data, the model must be exported into a PMML file. Then you import the PMML file into DB2 UDB using a SQL script. If no mining model exists in the DB2 Intelligent Miner for Data workbench, you create the SQL script that uses the modeling API to create the model via the mining algorithm of IM Modeling. After the script is set up, another application or the batch scheduler can invoke it. Every time a mining run is invoked using the
Chapter 2. Overview of the new data mining functions
31
SQL API of IM Modeling, a new mining model is created. When this model is applied, using the IM Scoring API, individual records are scored. For example, in the case of clustering, the individual records are scored with a cluster identifier (number). At regular intervals, you invoke the mining run and scoring run on the data. This way, you detect whether the customer behavior and possibly a change in the demographic characteristics leads to a change of score. That is because the customer may have been scored to be in a different cluster over time.
2.6 Conclusion Database induced modeling and scoring leads to single point-of-entry mining and deployment of actionable analytic results within a business environment. The reason is that the following items are all stored and accessible in one place in your RDBMS environment:
32
Operational and demographic data Settings for the schematics of the models Mining models themselves Scores as predicted by the models Relationship data (call center, campaign, Web traffic)
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
3
Chapter 3.
Business scenario deployment examples This chapter introduces several business scenarios in different business industries that are facing mining deployment and integration issues in their organization: Customer profiling in banking that describes conventional mining with a workbench combined with scoring in the database Fraud detection in telecommunications that demonstrates how modeling can be done in the database without any workbench and describes the modeling and scoring steps for SQL in more detail Campaign management in telecommunications and banking that presents further variations of modeling or scoring in the database Up-to-date promotion in retail that presents further variation of using modeling in the database only This chapter presents an overview of these scenarios, including both the business issues and business benefits. For a more in-depth explanation of these scenarios, see Part 2, “Deploying data mining functions” on page 49. This chapter also provides the generic environment and components for these scenarios that are used to integrate the data mining functions in the business applications.
© Copyright IBM Corp. 2002. All rights reserved.
33
3.1 Customer profiling The challenge that most banking organizations face is that the volumes of data that potentially can be collected are so great, and the range of customer behavior is so diverse, that it seems impossible to rationalize what is happening. You may have reached the stage where you modelled the business environment and became successful at explaining the customer behavior to one another. Even at this stage, you have to make the next important hurdle of the ever-changing customer behavior. The question that you must address, beyond discovering new information about customers and using it to drive business forward, is how to anticipate the change of customer behavior on a timely frequent basis in near real time. Traditionally, there are several banking business questions to which traditional data mining workbenches can provide answers, for example:
What are the characteristics of my customers? To which customers should I target specific products or offers? Which of my customers is likely to leave? Which customers should I try to acquire? Which customers may not be ones that I want to encourage? Can I identify previously known fraudulent behavior?
In this case, consider the question of how to characterize the customers that use the bank services and products, using data that you routinely collect about them. The issue is to figure out which customers may not be ones that you want to encourage. For example, a depositor with a $200 average balance, a lot of call center time, and a lot of teller time may not be profitable for the organization. On the other hand, that same interaction profile with trust funds, a brokerage account, pr other investments may be a high profit. The net is that you want to determine which customers you want and treat them appropriately, as well as which customers you don't want and treat them appropriately. To expand this question, you also need to look at the factor time. That is the need for real time servicing and fast personalized service offerings to stay on top as a bank. A bank in today’s competitive banking arena needs to profile and score new customers in a fast and easy manner. New customers to the bank who visit any channel (teller, Web site, call center, kiosk,...) for the first time have become spoiled because of what competitors offer. Also, the assumption that a customer stays with a bank once they apply for a checking account is something from the past. There seem to be more reasons for understanding which customers stay or leave with the bank and why customers
34
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
change their behaviors more frequently and within short notice. The bank has to keep up to date on these changes when it applies its business models to the competitive marketplace. Therefore, you choose to restrict yourselves to one very important business issue:
What are the characteristics of the newly arrived bank customers, the customers with whom you had very few initial contacts through any bank channels in a very short or recent time period? How should you target them right now to stay on board? You also explain how you can operationally deploy the results into your business and how to use the scores. This is so important because all too often data mining is seen as an analytical tool for gaining business insight, but is difficult to integrate into existing Customer Relationship Management (CRM) systems. Segmentations generated by data mining contain a large potential value for your business. However, this value is not realized unless the segmentations are integrated into your business and combined with actual scores on basis of the segmentations. Informational deployment by integrating the result of your segmentation into your bank Management Information System (MIS) or data warehouse allows management and employees to analyze and make decisions on the basis of this new information. Working with the new segmentation in the system also allows them to become comfortable with the segmentation. An important business need is to see the segmentation results in an end-user tool that already exists in the organization (for example Business Objects) and use reports with characteristic information on the customers. The need is not to see the results in a workbench environment. End-user tooling such as Business Objects for reporting assist bank teller personnel and sales management about the behavioral, transactional, and demographic characteristics of the bank customers. Again, this occurs in near real time since scoring happens on the fly when the customer contact takes place. This supports the bank personnel with the latest information at the time of customer interaction.
3.1.1 Business benefits Clustering for customer profiling is used largely in CRM (see 3.3, “Campaign management” on page 39). The business insights provided by the use of clustering for customer profiling enables the bank to offer specific, personalized services and products to its customers. In the commercial environment, clustering and profile scoring are used in the following areas among many others:
Chapter 3. Business scenario deployment examples
35
Cross-marketing Cross-selling Customizing marketing plans for different customer types Deciding which media approach to use Understanding shopping goals
The first question for any commercial organization is always to know and profile the existing customers that are present in the bank’s customer base. Then, from there, score newly arriving customers in near real time to target them in a fast timely manner to: Reuse customer profiles and scores in the operational databases Exploit customer profiles and scores and combine it using the end-user reporting tool Trigger additional actions based on customer profiles and scores Personalize any customer access based on these customer profiles and scores
Operational use and reuse The feedback of profile score results to the operational databases, for both long-time customers and new customers, makes for effective use of customer profiles in your business environment. The reuse to other personnel and end-user applications that would use these customer profiles is enhanced once the scores are fed back to the operational databases that are accessible to many end users within the bank.
Reporting and overviews Customer profiles are visualized in a reporting tool to the end users to provide reports on a regular, daily basis or on demand. The end users could be bank personnel such as the bank manager and the bank teller operators. Reports and overviews are provided via the reporting tool with ease of use. They depict typical characteristics of several customers or one particular customer.
Multi-channel usage There is the need to understand the customer at any touch point of the organization. There is also a need to provide equivalent offers and treatment regardless of where they “touch”. A common database of customer information is necessary, along with consistent scoring algorithms. An example may be a customer who just closed or opened an account through the call center that goes to the Web banking interface to move funds. You need to appropriately analyze this activity to determine whether they are increasing or decreasing their relationship or profit potential. You must also determine if there are other services that may be appropriately offered.
36
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Real-time personalization Real-time personalization may address newly arrived customers via any of the customer-preferred channels to our bank. It offers more chances of addressing these new customers with better targeted services and personalized products. While you can quickly score and categorize a new customer, you must also apply the personalization to an existing customer, both on their entrance into the touch point and based on their actions while they are interacting. Real-time scoring right from the start of the customer’s initial contact, and during very next customer contact, raises your chances of a (prolonged) lifetime customer value. This business scenario and the steps involved, with respect to implementation and deployment to the Business Objects end-user environment, are discussed in more (and technical) detail in Chapter 4, “Customer profiling example” on page 51.
3.2 Fraud detection The case for fraud detection for large organizations in the banking, insurance, telecommunications, and government sectors has been well documented. Evaluating transactions on a weekly or monthly basis after they occur may be good enough in businesses where the potential fraud is known. This is applicable in detecting white collar crime such as fraudulent claims by service providers of any kind. However, other types of business transactions may require much more timely intervention. In the case of detecting unauthorized usage of credit cards, mobile phone calls made from stolen mobile phones, money laundering, and insider trading, you need a near real time or very timely detection system. The fraud detection scenario illustrates how to map a solution to the problem of fraud as experienced by a telecommunications service provider. Specifically, the scenario provides answers to the following questions: What are the characteristics of fraudulent connections? Which of the connections in the Call Detailed Records is likely to be fraudulent? How can I generate a list of suspicious connections so our investigators can focus their attention to the urgent cases? We have millions of transactions in a day. How can we automate the process?
Chapter 3. Business scenario deployment examples
37
Fraudulent patterns change, so how do we know our system can evolve with the changes?
3.2.1 Business benefits The business benefits are multiple.
Up-to-date analysis on current data This fraud detection scenario illustrates how the mining analysis can be activated at any time by a business user and how it uses the most recent data. There is no latency between the data analysis and the deployment of the results. This semi-automatic approach to fraud detection speeds up the time between learning from the past and deploying this knowledge for business advantage. There is some lead time when the business solutions are designed and configured, but once the system is set up, it returns the results almost immediately.
A system that adapts to change in undesirable behavior The segmentation model is refreshed automatically at regular intervals. This allows you to detect new patterns of undesirable behavior. Because undesirable behavior changes over time, chances are that these behaviors are outliers in some way. This evolving approach is more capable of detecting new outliers. In addition, this semi-automatic approach ensures consistency and a repeatable process that an organization can use to gain on productivity.
Capture data mining expertise Data mining skills are scarce. It is generally expensive to keep a full-time data miner on board to maintain a fraud detection model. This approach allows an organization to use the service of the data miner only for defining the model in the first instance. For example, external consultants can set up a model and then pass on the maintenance to the database development team for implementation.
Enhance repeatable success The entire business process can be captured in a system maintained by IT, in a database maintained by a database development team and database administrator (DBA). This provides an environment that is conducive to automation and a repeatable process.
Potential for enhanced communication Since every component is stored and maintained by IT, the entire model is documented. That includes data settings, model settings, tasks, and results. All
38
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
components can be queried and visualized by anyone who needs access. The communication process between the developers, data miners, and the business can be greatly enhanced.
Actionable result The return on investment (ROI) of any Business Intelligence initiative depends on how an organization can turn knowledge into actionable steps. The results of the scenario are highly actionable. A list of risky transactions is generated on a daily basis. With such tools as Business Objects, this telecommunications service provider can generate a list of customer transactions that warrant further investigation on a regular basis. This business scenario and the steps involved with respect to implementation and deployment are discussed in more (and technical) detail in Chapter 5, “Fraud detection example” on page 75.
3.3 Campaign management Campaign management is one of the most interesting and challenging processes in the marketing arena. It needs a lot of data points to get to one-to-one (1:1) marketing (see Figure 3-1). It is impossible to analyze all of the data points without some mining capability or scoring.
Chapter 3. Business scenario deployment examples
39
Rich customer data ultimately fuels your ability to move customers along the relationship continuum.
Ability to Individualize Products, Services and Processes
E
ed a as d B ion e s h s n e rc Ba im ea es ery i-D t u R l Q et Mu ark M l rna Youare are Here Here xte You Databased Marketing
Mass Marketing
You Understand My Wants and Needs
Sometimes You Get it Right
You Don’t Know Me
sis aly n lA
Discovery Based
1-2
4-6
50-400
Masses
Traditional Segmentation
GeoDemographic
None
Zip Code, Census
Segment Demographics, Psychographics
M
u sC as
n tio za i m sto
e ris rp e t En
Portfolio Based
e tom s Cu
s tic aly n rA
1:1 Relationship Marketing
FutureDesire Desire Future 500+ Numerical Models Customer Buying Patterns/ History
Thousands Millions Individuals “Portfolios” of Customers Customer Customer Forecasts of Promotion History, Preferences and Company Future Behavior Objectives
Number of Market Segments
Figure 3-1 Getting to 1:1 marketing
In a more competitive environment, many companies face and struggle with the important issue of how to avoid losing customers and to retain them. To effectively create and run a marketing campaign, professionals must be able to profile customers, products, and services; create a strategy to allocate the right amount of resources to maximize ROI; and measure the results of each action. They have to determine the right combination of customer type and attrition situation, as well as the best time to present each offer. These challenges are not easily overcome. The first task is often to determine who should receive a differentiated treatment. But how do you make this determination, especially when not enough history is available (new customers and prospects) or when the lack of profitability is due to unsuccessful cross-sales strategies?
40
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Determining when the right time to act is and which offer to present requires a careful planning. It also requires an integrated infrastructure that can support data exchange between the data storage systems (such as a data warehouse), the analytical systems (such as data mining and reporting applications), and the various channels (such as call centers). Although companies today have an enormous amount of customer data, direct marketers still struggle to extract useful and actionable information from those large databases. To effectively create a high ROI marketing strategy, a shift from traditional list generation campaign management (mass) to a more targeted and complete process was necessary to deal with:
Different profiles for different campaigns Dynamic profiles for customers and products Large number of campaigns Promotion cannibalizing Optimization for budget allocation Integration of analysis and channels
To achieve this goal, two key steps are necessary: 1. Develop a comprehensive analytical scenario based on a data warehouse environment. An effective marketing automation process uses a number of business intelligence tools. These tools provide quantitative and qualitative information about customers and prospects to determine the right message, offer, and channel for a given initiative. 2. Develop a strategy for campaign automation. To effectively manage a large number of campaigns that contact an ever growing number of prospects and customers, marketing specialists require an automated process to: – – – –
Manage the channel utilization Deal with different types of customer interaction Track responses Analyze the actions feedback using consolidate and detailed campaign reports.
Note: To enable a comprehensive view of existing and potential customers, a centralized data repository is usually necessary to collect data from all appropriated operational systems, channels, and third-party sources. This repository should support a customer centric, cross-product view necessary to develop effective marketing campaigns. Refer to Chapter 6, “Campaign management solution examples” on page 97, which provides scenarios on: How to automatically trigger some marketing actions when a customer event appears (trigger-based marketing scenario)
Chapter 3. Business scenario deployment examples
41
How to use data mining functions to automatically launch a campaign when a customer has a high propensity to leave (retention campaign scenario) How to use the data mining functions to enable the analytical scenario and campaign automation based on a data warehouse solution (cross-selling campaign scenario)
3.3.1 Business benefits The advantage of this approach is that the campaign is timely and automatic. It is also more focused because, for example, a churn score is updated every time there is a change in the customers behavior. The campaign management system is designed to automatically run different campaign/scripts to different combinations of churn score, customer value, and tenure. The cross-selling option is that the campaign is more precise. It can be less expensive to have an accurate real-time system that identifies and acts fast enough to contact the customers that the company really wants to keep (focus market segment) and the ones that can cause trouble. The result of this fast and accurate contact can be to prevent the customer churn (most likely). Therefore, the campaign must be the most targeted possible to avoid customer dissatisfaction and to save campaign funds or money. This business scenario and the steps involved with respect to implementation and deployment are discussed in more (and technical) detail in Chapter 6, “Campaign management solution examples” on page 97.
3.4 Up-to-date promotion In the retail industry, the large number of products and possible combinations of them make the business analyst decisions difficult. The analyst needs to keep track of the customer purchase behavior and the best mix of products in each store to decide for the product’s optimal price and marketing appeal. In this way, the manager of each store can control the warehousing time. And the marketing analyst can control the efficiency of a promotion when it is still running and even change the promotion. The manager or business analyst has to design and run a different promotion everyday based on the mix of products they have in the store. For example, the business analyst has to sell perishable products in a short time frame and needs to determine the best location in each store. In the same time frame, they can also check the efficiency of their decision and change it if they want. Another example is to use the Web to perform “micro testing” on
42
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
promotions and to put items up for sale on the Web. Then determine how much to produce (overseas) before ordering the merchandise in the stores.
3.4.1 Business benefits The business benefits are multiple.
Automation There is an automated way to find the associations rules embedded in an application without using any data mining software or skill. The business analyst (in this case the manager) can see and interpret the product combinations every time a new product is launched, or a new transaction is inserted in the transactional system. They can also test the efficiency of a promotion when it is still occurring.
Calibration: New data = new model The advantage of having the association technique embedded in any application is that it can be run in batch almost every night. Also the business analyst keeps track of the product association patterns. In this way, the manager of each store can control the warehousing time. The marketing analyst can control the efficiency of a promotion when it is still running and even change the promotion. IM Modeling brings the ability to calibrate and recalculate a data mining model every time new transactional data is loaded. The faster the data is inserted in the transactional data, the faster the calibration is done.
Fast and accurate Again the faster and repeatable that the process is, the more accurately the business analyst can decide on this industry that is characterized by having a large number of products combinations and customer profiles.
Ready for the Web environment You can use this promotion application in a Web-based environment to perform “micro testing” on promotions. You can also use it to compare the success rate of two different promotions. This business scenario and the steps involved with respect to implementation and deployment to the Business Objects end-user environment is discussed in more (and technical) detail in Chapter 7, “Up-to-date promotion example” on page 149.
Chapter 3. Business scenario deployment examples
43
3.5 Integrating the generic components This section describes the generic components that cover an end-to-end deployment process for integrating IM Modeling, IM Scoring, and IM Visualization in business applications and solutions.
3.5.1 Generic environment and components The different environments and components used in the different business scenarios, as illustrated in Figure 3-2, include:
Customer profiling Campaign management Trigger-based marketing on retention campaign Fraud detection Up-to-date promotion Applications with embedded mining
Mining work bench
Scheduler
JOb Ty pe play er
Model C alibrat ion Type player T ype pl ay er
Scheduler
Type a ttr ibute s
Scheduler
Applications with embedded scoring
Type pla yer
Type a ttr ibute s
Model Calibration
Segmentation
Business Objects
Target Campaigns
Call center
Application Integration Layer: CLI/JDBC/SQLJ/ODBC IM Scoring API
IM Modeling API
Export model
Mining run Model (BLOB)
Mining run
Data Warehouse Analytical Data Mart
Model transportation as CLOB or PMML
Modeling environment
Operational Data Store Data Models Scores
Scoring environment
Figure 3-2 Business scenarios components
44
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
There are three distinct environments: The business applications environment The modeling environment The scoring environment
Business applications environment Business applications can be any applications that lend themselves for analytics. This may include real-time or batch applications that embed a fraud detection application. They may also include: A call center application that computes a customer’s new propensity to churn, based on changed circumstances A call center application that promotes up-to-the-minute “next best offer”, based on the latest customer information A weekly batch application, a so-called scheduler, that generates customer leads for targeted campaigns A credit scoring application that sits on the loan officer’s desk in the branches A traditional workbench environment such as DB2 Intelligent Miner for Data or SAS Enterprise Miner With the new mining and scoring extenders, any application can now embed the modeling, scoring, or both. The customer profiling scenario and the trigger-based marketing on retention only use the scoring function. The fraud detection scenario showcases an application where both the modeling and scoring extenders are used in a single business application. Therefore, from an application perspective, the applications can span across both the modeling and scoring environment. The applications environment is integrated with the modeling and scoring environment using CLI, ODBC, JDBC, and SQLJ.
IM Modeling environment The modeling environment is a DB2 database that is “enabled” for mining. The database has all the additional database objects required for executing the Modeling API. Typically, the source database is part of a data warehouse, where the data to be modeled on is cleansed, joined, pre-processed, and enriched. Using SQL script, the application developers use the mining extenders to mine or “model” the data in a batch environment. The result of a modeling run is a model stored as a special DB2 datatype within the database. The model is exported to PMML in an external file or a DB2 CLOB type for transport and storage. For more information, refer to Chapter 10, “Building the mining models using IM Modeling functions” on page 211.
Chapter 3. Business scenario deployment examples
45
This model is then transported by means of export or DB2 Universal Database (UDB) Connect to the scoring environment for scoring and applying the model on new customers.
IM Scoring environment Scoring is the application of the model built in a modeling environment. The model has to exist before it can be used for scoring new customer data. From a DB2 perspective, there is no difference between a mining and scoring environment. In practice, most IT shops separate their modeling and scoring environments for security, operational, and efficiency purposes. The scoring environment is a DB2 database that is enabled for scoring. An enabled database has all the additional database objects required for executing the scoring extender via the SQL API. The enablement process is the same for both modeling an scoring. Typically the scoring environment is a single table within an Operational Data Store (ODS). You can think of this ODS as a house midway between the data warehouse and OLTP systems. Models are developed in the data warehouse and deployed to the ODS. Finally the scores, as a result of the scoring run, are most likely integrated into the front-of-house or operational OLTP applications where scores are needed for different campaigns, call center, or CRM applications.
Components The components that are used throughout this chapter include:
Applications IM Modeling data mining function and API IM Scoring data mining function and API Data Mining Model represented in different forms Models can be stored as datatype BLOB in DB2 UDB, PMML in external files, or as CLOB in DB2 UDB. Refer to Chapter 10, “Building the mining models using IM Modeling functions” on page 211, for more information.
Table 3-1 outlines the basic steps for deployment of modeling capability into the database.
46
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 3-1 Step-by-step modeling process Model building step
Modeling environment
IM Modeling SQL API
Workbench
1
Use IM Modeling to build a model based on business requirements.
Use Workbench such as DB2 Intelligent Miner for Data or SAS Enterprise Miner based on business requirements.
2
Train the model and save the model.
3
Validate the model.
4
If Modeling and Scoring are in separate databases, export the result or model as a PMML file for distribution.
Table 3-2 outlines the basic steps for deployment of a modeling capability into the database. Table 3-2 Step-by-step scoring process Score building steps
Scoring environment
1
If Modeling and Scoring are in different databases, import the PMML model into DB2 UDB.
2
Build the SQL script that uses the imported models for scoring.
3
Deploy the score script.
Application environment 4
Integrate with SQL core scripts using CLI, ODBC, SQLJ, and JDBC.
3.5.2 The method For each scenario, refer to the following chapters: Customer profiling: Chapter 4, “Customer profiling example” on page 51 Fraud detection: Chapter 5, “Fraud detection example” on page 75 Campaign management: Chapter 6, “Campaign management solution examples” on page 97 Up-to-date promotion: Chapter 7, “Up-to-date promotion example” on page 149
Chapter 3. Business scenario deployment examples
47
For each scenario and its implementation process, the same methodology and workflow were adopted: The business issue: – What are the current problems in the business? – What are the possible approaches for a solution? – Why should the company choose a solution that involves data mining? That is, why are the other approaches not good enough? – Where's the ROI? Mapping the business issue to data mining functions: This section describes the problem at a more analytical level. – What is the key idea where a solution designer would say “this looks like mining function X; maybe we should try that”? The business application: – How does the application look from an end user’s point of view? – Who is the user, are there different kinds of users, and what tools or applications do they use? Environment, components, and implementation flow: This is the technical counterpart of the business application. It sketches the design of the solution with reference to mining. The step-by-step implementation: – – – –
How does the input data look? Optionally, how can it be derived from the source tables in the warehouse? How do you invoke mining? How do you show and integrate the results?
The benefits: – What are the technical benefits of such integrated solutions?
48
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 2
Part
2
Deploying data mining functions This part provides examples on how to easily integrate data mining functions through SQL scripts to enhance your business applications. This requires more IT or database administrator support than real data mining knowledge. This part provides several business scenarios and highlights the implementation and deployment of the DB2 data mining functions. This is based on the generic components that cover an end-to-end deployment process for integrating IM Modeling, IM Scoring, and IM Visualization in business applications and solutions (see 3.5.1, “Generic environment and components” on page 44). Plus this part elaborates other ways to integrate the DB2 data mining functions with different e-commerce- and Business Intelligence-related technologies.
© Copyright IBM Corp. 2002. All rights reserved.
49
50
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
4
Chapter 4.
Customer profiling example This chapter presents a basic scenario for customer profile scoring of new customers for a bank. This business scenario leans heavily on the use of scoring techniques and scoring results. It describes conventional mining with a workbench, combined with scoring in the database. In this chapter, the mining models are created with a workbench such as DB2 Intelligent Miner for Data. This chapter also discusses the steps involved with respect to IM Scoring and deployment to the Business Objects end-user environment.
© Copyright IBM Corp. 2002. All rights reserved.
51
4.1 The business issue Understanding customer behavior and being able to predict their preferences and actions is crucial for every business. This is particularly true in the highly competitive banking industry. An advantage in addressing the customers' needs can make a big difference in the long run. Customer segmentation has been a very useful technique for guiding interaction with the customer. Traditionally segmentation is done by market experts who somehow come up with descriptions of segments as they feel appropriate. With data mining, however, a bank can detect different patterns in the customer behavior without depending on instinct. Even when data mining techniques are used to optimize the channels toward a “segment of one”, a bank still faces the issue of how to enable better the front office employees to use the mining results in daily operations. If data mining becomes operationalized in the customer interactions, an institution can achieve the full return on investment (ROI) of the investment in data mining. As an example, we address the issue of how to distinguish profitable and non-profitable customers. More specifically, we demonstrate how end users can easily access customer scores, from within their usual applications. We also solve the requirement that the end users must be able to achieve results that were computed using the most recent information about a specific customer. This approach allows you to score new customers. The other important issue here is that not only profiles have to be discovered in a fast timely manner. Scoring to existing profiles has to take place in a very fast and frequent way. New customers often join a bank or change banks for private or business purposes. This happens particularly in the U.S.A. where an individual still has to close one account and open a completely new account in the next place of residence when they relocate. Therefore, chances are that after a first visit, or couple of informative visits, they will never return again if the services or products are not targeted or recognizable to them at an affordable price. In our example, we consider the question of how to profile and score newly arrivals at the bank, via any channel, with near real-time scoring.
4.2 Mapping the business issue to data mining functions To generate the customer profiles, you use the traditional workbench and run a data mining technique for segmenting the customer database called clustering.
52
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Clustering: Customer segmentation is a method where customers are grouped into a number of segments based on their behavior. In data mining, this corresponds to the clustering technique. The customer behavior is described with profile data that includes features, such as age and income, and behavior indicators, such as the aggregated monetary value of transactions during the last four weeks. Clustering groups the customers into different groups, based on the similarity of their profiles. There is no need to predefine a particular criteria for the grouping operations. The clustering technique can find the right groups automatically. The models and scores that you produce can be used by the business users and be integrated in their reports (Business Objects for example) to identify the different types of customers and the potential niche market segments. The clustering technique typically generates three different result types: A graphical result for the analysis and business user An output table (or flat file) with scores The clustering model with segment characteristics To address the newly arrived customers to our bank, you use real-time scoring to deliver fast results on a typical profile fit. With real-time scoring, you can match the business actions faster and more centered to the possible needs of the newer customers. For example, a recent customer to the bank Web site registers their preferences to your products and services and confirms their registration. They are more likely to become a steady customer than the one-click Web user that visits your bank Web site. Real-time scoring of the customer behavior of this new registrant enables you to use the scoring technique to match their profile with other existing customers to your specific Web channel. From there, you can personalize the bank Web site pages to target this newly arrived customer to your bank. You can offer products that other customers in their segment also became aware of and then decided to receive service for through your bank. Similarly, scoring can automatically and instantly trigger an e-mail to this customer. The e-mail confirms the registration. Also on basis of the fit of demographic characteristics and product preferences, the e-mail targets this recent customer in a more personalized way from the start. If this same scenario occurs at a bank teller or sales account desk, and not through the Web channel, the real-time scoring instantly provides a quick score report with which the customer is serviced in the personal interview.
Chapter 4. Customer profiling example
53
4.3 The business application The features of the bank services and bank products types can lead the business analysts to choose certain segments of customers and set other segments aside. A mortgage product may be of interest only to one or two segments of customers and not to all segments. Demographic characteristics, such as family size, household income, and ownership of private property, may trigger the business analyst to select a certain segment as a target set. The customer lifetime of the bank customers may lead to the bank manager deciding to target a subset of customers in a segment differently from customers in the same segment. The first stage is to know customers better: To have a better understanding of your new customers as a bank by scoring To score these new customers on the basis of their characteristics and their match to profiles of certain segments To have customers interact with the bank via different channels of choice, be it the bank teller, ATM, Web, telephone, or kiosk So the bank can offer diverse services and products (checking, savings, bank funds, credit card, mortgage, insurance, automatic payments), which over time, need adjustments and additional features To help the business analyst in the bank to know what types of customers the bank deals with in the first place To act with a customer-oriented focus instead of service or product focus The segmentation models are produced by the business analyst that is in charge of data mining and using a mining workbench that already exists in the enterprise. The scores can be used by the business users. They can be integrated in their day-to-day reports (Business Objects for example) to identify the different types of customers and the potential niche market segments.
4.4 Environment, components, and implementation flow Figure 4-1 shows a typical end-to-end deployment environment with its components in this case of profile scoring.
54
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
A p p lic a tio n s w ith e m b e d d e d s c o rin g S ch e d u le r T y pe p l ay e r
Type attr ib utes
M in in g W o rk B ench
JO b
M o d e l C a l i b r a t io n
T yp e pla ye r
Ty p e p l a ye r
Typ e p lay er
T yp e at trib u tes
S c h e d u le r
S e g m e n ta t io n
B u s in e s s O b je c t s
A p p lica tio n In teg ra tio n L ay e r: C L I/J D B C /S Q L J/O D B C
IM S c o r in g A P I
M in es th e d a ta u s in g D e m o g r ap h ic C lu ster in g C lu s t e r m o d e l
O p e ra tio n a l D a ta S to re D a ta M o d e ls Scores
D a ta W a re h o u s e A n a ly tic a l D ata M a rt
M o d e lin g E n v iro n m e n t
S c o rin g E n v iro n m e n t
Figure 4-1 Customer profiling using IM Scoring, deployment to Business Objects
The mining workbench mines the data using the demographic clustering technique. The results (model with clusters) already produced by the business analyst in charge of the data mining workbench are stored and used by the IM Scoring score function to score new customers (or customers with changed demographic or transactional behavior). The visualization of a customer with their demographic data and bank-related transactional data, plus the segment to which they belong, assist the bank manager in profiling the customer. The customer is profiled according to the most their important characteristics. The visualization of this customer is done via Business Objects, which is an end-user reporting tool that tightly integrates with IM Scoring. Visualization is also done by embedding the graphical or textual cluster visualizations by IM Visualization in Business Objects reports. The Business Object report then shows the individual segments to the end user when it displays the segment identifier, score, and detailed data information for a customer that the end user selected. Spatial data may also be used. Address, income levels, and scoring can all determine where to open the next branch or in which geographical area to target the next campaign. Scoring can be done on demand. Maybe because the new customer was not in the customer base overnight, they arrived in person at the bank manager’s desk the following afternoon.
Chapter 4. Customer profiling example
55
In any case, at regular intervals, you invoke the scoring that is run on the data. This way detecting whether the customer behavior or possibly a change in the demographic characteristics may lead to a change of score. This is driven by the notion that a customer may be scored to a different cluster over time. Not only are the demographic characteristics of the customer transient over time, but the transactional behavior of the customer in relationship to the bank also changes. To schedule scoring, you can use DB2 UDB scheduler so that the Business Objects report at the end user’s desktop displays the most recent profile scores of bank customers on a frequent basis. Example 4-1 shows the batch job score.bat for scoring where the SQL script score.db2 writes the scoring results into a DB2 table for customer segments.
Example 4-1 Batch job for scoring db2cmd /c /w /i db2 -tvf score.db2
The high-level view of the implementation flow in Figure 4-2 shows two main phases of activities: Design and modeling: This phase comprises mining, modeling updates of mining models, creating the static report, and finally linking the scoring functions, the profile data, and the report. At this application integration stage, the database administrator (DBA) and the end user interact to define the initial report setup. They also select the right business terminology in the report, the right informational fields, and the format of report itself. End-user interaction: This phase is driven by the end user. Somehow there is a new customer. The scoring function is activated, and the end user reads the results from the report. Note: On the implementation flow shown in Figure 4-2, input may be added to trigger the generation of a new model.
56
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Con figuration
Database enablement
Table to be sco red
W orkbench Data M ining
Export PM M L file
Scoring Latest data m ining m odel in p la ce ?
Im port data mining model
No
Yes
E xternal event: new customer or changed behavior of existing custom er
Apply the Sco ring function
App lication Integ ration BO report exists?
Display BO report
No
Yes Bu ild Rep ort
Figure 4-2 Implementation flow of customer profile scores
The next section discusses the implementation in more detail.
4.5 Step-by-step implementation The detailed implementation, according to the flow shown in Figure 4-2, consists of the steps that are explained in this section. Some steps run once, some steps run each time a tune-up of the mining model occurs, and some steps run at each scoring instance. All in all, each step is short and easy. The implementation flow overall is simple because most of the steps are done earlier, such as the database configuration setup.
Chapter 4. Customer profiling example
57
4.5.1 Configuration You must do the configuration once during the deployment stage. It consists of database enablement and table creation.
Database enablement Make sure that both the DB2 UDB instance and the database BANK for the customer profile scoring projects are enabled. The steps to set up the environment for the BANK database are: 1. Enable the DB2 UDB instance. Check whether the DB2 UDB instance is enabled. If it is not enabled, enable the DB2 UDB instance by referring to the steps in 9.3.1, “Configuring the DB2 UDB instance” on page 197, for IM Scoring. 2. Enable the working database. Check whether the DB2 UDB database for scoring is enabled. If it is not enabled, enable the database (‘BANK’) by referring to the steps in 9.3.2, “Configuring the database” on page 198, for IM Scoring. In this case, you invoke the script in Example 4-2.
Example 4-2 Script to enable the USBANK database for scoring idmenabledb BANK fenced tables
Table to be scored Table 4-1 lists the data fields that are used for clustering. This list contains the demographic, relationship, and transactional data that is used for customer base profiling of the bank.
Note: Only use variables that are available in the online environment to deploy it. A typical mistake is to generate clusters based on everything you know about your current customers and not realizing that you do not have all that information on prospects.
58
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 4-1 Table fields to use Field description
Field name
1
Customer identification
Client_id
2
Age
Age
3
Sex
Sex
4
US state of residence
State
5
Monthly income
Salary
6
Profession
Profession
7
Home owner
House_ownership
8
Car owner
Car_ownership
9
Marital status
Marital_status
10
Number of dependents
N_of_dependents
11
Has children
Has_children
12
Time as customer
Time_as_Customer
13
Monthly checking amount
Checking_amount
14
Total monthly amount of automatic payments
Total_amount_automatic _payments
15
Monthly savings amount
Savings
16
Monthly value of bank funds
Bank_funds
17
Monthly value of stocks
Stocks
18
Monthly value of government bonds
Govern_bonds
19
Amount of gain or loss of funds (bank funds, stocks and government bonds)
Gain_loss_funds
20
Number of mortgages
N_mortgages
21
Monthly amount of mortgage
Mortgage_amount
22
Amount monthly overdrawn
Money_monthly_overdrawn
23
Average value of debit transactions
AVG_debit_transact
24
Number of checks monthly written
Monthly_checks_written
25
Number of automatic payments
N_automatic_payments
Chapter 4. Customer profiling example
59
Field description
Field name
26
Number of transactions using bank teller
N_trans_teller
28
Number of transactions using ATM
N_trans_ATM
29
Number of transactions using Web bank
N_trans_Web_bank
30
Number of transactions using kiosk
N_trans_kiosk
31
Number of transactions using call center
N_trans_Call_center
32
Monthly amount of credit card limit
Credit_card_limits
33
Has life insurance
Insurance
The table with all the customers that the bank wants to score is either already created, or can be created, by joining the diverse data fields from the different operational databases needed in the bank environment. For example, the customer demographic data, from the customer table in the bank data warehouse, is joined with sales transactional data from the table owned by the bank’s accounts department. The SQL script to create and load a sample input table containing customer profile data in DB2 UDB for this scenario is: Script to create and load the customer segm table for scoring.sql
4.5.2 Workbench data mining Collect all the customer demographic and transactional data and store the data in a relational database, such as DB2 UDB. Then use the DB2 Intelligent Miner for Data workbench to exercise the generic data mining method and use the demographic clustering technique. Figure 4-3 shows an example of the Summary page for this technique.
Note: For a description of the clustering algorithms that DB2 Intelligent Miner for Data offers, see Mining Your Own Business in Banking Using DB2 Intelligent Miner for Data, SG24-6272, and Mining Your Own Business in Retail Using DB2 Intelligent Miner for Data, SG24-6271. In this case, use the statistical clustering algorithm at the bank because it automatically determines the “natural” number of clusters. The clustering mining function groups data records on the basis of how similar they are. A data record may, for example, consist of information about a customer. In this case, clustering would group similar customers together, while maximizing the differences between the different customer groups formed in this way.
60
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The groups that are found are called clusters or segments. Each cluster tells a specific story about customer identity or behavior. For example, the cluster may tell about their demographic background or about their preferred products or product combinations. In this way, customers that are similar are grouped together in homogeneous groups that are then available for marketing or for other business processes.
Figure 4-3 Demographic clustering in DB2 Intelligent Miner for Data
Exporting the PMML model The model to be deployed is created in DB2 Intelligent Miner for Data. Therefore the model needs to be exported from the workbench and transferred to the database where the scoring is invoked. The model may be exported into a PMML file (see Figure 4-4) or DB2 Intelligent Miner for Data format. Both formats can be used by IM Scoring. In this example, use the PMML format. The model was exported to the CustomerSegmentation.PMML file. Refer to 9.3.3, “Exporting models from the modeling environment” on page 200, to learn how to export the model in PMML format.
Chapter 4. Customer profiling example
61
Intelligent Miner for Data: Exporting mining models
Native IM4D format
PMML format
Figure 4-4 Exporting the mining model in the PMML format
4.5.3 Scoring Use IM Scoring to score additional customers in the bank’s operational database, to the segments to which they belong or don’t belong. Scoring the data is the most important step. By itself, it is a simple step because of the ease of use of different handy SQL scripts that are provided with IM Scoring and DB2 UDB.
Importing the data mining model The model is imported into DB2 UDB using a SQL script. Refer to 9.3.4, “Importing the data mining model in the relational database management system (RDBMS)” on page 202, to learn how to import the model in PMML format. If the PMML model was exported as the CustomerSegmentation.PMML file, then the script in Example 4-3, which is part of the score.db2 file, imports the PMML model.
Example 4-3 Importing the clustering model INSERT INTO IDMMX.CLUSTERMODELS VALUES ('Demographic clustering of customer base of an US bank', IDMMX.DM_impClusFile('c:\scoring\CustomerSegmentation.PMML'));
Apply the scoring functions You apply the scoring functions on the basis of a cluster model by invoking the DM_applyClusModel command. In this example, the view CustomerSegmentsView was created. The REC2XML command was used to
62
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
convert the data that is to be scored to XML format. In this scenario, the data to be scored is data for new customers who were saved to the bank customer base in the table named Recent_New_Customers. See the SQL script in Example 4-4.
Example 4-4 Defining the view CustomerSegmentsView CREATE VIEW CustomerSegmentsView ( Customer_id , Result ) AS SELECT data.Client_id, IDMMX.DM_applyClusModel(models.MODEL, IDMMX.DM_impApplData( REC2XML(1,'COLATTVAL','', data."CAR_OWNERSHIP", data."HAS_CHILDREN", data."HOUSE_OWNERSHIP", data."MARITAL_STATUS", data."PROFESSION", data."SEX", data."STATE", data."N_OF_DEPENDENTS", data."AGE", data."SALARY"))) FROM IDMMX.CLUSTERMODELS models, RECENT_NEW_CUSTOMERS data WHERE models.MODELNAME = 'Demographic clustering of customer base of an US bank';
This SQL script is actually generated using the command IDMMKSQL. The tip in “Constructing the record and applying the model” on page 207, explains how to use this command. The complexity of the score code is encapsulated in the view CustomerSegmentsView. You can access the scoring results by simply reading from the view. Due to the dynamic nature of views, the scoring results are computed on the fly. The mining result contains the cluster ID, the score, and further indicators.
Note: With the new IM Visualizers in V8, the names of the clusters can be modified. That is, there is an abstract integer number that represents the cluster ID and a name for a certain cluster, for example, dead beats instead of clusterID=5. A table that contains the cluster ID and the score as separate entries is constructed by the SQL statement in Example 4-5.
Chapter 4. Customer profiling example
63
Example 4-5 Write scoring results into table CustomerSegmentsTable --- Use view CustomerSegmentsView to score data and to --- write scoring results into a table CustomerSegmentsTable INSERT INTO CustomerSegmentsTable SELECT Customer_id, IDMMX.DM_getClusterID( Result ), IDMMX.DM_getClusScore( Result ) FROM CustomerSegmentsView;
The table CustomerSegmentsTable is defined once as shown in Example 4-6.
Example 4-6 Defining the CustomerSegmentsTable --- DROP TABLE CustomerSegmentsTable; CREATE TABLE CustomerSegmentsTable ( Customer_id CHAR(8), Cluster_id INTEGER, Score FLOAT );
The SQL statements in both of the previous examples are listed as part of the score.db2 file. This file is available in the additional materials for this redbook. For more information, see Appendix K, “Additional material” on page 301. The reason for creating a table in addition to a view is that account managers of the bank want to see static results, not just the dynamic changes of the scores at all times in their reports, when they are involved in client engagements over a period of time. In case only real-time scoring is required by the bank’s end users, then only the view is necessary.
4.5.4 Application integration As part of the business issue to integrate in an existing enterprise query reporting tool, this section demonstrates how end users can visualize scoring results on their well-known reporting tool, in this case Business Objects.
Integrating with Business Objects reports The purpose of customer profiling the bank customers, produced by data mining, is to discover the different behavior groups in your customer base by extracting
64
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
the natural segments in your data. This allows your end users to identify different customer types and to develop the most appropriate way to service (and price) the different customer groups. The mining results are available in plain database tables or views. This makes them readily available to a reporting tool. There is no need to configure mining functions in the report. Even more, when mining results are read from the view, the scores are computed on the fly. That is, when a user of a Business Objects loads a report, the scores are evaluated in real time. Visualizing the customer profile score results in a visualization of the cluster results that IM Visualization offers. The visualizations are embedded as applets into a HTML version of a marketing report to the end user. To embed the IM Visualizers in an HTML document as an applet that is either a graphic inside the HTML version of the report or that is initiated from a Start button on the report, see 11.3.1, “Using IM Visualizers as applets” on page 235. Example 4-7 shows the code used to generate the report pages that are shown later in this section. This particular piece of code shows how to include the IM Visualization applet with a graphical view of clusters on a page titled “New scores of today” in a Business Objects report. The complete code is provided as an example in the additional materials for this redbook. The code is in the
0 - File0.htmlToshow10customerscoresplusgraphicalviewIMVisAppletpageNew ScoresToday.htm file. See Appendix K, “Additional material” on page 301, for more information.
Example 4-7 HTML code to show IM Visualization applet in BO report <TITLE>New scores of today <META NAME="GENERATOR" CONTENT="BusinessObjects 4.0 on Windows"> ...
Scoring of new customers of the bank at: | ...
Demographic clustering of customer base of an US bank | ...
Chapter 4. Customer profiling example
65
>
<param name="model" value=""> <param name="embedded" value="true" <param name="view" value="graphical"> <param name="JDBC_URL" value="jdbc:db2://localhost/BANK"> <param name="JDBC_Driver" value="COM.ibm.db2.jdbc.net.DB2Driver"> <param name="DBUserName" value="db2admin"> <param name="DBPassword" value="db2admin"> <param name="DBTable" value="IDMMX.CLUSTERMODELS"> <param name="DBPrimaryKeyCol" value="MODELNAME"> <param name="DBModelCol" value="MODEL"> <param name="DBModelKey" value="Demographic clustering of customer base of an US bank"> |
...
Customer | Segment | Score |
65000001 | 1.00 | 0.49 |
...
Typically end users want to see the score results in an application environment to which they are accustomed. An end-user reporting tool, such as Business
66
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Objects, would suffice. The customer profile scores and profiles are shown together in a Business Objects customers’ profiles report. The different report pages that are then portrayed in a Web environment to the end users, are shown in Figure 4-5, Figure 4-6, Figure 4-7, and Figure 4-8 respectively. The report consists of three pages. The first page (Figure 4-5) is displayed graphically to the end user. It shows the scores for 10 new customers in the bank customer database.
Figure 4-5 Business Objects report first page: Scores graphical display
Chapter 4. Customer profiling example
67
The scores indicate the individual match of the 10 customers to the cluster (one to six). Next to this, the end user sees a visualization of the six clusters in an applet. They can click the clusters to understand, or remind themselves, the characteristics that a customer has according to the cluster profile they fit best. Next to this graphical presentation, the end user also has a textual representation of the same six clusters (see Figure 4-6).
Figure 4-6 Business Objects report first page: Scores textual display
In this scenario, the bank has six clusters. Several of the new customers were scored primarily to cluster 1 and cluster 5. The end user sees that cluster 1 typically characterizes these customers to be predominantly:
68
A single male Mostly between 20 and 25 years of age Resides in the state of New York Doesn’t own a home but owns a car Has a low income
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Has a low savings account Has been with the bank for only three years Has made very few transactions across all channels This may be a customer to which the account managers decide not to target for cross-sell actions but decide to inform customers to set more money aside in the savings account to lower the base monthly fee to keep the savings account. Or the bank account manager informs this customer to use the Web channel more often instead of writing checks, because it would be less costly to process for both parties, the customer and the bank. Likewise, cluster 5 shows a profile that is characterized as predominantly: A divorced male Above 45 years of age Resides in the state of New York Owns both a home and a car Has a low savings account Has a high income and high mortgage amount Does a low number of teller transactions but a higher number of ATM transactions Has been with the bank for only two years at present
The account managers may call this customer a “risk-taking yuppie”. Due to this information distilled from the report, the bank manager decides to treat this type customer with care for the next two years. This way when a change in social status happens in the near future, the financial status apart from the high income may become even better. Then this customer is likely to buy more services from the bank at that point.
Note: The file for visualizing the clusters in IM Visualizer is called Demographic clustering of customer base of a US bank.vis. You can find this file in the additional materials that accompany this book. For more information, see Appendix K, “Additional material” on page 301.
Chapter 4. Customer profiling example
69
The Business Objects report also displays a page to the end user where each individual customer may be analyzed. Again the visualizer for the particular segment that most likely fits the profile of this new customer is shown in an applet (see Figure 4-7). This customer with ID ‘65000010’ fits cluster 5 with a score of 0.41. Based on the cluster details, the bank manager has another view of the customer characteristics according their fit to cluster 5. The information that the manager distills from this particular view again validates the bank manager’s idea to treat this type customer with care for the next two years. Then at regular intervals contact this customer by phone.
Figure 4-7 Business Objects report second page: Graphics cluster
70
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 4-8 shows the third page of the HTML report. It includes the details of the 10 new individual customers.
Figure 4-8 Business Objects report third page: Detailed data of selected customer
With this view of each individual customer, the bank employee who is in contact, at some point, with the customer sees the customer’s transactional and demographical data values (used for scoring), with the segment identifier and score. Through this view, the bank employee has an overview of which demographic, relationship, and transactional data values, aggregated or not, have lead to the score fit to cluster 5. It gives the bank employee the means to use the cluster fit as a sole, mere trigger to address the needs of the customer. By having the basic and historical data, the employee is more effective in relating to the customer.
Chapter 4. Customer profiling example
71
4.6 Benefits Apart from the business-oriented benefits stated in 3.1.1, “Business benefits” on page 35, there are several, mostly technically-oriented benefits to addressing the business issue in the way that is proposed in the implementation flow in this business scenario.
4.6.1 End-to-end implementation The bank’s requirements to characterize, know, and profile their customers is now achieved via the end-to-end implementation flow where:
The operational data is accessed. Clustering is applied. IM Scoring is invoked for repeatable scoring of new customers. The actual scores are visualized in the Business Objects reporting tool to the end users.
Scoring on a regular basis and monitoring the changing score over a period of time can rapidly bring to the attention of the bank those customers with unusual or sudden changes in behavior. You can do this using the score and not simply the movement between clusters, which may occur too slowly.
4.6.2 DB2 mining functions next to the workbench A workbench, such as DB2 Intelligent Miner for Data, is used in a separate environment from the operational database. IM Scoring, however, is set up to access the operational database. The workbench is meant to find an initial clustering model in one run. But it is the IM Scoring functions that facilitate for repeated and timely profile scoring. They can timely profile newly arrived customers to the database via a trigger that stems from a business rule that, for example, says, “If the customer bought a home and is applying for a mortgage, the bank should also look at their profile characteristics again and re-score.”
4.6.3 Real-time analytics The segmentation model and the scores can also be implemented in your operational database system. The scoring functions can be accessed via the SQL or the Java API. Then they can be invoked to score the bank customer data from within an application and display the results directly into a Business Objects report. This allows your segmentation model to work in real time. This is particularly relevant if you want to score new customers on the fly and on demand.
72
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
4.6.4 Automated and on demand for multi-channels IM Scoring functions offer repeated and timely profile scoring, without the need for human interaction. The production of reports for each workday can be automated by batch run scheduling of jobs to exercise the scoring function to the operational data at hand. The end users of the reports would be bank personnel, such as the bank manager and the bank teller operators, at times when they interact with a customer. The automatic scoring can also trigger new message flows at the bank’s ATM machines. All of these communication channels interact with the identified customer at the customer touch point. And all of these channels that a customer may prefer from time to time can use the profile information of the segmented customer to target the customer via effective CRM activities.
Chapter 4. Customer profiling example
73
74
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5
Chapter 5.
Fraud detection example This chapter describes a scenario on how to use the data mining functions from scratch to build a quick and reusable data model and to score data. The scenario shows how to integrate the functions into existing operational applications for example to detect fraud in telecommunications. This scenario demonstrates how you can perform both modeling and scoring in the database. It also describes the modeling and scoring steps in SQL in more detail.
© Copyright IBM Corp. 2002. All rights reserved.
75
5.1 The business issue Fraudulent behavior is detected when a telecommunications company tries to bill a customer but never receives any payment. It is possible to manually evaluate these cases, classifying them as fraud, bankruptcy of customer, or a technical processing error. But fraudulent customers are creative and use techniques that were not found previously as being characteristic for fraud. The ability to detect “unusual” behavior, without knowing in advance which features make a certain behavior unusual, is necessary to detect these cases as early as possible. This business scenario is based on a real-life fraud detection exercise performed by IBM data mining experts in a large telecommunications provider in Europe. This organization was actually aware that they were a victim of fraud in their Premium Rate services business unit. Premium Rate services provide expert hotline services on such topics as:
Advice on insurance and judicial Legal Stock market tips Adult services
The charge for these services can be up to 2 euro per minute and is billed by the telecommunications company. The business model is that a company offering such a service earns quite a high share of this rate immediately from the phone company, whereas the phone company itself charges the caller for the entire amount. On the basis of this constellation, fraud can be carried out in the following way: 1. The (fraudulent) company offering a service via a premium number cooperates conspiratorially with a partner. 2. The partner makes frequent and long phone calls from one or a few other phone numbers to the (expensive) premium number. 3. A high amount of call charges accumulates within a few weeks (this task can be technically facilitated by using a computer or an automatic dialing device to perform the phone calls). 4. The service provider receives their comparably high share of the call charges from the phone company. 5. Meanwhile, the phone company tries in vain to get its money from the conspiratorial partner. But the partner most likely used a wrong name and disappeared, so their conspiracy with the service provider cannot be proven.
76
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Note: This kind of fraud falls into a category of fraud commonly known as ghosting in the telecommunications industry. This kind of fraudulent behavior is also common in other industries where service providers charge for services that were never carried out. The technique used in this scenario may equally apply to other industries where there is a high incidence of fraud. In the past, fraud detection models were typically created in a workbench environment by data mining experts using algorithms that produce predictive models. There are a few practical inconveniences with this approach: Effort to model fraud In general, fraudulent behavior is relatively scarce, and labeled fraudulent behavior is even more so. Using a classification approach, fraud has to be identified up front and fraudulent cases must be labelled in the data. In reality, this is both time consuming and resource intensive for most organizations. In the case where a business can accurately identify fraudulent behavior, fraudulent transactions are relatively scarce. In most cases, they only make up less than 1% of the total number of transaction in most businesses. Analyzing rare cases, such as fraud, requires artificial injection of data for modeling purpose, which in turns require more planning, documentation, and effort. Timeliness of the fraud detection model It usually takes a significant amount of time to produce the model. By the time it is deployed to the fraud detection system, the model loses some of its predictive power. The result is fraud detection models that perform well in the workbench but failed when deployed to the business. Shelf life of the fraud model In addition, due to the elusive nature, improper behaviors change over time and certainly with increasing speed. It is typically nontrivial to define known fraudulent behavior and model it in a timely manner. By the time the model is deployed, manual recalibration of model performance may be required. The fraud model takes a long time to build, and only has a short useful life or shelf life. This is similar to perishable goods, where you must use them by a certain date. Otherwise, it is not good after that date.
Chapter 5. Fraud detection example
77
5.2 Mapping the business issue to data mining functions Although fraudulent behavior can take many different patterns, a common feature among most fraudulent behavior is that they are different from the norm in some way. As an alternative approach, we look at how to use a deviation detection technique to uncover fraudulent behavior in a more automated manner. This scenario focuses on a technique called Demographic Clustering in IM Modeling and shows how it was successfully applied in fraud detection in a telecommunications environment. The approach addresses the business issues that were identified previously: The deviation technique requires less effort Part of the problem of modeling fraud is that you don’t know what to model. The deviation detection technique overcomes this problem in some ways, because it does not require a target variable or labeled data. Fraudulent cases do not need to be defined up front. This is useful since fraud is not always easy to define with a high degree of certainty. Using a demographic clustering technique, you group the connections that are similar to each other into clusters and assign them with cluster IDs. Potentially undesirable behaviors are likely to deviate from the norm. For example, the large groups are usually found in the clusters with the smallest size. The deviation approach results in a faster time to market model In general, clustering models are faster to build because there is no need to identify previously confirmed fraudulent cases. Consequently, there is no need to oversample. In many cases, this reduces the inertia to produce a model. Increase the life of a model by automatic recalibration In the past, fraud detection models had a short shelf life, meaning it took a long time to build but only had a short useful life. By its nature, fraud evolves quickly, sometimes from week to week. Using the SQL API provided in IM Modeling, the model can be recalibrated at a much more frequent interval. The SQL API allows the model to be rebuilt on fresh data as a scheduled job. This approach offers an opportunity for fraud detectors to catch up with the pace that frauds evolves. The result of the modeling run is a set of clusters. Empirical evidence has shown that by examining the bottom n smallest outlying clusters, organizations have identified undesirable behaviors of which they previously knew. The semi-automatic approach may not produce models that are as accurate as ones that are produced by highly skilled data miners. However the objective here
78
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
is to produce a model quickly in a production environment, rather than to produce a perfect model in a controlled environment. When you consider the value of a model as a function of predictive power and timeliness, this approach makes a viable approach to model fraud in terms of the “time to market”.
5.3 The business application The application in this scenario is fed by data generated in the Call Detail Record (CDR). Considering the changing nature of frauds, it can be difficult to come up with hard and fast business rules for fraud detection. Given the volume of data involved, some business rules were set up to ensure effective use of resources, for example: The system only mines call connections that involve the caller_id making long calls. Limited resource allows for detailed investigation of the five smallest clusters in terms of rank. Given that the clustering technique only allows you to flag a set of records as “different” from the crowd, rather than a propensity that a connection is fraudulent, each connection in the “interesting clusters” should always be cross-examined by business analysts or investigators.
Note: Clustering techniques are commonly used for segmentation purposes where users are interested in larger segments. But clustering also finds small clusters and mark entries that don't really fit into any clusters. That is, you can use clustering as a method for the detection of outliers or unusual patterns. Here, focus only on the small clusters instead of the larger ones. The business application was previously built in a less automated manner using Intelligent Miner for Data. The model built by using the demographic clustering technique proved to be successful in identifying clusters of transactions that were different from the norm and later proved fraudulent by investigators. The operational use of the first version of the solution started at the phone company already in 1998. A customer uses the solution permanently and with great success in several locations. Each week about 5,000,000 new CDRs are analyzed. Fraud attempts within the scale of tens of thousands of euro are detected and prevented. The return on investment (ROI) was achieved after only six months. This scenario is built on the success of version 1 of the model. The model parameters are translated in SQL script, using the IM Modeling and IM Scoring SQL APIs. After the models are coded in SQL scripts, the workbench approach
Chapter 5. Fraud detection example
79
in version 1 of this solution is transformed to an automated and table-driven approach of data mining. Most components of the applications are objects in the database. It is scheduled to rerun and remodel at regular intervals, ensuring the initial success is repeatable. Every time the model is refreshed (by scheduler), a new list of suspicious transaction is generated. The investigator then uses Business Objects to look at a Business Objects Report on list (generated as a DB2 Universal Database (UDB) view). Refer to Figure 5-6 on page 96 to see the end-user report.
5.4 Environment, components, and implementation flow In this scenario, the modeling environment is the same as the scoring environment. When a new model is built, the solution immediately reads the scoring results and chooses records from very small clusters. The environment is enabled once. This enablement creates both the mining and scoring extenders and APIs inside the database, so that the mining and scoring scripts execute. The application environment consists of four components: DB2 script that creates the model: The model created by the mining script is stored as a Binary Large Object (BLOB) inside a special DB2 UDB table created during the enablement process. DB2 UDB script that applies the model A scheduled job A Business Object report that highlights the suspicious connections A DB2 UDB job is scheduled to run every week. This job runs two DB2 UDB scripts that mine and score the data in succession. The first DB2 UDB script sets up the mining environment and creates the mining model. The second script selects the model that was created and scores the connections. The result of the scoring is that each connection is pigeonholed in a cluster. The last component is the Business Object report that lists the transactions that fall in the smallest clusters. Based on empirical evidence, these transactions deviate from normal behavior and are likely to be fraudulent connections.
Note: In this scenario, because the modeling and scoring environment are the same, there is no need to transport the model between modeling and scoring environments. In case the two environments are different, an extra step may be required to transport the model.
80
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 5-1 illustrates the mining and scoring environment for this fraud detection application.
Fraud detection applications with embedded mining and scoring Scheduler Sch eduler
SQL script for mining
JOb
Mod el Calibrat ion
SQL script for Scoring
Generate list of risky customers
Application Integration Layer: CLI/JDBC/SQLJ/ODBC or Message Queuing
IM Modeling API
IM Scoring API
Model (BLOB)
Model run Segmentation Typ e a ttribu tes
Typ e p layer
Ty pe p la ye r
Ty pe p la yer
Typ e lp aye r
Weekly load of CDRs
Ty pe attrib ute s
Operational DataStore Data Models Scored Modeling Environment
Scoring Environment
Figure 5-1 Fraud detection: Deployment environment and components
This architecture allows you to turn the art of fraud detection into a science. By porting it from a workbench to the production environment, it is easier to launch such a project off the ground and ensure repeatable success.
5.5 Data to be used For each phone call, the phone company records a Call Detailed Record, which is stored in a database. This database is a useful starting point for detecting fraud. A CDR consists of about 50 fields, out of which the six most important with respect to the solution described here are:
Identification of the caller (CALLER_ID in Table 5-1) Identification of the premium number called (PREMIUM_ID in Table 5-1) Date when the phone call started Time when the phone call started
Chapter 5. Fraud detection example
81
Duration in seconds Charges Table 5-1 shows the fields in the CDR that are used in this scenario. Table 5-1 CDR fields Caller_ID
Premium_ID
Start date
Start time
Duration
Charges
87645
1800...
12/12/01
0400
6000
900
87645
1899...
12/12/01
0400
6000
900
... 40 other fields not used
...
5.5.1 Data extraction CDR data is sourced at regular intervals from the exchanges and is loaded into a database every week. Given the large volume of data involved, the trend is to extract the data near real time, using such middleware as MQSeries Integrator. Detailed descriptions of data extraction techniques is beyond the scope of this case study. From the CDR, the following tasks are performed to create the connection table for modeling: Select CDRs for long distance calls to reduce the amount of data to be analyzed Data enrichment by creating derived variables that describe connection behavior
5.5.2 Data manipulation and enrichment CDRs are loaded into DB2 UDB, and extra variables are derived from the original data. From previous work done in the generic data steps, the data miner and business analyst, based on their intimate business knowledge, came up with seven extra derived fields. The fields would encapsulate connection patterns and reveal the underlying undesirable behavior. Table 5-2 lists the seven attributes used for clustering.
82
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 5-2 Attributes used for clustering Derived column from the enrichment process
Description
SUM_DUR
Whole duration of all calls on a specific connection (that is, from a specific caller to the premium number
NO_CALLS
Number of all calls on the connection
REL_DUR
Field indicating whether the connection has an extraordinarily high share in the turnover generated by all connections with the same premium number (exactly relationship between the duration of all calls on the specific connection and the average duration of all different connections with the same premium number)
SUM_COST
Call charges for the connection
MAX_DUR
Duration of the longest call on the connection
VAR_DUR
Variance of the call duration on a connection
NO_CLRS
Number of all different connections to the premium number
Figure 5-2 illustrates the transformation process to derive the seven attributes.
CALLER_ ID
PREMIUM_ ID
Start Date
Start Time
Duration (secs)
Charges
...
2561501233
0815691381
2000-01-09
2538366458
0815656545
2000-01-09
22.40.12
123
4.01
...
10.42.36
37
1.06
.
.
.
...
.
.
.
...
CDRs
b_con
b_phone_no
SUM_DUR
NO_CALLS
REL_DUR
b_premium_no
SUM_COST
MAX_DUR
VAR_DUR
NO_CLRS
1426
5
1.7
45.27
823
0.7
20
753
4
1.2
24.56
243
0.4
55
.
.
.
.
.
.
.
Connection Table (each row contains furthermore: CALLER_ID and PREMIUM_ID)
Figure 5-2 Data enrichment process
Chapter 5. Fraud detection example
83
You can find the SQL script for performing this data preprocessing in Appendix C, “SQL scripts for the fraud detection scenario” on page 255.
5.6 Implementation in DB2 UDB V8.1 DB2 UDB V8.1 or above supports extra XML functions that allow you to run the modeling step in fewer lines of code. To run IM Modeling in DB2 UDB V8.1 or higher, refer to the process flow diagram in Figure 5-3.
Enable database for modeling and scoring
Install extra UDF and installed procedure
Configuration Run stored procedure that: Defines data table setting Defines model parameter setting Builds the mining task
Yes
No
Yes
Modeling Need to find new patterns
Does score script exist
Runs the modeling task
No Run Score
Rank Score
Yes
Build score script
Does investigation report exist
Scoring Fresh CDR extract available
Display report
Build report
Application Integration Figure 5-3 Step-by-step implementation flow with DB2 UDB V8.1
84
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5.6.1 Enabling database for modeling and scoring Before you can use the modeling function and the scoring function, you must enable it by configuring the database. Refer to Table 10-2 on page 214 and Table 10-3 on page 214 for details.
5.6.2 Installing additional UDFs and stored procedures With DB2 UDB V8.1, some extra UDFs and stored procedures are provided through this redbook to simplify the creation of a data mining model. Refer to Appendix F, “UDF to create data mining models” on page 281, to install these new database objects.
5.6.3 Model building After you install the additional database objects, you use the new stored procedure BuildClusModel for building a clustering model. The following example uses the stored procedure to create and save a segmentation model by the name of “Connection_Segmentation” in the table IDMMX.CLUSTERMODELS (default). The parameters allow a maximum of 30 clusters to be defined and sets the fields “CALLER_ID” and “PREMIUM_ID” to information field only. You can insert additional parameters into the stored procedure by using additional methods. For a list of commonly used methods for clustering models, refer to Table 10-6 on page 217. After you invoke the stored procedure shown in Example 5-1, DB2 UDB creates a model called “Connection_Segmentation”.
Example 5-1 Stored procedure to invoke clustering run in DB2 UDB V8.1 call BuildClusModel('Connection_Segmentation', 'CONNECTION_TABLE', ClusSettings('CONNECTION_TABLE’').. DM_setMaxNumClus(30).. DM_expClusSettings() );
You can now use the model to score the table as explained in 5.7.7, “Applying the scoring model” on page 90.
Chapter 5. Fraud detection example
85
5.7 Implementation in DB2 UDB V7.2 Figure 5-4 illustrates the step-by-step implementation of the model in DB2 UDB V7.2.
Enable database for modeling and scoring
Configuration Data model setting
Model parameter setting
Build mining task
Run stored procedure to build model
Yes
No
Yes
Modeling Need to find new patterns
Does score script exist
No
Run score
Rank score
Yes
Build score script
Does investigation report exist
Scoring Fresh CDR extract available
Display report
Application Integration Figure 5-4 Step-by-step implementation flow using DB2 UDB V7.2
86
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Build report
5.7.1 Prerequisite: Initial model building Model building was first performed in DB2 Intelligent Miner for Data. Model building with new cases of data is best performed in an interactive workbench environment, where data can be visualized and understood in an interactive manner. Initial model building is also typically performed in cycles where the data miner may experiment with different parameter settings and configurations. For example, for the fraud detection exercise, you may want to experiment with the maximum number of clusters visually, before you commit the max_clusters parameter into the scheduled job using the IM Modeling function.
5.7.2 Data settings The first step to integrate the modeling step into DB2 UDB is to define the data settings using UDFs in the IM Modeling data mining function: Table name Usage of table columns for modeling The script in Example 5-2 defines the data setting in DB2 UDB V7.2.
Example 5-2 SQL script to define data setting in DB2 UDB V7.2 connect to premiums; delete from IDMMX.miningdata where id = 'Connection'; insert into IDMMX.miningdata values ('Connection', IDMMX.DM_MiningData()..DM_defMiningData('CONNECTION_TABLE').. DM_SetColumns(' '));
5.7.3 Model parameter settings After the input data for the modeling step is defined, the next step is to define the modeling parameter settings. During the parameter setting phase, it is required that the parameter settings are already defined and optimized by a highly skilled
Chapter 5. Fraud detection example
87
data miner. The IT specialist implements the model in a production environment by plugging these parameters into the SQL script that uses the IM Modeling function. For the technique that is used for this scenario, demographic clustering, the possible parameters are:
Active/supplementary fields Field weights Value weighting Similarity scale Treatment of outliers Similarity matrix Maximum number of clusters Similarity threshold Desired execution time Minimum percentage of data to be processed Number of maximum clusters
The script in Example 5-3 shows how to define cluster training parameters.
Example 5-3 SQL script to define model setting for clustering delete from IDMMX.ClusSettings where id='Connection_Segmentation'; insert into IDMMX.ClusSettings select 'Connection_Segmentation', IDMMX.DM_CLusSettings().. DM_useClusDataSpec(MININGDATA..DM_genDataSpec()).. DM_setMaxNumClus(30).. DM_setFldUsageType('CALLER_ID',2).. DM_setFldUsageType('PREMIUM_ID',2) from IDMMX.MiningData where ID='Connection';
In this scenario, mostly default parameters settings are used, except for setting the maximum clusters to 30. For a market segmentation exercise, 30 may be too high. For detection of undesirable behavior, however, 30 or more is appropriate, since the goal of performing the segmentation is to identify the clusters of out-of-the norm behavior. Because fraudulent behavior is likely to be out of the norm in some way, setting a large number of clusters has a better chance of unearthing those small clusters.
88
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5.7.4 Building the mining task The data settings and model settings are used as input to build the mining task. The task is stored as a UDT in IDMMX.ClusTasks. The script in Example 5-4 demonstrates how to build a data mining task.
Example 5-4 SQL script to build a data mining task delete from IDMMX.ClusTasks where id='Connection_Segmentation_Task'; insert into IDMMX.ClusTasks select 'Connection_Segmentation_Task', IDMMX.DM_clusBldTask()..DM_defClusbldTask(d.miningdata,s.settings) from IDMMX.MiningData D, IDMMX.ClusSettings S where d.id='Connection' and s.id='Connection_Segmentation';
5.7.5 Running the model by calling a stored procedure After the task is built, you can invoke the task by calling a stored procedure. This task is typically scheduled to run at regular intervals, thereby recalibrating the segmentation model much more regularly than in the past. The example script in Example 5-5 demonstrates starting a cluster that is run by calling the procedure IDMMX.DM_BuildClusModelcmd.
Example 5-5 Calling a stored procedure to run the segmentation task call IDMMX.DM_BuildClusModelcmd('IDMMX.CLUSTASKS','TASK','ID', 'Connection_Segmentation_Task', 'IDMMX.CLUSTERMODELS','MODEL','MODELNAME', 'ConnectionSegmentationModel');
5.7.6 Scoring script generation Scoring script generation needs to be done only once when configuring the environment. The scoring script depends on two inputs: Input table Model to use
Chapter 5. Fraud detection example
89
The score script in Example 5-6 is part of the SQL script generated by the IDMMKSQL command.
Example 5-6 Creating the SQL script for scoring with a data mining model CREATE VIEW Scoring_engine( premium_id, caller_Id , Result ) AS SELECT data.Premium_id, data.caller_id, IDMMX.DM_applyClusModel(models.model, IDMMX.DM_impApplData( REC2XML(1,'COLATTVAL','', data."NO_CALLS", data."NO_CLRS", data."SUM_DUR", data."REL_DUR", data."SUM_COST", data."MAX_DUR", data."VAR_DUR"))) FROM IDMMX.CLUSTERMODELS models, connection_table data WHERE models.MODELNAME= 'ConnectionSegmentationModel';
5.7.7 Applying the scoring model The complexity of the score code is encapsulated in the view. To perform the score, you can call the ResultView that contains the score script. At a minimum, you want the record ID and the cluster ID. Example 5-7 illustrates the scoring script using input data and the model.
Example 5-7 Scoring with a generated script INSERT INTO Allocated_Cluster SELECT PREMIUM_ID, CALLER_ID, IDMMX.DM_getClusterID( Result ), IDMMX.DM_getClusScore( Result ) FROM scoring_engine; --The ALLOCATED_CLUSTERS table is defined once as below, the SQL is modified --from the result script generated by IDMMKSQL. CREATE TABLE Allocated_Cluster (
90
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
premium_id caller_id Clus_id Score
char(12), char(12), INTEGER, FLOAT );
5.7.8 Ranking and listing the five smallest clusters The connections are now scored with the segmentation model and each connection is assigned a cluster. Since the goal is to find connections that exhibit behaviors that depart from the norm, rank the clusters by their size in descending order. The clusters at the bottom with the lowest ranks are those that have the smallest size. By extracting the connections that have a rank of less than or equal to five, you have a subset of connections that are different from the norm. By examining these connections in detail, there is a good chance of finding connections are potentially fraudulent or undesirable. The final step of this business scenario is to generate a list of connections in the smallest (bottom) five clusters, as shown in Example 5-8.
Example 5-8 Script to list the connections in the five smallest clusters select scored.clus_id, attr.caller_id, attr.premium_id, attr.sum_dur, attr.no_calls, attr.rel_dur, attr.sum_cost, attr.max_dur, attr.var_dur, attr.no_CLRS from allocated_cluster scored, connection_table attr where scored.premium_id = attr.premium_id and scored.caller_id = attr.caller_id and scored.clus_id in (select clus_id from ( select clus_id, count(*) , rank() over(order by count(*)) as top_N from allocated_cluster group by clus_id
Chapter 5. Fraud detection example
91
) as temp where top_n Business Service Administration-> Business Service Details. Figure 6-16 shows how the window may appear if you have all the necessary services for this scenario.
Figure 6-16 Business Service example
2. Create your workflow process. Siebel has several sample workflows that you can use as a base for activity treatment. Unfortunately this information is not well advertised.
142
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
To find the samples that Siebel provides, use the Siebel Client to connect to the sample database. To go to the Workflow windows, select Screens-> Siebel Workflow Administration-> Workflow Processes-> All Processes. Here you can find the list of available examples. You need to export them from the Sample database and close the client. Start the client again using the server database and import the selected flow into the Workflow display using the import button. After you have the Workflow in the Process view, highlight it and click Process Designer. This shows you the Workflow diagram of the existing processes. You can modify existing ones or create new ones using the flow chart components in the left navigator. If you double-click each box in the Workflow diagram, you go to the underlying Business Service. Here is where you can specify what you want to have happen in your workflow. The MQ Transport is where you specify your queue names. 3. Invoke your workflow. There is a process simulator you can use that takes you through and allows you to test whether the workflow is correct. Once it is working properly, you can then create an automatic Workflow to pull messages from the queue. Or you can use Visual Basic or Siebel eScript to create control buttons (for example read from queue, create activity, and so on). If you use eScript, you may need to change the scripting DLL in your configuration file. If you are using Siebel Visual Basic, you must specify sscfbas.dll. You cannot use both languages in your application. You have to choose what you will use. According to the Siebel documentation, the EAI components prefer eScript. With the MQ Receiver, you can manually start an MQSeries Receiver task through the Server Tasks view, with the parameters shown in Example 6-13.
Example 6-13 Starting an MQ Receiver task MQSeries MQSeries MQSeries MQSeries Workflow
Physical Queue Name SIEBEL_INPUT Queue Manager Name HUB_QM Response Physical Queue Name SIEBEL_OUTPUT Response Queue Manager HUB_QM Process Name Receive Activity Record
The rest of the parameters for the receiver are defaults, as shown in Example 6-14.
Chapter 6. Campaign management solution examples
143
Example 6-14 Other parameters for MQ Receiver Receiver Dispatch Method RunProcess Receiver Dispatch Service Workflow Process Manager Receiver Method Name ReceiveDispatchSend
Then in Receive Activity Record, the XML message is converted to a Property Set, Create Activity Record. After the outbound call is made, the workflow automatically receives the response data for that particular Activity_id and converts the Activity Response property set to XML. The MQSeries Receiver places the response on the queue. 4. Create the integration objects for the MQSeries Server Transport Adapter. There are several sample integration objects listed in Tools. If you highlight them and click Generate Schema, you see a DTD file that shows you the various elements. You can create new ones from the same window. Copying and modifying existing samples is faster than defining a new object. The process is: a. b. c. d.
Highlight the sample you want to use. Right-click and choose Copy record. Rename the new copy. Click the Synchronize button.
If you expand that object, you see the list of components associated with your primary integration object. Select the ones that you want to use. Do not choose all of them if possible. Your integration object will be rather large. Each integration object has its own XML representation. Using Siebel Tools, it is easy to obtain the DTD definition of the objects that you want to work with. When creating the messages in the analytical part of the solution, it is important that the XML format complies with Siebel’s standards. Figure 6-17 provides a simple XML representation of an account object.
144
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
A Siebel XML Document - Elements representing Integration Object Siebel Message Element
<SiebelMessage MessageId="" MessageType="Integration Object" IntObjectName="Sample Account">
Object List Element
Root Component element
P.Pereira Dallas IBMCorp
Component Field Elements
Component Container Element
Component Element
<Business Address PhoneNumber="5553334422">
Component Field Elements
Figure 6-17 Siebel integration object XML representation
6.4.6 Other considerations The function calls described in this book to work with IM Scoring can present a challenge when working with nonflexible SQL applications. It is particularly difficult to have all vendors add capabilities to use those function calls. There is a simple way to overcome this by working with relational views. You can hide the function call by embedding the SQL calls in create view statements. For all applications accessing IM Scoring results through the views, the results appear as regular data elements that can be queried and used in more complex processing.
Data mining functions and OLAP Especially when working with the OLAP integration server, it is important for the input data to be in a well-behaved format. You can use free-hand SQL to add the function calls to the OLAP load process. However, if the results are used by a number of load and communication processes (for example, stored procedures sending messages to Siebel), it is beneficial to create a view to facilitate the process.
Data mining functions and Siebel All data applications have their own data models and definition. They are not different with Siebel. When interacting from a data warehouse with those systems, it is important to maintain consistency among the data elements in the
Chapter 6. Campaign management solution examples
145
involved systems. A data hub is necessary to maintain this consistency. Imagine sending a message from the data warehouse to the call center to contact a particular customer, but the customer_id and the campaign_id are not available on the Siebel side. The process will fail. It is important to keep the two systems synchronized to avoid this kind of inconsistency. In this particular example, the data hub can be implemented in the data warehouse. A series of workflow and exception treatments needs to be in place. For example, we implement an exception routine when a message reaches Siebel from the data warehouse without the appropriate correspondence in the online transactional process (OLTP) database. A new process starts to send a message to the data warehouse to ask for the information, and the new data is created on the OLTP database. In the data warehouse side, a series of tables is necessary to maintain a surrogate key for each data element that becomes the master key for the data hub. Each master key has a key on each system associated with it. Consider customer_id, for example. When a customer is added to the data warehouse from a source other than Siebel, a Siebel key must be available in the data warehouse. Part of the Extract Transformation Load (ETL) process is responsible for sending a message to the Siebel system to add the new customer and retrieve the new Siebel keys. At the end of this process, the data hub customer table should appear like the example in Table 6-6. Table 6-6 Hub customer table example Data element
Hub master key
Data warehouse key
Siebel key
Customer_ID
452156
00358234
98735
Customer_ID
452157
00378645
89732
Campaign_ID
452158
C874
78234
...
Business Object campaign reports Figure 6-18 provides an integration example in Business Objects, the query product used by the end users.
146
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 6-18 Campaign summary Business Objects report
Chapter 6. Campaign management solution examples
147
148
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
7
Chapter 7.
Up-to-date promotion example This chapter explains how to quickly build a data mining model to answer a punctual issue to perform an up-to-date promotion on products in retail. This case uses IM Modeling and IM Visualization and integrates them with a standard reporting tool of the end user.
© Copyright IBM Corp. 2002. All rights reserved.
149
7.1 The business issue In the retail industry, a large number of products and possible combinations of them make the business analyst decisions difficult. The analyst needs to keep track of the customer purchase behavior and the best mix of products in each store. This helps them to decide the product optimal price, marketing appeal, and the warehousing time. In this way, the manager of each store can control the efficiency of a promotion when it is still running and even change the promotion. The manager or business analyst must design and run a different promotion everyday based on the mix of products they have in the store. For example, the store manager has to sell perishable products in a short time frame and needs to determine the best location in each store. In the same time frame, they can also check the efficiency of their decision and change it if they want. The restrictions of storage space, budget, or the number of stores can lead the business analyst to quickly choose a precise mix of products in each store and arrange the products on the shelves so they don't stay too long there. This chapter has a situation where the manager of each store has to make a promotion every week. With a fast and automatic system, they can actually change their decision or make another promotion every day.
7.2 Mapping the business issue to data mining functions The data mining technique called associations rules is very useful to help the manager to identify opportunities of cross and up selling. Knowing this, the IT analyst can use IM Modeling and speed the process of identifying patterns and opportunities. They do this by running associations rules embedded in another application. With IM Visualization, the manager can check whether the rules make sense. The business analyst can keep track of the new rules and decide quickly what promotion to make. The results of the association mining that is run are the product combinations (rule head and body), the chance or probability that it can occurs again (confidence), and the percentage of time this combination occurs in all the transactions (support). For example, consider a total 1000 transactions. In 100 of the transactions, we found the product wine. In 30 transactions where we had the product wine, we also found the product cream. Now for the rule Wine => Cream, the rule head is
150
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
the wine, the rule body is the cream, the support is 30/1000 (3%), and the confidence is 30/100 (30%). The manager must decide in a short time frame how to sell a particular product that is more perishable (cream). The valid date is expiring in the next two days. Therefore, he has to quickly sell and change the selling strategy before the product valid date expires. Finding the correlations between products using the associations rules technique from IM Modeling helps the business analyst to decide the best product locations and the optimal promotion. After it can be run in a fast and automated way, the business analyst can change their decision based on the selling pattern before the end of the week (end of the promotion time).
7.3 The business application The business application provides the store manager information to discover which combinations of products the customers purchases. It helps the store manager to plan their promotions and to act quickly and change their decision to make new promotions every day. Cross-sell the product offerings to the customers is based on the associations technique to discover the relationship between products at all levels of the product hierarchy.
7.4 Environment, components, and implementation flow The suggested deployment process is outlined here and shown in Figure 7-1: 1. Collect all the data transactions made by each store. 2. Store the data in a relational database such as DB2 Universal Database (UDB). 3. Use IM Modeling to identify which products are sold more often and the possible product combinations. This is done using the associations technique that generates basic descriptive statistics and defines the associations rules. 4. Use IM Visualization to view the rules produced by the associations technique. 5. Extract the associations rules to a DB2 UDB table. 6. Build into the rules to any analytical query tool (Business Objects) that can access DB2 UDB, with the resulting rules, the products sales, by stores, by a range in time, category, promotion type, or region.
Chapter 7. Up-to-date promotion example
151
To implement this solution that can be easily integrated to any software that has the ability to run SQL, the skills required are a database administrator and a programmer. The database administrator skills allow you to schedule and manage the data mining models. The programmer skills allow you to build the application. The business analyst skill is required to interpret the rules and give proper feedback to the programmer to build a transformation or filtering step. It should be clear that the IM Modeling was developed to leverage the IT data mining skill, based on the feedback of the business analyst that will use IM Visualization.
Applications with embedded mining Scheduler
JOb
applications that search for association rules
Model Calibrat ion
Scheduler
Transactional Data
Business Objects
IM Visualization
Application Integration Layer: CLI/JDBC/SQLJ or ODBC
IM Modeling API
Mines the data
Data Warehouse Analytical Data Mart
Modeling Environment
Figure 7-1 Up-to-date promotion components
7.5 Step-by-step implementation This scenario shows how after you have a new transactional data load and new purchase behavior, you can automatically run an association rule model. Then you can help the manager of a retail store decide the kind of promotion to make today. The associations rules data mining technique permits you to find product combinations ordered by a sequence, based on the customer purchase behavior.
152
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
That is if you buy product A and B, you may also buy product C and D. See Mining Your Own Business in Retail Using DB2 Intelligent Miner for Data, SG24-6271, for more information on associations rules. Some steps in this implementation (Figure 7-2) are done once. That is because configuration and building the model are scheduled depending on the timing and urgency of the manager of each retail store.
Implementation Flow Database Enablement
Configuration
Database Profile in IM Visualization
Transaction table
Scheduled Operational
New products or new customer purchase behaviors
Build Filters
Develop the model rules with defined input data and parameter settings
No Rules make sense?
Yes
Extract rules into a table
Modeling Update B.O. Reports
Select Rules and Design promotion
Application Integration
Figure 7-2 Implementation flow of the up-to-date promotion
Chapter 7. Up-to-date promotion example
153
7.5.1 Configuration The database where IM Modeling is running must be first enabled. When enabled and configured, this database can also be used in IM Visualization and Business Objects.
Database enablement You must first enable the DB2 UDB instance and the RETAIL database to allow the modeling functions to work. To enable the DB2 UDB instance, refer to the steps in 9.3.1, “Configuring the DB2 UDB instance” on page 197, for IM Modeling. To enable the RETAIL database, refer to the steps in 9.3.2, “Configuring the database” on page 198, for IM Modeling. You must do this only once for this working database and instance.
Database profile in IM Visualization The database profile can be edited using IM Visualization as explained in 11.2.2, “Loading a model from a database” on page 232.
7.5.2 Data model The everyday purchase of every customer provides transactional data as input to be loaded in DB2 UDB. Table 7-1 describes the Transactions_Data table to be used in this scenario. Table 7-1 Customer data in retail case Field name
Field description
Data type
CUSTOMER_NO
Customer number
char(7)
DATE
Purchase date
char(6)
ITEM_NO
Purchase item number
char(3)
STORE_NO
Store number where the customer has purchased the items
char(3)
TRANSACTION_ID
Transaction ID received whenever a customer buys one or a group of items
char(5)
A job can easily be scheduled to run for any update or insert that occurs in this transactional table and provides up-to-date reports.
7.5.3 Modeling Before you build the associations model that is required in this scenario by the business analyst, it may be interesting to apply some filters on the data. 154
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Building filters Depending on the support of the product, the business analyst may need to avoid frequent products such as soda, salt, sugar, garlic, and so on. One of the store manager’s problems is to sell perishable products in a short time frame. Therefore, a possible filter may be to select all the transactions concerning products where the perishable expiration date is the following week.
Developing model rules with input data, parameter settings The model was developed to run only in DB2 UDB V8.1. In a single command, the model input the data specification, mining technique, model parameter settings, and name mapping. Example 7-1 shows the Build_associations.db2 script, which is also included in the additional material that accompanies this redbook. For more information, see Appendix K, “Additional material” on page 301.
Example 7-1 Building the associations rules model -- build association rules model call BuildRulesModel('Up_to_date_promotion','TRANSACTIONS_DATA', IDMMX.DM_RuleSettings() ..DM_useRuleDataSpec(MiningData('TRANSACTIONS_DATA').. DM_genDataSpec()..DM_addNmp('desc','PRODUCT_NAME','ITEM_NO','DESCRIPTION') ..DM_setFldNmp('ITEM_NO','desc')) ..DM_setGroup('TRANSACTION_ID') ..DM_setItemFld('ITEM_NO')..DM_setMinSupport(3) ..DM_setMinConf(30)..DM_expRuleSettings());
The input data table is Transactions_data. Refer to the products in the Product_Name table (Table 7-2). The business analyst requires a minimum confidence set to 30% and a minimum support of 3%. Note: The high confidence value ensures that you only produce rules that are strong enough. The low support value gives you a variety of rules so you are likely to find a rule for most of the products later.
The rule model generated is named Up_to_date_promotion.
Chapter 7. Up-to-date promotion example
155
Table 7-2 Product_Name table Field name
Field description
Data type
ITEM_NO
Purchase item number
char(3)
DESCRIPTION
Purchase item description
char(15)
EXPIRATION_ DATE
Item expiration date
YYYYMMDD
The rules model is loaded into IM Visualization as shown in Figure 7-3.
Figure 7-3 Rules in IM Visualization
The business analyst in agreement with the IT specialist can see a quick view of the number of rules generated and perform a consistent check before exploiting and sending them to the store managers through daily reports.
Extracting rules into a table With DB2 UDB V8.1, some extra UDFs and stored procedures are provided through this redbook as the UDF ListRules. You can find the UDF ListRules in Appendix G, “UDF to extract rules from a model to a table” on page 285. UDF ListRules is used first to extract the rules from the model into a table so the rules can be exploited by end-user tools such as Business Objects.
156
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Next you run script, such as extract_rules.db2 (Example 7-2) that calls the function ListRules and select the model called Up_to_date_promotion that was created in earlier. This script is also available in the additional materials that accompany this redbook.
Example 7-2 Extracting the associations rules -- list rules using item descriptions INSERT into Rules_Definition SELECT P1.Description as Antecedent,P2.Description as Consequence,R.support,R.confidence from table ( ListRules( (SELECT MODEL FROM IDMMX.RuleModels WHERE MODELNAME='Up_to_date_promotion') ) ) R, PRODUCT_NAME P1, PRODUCT_NAME P2 where R.Antecedent = P1.ITEM_NO and R.Consequent = P2.ITEM_NO;
Table 7-3 shows the layout of the table that is created (Rules_Definition). Table 7-3 Rules_Definition table Field name
Field description
Data type
ANTECEDENT
Rule head (only one item)
char(15)
CONSEQUENCE
Rule body (only one item)
char(15)
SUPPORT
Rule support
decimal(12,2)
CONFIDENCE
Rule confidence
decimal(12,2)
Now with this table, any application that accesses DB2 UDB can select the rules head, body, and respective support and confidence. Using the DB2 UDB scheduler, one single DB2 UDB script building the associations rules and inserting the results into a table every day can be run.
7.5.4 Application integration The application used in this case study is the Business Objects reports that the manager of each retail store receives to help them in designing a promotion.
Designing a promotion with Business Objects reports Once you have the resulting associations rules in a table, the business analyst can use any query and report tool that they feels comfortable in using. This example shows how the end-user Business Objects report tool, already in place
Chapter 7. Up-to-date promotion example
157
in the enterprise, can help the manager of a store to make a decision, as a promotion, by looking at the report they receive daily. This report has the product combination based on the last customer purchased behavior. Figure 7-4 shows a report where the product combination (associations rules) is highlighted, when the confidence is greater than 60%.
Figure 7-4 Business Objects report with the product combinations
With the Business Objects report, the manager can select the product combinations that are the most relevant in their every day business. Since they must quickly sell perishable products, such as cream, they may create a promotion to sell disposable nappies together with cream. With the Business Objects scheduling feature, the report is in sequence, updated, and accurate on their desk every morning. If required, the manager can ask to change the scheduling and receive the information more often.
158
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
7.6 Benefits The benefits of this solution are to: Act and react quickly on a market that is in permanent move and change Recalibrate and quickly build a new data mining model
7.6.1 Automating models: Easy to use There is an automated way to find the associations rules embedded in an application without using any data mining software or skills. This way allows the business analyst (in this case the manager) to see and interpret the rules with the IM Visualization. Then the IT analyst can run a SQL API to discover new rules every time a new product is launched, or they can test the efficiency of a promotion that it is still running.
7.6.2 Calibration: New data = new model The advantage of embedding the association technique in any application is that it can be run in batch almost every night. Also the business analyst can keep track of the products associations pattern. In this way, the manager of each store can control the warehousing time. And the marketing analyst can control the efficiency of a promotion when it is still running and even change the promotion. IM Modeling brings the ability to calibrate or recalculate a data mining model every time new transactional data is load. The faster the data is inserted in the transactional data, the faster the calibration is done, and the more accurately the business analyst can decide. This up-to-date promotion application can also be used in a Web-based environment. It can perform micro testing on promotions or compare the success rate of two different promotions prior to implementing in the brick and mortar environment. For example, retailers may want to put items up for sale on the Web and then determine how much to produce (overseas) before ordering merchandise in the stores.
Chapter 7. Up-to-date promotion example
159
160
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8
Chapter 8.
Other possibilities of integration Integrated mining makes mining simple and offers much more opportunities for integration into any kind of application. A recent trend for DB2 data mining is integration into analytical solutions. Web analytical solutions are in place to provide real-time scoring based on Web site traffic behavior. DB2 OLAP Server now has a feature that highlights deviations and hidden business opportunities. The DB2 Intelligent Miner for Data technology has been integrated into SAP Business Warehouse and Customer Relationship Management (CRM). For example, customer interaction is enhanced by product recommendations computed by mining. Business Objects Application Foundation, as well as QMF for Windows, integrate IM Scoring functions into traditional reporting and OLAP. The WebSphere Commerce Analyzer comes with predefined mining configurations. This chapter covers several of these examples from a business user's point of view. A brief look under the covers tells application designers how mining functions work or may work and help in these cases. A business end user can benefit from mining analysis without even knowing the technology.
© Copyright IBM Corp. 2002. All rights reserved.
161
8.1 Real-time scoring on the Web (using Web analytics) Marketing across traditional channels has changed forever due to the Internet explosion. It forced changes in traditional marketing techniques, including capture and use of extensive customer data. For example, today's online retailers are drowning in customer data. Industry experts concur that understanding customers as individuals, and leveraging every interaction to maximize those insights, is crucial to success.
8.1.1 The business issue Understanding (inter)actions between customers and enterprises through the Web and actually achieving them with measurable results presents a challenge. Barriers to real customer insight are:
Lack of centralized information source Lack of coordination across channels Lack of deep, current customer data Lack of real time response capability, especially online
Assume that in our day-to-day business, we have succeeded in addressing the first two barriers. When the channel turns out to be the World-wide Web, we are often still confronted with the last two issues. Our business issue in that case translates into: We want to better understand where traffic originates, who the visitor is, what the customer preferences are, and how to approach desired customers effectively in a real-time Web environment.
8.1.2 Mapping the business issue to data mining functions To address this business issue, let’s look at customer-related data and Web-surfing behavior data. The mining model is in place to have a number of preset characteristic profiles ready to target in the business via the Web channel. To achieve in this manner, a more personalized marketing approach is required beyond mass marketing. On the one hand, the scoring engine underneath uses the model. On the other hand, there is the data to score the Web visitor to a certain profile. From there, you can work toward understanding the customer preferences. However, both sets of data, customer-related data and Web surfing behavior, are often incomplete or even missing as in the case of Web traffic. This may simply be because it may be a first time customer to the business, a first time visit to the Web site, or an incomplete online application form. In that case, you would make
162
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
a prediction of the likelihood that this particular Web site visitor has a certain customer behavior profile. They may be based on predicted values to characteristics such as age group, sex, income class next to predictions of product characteristics (model, color, price range) preferences. The data to be used for prediction may be the dynamically changed data during their session and the specific Web site traffic in the session (Web page, page sections, and product information traffic). This data would be combined with other data prepared in the data warehouse and linked to the operational database. You may use modeling and scoring, as part of the Web analytics approach, to perform real-time scoring to the data warehouse or operational database. For example, the more the female gender propensity increases, the larger the number and the more interested in female content the viewer is. Real-time scores are produced for up-sell and cross-sell campaigns or other CRM initiatives depending on the business issue. Real-time scoring services for CRM initiatives occur in the data warehouse or operational database. You can address the business issue initiatives in this scenario to up-sell and cross-sell via the Web, with or without a combination with data driven personalization. For example, if the female gender propensity increases over the Web traffic session time, the personalized recommendations focus more on female product lines. The product recommendations can come from an association modeling (or a collaborative filtering tool) run to the data warehouse. This solution is shown in Figure 8-1.
Chapter 8. Other possibilities of integration
163
Real-time Scoring Services Web analytics Web data
Web analytics
Web delivery
Response data
Real-time interactions
Customer information
Real-time actions
Data driven personalization R
Data Warehouse
Scoring
Modeling: Segments
g in or Sc e c es im i l-t rv ea se
Reporting and analysis Real-time scoring services
Operational and trade-off data
. Customer value segment . Up-sell scores . Cross-sell scores . Risk scores . Product category scores . Attrition scores
Figure 8-1 Using modeling and real-time scoring services for Web analytics
8.1.3 The business application For a better understanding of your new customers that visit the Web site, you can use scoring. This way you can address your customers more on an individual basis than as a general customer via a static Web site. You can also make product recommendations that can lead to up-sell or cross-sell.
8.1.4 Integration with the application example Both modeling and scoring are integrated in the Web application (see Figure 8-2). Both features of a clustering and an association model are reflected in the monitor and trace facility that the IT developer set up for the end users, whether it is the marketing department or the Web site manager. The features are set up as part of the Web delivery to internal viewers. The trace facility is not meant for external viewing purposes, and therefore, is hidden for customers by the Web delivery tools.
164
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The scoring engine is used to score several features of the profiles that are set up on the basis of the models. Both the sets of demographic features in the cluster model and product navigational and transactional related features in the association model are continuously and in real time updated by the scoring engine. The scoring engine continuously monitors the visitor traffic during the life span of each individual visitor’s Web session. Based on this, you can approach customers more effectively in real time in a Web environment.
Monitor and trace customer profile behavior: This page was specially designed to gain insight into the price sensitivity of the visitor. The gender propensity was scored as part of the visitor profile on the basis of other Web page traffic during this or another session of this same visitor. The information in the customer behavior profile is derived from many sources. Information comes directly from purchase history and indirectly via subtle questioning. This is similar to the approach that a master salesperson would undertake.
Figure 8-2 Tracing a customer behavior profile based on session traffic
This example of Web analytics demonstrates the focus on maximizing relationship for the customer’s return. At several stages of the customer interaction with the online application, both the mining model and the online scoring principle are used. Both IM Modeling and IM Scoring support this principle as integrated technology parts in a Web analytics solution to a cross-sell or up-sell application.
Chapter 8. Other possibilities of integration
165
The models, then tightly integrated with the data itself in the database management system (DBMS), facilitate automating the process of dynamically evaluating and responding to individual customer preferences and behaviors. Certain product items and price offerings, preferred color combinations, and other micro campaigns can be offered to Web site visitors in a more dynamic way, based on the underlying mining model.
8.2 Business Intelligence integration This section discusses the integration of IM Scoring with other Business Intelligence tools. Particularly, it includes the tools for online analytical processing (OLAP) using the DB2 OLAP Server and tools for query reporting using QMF.
8.2.1 Integration with DB2 OLAP Making data mining results available to the business analyst using multidimensional OLAP front-end tools gives new insights to solve real business problems such as to: Find the most profitable customer segments Understand customer buying behavior (find product associations using market basket analysis) Optimize the product portfolio Keep profitable customers by understanding attrition factors
Knowledge that was previously hidden in the data warehouse and data mining models and that was only available to data mining experts is now available for cross-dimensional browsing using both DB2 UDB and DB2 OLAP Server. Integrating IM Modeling and IM Scoring further into OLAP solutions, by automating steps done manually by the OLAP designer manually, reduces the steep learning curve for OLAP users when applying mining technology. Plus, it brings faster time to market of marketing- and sales-related actions on the basis of found knowledge as the automation eliminates the manual efforts.
Basic understanding of OLAP cube An OLAP cube is a multidimensional view into a company's data. Typical dimensions include time, products, market or geography, sales organization, scenario (plan versus actual), and a measure dimension (includes such measures as revenue, cost of goods sold (COGS), profit or ratios-like margin). The structure of the dimension that defines a multidimensional view is called outline.
166
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Each dimension can be hierarchical in structure. For example, the time dimension can be broken down into years, the years can be broken down into quarters, and the quarters can be broken down into months and so on. Typically the outline contains a hierarchy that business analysts used for a long time in their reporting. For example, the customer dimension in the banking industry could be aggregated according to customer segments such as private individuals, corporate customers, public authorities, and so on. The cube typically does not contain attribute dimensions for all attributes that are known in the warehouse about the base dimension “customer”. In the banking industry, for example, the warehouse may have dozens of attributes, such as age, marital status, number of children, commute distance, number of years as a customer, and so on, for each customer.
Integrating a new dimension in the cube The attributes described in Chapter 4, “Customer profiling example” on page 51, can be represented as an additional dimension or as an attribute dimension. An attribute dimension simply assigns the attribute as a label to the base dimension. Defining a hierarchy for customers, such as using geographical information, is easy to do in an OLAP system. Using market regions in an OLAP cube is common practice. However, a hierarchy that is easy to define, such as a geographical hierarchy, does not give valuable information about the business. Data mining using IM Modeling and IM Scoring, instead, can produce a segmentation of customers. It takes more information about customers into account, such as family status, size of city, estimated income, and other demographic data. Such segments, also called clusters, can then be used to define a customer dimension in OLAP as shown in Figure 8-3. The cluster identifiers can then be added to the OLAP cube as additional attribute dimensions. A meaningful short description of the separate cluster can be added as a dimension.
Figure 8-3 OLAP outline with initial rough hierarchy to present customer groups
Chapter 8. Other possibilities of integration
167
The bank's focus group consists of private individuals and the bank's data warehouse contains dozens of attributes such as age, marital status, number of children, and customer annual income segment. IM Modeling that was run using the clustering technique on those attributes found the following groups of customers:
Seniors Families with children Yuppies Other
These customer segments are loaded into the OLAP cube (Figure 8-4). They allow cross-dimensional analysis on each customer segment by geography, product, and time.
Figure 8-4 Customer segments placed as dimensions in an OLAP outline
When the OLAP analyst views a slice that shows only the customer segment “families with children”, they may want to understand this segment better and invoke the clustering visualizer. The clustering visualizer shows, for a set of attributes, the distribution of the attribute values in a specific segment, compared to the distribution of the attribute values in all customer records. This scenario can be enhanced by a cross-selling scenario as described in 6.4, “Cross-selling campaign” on page 128. Such a scenario provides, in its implementation flow, an example of how to integrate with DB2 OLAP Server when building the cube outline.
8.2.2 Integration with QMF On a frequent basis, business analysts need timely and easy reporting of information in their customer base. A typical business question is to understand
168
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
the customer base by itself: “What type of data do we have on our customers, and based on that, what do our customers look like?” End-user reporting to address these questions is often based on query reporting. Query reporting relies on queries that business analysts may initially have set up by the database people in the IT department. In a later stage, these same business analysts may configure and enhance the queries themselves since they need to administer similar queries on a more ad-hoc basis. On the basis of need to know, they may want to configure the queries at their desktop after they are accustomed to its code and format.
Mapping the business issue to data mining functions QMF for Windows users can improve the quality and timeliness of business intelligence available to analysts by using the IM Scoring feature of DB2 UDB Extended Edition and Enterprise Extended Edition. Using the new QMF for Windows V7.2 Expression Builder, users can easily invoke IM Scoring to apply the latest mining analytics in real time. The QMF V7.2 Expression Builder helps QMF users build the SQL expression to invoke IM Scoring which, in turn, applies these rules to new business situations. An example of using IM Scoring for clustering functions in QMF is provided through the following business application.
The business application A financial institution, such as a bank, typically runs weekly, monthly, and quarterly reports to monitor transactional relationships with its customer base. Apart from this, these same base reporting queries also serve as starters for ad-hoc reporting. Queries need to be run to feed other end-user applications that are used by sales account managers, portfolio managers, mortgage loan officers and bank call center operators that interact with customers. These people need to interact with the customer based on their needs. They also want to communicate a sense of personalized and up-to-date service to the customer at hand. Personalized answers to customer questions and needs in near real time for quick response is key to effect customer satisfaction. These end users need the information right there, right now when they interact with the customer. For this to happen, access to the data on the typical sets of customers in the database, as well as the means to make slight variations for ad-hoc query reporting, are useful.
Integration with the application example To provide access to the customer database of the bank and, at the same time, achieve near real-time response based on the database mining extenders
Chapter 8. Other possibilities of integration
169
embedded in the relational database management system (RDBMS) of our bank, we build the SQL query shown in Example 8-1 in the Expression Builder panel of QMF V7.2. We use the scoring functions on basis of a cluster model. The query is called QMF Query to use IM Scoring.qry.
Example 8-1 QMF V72. query for IM Scoring --- DROP TABLE CBARAR3.QMFResultTable ; --- CREATE TABLE CBARAR3.QMFResultTable ( --Customer CHAR(8), --Cluster_id INTEGER, --Score DOUBLE, --Confidence DOUBLE --- ); INSERT INTO CBARAR3.QMFResultTable (Customer, Cluster_id, Score, Confidence) SELECT CLIENT_ID, Q.PredictClusterID( 'Demographic clustering of customer base of an US bank' , REC2XML( 2 , 'COLATTVAL' , '' , CAR_OWNERSHIP, HAS_CHILDREN, HOUSE_OWNERSHIP, MARITAL_STATUS, PROFESSION, SEX, STATE, N_OF_DEPENDENTS, AGE, SALARY ) ), Q.PredictClusScore( 'Demographic clustering of customer base of an US bank' , REC2XML( 2 , 'COLATTVAL' , '' , CAR_OWNERSHIP, HAS_CHILDREN, HOUSE_OWNERSHIP, MARITAL_STATUS, PROFESSION, SEX, STATE, N_OF_DEPENDENTS, AGE, SALARY ) ), Q.PredictClusConf( 'Demographic clustering of customer base of an US bank' , REC2XML( 2 , 'COLATTVAL' , '' ,
170
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
CAR_OWNERSHIP, HAS_CHILDREN, HOUSE_OWNERSHIP, MARITAL_STATUS, PROFESSION, SEX, STATE, N_OF_DEPENDENTS, AGE, SALARY ) ) from CBARAR3.BANKING_SCORING, IDMMX.CLUSTERMODELS WHERE IDMMX.CLUSTERMODELS .MODELNAME = 'Demographic clustering of customer base of an US bank';
Figure 8-5 shows the result of running this query in QMF.
Figure 8-5 Results of IM Scoring run in QMF V7.2
Chapter 8. Other possibilities of integration
171
The results can obviously be portrayed in a table or exported to a file format for use by the end user in another business application.
8.3 Integration with e-commerce To be competitive in the global marketplace, businesses need to offer greater levels of customer service and support than ever before. When customers access a Web site today, they expect to browse through a product catalog, buy the products online in a secure environment, and have the product delivered to their doorstep. Electronic commerce or e-commerce involves doing business online, typically via the Web. E-commerce implies that goods and services can be purchased online, whereas e-business may be used as more of an umbrella term for a total presence on the Web, which includes the e-commerce component on a Web site. Note: Some of the key concepts of an e-commerce Web site include: User profile: Information that is entered and gathered during a user’s visits form the user profile. Product catalog: On the Web, this is analogous to a printed catalog. Products are organized into logical groups. The display of the products are tailored to maximize the sales of the products. Customers can browse the catalog to search for products and then place orders. Shopping flow: In the e-commerce environment, this is the process where customers browse the catalog, select products, and purchase the products. Shopping cart: The metaphor of a shopping cart has become widely used on the Web to represent an online order basket. Customers browse an e-commerce site and add products to their shopping carts. Shoppers proceed to the check-out to purchase the products in their shopping carts.
The Business-to-Consumer (B2C) e-commerce store model is a publicly accessible Web site offering products for sale. It is analogous to a store on the street, where any member of the public can walk in and make a purchase. A new, unknown customer is called a guest shopper. The guest shopper has the option to make purchases, after they provide general information about themselves to fulfill the transaction (name, address, credit card, and so on). Most B2C sites encourage users to register and become members. In doing so, the business can establish a relationship with the customer, provide better service, and build customer loyalty.
172
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The Business-to-Business (B2B) e-commerce store model refers to an e-commerce store specifically designed for organizations to conduct business over the Internet. The two entities are known to each other and all users are registered. In our case, where we deal with a B2C e-store, after we set up our store, we are likely to be interested in how successful the store is. We may want to have access to specific information about the success of different campaigns and initiatives. We may also want to know more about the customers who are using the store and the responses to specific campaigns and initiatives. See the example in Figure 8-6.
Campaigns, Initiatives Catalog
Orders
Payment
Business-to-Consumer
Sales
Customers Workflow
Contracts Negotiation
Business-to-Business
Business focal points
Figure 8-6 B2C, B2B, and business focal point in an e-store
The business issue is to leverage the information, stored separately on both campaigns and customers, by combining these two information sources to gain more lift in responders to campaign initiatives. Web and call center interfaces allow us to test campaigns and effectiveness before applying them to the masses. For example, campaign A targets the first 50 eligible visitors to our Web site, and campaign B targets the next 50 eligible visitors. We can determine offer response, customer segmentation, and so on. Then we can create the appropriate batch or campaign management offer to the segment that best fits the response profile.
Chapter 8. Other possibilities of integration
173
We address the following business issues: What are the characteristics of the initiatives that are most successful? What are the characteristics of the customers that respond most favorably to the initiatives?
Mapping the business issues to data mining functions After the e-store is deployed, there are many activities that you must perform to manage the e-store and to examine how well the B2C e-store is performing as shown in Figure 8-7. Both IM Modeling and IM Scoring may help you to understand better and differentiate between online campaign initiatives and between customers.
Visitor traffic
Traffic volumes on visitors, repeat visitors, registered customers Traffic trends based on hour of day, day of week Time spent by visitors on pages viewed Effectiveness of referrers (advertisements and links on partner Web sites) Visitor network geography (domain, subdomain, country) Search key words
E-commerce
Products seen or selected Shopping behavior (searching or browsing) Shopping cart abandon rate Campaign effectiveness Business impact measure Conversion rates (browse-to-buy, search-to-buy)
Path analysis
Popular paths through the site Content effectiveness of specific site features Customer needs (search analysis)
Personalization
Customer profiling Effectiveness of business rules
Site operation
Web site bugs (broken links, server errors) Speed of page load or response
Figure 8-7 E-commerce analysis
IM Modeling can be used to discover additional characteristics, such as why one initiative is more successful or why a particular cluster of customers responds more favorably to the campaign than another cluster of customers. The outcome is a set of mining models that describes several clusters of campaigns or initiatives, respectively in case of the first question. And for the second question, you have a mining model that describes several clusters of online customers who are favorable or less favorable responders to the campaign initiatives.
174
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
IM Scoring may be used to score similar, but newly setup, online campaigns or initiatives against the customer segmentation results stored into DB2 UDB tables. Likewise, in the case of the second question, it allows you to score new customers to existing initiatives or in combination with new campaigns to be run on short notice. These DB2 data mining functions integrate the mining results automatically back into the datamart, where they can be used to revise e-commerce marketing strategies. Using IM Modeling and IM Scoring, you can score data on online customers and at the same time display the history of these customers, because both are stored in the data warehouse.
8.4 Integration with WebSphere Personalization Most B2C sites try to maintain information about users. They encourage users to register. Information that is entered and gathered during the users’ visits form the user profile. The user profile information can be used as a powerful marketing tool to personalize the Web site content for the user. The personalized content can be used to filter the product catalog for only products that the customer is interested in. The content can also be used to implement such selling techniques as cross-selling and up-selling. Web-based personalization is meant to match more the individual needs and preferences of the visitor to the online site. The intent is that by targeting the Web content to the customer’s needs, they will become a true customer beyond the status of Web site visitor. Online shops, such as online auctions sites (eBay), camping equipment (REI), and book stores (Amazon) try their best to match Web content to the visitors to get them to become frequent visitors and eventually buying customers. In Web channel-based communication, a medium that provides no actual personal contact with the visitor, real-time response is essential to the visitor’s needs and preferences. Typically, a visitor to a Web site must have their needs recognized at the Web site within split seconds. Otherwise they will not return again to seek whether the site offers the information, services, and products they were looking for and considered buying. Fast delivery of matching content is important to new visitors and to established customers that were actually shopping at the online site before. We want to address the business issue of minimizing those who leave (leavers) at all costs, by improving their experience with our B2C site. And at the same time, we must turn navigational and information request visits into actual sales transactions at each occurrence of a Web session.
Chapter 8. Other possibilities of integration
175
Mapping the business issue to data mining functions Companies with online shop sites, such as Amazon, use recommendation engines to address the possible needs of its Web site traffickers in a more personalized manner. Using this manner, they sell to new guest shoppers or up-sell or cross-sell books and other items to existing customers. The association techniques of IM Scoring may be used to monitor the Web pages flow the Web site user visits during a session. The navigational clicks by Web site traffickers who search for items, either for information request purposes or for simple on-time page visits, tell what information needs the guest shopper or customer may have. At the same time, a series of mouse clicks to register, search, select, and pay for items in the online shopping cart further enhances the personalized services to customers with whom you start to establish a customer lifetime relationship. This Web transactional behavior by traffickers, who became part of the customer base after registration, leads to a useful shopping history. It allows IM Scoring to provide scores to the recommendation engine that will be more effective in case of future Web sessions by the customers. Figure 8-8 shows a solution that maps the business issue to technology involving personalization and recommendation engines that relies on data mining functions with IM Modeling and IM Scoring in place. Here, WebSphere Application Server is running with personalizations based on the recommendation engine of WebSphere Personalization. Scoring results are plugged in by using the Java API of IM Scoring. In certain e-business scenarios, such as first time visits to the Web site, the input data for scoring may include data that is not yet made persistent in the database. The current data may depend on the most recent mouse click on a Web page. A small Java API for the scoring functions allows for high-speed predictive mining in these cases as well. It also offers support to personalization in WebSphere. In addition, IM Visualizers may be plugged in using the applet view of these visualizations once a Web browser view of the personalizations is set up by the IT department for the content manager or managers.
176
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Personalization Workspace (inlcuded with WebSphere Personalization, V4.0)
Business Manager
Manage campaigns Preview Web site Develop business rules
For each site visitor: Evaluate business rules Select appropriate content
Modeling Scoring
WebSphere Personalization
Recommendation Engine Resource Engine Rules Engine
WebSphere Site Analyzer Analyze business rules for effectiveness Improve your site's ROI
WebSphere Application Server, Advanced or Enterprise Edition AIX, Solaris, Windows NT/2000, HP-UX, Linux
Profiles: Collected on site or retrieved from other business systems LDAP Content: Interwoven team site IBM EIP Other content management systems
Pages personalized for each site visitor
Figure 8-8 WebSphere Personalization
The business application or applications There are two business applications where personalization is quite important. One is Amazon, and the other is Recreational Equipment, Inc. (REI). The association techniques of IM Scoring are used to monitor the Web page flow that the Web site user visits during a session.
Amazon When you access the online bookstore Amazon (http://www.amazon.com), the site immediately uses the recommendation engine. Amazon invites you to become a registered guest shopper. See Figure 8-9.
Chapter 8. Other possibilities of integration
177
Figure 8-9 Personalizing at Amazon.com when you visit their Web site
Amazon also invites you to tell them your interests, so that they will remember them and can personalize their site just for you. After you decide to have Amazon.com personalized to your interests, the site displays a recommendation wizard (Figure 8-10).
178
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 8-10 The Amazon three-step recommendation wizard
Amazon asks you to follow three steps to receive personalized recommendations: 1. Tell what your favorite stores and categories are. 2. Tell what your favorite products are. 3. Confirm your favorites to receive recommendations from now on. By looking at your purchase history with Amazon that we supplied in their recommendations wizard, you see a recommendation immediately after your registration. (If you are a first time guest shopper at the Amazon site, Amazon may look at your purchase history with other stores.) If you do not see such a message, then you see a personal friendly message: Hello Jaap Verhees...We’re sorry. We were unable to find any titles to recommend after looking at your purchase history.”
Chapter 8. Other possibilities of integration
179
You may also see a suggestion hint if the items recommended by Amazon are not on target (right-hand side in Figure 8-11). This way, Amazon tries to minimize the likelihood of us leaving their Web site and not returning in the near future, by improving our experience with the Amazon e-store.
Figure 8-11 Amazon has no recommendations yet, but tries to get on target
After you refine your recommendations by rating products that you have bought, either at the Amazon Web site as another registrant in the past or by shopping at competitors with both online and offline stores, the recommendation history considers this as a purchase history. In this example, we specified that we are interested in e-stores that sell: Books Magazines Products with respect to outdoor living
180
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Next, in store books, we selected categories (Arts & Photography, Outdoors & Nature, and Travel) and did a category selection in stores Magazines and Outdoor Living products. Finally, we rated books that were listed to us that belong in our preferred categories, and indicated which of these books we own. We bought a number of books with titles, such as “Discovering Data Mining”. From then on, the wizard had enough purchase history and preferences to recommend a product that may likely make us behave as those who stay (stayers) instead of leavers. See Figure 8-12.
Figure 8-12 The Amazon.com recommendations wizard suggesting a book
REI The other example is the REI online store (Figure 8-13). Formed in 1938 as a co-op to provide high-quality climbing and camping gear at reasonable prices, REI is now the nation's largest consumer cooperative with 2 million members. The respected outdoor gear retailer has 59 brick-and-mortar stores in 24 states.
Chapter 8. Other possibilities of integration
181
Kiosks in every store allow customers to access the REI Web site at http://www.rei.com, where approximately 78,000 SKUs are listed.
Figure 8-13 REI home page
There is also a value-priced storefront, REI-OUTLET.com, as well as 800-number shopping. With 6,500 employees, REI generates approximately $700 million in sales, of which $100 million comes from the online stores. REI is known for integrating its multiple sales channels to provide its customers (an exceedingly loyal crowd) with a consistently convenient, pleasant, and informative shopping experience. REI's in-store point-of-sale (POS) terminals have been Web-enabled since 1999. They can be used, for example, to order items that are out of stock at the store. REI's multi-channel retailing strategy, moreover, has proven itself beyond a doubt. In a 24-month period, REI found that dual-channel shoppers spent 114 percent more per customer than
182
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
single-channel shoppers. Tri-channel customers spent 48 percent more than dual-channel customers. With the importance of its online store firmly rooted in its overall retail strategy, REI began seeking ways to simplify its underlying technology as the site and its functionality grew. REI wanted to focus its energies on what it does best. That is building more personalized relationships with its customers to improve their experience with REI (see Figure 8-14).
Figure 8-14 Recommendations at the time of shopping cart entries
By scanning the current shopping cart entries, REI’s personalization engine looks at the associated products in its product catalog database. Then it sets up those items as suitable recommendations. If you are a first time guest shopper, you will not have a purchase history in their database. Therefore, the engine does not cross-reference recommendations back to the sales and purchases
Chapter 8. Other possibilities of integration
183
data tables. But after you become a frequent shopper, the recommended products are also filtered for your previous purchases of those products. Personalization clearly has more chance of succeeding if your previous purchase history is taken into account so that you do not receive recommendations to products that you have. REI also tries to boost sales by dynamically linking Web content with targeted marketing information, not only by sales and purchase information. For example, if a customer is reading an REI “Learn &Share” article on backpacking, the personalized recommendation engine could drop an image of hiking boots featured that week onto the Web page. Personalization helps REI use the Web site as a powerful marketing tool, plus also enhances the multichannel integration for the ultimate benefit of their customers. For instance, the recommendation engine of the Web site has the ability to refer new Web customers to nearby stores that are having sales. It can trigger to e-mail a coupon, redeemable in stores or online, to a recent brick-and-mortar customer who has purchased a bicycle, offering discounts on helmets and other complementary products.
The integration with the application examples By handling your item preferences or purchase history, the Web applications of Amazon and REI recommend other associated items to you. The steps for this may be as follows: 1. The IM Modeling functions for association analysis come up with a model of associations between numerous products sold by an e-store, such as Amazon and REI, over time. 2. Your itemized preferences are then matched one-by-one to the associations rules. 3. By using each itemized preference one by one, the IM Scoring function for associations then selects the association rule with the highest lift (confidence level/support) as the product item to recommend. Example: “If book title was purchased, then book titleB is often bought in combination or within a short time period after.”
4. The list of selected products is presented next on the personalized Web page to the guest shopper as recommendations. With IM Modeling and IM Scoring, next to WebSphere Personalization and Application Server powering a Web site, online stores are a lot more efficient and able to quickly make changes that enhance the way they interact with shoppers.
184
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8.5 Integration using Java The final, but one of the most worthwhile integration capabilities of IM Scoring right into any Web-enabled end-user business application is the use of Java Beans technology. This section explains the IM Scoring Java Bean concept through a business case that we present in Chapter 4, “Customer profiling example” on page 51.
8.5.1 Online scoring with IM Scoring Java Beans IM Scoring Java Beans can be used to score single or multiple data records using a specified mining model. IM Scoring Java Beans are designed to be used for applications where the online scoring of data records is the main task. IM Scoring Java Beans enable you to score a single data record in any Java application given a PMML model. This can be used to integrate scoring in e-business applications, for example for real-time scoring in CRM systems. Basically, the IM Scoring Java Beans are a good way to integrate scoring into any Web application. The Java Beans implementation of IM Scoring is set up with the idea to have: Fast deployment Ease of use using a Java programming environment Scoring available to any Web-based application
The functions of IM Scoring Java Beans are implemented as methods of the class com.ibm.iminer.scoring.RecordScorer. Note that the Java API is documented in online documentation (Javadoc) in the directory (to the IM Scoring program files) \doc \ScoringBean \index.html. Javadoc is shown in Figure 8-15.
Chapter 8. Other possibilities of integration
185
Figure 8-15 IM Scoring Java Beans: JavaDoc on class RecordScorer
186
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8.5.2 Typical business issues A possible application area of IM Scoring Java Beans in CRM systems may be the realization of an Internet-based call center scenario. In this scenario, the required business logic, the scoring functions, runs on a Web or application server. Clients can connect to the server and send to it a data record that was specified by a call-center operator by means of a user interface on the client. The data record is scored on the server. Then the result is passed back to the client in real time. Figure 8-16 shows a simplified design of how such a scenario can be realized using IM Scoring Java Beans. Here, IM Scoring Java Beans are integrated into a Java 2 Enterprise Edition (J2EE) implementation using, for example, servlets or Enterprise JavaBeans (EJB).
Figure 8-16 Architecture sample to realize a call-center scenario
Note: For optimum performance throughput, you may decide to run each mining model in a separate process. In this case, you would pass only the new records to the appropriate scoring process. This results in a considerable performance improvement. The reason for the improvement is that the model-loading step, which is very time-consuming, is done only once.
Chapter 8. Other possibilities of integration
187
Another typical application area is in the bank customer profile scoring case from 3.5, “Integrating the generic components” on page 44. In this case, the Internet-based part of the bank business environment uses scoring to profile new or recent guest shoppers at the bank. And on the basis of the profile that the guest shopper has entered in the online form, it decides what product or service offer most likely suits them. The remainder of this section continues with this last case.
8.5.3 Mapping to mining functions using IM Scoring Java Beans To perform the scoring for a new customer, you must specify the following input: The mining model to be used The data record with data of the customer for whom we want to compute a score value
When you specify the necessary input, you can apply scoring and then access the result fields that were computed. Appendix I, “IM Scoring Java Bean code example” on page 293, contains the complete Java code “CustomerScore.java”. This code runs an IM Scoring Java Bean to score a new customer with a specified customer ID to any of the clusters as defined in a clustering model that has segmented the customer base. This code example, which uses the IM Scoring Java Bean, performs the following actions: 1. Takes a bank customer ID as an input. 2. Retrieves the customer record using Java Database Connectivity (JDBC). 3. Loads the ResultSet into a record. 4. Uses the scoring bean class RecordScorer to load a “selected Model” and score the record. Note: The Java Bean class RecordScorer is used to perform scoring on a single data record. The record is a set of (field name, value) pairs and must be defined by the user. The computed result can be accessed through a set of getXXX() methods. Before calling the score(Map) method, you must specify the following items: The actual values of the mining fields used by the scoring algorithm The connection, if the used mining model is stored in a database
5. Displays the result of the score.
188
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Note: The code uses JDBC to retrieve a record (based on customer_id as arg[0]) instead of hardcoding. This way you can link the scoring bean back to DB2 UDB or to any JDBC-enabled database for that matter. For example, changing the DB2 UDB specification COM.ibm.db2.jdbc.app.DB2Driver to the specification of the JDBC driver for the ORACLE database offers access to data records in this RDBMS.
This code also uses a method that matches the columns in the PMML model to all columns in the ResultSet, instead of hard coding the data fields.
8.5.4 The business application Bank customers, in particular guest shoppers who use the Internet-based part of the bank business environment, often interact with the bank in short time bursts. This way data records based on information entered within the Web session must be scored in near real time. Then, the bank can provide an immediate response to their needs that they stated through the online customer information or request form. Both to the online customers and to the bank, the benefit of a CRM approach by the bank to its individual customers is enlarged once IM Scoring is done with high speed and with no need for operator interference. IM Scoring facilitates for an effective CRM process toward the bank’s customers who use the Web channel to interact with the bank.
8.5.5 Integration with the application example To score new data records each time a Web channel interaction occurs between the bank customer and the bank’s Internet-enabled application (online form), the integration occurs as follows: 1. The online form contents from the Web page are sent in a data record format to the servlet. 2. The servlet feeds the record to the bank’s front- or back-office application that uses the Java Bean (RecordScorer) to score the customer to one of the customer segments in the clustering model. 3. The result (score and segment ID, matched with the customer ID) from the IM Scoring Java Bean is to be used to offer a bank service or product to up-sell or cross-sell to the online customer. 4. The offer, based on the score, is received by the servlet. 5. The servlets sends the offer in near real-time response during the Web session to the Web page of the online guest shopper of the bank service. Or
Chapter 8. Other possibilities of integration
189
the servlet sends an e-mail to the registered customer in addition to a response to their Web page visit. The Java Bean code to perform and apply scoring to fit the new customer to already existing ones helps to achieve a near real-time response back to the customer and can be easily reused.
8.6 Conclusion IM Scoring enables users to incorporate analytic mining into Business Intelligence, e-commerce, and online transactional processing (OLTP) applications. Applications score records (segment, classify, or rank the subject of those records), based on a set of predetermined criteria expressed in a data mining model. These applications can serve business and consumer users alike. For example, they can provide more informed recommendations, alter a process based on past behavior, and build more efficiencies into the online experience. Therefore, in general, you can be more responsive to the specific customer relationship event at hand, that often takes place in an Web-enabled business environment.
190
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 3
Part
3
Configuring the DB2 functions for data mining This part provides a more technical discussion of the different configurations and uses of the DB2 data mining functions. It includes: Implementing the DB2 data mining function IM Scoring for existing data mining models Using the DB2 data mining function IM Modeling for building the data mining model Using the DB2 data mining function IM Visualization for visualizing the mining results, the scores to the operational data on the basis of the model
Note: Consult the product standard documentation for more current information.
© Copyright IBM Corp. 2002. All rights reserved.
191
192
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
9
Chapter 9.
IM Scoring functions for existing mining models This chapter provides detailed information about the integration of existing data mining models into a DB2 Universal Database (UDB) database for the purpose of scoring. It starts with an overview of the scoring functions and then provides a step-by-step guide on: Enabling the DB2 UDB database for scoring Importing models in various formats into the selected DB2 UDB database Using the imported model to score and return the result
© Copyright IBM Corp. 2002. All rights reserved.
193
9.1 Scoring functions The IM Scoring data mining function uses the following features of DB2 UDB extensively: User-defined function (UDF) User-defined data structured type (UDT) Method
A user defined function is a mechanism with which you can write your own extensions to SQL. For example, the API of IM Scoring is implemented with UDFs. User-defined types are useful for modeling objects that have a well-defined structure consisting of attributes. For example, a user-defined structured type that contains the classification identifier and confidence is useful for storing and structuring the result of a classification model. Methods, like UDFs, enable you to write your own extensions to SQL by defining the behavior of SQL objects. However, unlike UDFs, you can only associate a method with a structured type stored as a column in a table. In IM Scoring, you use methods to extract individual results from UDTs. To read more about UDFs, UDTs, and methods, refer to IBM DB2 Universal Database Application Develoment Guide: Programming Server Applications, SC09-4827.
9.1.1 Scoring mining models IM Scoring has a two-level structure. The IM Scoring functions provided (clustering, classification, and regression) apply models from algorithms as shown in Figure 9-1. Consider the example of scoring functions for clustering that applies models from demographic and neural clustering algorithms. The results of applying the model are referred to as scoring results. To learn more about these mining models, see Discovering Data Mining, SG24-4839.
194
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Scoring function........................Applies models from algorithms... Demographic Clustering Clustering Neural Clustering
Decision Tree Classification Neural Classification
RBF Regression Neural Value prediction Polynomial Regression Linear, Logistic
Figure 9-1 IM Scoring to apply models to new data
9.1.2 Scoring results The scoring results differ in content according to the type of model applied. When a classification model is applied, the scoring results assign a class label and confidence value to each individual record that is being scored. Confidence value: This is a data mining term for the reliability of fit of a record to a certain class. If the confidence value (range 0-1) is below or near 0.5, another grouping of the record may also be done. Or maybe it is not reasonable to expect the record to be considered for the respective class.
The predicted class that is produced when you apply a classification model identifies the class within the model to which the data matches. When a clustering model is applied, the scoring results are the assigned cluster ID. They are the measure that indicates how well the record fits into the assigned
Chapter 9. IM Scoring functions for existing mining models
195
cluster, to each individual record being scored. The cluster ID identifies the position of the cluster in the clustering model that is the best match for the data. When a prediction model is applied, the scoring results are the assigned predicted value. The predicted value, which is produced when you apply a regression model, is calculated according to relationships that are established by the model.
9.2 IM Scoring configuration steps The steps required to configure a DB2 UDB database for scoring are listed in Table 9-1. They are categorized according to several actions. Table 9-1 Step-by-step action categories and steps Step
Action category
Action steps
1
Enable DB2 UDB instance
Update the database configuration parameter UDF_MEM_SZ. Restart the DB2 UDB instance.
2
Enable the database
Increase the Database Transaction Log size. Increase Database Control Heap Size. Increase the Application HEAPSIZE. Create database objects that are required for scoring.
3
Export models from the modeling tool
Export the selected model or models to an external file or files, either in PMML or DB2 Intelligent Miner for Data format. (IM Modeling SQL API only exports model(s) in PMML format.)
4
Import models
Import models from the external files into the relational database.
5
Generate the SQL script
Generate the SQL script that scores the target table using the models.
6
Application
Invoke the SQL scoring scripts from the application.
Figure 9-2 graphically shows the steps at a high-level view of an application architecture involving a modeling and scoring (and application) layer.
196
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Scheduler Type pla y er
Typ e att rib ute s
Sc heduler
JO b
JOb
Model Calibr ation
Typ e p laye r
Typ e player
T ype pla yer
Model Calibr ation
Scheduler
Typ e a ttr ibut es
Scheduler
Campaign management
Step 3: Export models as PMML
Scoring inside the Customer segmentation call center application
Step 5: Score the data
Application Environment
Step 1: Enable instance for scoring
Step 2: Enable database for scoring
Modeling API
Step 4: Import models into DB2
Step 3: Save model directly
... PMML stored in files
Operational Data Store Data Models Scores
Analytical Data Mart
Modeling Environment
File System
Scoring Environment
Figure 9-2 Application architecture with modeling and scoring
9.3 Step-by-step configuration The main configuration steps that you perform once for scoring are: 1. 2. 3. 4. 5.
Configure the DB2 UDB instance. Configure the database. Export models from the modeling tool. Import models in DB2 UDB. Generate the SQL script.
9.3.1 Configuring the DB2 UDB instance After the scoring module is installed, you need to configure the DB2 UDB instance and the database before you can use IM Scoring. This is done by enabling the DB2 UDB instance. See the steps in Table 9-2.
Chapter 9. IM Scoring functions for existing mining models
197
Since the scoring data mining function is implemented primarily as UDF, you must increase the default memory size allocated to the UDF. A recommended value is 60000. Table 9-2 lists the parameters and their recommended values. Table 9-2 Steps for enabling the DB2 UDB instance Step
Platform
Purpose
DB2 command
1
UNIX Windows
Increase UDF_MEM_SZ
db2 update dbm cfg using udf_mem_sz 60000
2
Windows only
Increase the DB2 registry parameter
db2set DB2NTMEMSIZE=APLD:240000000
3
UNIX Windows
Bounce the database
db2stop db2start
9.3.2 Configuring the database Once the database instance is configured for scoring, the next step is to enable the database. These steps ensure that: The database is configured with the appropriate database parameters. The required database objects are created:
– – – – –
Tables UDFs UDTs Methods Stored procedures
The steps in Table 9-3 are required for each database. Table 9-3 Steps for enabling the DB2 UDB database
198
Step
Platform
Purpose
DB2 command
1
UNIX Windows
Increase the log size for likely long transactions during scoring.
db2 update db cfg for using logfilsiz 2000
2
UNIX Windows
Increase the application heaps control shared memory.
db2 update db cfg for using APP_CTL_HEAP_SZ
3
UNIX Windows
Increase the private memory for the application.
db2 update db cfg for using APPLHEAPSZ 1000
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Step
Platform
Purpose
DB2 command
4
UNIX Windows
Create database objects required for scoring, including administrative tables, UDFs, and UDTs.
idmenabledb fenced tables
Federated access If the table to be scored is in a remote DB2 UDB table, such as a DB2 UDB on the z/OS Server, you can score the table to the remote server using federated access support in DB2 UDB. Table 9-4 summarizes the prerequisites for federated access. Table 9-4 Middleware prerequisites for federated database access IM Scoring environment
Remote database DB2 UDB for z/OS
DB2 UDB for iSeries
DB2 UDB Enterprise Edition for Windows
Oracle (Solaris, Linux, AIX)
SQL Server
DB2 UDB Enterprise Edition on Windows
DB2 Connect on Z/OS
DB2 Connect on OS/400
No additional software requirement
Relational Connect
Relational Connect
DB2 UDB Enterprise Edition on UNIX
DB2 Connect on z/OS
DB2 Connect on OS/400
No additional software requirement
Relational Connect
Relational Connect
With all the prerequisite software installed, you can now configure remote database tables for federated access. Table 9-5 summarizes the steps to achieve this. Table 9-5 Configuring a remote DB2 UDB table as a target table Step
Example
Catalog the remote node
CATALOG TCPIP NODE DB2NODE REMOTE SYSTEM42 or 9.1.150.113 SERVER DB2TCP42 or 50000
Define the remote server
CREATE SERVER DB2SERVER TYPE DB2/390 VERSION 6.1 WRAPPER DRDA OPTIONS (NODE 'db2node',DBNAME 'quarter4')
Chapter 9. IM Scoring functions for existing mining models
199
Step
Example
Create the wrapper
Create wrapper DRDA
Create the user name mapping
CREATE USER MAPPING FROM USER27 TO SERVER DB2SERVER AUTHID "TSOID27" PASSWORD "TSOIDPW"
Create a nickname
CREATE NICKNAME DB2SALES FOR DB2SERVER.TSOID27.MIDWEST
The nickname is the table where you want to score. Currently this is the only way to score a DB2 UDB table on the z/OS.
9.3.3 Exporting models from the modeling environment Scoring requires a model as the basis of how to score. The models must be stored in the database table as a model object so IM Scoring can use it via the SQL API. If the mining models are created by means of IM Modeling, they can be directly applied since IM Modeling stores the result models in the database tables directly. The mining models may have been created in a workbench environment that supports model export in PMML format, such as DB2 Intelligent Miner for Data or SAS/Enterprise Miner. In this case, the models must be exported to an intermediate file format first. DB2 Intelligent Miner for Data can export the models in the native DB2 Intelligent Miner for Data model format or in the industry standard format PMML. After the models are in the PMML format and made accessible from the file system, DB2 UDB can use the SQL API to import the models into DB2 UDB tables for scoring applications. Imported models are stored as Character Large Object (CLOB) in DB2 UDB tables.
Exporting the data mining model from the mining tool Figure 9-3 shows an example of the available formats for exporting a model from DB2 Intelligent Miner for Data.
200
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 9-3 Export model menu from DB2 Intelligent Miner for Data
Using the DB2 Intelligent Miner for Data format After a model is created using DB2 Intelligent Miner for Data, it is ready to be exported to external files for exchange with other applications, such as IM Scoring. The DB2 Intelligent Miner for Data format is appropriate when the model is for deployment using IM Scoring, since IM Scoring can import models in DB2 Intelligent Miner for Data format as well as PMML format. One point to mention is that when exporting to a file with the proprietary DB2 Intelligent Miner for Data format, the file is stored using the system codepage of the DB2 Intelligent Miner for Data client. Make sure that the language-specific characters appear correctly on this machine, which means that the system codepage is correct.
Using the PMML format Exporting model in PMML format is recommended when the model is to be imported in third-party tools that only support PMML. Again, it is worthwhile to mention in this case is that when exporting to a file with the PMML-format, the file is stored using the system codepage of the DB2 Intelligent Miner for Data client. However, the encoding written in the first line of the model (XML declaration) specifies the codepage of the DB2 Intelligent Miner for Data server, where the conversion to PMML occurred. Therefore, the encoding can be erroneous if the DB2 Intelligent Miner for Data client and server are on different machines and systems.
Chapter 9. IM Scoring functions for existing mining models
201
Converting files from DB2 Intelligent Miner for Data to PMML DB2 Intelligent Miner for Data models may be stored in native DB2 Intelligent Miner for Data format from previously mining runs. You may need to change the format. With respect to transforming a file from DB2 Intelligent Miner for Data to PMML format, consider a file that contains a model with the proprietary DB2 Intelligent Miner for Data format. You can create a file containing the same model with the PMML format by explicitly calling the executable idmxmod command. The input file is supposed to be in the system codepage of the current machine. The PMML file is written in the system codepage of the current machine, and the corresponding encoding is written in the first line of the model. Tip: Transfer your PMML files between machines as binary objects to prevent any implicit codepage conversion.
You can specify the encoding of the model by using an additional parameter in the import function, for example: DM_impClasFileE('/tmp/myModel.pmml' , 'windows-1252')
9.3.4 Importing the data mining model in the relational database management system (RDBMS) Once models are stored in export formats, the next step is to load the specified model into the database for deployment of scoring applications. There are different mode types and import features for you to take into account. IM Scoring provides the SQL API that imports the mining models in various formats.
Data mining model types IM Scoring can read the models produced by the following DB2 Intelligent Miner for Data functions:
Demographic and neural clustering Tree and neural classification RBF and neural prediction Polynomial regression
When you import a model, you must be aware of the type of models that are being used. These models include: Clustering models Classification models Regression models
202
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
IM Scoring provides an SQL interface that enables the application of PMML models to data. In this way, IM Scoring supports the PMML 2.0 format for:
Center-based clustering (neural clustering in IM Scoring) Distribution-based clustering (demographic clustering in IM Scoring) Decision trees (tree classification in IM Scoring) Neural network (neural prediction and neural classification in IM Scoring) Regression (polynomial regression in IM Scoring) Logistic regression Radial Basis Function (RBF) prediction Note: IM Scoring also supports the RFB prediction in addition to all the other algorithms that are listed. The PMML standard version 2.0 does not yet state the RBF prediction. See the following Web site for more information: http://www.dmg.org
Importing features (DB2 Intelligent Miner for Data, PMML, CLOB) IM Scoring provides the SQL API and a set of UDFs for importing and using various scoring functions. There are different UDFs and DB2 tables for importing and storing different types of models. These UDFs and special tables are created when the database is enabled for scoring using the DB2 script idmenabledb. Table 9-6 cross tabulates the different models to UDFs, UDTs, and DB2 UDB tables to use. Table 9-6 Matching models to UDFs, UDTs, and DB2 tables Models produced by
UDFs to import models
UDTs for storing model
Name of DB2 tables where models are stored into
Demographic clustering
DM_impClusFile
DM_ClusteringModel
ClusterModels
Neural clustering Example DB2 command: db2 insert into IDMMX.ClusterModels Values ('', IDMMX.impClusFile('') Tree
DM_impClasFile
DM_ClassModel
ClassifModels
Neural classification
Chapter 9. IM Scoring functions for existing mining models
203
Example DB2 command: db2 insert into IDMMX.ClassifModels Values ('', IDMMX.impClasFile('') Polynomial regression
DM_impRegFile
DM_RegressionModel
RegressionModels
Radial basis function Neural prediction Example DB2 command: db2 insert into IDMMX.RegressionModels Values ('', IDMMX.impRegFile('')
The following assumptions are made about the codepage during the import: Importing from a file with the proprietary format of DB2 Intelligent Miner for Data, functions IDMMX.DM_imp{Clas|Clus|Reg}File(file):
The file is supposed to be in the system codepage of the database server. It is transformed to PMML using this codepage. Importing from a file with the PMML format, functions IDMMX.DM_imp{Clas|Clus|Reg}File(file):
The encoding specified in the first line of the model (XML declaration) is supposed to be correct in IDMMX.DM_imp{Clas|Clus|Reg}File(file). If the encoding is not correct, use the next import with the file and encoding option. Importing from a file with the PMML format, functions IDMMX.DM_imp{Clas|Clus|Reg}File(file, encoding):
The encoding given as a parameter is used to convert the model. The encoding specified in the first line of the model itself is ignored. Importing from a database object with the PMML format, functions IDMMX.DM_imp{Clas|Clus|Reg}Model(clob):
The database object implicitly has the database codepage. The encoding specified in the first line of the model itself is ignored. Note that the model is not converted to the database codepage when copied from a file into the database. However, it is assumed that the file codepage is compatible with the database codepage. We recommend that you use the functions IDMMX.DM_imp{Clas|Clus|Reg}File(file, encoding) when you want to override the encoding in the XML declaration of the PMML model. This may be necessary
204
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
if a previous file transfer changed the code page of the PMML file without updating the XML declaration within the file.
9.3.5 Scoring the data This section provides a sample SQL script template for scoring. After the database is enabled and models are in place, the next step is to generate the DB2 script for the actual scoring. IM Scoring provides the SQL API and a set of UDFs to apply the scoring functions. There are different UDFs for different types of models. These UDFs and special tables are created when the database is enabled for scoring using the DB2 script idmenabledb. Table 9-7 lists the model categories, model types, and relevant UDFs. Table 9-7 Model types and UDFs Model type
UDF
Tree
DM_applyClasModel
Neural classification Demographic clustering
DM_applyClusModel
Neural clustering Polynomial regression
DM_applyRegModel
Neural prediction Radial Basis Function
Figure 9-4 explains the conceptual overview of the scoring process, by showing elements of a SQL scoring script for each UDF. Business application-specific tables and associated models are used as input to the scoring process. Multiple versions of the same model can be stored the model tables. The benefits for multiple versions are mainly flexibility in model execution. You can have: A different version of models for different customer segments Different versions of the same model can be saved at different points of recalibration
The business scenario may be generic customer data used for segmentation, customer churn tables for churn scoring, or customer product tables for cross-sell propensity scoring.
Chapter 9. IM Scoring functions for existing mining models
205
Customer Attributes table
SQL: DM_ApplyClusModel
V1 C lusterModel
V2 V3
Custom er Segm ents Chustomer with Churn Table
V1 SQL: DM _ApplyClasModel
C lassifModel V2
V3
Custom er Churn score
Customer Product Portfolio Table
SQL: DM_ApplyR egModel
V1 RegressionM odel
V2
V3
Custom er propensity score
DB2 Figure 9-4 Conceptual overview of SQL scoring elements
There are four essential elements in the SQL scoring script:
Input data: The table to be scored Input model to use The appropriate UDF to use for the model The output score
The pseudocode in Figure 9-5 highlights those elements.
206
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
insert into 1 Select ,
from
IDMMX.<selected UDF>2 (<Model Table>.model , IDMMX.DM_impApplData 3( REC2XML(... column 1...column N 4 ))) <Model Table>
where Model Name= < selected model> Figure 9-5 Pseudo code for a SQL script for scoring
Consider the following explanation for each of the subscript numbers in Figure 9-5: 1. An output table is required if scores need to be stored. 2. IDMMX.<selected UDF> is the UDF that actually scores. 3. IDMMX.DM_impApplData converts the REC2XML result to the UDT DM_ApplicationData so it can be used as an argument to the scoring UDF. 4. While passing the input record into the scoring UDF, make sure the column names used in the model match the columns in the input table.
Applying the model There are three UDFs in the IM Scoring SQL API that apply the model to score the input data. You need to keep the category of the model in mind, and use the appropriate UDF for the job: DM_applyClasModel can apply a classification model DM_applyClusModel can apply a clustering model DM_applyRegModel can apply a regression model
Constructing the record and applying the model The simplest and the most efficient way to construct a record for scoring is to use the DB2 UDB built in function REC2XML in conjunction with the DM_impApplData function.
Chapter 9. IM Scoring functions for existing mining models
207
RE2XML gets a number of control parameters and a list of columns as input. It returns an XML string containing name value pairs, which are used as input to the DM_impApplData function. Example 9-1 shows a sample code segment that applies a classification model to the result of a record constructed with REC2XML and DM_impApplData.
Example 9-1 Passing the record and applying a classification model IDMMX.DM_applyClasModel( c.model, IDMMX.DM_impApplData( rec2xml(2,’COLATTVAL’, ‘’ , .... )))
The parameter value 2 is appropriate most of the time, but COLATTVAL must be there all the time. There are other ways to pass the record, but the function REC2XML is recommended for syntax simplicity and performances. Figure 9-6 shows an example of implementing the pseudocode.
insert into churn.output_rbf_score select b.Customer_id idmmx.dm_applyClasModel ( c.model, idmmx.dm_impApplData( REC2XML(2,'COLATTVAL','', b.buying_power, b.cell_phone_contract, b.change_of_offer, b.conv_phone_contract, b.customer_age, b.customer_rating, b.distributor, b.duration_of_contract, b.gender, b.location_size, b.network, b.no_sim_changes, b.premium_account, b.revenue_category, b.revenue_development, b.revenue_of_3months, b.socio_dem_group))) from churn.all_customers_test b, idmmx.RegressionModels c where c.ModelName='Churn model 2';
Create table Churn.output_rbf_score (customer_no Char(11) churn_Score idmmx.dm_RegResult); Typically, results are saved into an immediate table. At a minimum you would need a customer id and a score. Note the data type of score column. In this example, it is of type idmmx.dm_RegResult for regression.
After the customers are scored, you may want to export the result to another aplication. For example, to an outbound Marketing system.
Figure 9-6 Sample code for scoring with a churn model
208
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Tip: Using UDFs, UDTs, and methods sometimes requires a lot of typing. IM Scoring supplies a DB2 UDB command IDMMKSQL to generate a SQL template for any given model. The syntax is: IDMMKSQL
We recommend that you use this command to generate the score code and save time. Refer to DB2 Intelligent Miner Scoring V8.1 Administration and Programming for DB2, SH12-6745, for details about the command.
Returning the result The results of scoring are returned as a result of the query. Depending on the model used, results are returned as specific UDTs by the IM Scoring API. Table 9-8 lists the scoring results. Table 9-8 Matching scoring functions, result types, and UDFs Application function
User-defined data type
Results containing
UDF to extract results
DM_applyClusModel
DM_ClusResult
Cluster ID score confidence
DM_getClusterID DM_getClusScore DM_getQuality DM_getConfidence
DM_applyClasModel
DM_ClasResult
Predicted class confidence
DM_getPredClass DM_getConfidence
DM_applyRegModel
DM_RegResult
Predicted value region (RBF only)
DM_getPredValue
Example 9-2 shows the sample coding for extracting the score from a UDT DM_RegResult.
Example 9-2 Extracting the score from a UDT DM_RegResult export to c:\temp\customer_score.csv of del SELECT CUSTOMER_NO, IDMMX.DM_getPredValue(CHURN_SCORE) AS CHURN_SCORE FROM CHURN.OUTPUT_RBF_SCORE;
Consider the following remarks: The export command saves the result of the query to a comma separated variables (CSV) file for import to third-party applications such as campaign management.
Chapter 9. IM Scoring functions for existing mining models
209
The UDF DM_getPredValue() is used to extract the score.
9.3.6 Exploiting the results After the scored result is in a table, you can perform either of the following actions to exploit the results: Store each propensity/score for each of the customers. Each customer can have multiple scores stored. Depending on the campaign, an appropriate score can be chosen. For example, integration between Siebel Marketing and DB2 Intelligent Miner for Data is via this mechanism. Assign each customer to a cluster based on the scoring. The customer’s movement between clusters can also be made available as part of the data warehouse or operational database.
9.4 Conclusion IM Scoring is an add-on service to both DB2 UDB and ORACLE. It consists of a set of UDTs and UDFs that extend the capabilities of the database to include data mining functions. Mining models may continue to be built using the DB2 Intelligent Miner for Data, or be built by using one of the other DB2 data mining functions, the so-called DB2 Intelligent Miner Modeling (IM Modeling). See Chapter 10, “Building the mining models using IM Modeling functions” on page 211, for more information on IM Modeling. In both cases, the modeling and the scoring functions are integrated into the database. Using the IM Scoring UDFs, you can import certain types of mining models into a relational table and apply the models to data within the database to achieve specific scoring results for the model type. Because of the import facilities, the advantages of PMML also hold true for IM Scoring. IM Scoring includes UDFs to retrieve the values of scoring results, which serves deployment. IM Scoring is a data mining function that works directly from the relational database. This helps to speed up the data mining process. The scoring, the determination of which customers are most likely to respond to some marketing actions, for example, is integrated into the database management system itself. In this way, it speeds up the deployment into the business environment by (database) developers.
210
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
10
Chapter 10.
Building the mining models using IM Modeling functions This chapter introduces another concept of building data mining models by discussing the advantage of using IM Modeling in this process. The key concepts are: Automation Effectiveness Fast time to market
This chapter describes the modeling functions that are available, the data mining process for modeling, and the implementation steps to perform modeling.
© Copyright IBM Corp. 2002. All rights reserved.
211
10.1 IM Modeling functions The following mining model functions are supported by IM Modeling: Associations Tree classification Demographic clustering
To understand better what these mining model functions are, see Discovering Data Mining, SG24-4839, or DB2 Intelligent Miner Modeling V8.1 Administration and Programming, SH12-6736.
10.2 Data mining process with IM Modeling IM Modeling brings to the database administrator (DBA) the facility to use SQL to build the data mining process. Now they can manage and control it so that the data mining process can be fast, secure, scheduled, and easily recomputed. The steps that result in a ready-to-use data mining model are: 1. 2. 3. 4. 5. 6. 7.
Specifying mining data Defining logical data specifications Defining mining settings Defining mining tasks Building and storing mining models Testing the data mining models Working with mining models and test results
This chapter shows how to implement them with IM Modeling, after you create the database objects. Table 10-1 outlines all required steps to enable a database for mining. Table 10-1 Steps to set up a modeling run
1
Action category
When
Action steps
Enable DB2 UDB instance
Once only per instance
Update the database configuration parameter UDF_MEM_SZ. Restart the DB2 UDB instance.
2
Enable database
Once only per database
Increase Database Transaction Log size. Increase Database Control Heap Size. Increase Application HEAPSIZE. Create database objects required for scoring.
212
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Action category
When
Action steps
3
Specify mining data
Once for each table
Specify the name and the columns of the training table.
4
Define mining settings
Every model
Generate logical data definitions. Set a number of parameters that are specific for each data mining function.
5
Defining mining tasks
Every model
Create the mining task that can also include the test run specification.
6
Building and storing mining models
Every model
Generate the SQL script that builds and stores the mining models.
7
Testing classification models
Test the mining model with the predefined stored procedure.
8
Working with mining models and test results
Exporting mining models and test the results.
10.3 Configuring a database for mining This section explains each of the steps in Table 10-1.
10.3.1 Enabling the DB2 UDB instance for modeling After the modeling module is installed, you need to configure the DB2 UDB instance and the database before you can use IM Modeling. Since the modeling data mining function is implemented primarily as UDF, you have to increase the default memory size allocated to UDF. A recommended value is 60000. A DBA or someone with database management (DBM) authority must perform the steps that are outlined in Table 10-2.
Chapter 10. Building the mining models using IM Modeling functions
213
Table 10-2 Database instance parameters required for scoring Step
Platform
Purpose
DB2 command
1
UNIX Windows
Increase UDF_MEM_SZ
db2 update dbm cfg using udf_mem_sz 60000
2
Windows only
Increase DB2 UDB registry parameter
db2set DB2NTMEMSIZE=APLD:240000000
3
UNIX Windows
Bounce the database
db2stop db2start
10.3.2 Configuring the individual database for modeling After the database instance is configured for modeling, you must enable the database. The following steps are required for each database. A DBA or someone with database management (DBM) authority must perform the steps that are outlined in Table 10-3. Table 10-3 Database parameters required for scoring
214
Step
Platform
Purpose
DB2 command
1
UNIX Windows
Increase the log size for a likely long transaction during modeling.
db2 update db cfg for using logfilsiz 2000
2
UNIX Windows
Increase the application heaps control shared memory.
db2 update db cfg for using APP_CTL_HEAP_SZ 10000
3
UNIX Windows
Increase private memory for the application.
db2 update db cfg for using APPLHEAPSZ 1000
4
UNIX Windows
Create the database objects that are required for modeling: administrative tables, UDFs, and UDTs.
idmenabledb fenced tables
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Step
Platform
Purpose
DB2 command
6
UNIX Windows
This is optional for DB2 UDB V8. With DB2 UDB V8, the use of IM Modeling can be simplified by installing these additional UDFs and stored procedures. You can find the source code in Appendix F, “UDF to create data mining models” on page 281.
db2 -tvf <path of the extra UDF and Stored Procedure creation scripts>
10.3.3 IM Modeling in DB2 UDB V8.1 DB2 UDB V8.1 provide additional User Defined Functions and stored procedures for IM Modeling. The invocation of modeling run can be made from a single call to stored procedure. Table 10-4 illustrates the invocation of modeling runs using stored procedures. Table 10-4 Stored procedures for starting mining runs Mining Algorithm
Stored Procedure
Clustering
BuildClusModel
Example: call redbook.BuildClusModel('myModel', 'BANKING_MODELING', ClusSettings('BANKING_MODELING') ..DM_setMaxNumClus(6)..DM_expClusSettings()
Note: This creates a model called myModel on the Banking_Modeling table and creates up to six clusters.
Tree classification
BuildClasModel
Example: call redbook.BuildClasModel('myModel', 'BANKING_MODELING', ClasSettings('BANKING_MODELING') ..DM_setMaxNumClus(6)..DM_expClusSettings()
Association
BuildRuleModel
Chapter 10. Building the mining models using IM Modeling functions
215
Mining Algorithm
Stored Procedure
Example: call redbook.BuildRuleModel('My_ProductMix_Model’,'CUSTOMER_PRODUCTMIX’, RuleSettings('CUSTOMER_PRODUCTMIX') ..DM_setMaxNumClus(6)..DM_expClusSettings()
10.4 Specifying mining data This step specifies the name and the columns of the table that contains your data training set. This achieved by populating the settings of the table to be used for IM Modeling in the IDMMX.MININGDATA table. Table 10-5 lists the methods for specifying the mining data. Table 10-5 Methods for defining mining data Method
Description
Input
DM_defMiningdata
Defines table as input for mining
Table name
DM_SetColumns
Defines the individual column to the modeling API
Column name data type
Example: insert into IDMMX.miningdata values ('Connection', IDMMX.DM_MiningData()..DM_defMiningData('CONNECTION_TABLE').. DM_SetColumns(' '));
216
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
10.4.1 Defining mining settings This step includes the following substeps: 1. Generate a logical data settings file from the previous step. 2. Add an additional parameter setting for the mining run. Since different mining algorithms require different setting, this step involves using different UDFs and methods for different algorithms. Table 10-6 tabulates the algorithms and the more frequently used methods associated with them. It is by no means an exhaustive listing of all the methods for each of the algorithms. There are also examples for each of the algorithms. These examples illustrate the setting up of data mining settings using a table driven approach. Again this is only one way to build the data mining settings. Table 10-6 UDFs and frequently used methods for defining mining settings Algorithm
Method
Description
Input
All
DM_genDataSpec
Generate the logical data spec for the data table.
None
DM_addNmp
Generate name mapping.
Name of mapping to create
varchar
DM_SetFldNmp
Set name mapping active.
Name of an existing name mapping
varchar
DM_setPowerOptions
Set the power option specific to an algorithm.
Power option specific to an algorithm
varchar
DM_setItemFld
Assign the role of item to an input column.
Name of the input column
varchar
DM_SetGroup
Assign the role of a group to an input column. Typically this is the transaction ID or customer ID.
Name of group
varchar
Name of input column
varchar
Set the maximum length for rule.
Length
integer
Association
DM_SetMaxLen
Input data type
Chapter 10. Building the mining models using IM Modeling functions
217
Algorithm
Method
Description
Input
Input data type
DM_setMinSupport
Set a minimum support threshold for rules.
Minimum support
integer
DM_setMinConf
Set a minimum confidence threshold for rules.
Minimum confidence
integer
DM_addTax
Create a taxonomy.
Name of taxonomy to be created
varchar
DM_setConType
Set a constraint type.
0 for exclusive, 1 for inclusive
integer
DM_addConItem
Add a constraint item.
Item to be included
varchar
DM_remConItem
Remove a constraint item.
Item to be removed
varchar
DM_SetFldTax
Set taxonomy to active.
Field name, name of taxonomy
varchar, varchar
Name of taxonomy
varchar
Example: insert into IDMMX.RuleSettings select 'Connection_Segmentation', IDMMX.DM_RuleSettings().. DM_useRuleDataSpec(MININGDATA..DM_genDataSpec().. DM_setFldType(‘TRANSACTION_ID,0).. DM_setFldType(‘ITEM_NO’,0).. DM_addNMP(‘newName’,’shop_1.transactions’,’SK2_code’, ’product_name’).. DM_setFldNmp(‘item’,’newName’) DM_addTax(‘Taxonomy_1’,’shop_1.prod_hierarchy’,’ITEM_NO’, ’prod_group’, cast(NULL as char),1).. DM_setFldTax(‘ITEM_NO’,’Taxononmy_1))..’ DM_setMinSupport(70).. DM_setGroup('TRANSACTION_ID’).. DM_setItemFld(‘ITEM_NO’).. from IDMMX.MiningData where ID='Market_Basket';
218
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Algorithm
Method
Description
Input
Input data type
Tree
DM_setClasPar
Set maximum purity.
‘MaxPur’
Keyword
Value
integer
‘MaxDth’
keyword
Value
float
Set a minimum number of records per internal node.
‘MinRec’
keyword
Value
Integer
DM_setCostMat
Specify a cost matrix for cost of misclassification.
Refer to Administration Guide
Refer to Administra tion Guide
DM_setClasTarget
Specify the target field.
Target field
varchar
Set maximum tree depth.
Example: insert into IDMMX.ClasSettings select 'Churn_Classification', IDMMX.DM_ClasSettings().. DM_useClasDataSpec(MININGDATA..DM_genDataSpec().. DM_setClasTarget(‘CHURN_FLAG’)).. DM_setTreeClasPar(‘MaxPur’,95).. DM_setTreeClasPar(‘MaxDth’, 6).. DM_setTreeClasPar('MinRec’,5).. DM_setICostMat(‘CUSTOMER.COSTMAT’,’ACTUAL’,’PREDICTED’, ‘WEIGHT’).. from IDMMX.MiningData where ID='Customer_Churnt';
Chapter 10. Building the mining models using IM Modeling functions
219
Algorithm
Method
Description
Input
Input data type
Clustering
DM_setDClusPar
Set the value weighting for field.
Refer to reference guide
Refer to reference guide
DM_setMaxClus
Set the maximum number of clusters allowed.
Maximum number of cluster
integer
DM_setFldSimScale
Set the field similarity scale.
Field name, similarity scale
varchar, double
Similarity scale
double
Field, treatment
varchar, integer (1,2,3)
Treatment
integer: (1,2,3)
DM_setFldOutTreat
220
Set the treatment of outliers.
DM_addSimMat
Add similarity metrix to setting.
See reference guide
refer to reference
DM_setExecTime
Set the maximum execution time.
Execution time in minutes
integer
DM_setMinData
Set the minimum percentage of data that the clustering run must process.
Percentage
double (0-100)
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Algorithm
Method
Description
Input
Input data type
Example: insert into IDMMX.ClusSettings select 'Connection_Segmentation', IDMMX.DM_CLusSettings().. DM_useClusDataSpec(MININGDATA..DM_genDataSpec()).. DM_setMaxNumClus(37).. DM_setDCLusPar('SimThr',0.45).. DM_setFldOutlTreat('MAX_DUR',2).. DM_setFldOutlTreat('NO_CALLS',2).. DM_setFldOutlTreat('NO_CLRS',2).. DM_setFldOutlTreat('REL_DUR',2).. DM_setFldOutlTreat('SUM_COST',2).. DM_setFldOutlTreat('SUM_DUR',2).. DM_setFldOutlTreat('VAR_DUR',2).. DM_setFldUsageType('CALLER_ID',2).. DM_setFldUsageType('PREMIUM_ID',2) from IDMMX.MiningData where ID='Connection';
10.4.2 Defining mining tasks This section explains how to create the mining task. This task combines all the information you have provided to run the training model. In this step, you can also define the test set to be run with the same mining model you created with the training set previously. There are four task types that you can build in IM Modeling:
DM_ClusBldTask: Build task for clustering DM_ClasBldTask: Build task for tree classification DM_RuleBldTask: Build task for association rule DM_ClasTestTask: Build task for testing the classification tree
For a long running task, you may want to specify control parameters for the task types in the error message table. Refer to DB2 Intelligent Miner Modeling V8.1 Administration and Programming, SH12-6736. Table 10-7 lists the table name, UDFs, and methods for building mining tasks for each algorithm.
Chapter 10. Building the mining models using IM Modeling functions
221
Table 10-7 Tables that store data mining tasks Table name
Algorithm
UDF
Method
IDMMX.RuleTasks
Association
DM_RuleBldTask
DM_defRuleBldTask
Example: insert into IDMMX.RuleTasks(id,task) select 'Cross_sell_Task', IDMMX.DM_RuleBldTask()..DM_defRuleBldTask(d.miningdata,s.settings) from IDMMX.MiningData D,IDMMX.RuleSettings S where d.id='Market_Basket' and s.id='Rule_settings'; IDMMX.DM_CLusBld Task
Cluster
DM_ClusBldTask
DM_defClusBldTask
Example: insert into IDMMX.ClusTasks(id,task) select 'Connection_Segmentation_Task', IDMMX.DM_ClusBldTask()..DM_defClusBldTask(d.miningdata,s.settings) from IDMMX.MiningData D,IDMMX.ClusSettings S where d.id='Connection' and s.id='Connection_Segmentation'; IDMMX.DM_ClasBld Task
Tree
DM_ClasBldTask
DM_defClasBldTask
Example: insert into IDMMX.ClasTask(id,task) select 'Churn_Classification_Task', IDMMX.DM_ClasBldTask().. DM_defClasBldTask(d.miningdata, cast (NULL asDM_MiningData), s.settings) from IDMMX.MiningData D, IDMMX.ClasSettings S where d.id='Customer_churn' and s.id='Churn_Classification'; IDMMX.DM_ClasTest Task
Tree
DM_ClasBldTask
DM_defClasBldTask
Example: insert into IDMMX.ClasTestTasks (id,task) select 'Churn_Classification_Test_Task', IDMMX.DM_ClasTestTask()..DM_defClasTestTask(d.miningdata, ‘IDMMX.CLASSIFMODELS’,’MODEL’,’MODELNAME’,’Churn_Model’) from IDMMX.MiningData D where d.id='Customer_churn';
222
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
10.4.3 Building and storing mining models This step trains the data mining model by calling the appropriate stored procedures. The stored procedures are stored in the database during the database enablement phase. We start the training phase by calling these procedures. The output is a data mining model created and saved in the output table. There is a procedure for each of model type as listed in Table 10-8. Table 10-8 Algorithms and the associated stored procedures input and output Algorithm
Stored procedure to use
Input table
Output table
Association
IDMMX.DM_BuildRuleModel
IDMMX.RuleTasks
IDMMX. Rulemodels
Example: call IDMMX.DM_BuildRuleModelcmd('IDMMX.RULETASKS','TASK','ID', 'Connection_Association_Task', 'IDMMX.RULEMODELS','MODEL','MODELNAME', 'Cross_Sell_Rules'); Note: Cross_Sell_Rules is the name given to model produced by this procedure call.
Tree classification
IDMMX.DM_BuildClasModel
IDMMX.ClasTasks
IDMMX. ClasModels
Example: call IDMMX.DM_BuildClasModelcmd('IDMMX.CLASTASKS','TASK','ID', 'Connection_Classification_Task', 'IDMMX.CLASMODELS','MODEL','MODELNAME', 'Churn_Model'); Note: Churn_Model is the name given to the model produced by this procedure call.
Demographic clustering
IDMMX.DM_BuildClusModel
IDMMX.ClusTasks
IDMMX. ClusterModels
Example: call IDMMX.DM_BuildClusModelcmd('IDMMX.CLUSTASKS','TASK','ID', 'Connection_Segmentation_Task', 'IDMMX.ClusterModels','MODEL','MODELNAME', 'Customer_Segmentation'); Note: Customer_Segmentation is the name of the model produced by this procedure call.
Chapter 10. Building the mining models using IM Modeling functions
223
10.4.4 Testing the classification models For classification models, there is an extra step to check the quality of the model produced. For testing purposes, you must set aside a test set with the actual value already in the target field. When the test is done, you produce a statistical report on the accuracy and quality of the model on how it performs on cases that were not seen before the training. To check the quality of the model you built in the previous step, you may want to test it using the predefined stored procedure. The result is to write in the database. To test the quality of the classification model, use one of the stored procedures listed in Table 10-9. Table 10-9 Procedures for testing a classification model Purpose
Stored procedure
Testing a model stored with data type model on Test Data
IDMMX.DM_testClasModelCmd
Testing a model stored as CLOB on Test data
IDMMX.DM_testClasModel
Testing a model using a Classification model test task
IDMMX.DM_testClasModelCmdT
Example 10-1 shows the DM_testClasModelCmdT stored procedure.
Example 10-1 Using DM_testClasModelCmdT stored procedure call IDMMX.DM_testClasModelCmdT(‘IDMMX.MININGDATA’,’MININGDATA’, ‘ID’,’MyMiningData’, ‘IDMMX.CLASSIFMODELS’, ‘MODEL’, ‘MODELNAME’ ‘MyClassifModel’, ‘IDMMX.CLASTESTRESULTS’,’RESULT’,’ID’, ‘MyTestResult’)
Notes: MyClassifModel is the model we are testing. MyTestResult is the name of the test result we are to produce.
224
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
10.4.5 Working with mining models and test results After you build an initial model, you may need to work with the mining model and test results to monitor model performance, for example.
Mining models You can work with the mining models in one of two ways: Leverage it directly in scoring Export it for distribution, and then import it into the scoring platform. See Example 10-2.
Example 10-2 Exporting a model for distribution export to dummy of del lobs to c:\temp\ lobfile segmentation.xml modified by lobsinfile select IDMMX.DM_expClusModel(model) from IDMMX.expClusModel(model) where modelname='ConnectionSegmentationModel';
Note: IDMMX.expClusModel is a UDF that converts a model into a CLOB for export.
PMML format of mining models IM Modeling also produces data mining models in the PMML V2.0 format. IM Modeling complies with the PMML producer conformance clause of the PMML v2.0 standard for: Association rules Decision trees Demographic clustering
See “IM Modeling conformance to PMML” in DB2 Intelligent Miner Modeling Administration and Programming Version 8.1, SH12-3736.
Test results The most important thing when working with the test results is to test for the quality of the classification model. The most commonly used quality indicator is the error rate (misclassification rate) of the model on the test data. You can use DM_getClassError to extract the error rate from the test result as shown in Example 10-3.
Chapter 10. Building the mining models using IM Modeling functions
225
Example 10-3 Extracting error rate from test result select ID, IDMMX.DM_getClassError(result) as ErrorRate from IDMMX.CLASTESTRESULTS where ID=’Churn_Model’;
10.5 Hybrid modeling The DB2 data mining function of IM Modeling offers ease of use for several mining techniques. It also offers ease of use when you want to apply hybrid modeling (see also 1.2, “Data mining does not stand alone anymore” on page 6). Hybrid modeling is a clear feature of IM Modeling. Hybrid modeling is practiced when you want to apply several mining techniques in sequence to the data. For example, in case of a predictive modeling of a debit rating of banking card holders, Figure 10-1 uses demographical clustering to profile each cluster. This way you can characterize each customer according to the cluster to which it belongs. From this, you get several homogenous groups of customers that are in our population of banking card holders. Next you select a number of the clusters that have profiles which interest you most according to the business issue. In the second stage of modeling, the decision tree and its rules tell you in which nodes the customers of a certain cluster belong. These rules help to generate more lift when you target the customers of each cluster for a competitive debit rating campaign.
226
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Analytical Data Mart
All clusters and their attributes in an overall view
Demographical clustering Tree induction
Tree induction of a cluster with 31% of banking card holders
Tree induction of cluster with 26% of banking card holders
Customers with less than 51 debit transactions, only teller access to their account, and have a least two banking cards
Customers with more than 86 debit transactions, remote access to their account, and have a least one banking card
Figure 10-1 Hybrid modeling run for classification of banking card holders
The ease of use with which you can apply a hybrid modeling process is covered by the fact that data, models, and techniques are all available in a relational database management system. Once you know how to use SQL and the SQL API, you are basically up and running.
10.6 Conclusion IM Modeling creates an infrastructure of tables, database objects, and stored procedures to enable a “table driven” approach for creating data mining models. This allows the mapping of a model produced by a data miner into SQL scripts. These scripts can be automated and therefore ensure repeatable success once the initial gems are found by the model. Now that the model building process and the model are part of the database, and because DB2 UDB is open, this data mining capability can be embedded in applications using CLI, ODBC, JDBC, SQLJ, WebSphere MQ, and so on.
Chapter 10. Building the mining models using IM Modeling functions
227
Embedding data mining into applications also reduces the reliance on the data mining expertise after the initial model building phase. Now there is a real window of opportunity to embed advanced analytics on a wider scale than previously.
228
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
11
Chapter 11.
Using IM Visualization functions IM Visualization provides analysts with visual summaries of data from a database. It can also be used as a method for understanding the information extracted using other data mining methods. Features that are difficult to detect by scanning rows and columns of numbers in databases often become obvious when viewed graphically. Data mining necessitates the use of interactive visualization techniques that allow the user to quickly and easily change the type of information displayed, as well as the particular visualization method used (for example, change from a histogram display to a scatter plot display or to parallel coordinates). Visualizations are particularly useful for noticing phenomena that hold for a relatively small subset of the data. They are, therefore, are “drowned out” by the rest of the data when statistical tests are used since these tests generally check for global features. The advantage of using visualization is that the analyst does not have to know what type of phenomenon they are looking for to notice something unusual or interesting. For example, with statistical tests, the analyst must ask rather specific questions, such as “Does the data fit this condition?” Often, the analyst wants to discover something unusual or interesting about a set of instances or an attribute. However, they must ask very directed questions, such as “Is the distribution
© Copyright IBM Corp. 2002. All rights reserved.
229
skewed?” or “Is this set of values consistent with the Poisson distribution?” No general statistical test can answer the question “Is there anything unusual about this set of instances?” There are only tests for determining whether the data is unusual in a particular way. Visualization compensates for this. Humans tend to notice phenomena in visually displayed data precisely because they are unusual. Depending on the skill level of the end user that needs to analyze or interpret a data mining result, the final visualization method needs to be implemented to take this in account. This chapter explains:
The IM Visualizer functions How to configure the IM Visualizers How to use them in another application How to use them from the Web How we use the Java API to benefit from the visualization techniques
The chapter presents some examples of IM Visualization for visualizing the mining models results of clustering, classification, and association mining runs.
11.1 IM Visualization functions IM Visualization provides the following Java Visualizers to present data modeling results for analysis: Associations Visualizer Classification Visualizer Clustering Visualizer
Each visualizer can show the same model in different views. Each view contains information that is not available or difficult to represent in another view. The views are synchronized. This means that when you hide items in one view, they are also hidden in the other views.
11.1.1 Common and different tasks The IM Visualizer framework provides features and tasks that are common to all Visualizers. For example, the different views of the individual Visualizers are implemented as tabs. You can easily switch to a different view by clicking a different tab. You can perform different tasks in all views of the IM Visualizers by clicking an icon on the tool bar or by selecting a task from the menu bar. In all visualizers,
230
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
you can right-click objects to display a pop-up menu that covers the most commonly-used features of the appropriate visualizer. In some of the views of the IM Visualizers, you can perform tasks that are specific to this particular view. For example, fanning in or fanning out is specific to the Associations Visualizer.
11.1.2 Applets or Java API The two previously mentioned principles offer the following scenarios. In addition to installing the IM Visualizers on your computer and using them as a stand-alone product, you can install the IM Visualizers on a Web server. You can use them as a Java applet embedded in an HTML document. You can store the HTML documents that embed the IM Visualization applet on a Web server. You can view the HTML documents on the intranet with a Java-enabled Web browser, for example, with Netscape or with Internet Explorer. IM Visualization functionality is written in Java and provides an API for embedding the GUI functionality into an application.
11.2 Configuration settings IM Visualizers can show the following models: Models that are created by IM Modeling PMML models that are created by a different application Models that are transformed to XML by the conversion utility of the DB2 Intelligent Miner for Data
You can save models that originally are in the IM Modeling format or in the PMML format in Visualizer format (.vis). However, you can use the Visualizer format only with the IM Visualizers. In the visualizer format, the data is compressed. The model properties are saved in the same file.
11.2.1 Loading a model from a local file system You can start the IM Visualizer and load a model from the local file system. When you start the IM Visualizer, you select the format in which the model was originally saved and select the file. See Figure 11-1.
Chapter 11. Using IM Visualization functions
231
Figure 11-1 IM Visualizer: Opening a model from the local file system
11.2.2 Loading a model from a database Of course, loading a model from a database is more secure than loading it from a file system. When you deal with sensitive models, such as fraud detection models or even credit risk models, the security features of the database system are in place. To load a model from a database, make sure that you perform the following actions: Make sure that the Java Database Connectivity (JDBC) driver of your database is in the classpath of the applet. You can copy it in the same directory as the JAR files of the visualizers and add it in the “archive” attribute of the applet. In case you forget, you will see the following error message: The connection to the database failed because the specified driver cannot be found: COM.ibm.db2.jdbc.net.DB2Driver
Make sure that you have not defined the parameter “model”. You must define all the parameters described in Table 11-1 that are not marked as being optional.
232
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 11-1 Parameters to load a model from a database environment Parameter name
Description
JDBC_URL
JDBC URL of the database to connect to. For DB2 UDB, this URL has the following form: jdbc:db2://hostname/DatabaseName
JDBC_Driver
Classname of the JDBC Driver to use for the connection with the database. The name of the JDBC driver delivered with DB2 UDB is: COM.ibm.db2.jdbc.net.DB2Driver
DBUserName
Name of the user to use for the connection with the database
DBPassword
Password to use
DBTable
Name of the table
DBPrimaryKeyCol
Name of the column that contains the primary keys of the models
DBModelCol
Name of the column containing the model themselves
DBModelPropCol (optional)
Name of the optional column containing the visualizer properties of the models
DBModelKey
Key of the model to load
You can set the right values of the parameters either from a command line start with all parameters and values that are specified (see Example 11-1 on page 235) after you start the IM Visualization. Or you start without additional parameters and then use File-> Open to use the GUI (Figure 11-2) to set up the right parameter values for selecting your model from the database.
Chapter 11. Using IM Visualization functions
233
Figure 11-2 Opening a model from the database system
After you use the GUI to set the parameters, the database profile for the model you want to visualize is stored in the database. Figure 11-3 shows the profile “Customer Profile Scoring” that was set up in Figure 11-2.
Figure 11-3 Loading the model in a secure manner from the database
234
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
11.3 Using IM Visualizers You can start the IM Visualizers in a command line prompt as shown in Example 11-1.
Example 11-1 Calling IM Visualizer via command line parameter settings imvisualizer -db2url jdbc:db2://localhost/BANK -dbdriver COM.ibm.db2.jdbc.net.DB2Driver -dbuserid db2admin -dbpassword db2admin -dbtablename IDMMX.CLusterModels -dbprimarykeycol MODELNAME -dbdatacol MODEL "Demographical clustering of customer of an US bank"
Note: Before you can work with databases, you must copy the JDBC driver from the database that you want to use to the lib directory in the IM Visualizers installation path on your operating system. The DB2 driver db2java.zip resides in the subdirectory sqllib\java. If you want to use a different database, ask your system administrator where the JDBC driver of this database is located.
For example, if you installed the IM Visualizers on a Windows operating system by using the default installation path, you must copy the file db2java.zip to the directory C:\Program Files\IBM\IMVisualization\lib. You do not need to extract the file db2java.zip.
11.3.1 Using IM Visualizers as applets To use the IM Visualization applet on the intranet, you must embed the visualizer that shows the model that you want to publish in an HTML document. Then send the URL of the HTML document to the team members who are supposed to view the model. To view the model, the team members must click the link in the HTML document. The visualizer is displayed on their screens showing the model that you provided. You can embed the IM Visualizers in an HTML document-like graphic. Or you can embed a push button in the HTML document that starts the IM Visualizers in a separate window independently of the browser: If you embed the IM Visualizers in an HTML document, it behaves like a graphic or any other element that can be embedded in an HTML document. If you embed a push button in an HTML document, the browser is only used to launch the visualizer. The HTML document contains the push button. When
Chapter 11. Using IM Visualization functions
235
you click the Visualizer button, the visualizer is displayed in its own window as if you had started it as a stand-alone application. Example 11-2 shows an example of the HTML code for launching the IM Visualizer to a cluster model when it is initiated via a start button.
Example 11-2 Clustering Visualizer launched with a start button <param name="model" value="models/clustering.xml"> <param name="startButtonLabel" value="View clustering model..."
The graphical user interface of the IM Visualization applet is the same as if you installed the IM Visualizers on your own computer. After the IM Visualization applet is started, you can open, modify, save, or print models that are located in your local file system on your computers or in a database. To embed the visualizer as an applet in a HTML document: 1. Copy the entire lib/ directory with the JAR files of the visualizer on the Web server. 2. Include the following syntax in the HTML document. This depends on whether you want to: – Open an entire visualizer as an applet in a HTML document (embedded in the document or in its own window) as shown in Example 11-3.
Example 11-3 Syntax to open visualizer as applet in HTML document <param name=" PARAM NAME #1" value="PARAM VALUE #1 "> ...
Example 11-4 shows the HTML code to open an entire Classification Visualizer embedded into a HTML document with the disabled save function.
236
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Example 11-4 Classification visualizer embedded in an HTML document <param name="model" value="models/tree.xml"> <param name="embedded" value="true"> <param name="noSave" value="true"> <param name="noExport" value="true">
– Display a single visualizer view in an HTML document as shown in Example 11-5.
Example 11-5 Syntax to embed a single view as an applet in an HTML document <param name=" PARAM NAME #1" value="PARAM VALUE #1 "> ...
Example 11-6 shows the HTML code to open a single Visualizer view in a HTML document without a menu bar, in this case of an associations rules view.
Example 11-6 Association rules view embedded in an HTML document <param name="model" value="models/assoc.xml"> <param name="view" value="rules">
See Appendix H, “Embedding an IM Visualization applet” on page 289, for an explanation of the parameter settings options for embedding the Intelligent Miner Visualization Applet in an HTML document (as graphic) or application (start via a push button). Also, refer to “Using the Intelligent Miner Visualization Applet” in DB2 Intelligent Miner Visualizing V8.1 Using Intelligent Miner Visualizers, SH12-6737, to learn about the parameters to embed the Intelligent Miner Visualization Applet, to open the complete IM Visualizers, and to open a particular view of the IM Visualizers.
Chapter 11. Using IM Visualization functions
237
11.3.2 Complete example script For example, the model to display is stored on a DB2 UDB database that is installed on the localhost. The name of the database is BANK. The table is named BANK and has two columns: KEY containing the primary key or keys MODEL of type BLOB containing the model models
The JDBC Driver for the database is contained in the db2java.zip file. The user db2admin with the password db2admin has a read access on this table. The key of the model to load is “Demographical clustering of customers for a U.S. bank”. If we assume that the JAR files of the visualizer are in the subdirectory lib/, relative to the location of the HTML file, and that db2java.zip was copied in the same directory, the syntax for the applet is one of the following two scenarios. The first scenario is to launch IM Visualizer as an applet embedded within an HTML browser. In this case, the script in Example 11-7 is run.
Example 11-7 Script to embed and open a Visualizer applet for a cluster model <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> IM-Visualization Applet - Mining View Customer profiles - Graphical Clustering View embedded in HTML Browser <param name="JDBC_URL" value="jdbc:db2://localhost/BANK"> <param name="JDBC_Driver" value="COM.ibm.db2.jdbc.net.DB2Driver"> <param name="DBUserName" value="db2admin"> <param name="DBPassword" value="db2admin"> <param name="DBTable" value="IDMMX.CLUSTERMODELS"> <param name="DBPrimaryKeyCol" value="MODELNAME"> <param name="DBModelCol" value="MODEL">
238
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
<param name="DBModelKey" value="Demographic clustering of customer base of an US bank"> <param name="view" value="graphical">
As a result, a graphical view of the bank customer base segmentation is displayed. Figure 11-4 shows the different customer profiles in this visualizer.
Figure 11-4 Invoking an IM Visualizer applet to be embedded in an HTML browser
The second scenario is to launch the IM Visualizer in its own window via the push button embedded in an HTML browser of your call center application, for example, using the script in Example 11-8.
Chapter 11. Using IM Visualization functions
239
Example 11-8 Script to launch IM Visualizer applet via push button in HTML browser <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> IM-Visualization Applet - Mining View Customer profiles - Clustering View invoked by a push button from within an HTML Browser <param name="JDBC_URL" value="jdbc:db2://localhost/BANK"> <param name="JDBC_Driver" value="COM.ibm.db2.jdbc.net.DB2Driver"> <param name="DBUserName" value="db2admin"> <param name="DBPassword" value="db2admin"> <param name="DBTable" value="IDMMX.CLUSTERMODELS"> <param name="DBPrimaryKeyCol" value="MODELNAME"> <param name="DBModelCol" value="MODEL"> <param name="DBModelKey" value="Demographic clustering of customer base of an US bank"> <param name="startButtonLabel" value="Press for customer profiles view."
The result (in its simplest form) is displayed in Figure 11-5.
Figure 11-5 Invoking IM Visualizer applet via a push button in HTML browser
240
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Providing the IM Visualizers as a Java applet on a Web server includes the following key advantages:
Central point of installation Availability on any client that is connected to a local area network (LAN) Reduced administration time Potential integration in existing Web applications
11.4 Examples of IM Visualization This section provides examples of using IM Visualization for cluster results from a clustering run, tree visualization of predicted classes by a classification run, and a visualization for associations.
Chapter 11. Using IM Visualization functions
241
Figure 11-6 shows an example of the result from a clustering run. It is obvious that a result like this needs to be presented to a very skilled analyst for further interpretation.
Figure 11-6 Cluster visualization
242
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 11-7 shows the result of a decision tree. Compared to the clustering result, it is easier to understand, but is still too complex for a quick decision made by an end user.
Figure 11-7 Tree visualization of predicted classes
Finally, Figure 11-8 shows the visualization of the rules resulting from an association run to a retail sales transactions database.
Chapter 11. Using IM Visualization functions
243
Figure 11-8 Visualization of associations
The result of a data mining process is used to make quick decisions. Apart form these visualizations, other visualization techniques should be applied. To make a decision based on certain thresholds, the traffic light technique is simple to use. A green traffic light shows that the value is in range, a yellow light indicates that the value needs attention, and a red light means that an action must be taken. Customized visualizers provided with DB2 Intelligent Miner for Data help you interpret data mining results, depict model quality, present the results of various statistical functions, and convey your findings to management and analysts. Programmable and end-user interfaces enable you to further customize the mining experience for various user communities. Where “expert” mining analysts may desire access to the full function of the product, a custom interface to select pre-defined subsets of the function may be preferable to other business analysts.
244
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 4
Part
4
Appendixes This part contains the appendixes for this book that offer additional information. It includes the following appendixes: Appendix A, “SQL script to configure database for data mining function” on page 247 Appendix B, “SQL scripts for the customer profiling scenario” on page 249 Appendix C, “SQL scripts for the fraud detection scenario” on page 255 Appendix D, “SQL scripts for the retention campaign scenario” on page 269 Appendix E, “SQL scripts for the up-to-date promotion scenario” on page 275 Appendix F, “UDF to create data mining models” on page 281 Appendix G, “UDF to extract rules from a model to a table” on page 285 Appendix H, “Embedding an IM Visualization applet” on page 289 Appendix I, “IM Scoring Java Bean code example” on page 293 Appendix J, “Demographic clustering: Technical differences” on page 299 Appendix K, “Additional material” on page 301
© Copyright IBM Corp. 2002. All rights reserved.
245
246
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
A
Appendix A.
SQL script to configure database for data mining function This appendix provides an SQL script to configure the database to be used by data mining functions. This is a Windows script with SQL statements that we use to configure a database.
Example: A-1 SQL script for configuring databases set DBNAME=%1 db2 update dbm cfg using UDF_MEM_SZ 7500 db2 update db cfg for %DBNAME% using APPLHEAPSZ 2096 db2set DB2NTMEMSIZE=APLD:30000000 db2 UPDATE DB CFG FOR %DBNAME% USING LOGFILSIZ 10000 db2 UPDATE DB CFG FOR %DBNAME% USING LOGPRIMARY 10 db2 UPDATE DB CFG FOR %DBNAME% USING LOGSECOND 10 db2 UPDATE DB CFG FOR %DBNAME% USING STAT_HEAP_SZ 20000 REM Statistics heap size (4KB) (STAT_HEAP_SZ) = 4384 REM .. for RUNSTATS
© Copyright IBM Corp. 2002. All rights reserved.
247
db2 CATALOG ODBC DATA SOURCE %DBNAME% @echo off echo Once per System: echo Create Bufferpool b32K SIZE 400 PageSize 32 K echo Create tablespace ts32K PAGESIZE 32 K managed by database using echo (file 'c:\DB2\32K.tmpspace' 4000 ) bufferpool b32K echo Create user temporary tablespace myspace managed by database echo using ( file 'c:\DB2\user.tmpspace' 4000 ) REM DECLARE GLOBAL TEMPORARY TABLE Tmp123 as (...) REM definition only on commit preserve rows not logged with replace REM requires at least one USER TEMPORARY TABLESPACE @echo on db2stop db2start db2 list database directory
248
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
B
Appendix B.
SQL scripts for the customer profiling scenario This appendix contains the scripts for the customer profiling case study as presented in Chapter 4, “Customer profiling example” on page 51. These scripts allow you to: Create and load the customer segment table for scoring Score new customers
These scripts are included as SQL statement files in the Additional Materials folder for this redbook. For more information, see Appendix K, “Additional material” on page 301.
© Copyright IBM Corp. 2002. All rights reserved.
249
Script to create and load the customer segment table The Script to create and load the customer segm table for scoring.sql script in Example B-1 is meant to create and load the customer base table that will be used for customer segmentation and profiling.
Example: B-1 SQL script to create and load segment table ----- SQL scripts to the Customer profiling case study. ----- Redbook SG24-6879, chapter 4. ------- Create the BANK customers database ---- drop database BANK; create database BANK; ----- create the Bank customer base table --connect to BANK; create table BANK.ALL_CUSTOMERS( Client_id char(8), N_of_dependents char(1), Profession char(3), Time_as_Customer char(3), State char(2), Salary integer, Savings integer, Government_bonds integer, Monthly_checks_written integer, Gain_loss_funds integer, Age integer, Credit_card_limits integer, Insurance char(1), Stocks integer, Bank_funds integer, Checking_amount integer, Has_children char(1), Money_montly_overdrawn integer, Mortgage_amount integer, N_mortgages integer, N_trans_Call_center integer,
250
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
N_automatic_payments integer, T_amount_autom_payments integer, N_trans_WEb_bank integer, N_trans_kiosk integer, N_trans_teller integer, Car_ownership char(1), House_ownership char(1), N_trans_ATM integer, Marital_status char(8), SEX char(1), AVG_debit_transact integer ) not logged initially; ----- Importing the customers' data into the Bank customer base table. --import from c:\temp\customers_bank.csv of del modified by coldel, decpt. method P(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 ,30,31,32) commitcount 1000 insert into BANK.ALL_CUSTOMERS; disconnect BANK;
Script to score new customers The script Score.db2 in Example B-2 is meant to score new customers to the bank by using the PMML formatted cluster model with customer segmentations and the IM Scoring functions.
Example: B-2 SQL script Score.db2 for customer profile scoring ----- SQL scripts to the Customer profiling case study. ----- Redbook SG24-6879, chapter 4. ----------------------------------------------------------------------------------- Load model in file CustomerSegmentation.pmml into table IDMMX.CLUSTERMODELS --- using modelname Demographic clustering of customer base of an US bank. ----- Fields Modelname and Tablename get default values. --- You can change these values if necessary .
Appendix B. SQL scripts for the customer profiling scenario
251
------------------------------------------------------------------------------INSERT INTO IDMMX.CLUSTERMODELS VALUES ('Demographic clustering of customer base of an US bank', IDMMX.DM_impClusFile( 'H:\sd_n602_IMScoring\SG24-6879-00\pgm\Customer_segmentation\CustomerSegmentati on.pmml')); --------------------------------------------------------------------------------- Create the score results table: CustomerSegmentsTable --------------------------------------------------------------------------------- DROP TABLE CustomerSegmentsTable; CREATE TABLE CustomerSegmentsTable ( Customer_id CHAR(8), Cluster_id INTEGER, Score FLOAT ); -------------------------------------------------------------------------------- Start Scoring Services with REC2XML --- Create the temporary result view: CustomerSegmentsView -------------------------------------------------------------------------------- DROP VIEW CustomerSegmentsView ; CREATE VIEW CustomerSegmentsView ( Customer_id , Result ) AS SELECT data.Client_id, IDMMX.DM_applyClusModel(models.MODEL, IDMMX.DM_impApplData( REC2XML(1,'COLATTVAL','', data."CAR_OWNERSHIP", data."HAS_CHILDREN", data."HOUSE_OWNERSHIP", data."MARITAL_STATUS", data."PROFESSION", data."SEX", data."STATE", data."N_OF_DEPENDENTS", data."AGE", data."SALARY"))) FROM IDMMX.CLUSTERMODELS models, RECENT_NEW_CUSTOMERS data WHERE models.MODELNAME = 'Demographic clustering of customer base of an US bank'; ------------------------------------------------------------------------------
252
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
--- Use the results view CustomerSegmentsView to score data and to --- write scoring results into a result table CustomerSegmentsTable. -----------------------------------------------------------------------------INSERT INTO CustomerSegmentsTable SELECT Record_id, IDMMX.DM_getClusterID( Result ), IDMMX.DM_getClusScore( Result ) FROM CustomerSegmentsView;
Appendix B. SQL scripts for the customer profiling scenario
253
254
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
C
Appendix C.
SQL scripts for the fraud detection scenario This appendix provides the SQL scripts used in fraud detection scenario Chapter 5, “Fraud detection example” on page 75. These scripts allow you to:
Prepare the data Build the data mining model Score the data Exploit the scoring results
© Copyright IBM Corp. 2002. All rights reserved.
255
Script to prepare the data Example C-1 shows the SQL script used to prepare the data.
Example: C-1 SQL script for data preparation ------------------------------------------------------------------------ This SQL-script does the complete data preparation in DB2 -- for the demo "Detection of Fraud with Premium Phone Numbers". -- We assume that the flat file cdrs.dat is available. ---------------------------------------------------------------------------------------------------------------------------------------------- Create the database: ----------------------------------------------------------------------attach to db2inst1 user db2oinst1 using db2admin; connect to cdr user db2inst1 using db2admin; ------------------------------------------------------------------------ Build the cdr table and load the cdr's from flat file: -----------------------------------------------------------------------
------------------------------------------------------------------------ Construct an intermediate table containing information about -- the phone numbers (sources): ----------------------------------------------------------------------drop table b_phone_no; create table b_phone_no ( phone_no char(12), phone_dur decimal(10), phone_vardur decimal(10,2), phone_avgdur decimal(10,2), phone_maxdur decimal(10), phone_calls integer, phone_costs decimal(15,2), phone_targets phone_mdur phone_topdur1 phone_rtopdur1 );
integer, decimal(10,2), decimal(10), decimal(10,2)
insert into b_phone_no
256
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
( phone_no, phone_dur, phone_vardur, phone_avgdur, phone_maxdur, phone_calls, phone_costs ) select phone_no, sum(duration), stddev(duration)/avg(duration), avg(duration), max(duration), count(*), sum(costs) from cdr group by phone_no having sum(duration) >= 1800; drop index b_phone_no_i; create index b_phone_no_i on b_phone_no ( phone_no asc ); drop table hphone; create table hphone ( phone_no char(12), premium_no char(10), sum_duration decimal(10) ); insert into hphone select cdr.phone_no, cdr.premium_no, sum(cdr.duration) from cdr, b_phone_no where cdr.phone_no = b_phone_no.phone_no group by cdr.phone_no, cdr.premium_no; drop index hphone_phone_no_i; create index hphone_phone_no_i on hphone ( phone_no asc ); update b_phone_no set ( phone_targets, phone_mdur, phone_topdur1, phone_rtopdur1 ) = ( select count(hphone.premium_no), b_phone_no.phone_dur/count(hphone.premium_no), max(hphone.sum_duration), max(hphone.sum_duration)/b_phone_no.phone_dur from hphone where b_phone_no.phone_no = hphone.phone_no
Appendix C. SQL scripts for the fraud detection scenario
257
group by hphone.phone_no ); ------------------------------------------------------------------------ Construct an intermediate table containing information about -- the connections between a phone and a service number: ----------------------------------------------------------------------drop table b_con; create table b_con ( phone_no char(12), premium_no char(10), con_dur decimal(10), con_vardur decimal(10,2), con_avgdur decimal(10,2), con_maxdur decimal(10), con_calls integer, con_costs decimal(15,2) ); insert into b_con ( phone_no, premium_no, con_dur, con_vardur, con_avgdur, con_maxdur, con_calls, con_costs ) select phone_no, premium_no, sum(duration), stddev(duration)/avg(duration), avg(duration), max(duration), count(*), sum(costs) from cdr group by phone_no, premium_no having sum(duration) >= 1800; drop index b_con_phone_no_i; create index b_con_phone_no_i on b_con ( phone_no asc ); drop index b_con_premium_no_i; create index b_con_premium_no_i on b_con ( premium_no asc );
258
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
------------------------------------------------------------------------ Construct an intermediate table containing information about -- the premium numbers: ----------------------------------------------------------------------drop table b_premium_no; create table b_premium_no ( premium_no char(10), premium_dur decimal(10), premium_vardur decimal(10,2), premium_avgdur decimal(10,2), premium_maxdur decimal(10), premium_calls integer, premium_costs decimal(15,2), premium_sources premium_mdur premium_topdur1 premium_rtopdur1 );
integer, decimal(10,2), decimal(10), decimal(10,2)
insert into b_premium_no ( premium_no, premium_dur, premium_vardur, premium_avgdur, premium_maxdur, premium_calls, premium_costs ) select premium_no, sum(duration), stddev(duration)/avg(duration), avg(duration), max(duration), count(*), sum(costs) from cdr group by premium_no having sum(duration) >= 3600; drop index b_premium_no_i; create index b_premium on b_premium_no ( premium_no asc ); drop table hpremium; create table hpremium (
Appendix C. SQL scripts for the fraud detection scenario
259
premium_no char(10), phone_no char(12), sum_duration decimal(10) ); insert into hpremium select cdr.premium_no, cdr.phone_no, sum(cdr.duration) from cdr, b_premium_no where cdr.premium_no = b_premium_no.premium_no group by cdr.premium_no, cdr.phone_no; drop index hpremium_pno_i; create index hpremium_pno_i on hpremium ( premium_no asc ); update b_premium_no set ( premium_sources, premium_mdur, premium_topdur1, premium_rtopdur1 ) = ( select count(hpremium.phone_no), b_premium_no.premium_dur/count(hpremium.phone_no), max(hpremium.sum_duration), max(hpremium.sum_duration)/b_premium_no.premium_dur from hpremium where b_premium_no.premium_no = hpremium.premium_no group by hpremium.premium_no ); ------------------------------------------------------------------------ Construct the (final) analysis table which is the basis for the -- fraud segmentation with Intelligent Miner: ----------------------------------------------------------------------drop table connections; create table connections (
260
phone_dur phone_vardur phone_avgdur phone_maxdur phone_calls phone_costs phone_targets phone_mdur phone_topdur1 phone_rtopdur1
decimal(10), decimal(10,2), decimal(10,2), decimal(10), integer, decimal(15,2), integer, decimal(10,2), decimal(10), decimal(10,2),
phone_no premium_no con_dur con_vardur
char(12), char(10), decimal(10), decimal(10,2),
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
con_avgdur con_maxdur con_calls con_costs
decimal(10,2), decimal(10), integer, decimal(15,2),
-- con_antropdauer -- con_antursdauer con_lift_pre_mdur -- con_lift_pho_mdur con_phone_rate premium_dur premium_vardur premium_avgdur premium_maxdur premium_calls premium_costs premium_sources premium_mdur premium_topdur1 premium_rtopdur1 );
decimal(10,2), decimal(10,2), decimal(10,2), decimal(10,2), decimal(10,2),
decimal(10), decimal(10,2), decimal(10,2), decimal(10), integer, decimal(15,2), integer, decimal(10,2), decimal(10), decimal(10,2)
insert into connections select b_phone_no.phone_dur, b_phone_no.phone_vardur, b_phone_no.phone_avgdur, b_phone_no.phone_maxdur, b_phone_no.phone_calls, b_phone_no.phone_costs, b_phone_no.phone_targets, b_phone_no.phone_mdur, b_phone_no.phone_topdur1, b_phone_no.phone_rtopdur1, b_con.phone_no, b_con.premium_no, b_con.con_dur, b_con.con_vardur, b_con.con_avgdur, b_con.con_maxdur, b_con.con_calls, b_con.con_costs, -- b_con.con_dur/b_premium_no.premium_dur, -- b_con.con_dur/b_phone_no.phone_dur, b_con.con_dur/b_premium_no.premium_mdur,
Appendix C. SQL scripts for the fraud detection scenario
261
-- b_con.con_dur/b_phone_no.phone_mdur, 1.0/b_premium_no.premium_sources, b_premium_no.premium_dur, b_premium_no.premium_vardur, b_premium_no.premium_avgdur, b_premium_no.premium_maxdur, b_premium_no.premium_calls, b_premium_no.premium_costs, b_premium_no.premium_sources, b_premium_no.premium_mdur, b_premium_no.premium_topdur1, b_premium_no.premium_rtopdur1 from b_phone_no, b_con, b_premium_no where b_con.phone_no = b_phone_no.phone_no and b_con.premium_no = b_premium_no.premium_no; create view connection_table ( sum_dur, no_calls, rel_dur, sum_cost, max_dur, var_dur, no_clrs, caller_id, premium_id) as select con_dur, con_calls, con_lift_pre_mdur, con_costs, con_maxdur, con_vardur, premium_sources, phone_no, premium_no from connections; terminate;
262
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Script to build the data mining model Example C-2 shows the SQL script used to build the data mining model.
Example: C-2 SQL script defining data, model parameters, task, and modeling ------------------------------------------------------------------------ Purpose: Define Logical Data Settings -- when: Set up ONCE ----------------------------------------------------------------------connect to premiums; delete from IDMMX.miningdata where id = 'Connection'; insert into IDMMX.miningdata values ( 'Connection', IDMMX.DM_MiningData()..DM_defMiningData('CONNECTION_TABLE').. DM_SetColumns(' ')); ------------------------------------------------------------------------ Purpose:Define the Cluster Model Settings -- when: set up ONCE ----------------------------------------------------------------------delete from IDMMX.ClusSettings where id='Connection_Segmentation'; insert into IDMMX.ClusSettings select 'Connection_Segmentation', IDMMX.DM_CLusSettings().. DM_useClusDataSpec(MININGDATA..DM_genDataSpec()).. DM_setMaxNumClus(30).. DM_setFldUsageType('CALLER_ID',2).. DM_setFldUsageType('PREMIUM_ID',2) from IDMMX.MiningData where ID='Connection'; ------------------------------------------------------------------------ Purpose: Create the Clustering Task
Appendix C. SQL scripts for the fraud detection scenario
263
-- when: set UP ONCE ----------------------------------------------------------------------delete from IDMMX.ClusTasks where id='Connection_Segmentation_Task'; insert into IDMMX.ClusTasks select 'Connection_Segmentation_Task', IDMMX.DM_clusBldTask()..dm_defClusbldTask(d.miningdata,s.settings) from IDMMX.MiningData D, IDMMX.ClusSettings S where d.id='Connection' and s.id='Connection_Segmentation'; ------------------------------------------------------------------------ Purpose: Call the Stored Procedure to run the Clustering Task. -- When: to be put in a DB2 script in a batch job ----------------------------------------------------------------------call IDMMX.DM_BuildClusModelcmd('IDMMX.CLUSTASKS','TASK','ID', 'Connection_Segmentation_Task', 'IDMMX.CLUSTERMODELS','MODEL','MODELNAME', 'ConnectionSegmentationModel');
Script to score the data Example C-3 shows the SQL script generated and updated for the business case in Chapter 5, “Fraud detection example” on page 75.
Example: C-3 SQL script generated by IDMMKSQL and modified for the example --------------------------------------------------------------------------------- Purpose: --- Load model in file clustermodel.pmml into table IDMMX.CLUSTERMODELS, --- using modelname ConnectionSegmentationModel. --- Modelname and tablename get default values. --- You can change these values if necessary. --- When: set up ONCE --------------------------------------------------------------------------------INSERT INTO ###IDMMX.CLUSTERMODELS### VALUES ('ConnectionSegmentationModel', IDMMX.DM_impClusFile('###ABSOLUTE_PATH### clustermodel.pmml'));
264
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
--------------------------------------------------------------------------------- Purpose: Create the table ALLOCATED_CLUSTERS --- when: Set up ONCE ------------------------------------------------------------------------------DROP TABLE ALLOCATED_CLUSTERS; CREATE TABLE ALLOCATED_CLUSTERS( premium_id char(12), caller_id char(12), Clus_id INTEGER, Score FLOAT ); -------------------------------------------------------------------------------- Purpose: --- Start Scoring Services with REC2XML. --- Create temporary view Resultview. ----- When: --- View created once, used every batch run -----------------------------------------------------------------------------DROP VIEW SCORING_ENGINE; CREATE VIEW SCORING_ENGINE( premium_id, caller_Id , Result ) AS SELECT data.Premium_id, data.caller_id, IDMMX.DM_applyClusModel(models.model, IDMMX.DM_impApplData( REC2XML(1,'COLATTVAL','', data."NO_CALLS", data."NO_CLRS", data."SUM_DUR", data."REL_DUR", data."SUM_COST", data."MAX_DUR", data."VAR_DUR"))) FROM IDMMX.CLUSTERMODELS models, connection_table data WHERE models.MODELNAME= 'ConnectionSegmentationModel';
-------------------------------------------------------------------------------- Purpose: --- Use view Resultview to score data and --- write the results into table Resulttable ---
Appendix C. SQL scripts for the fraud detection scenario
265
--- When: --- Used in every batch run -----------------------------------------------------------------------------INSERT INTO ALLOCATED_CLUSTERS SELECT PREMIUM_ID, CALLER_ID, IDMMX.DM_getClusterID( Result ), IDMMX.DM_getClusScore( Result ) FROM SCORING_ENGINE;
Script to get the scoring results Example C-4 shows the SQL script used to select the customers in the five smallest clusters and to build a view.
Example: C-4 Script to generate a list of customers from the smallest five clusters --------------------------------------------------------------------------------- purpose: --- set up a view to generate a list of connections in --- the 5 smallest clusters , i.e. outliers. ----- When: set up once, used many time in Business Objects ------------------------------------------------------------------------------Create view risky as select scored.clus_id, attr.caller_id, attr.premium_id, attr.sum_dur, attr.no_calls, attr.rel_dur, attr.sum_cost, attr.max_dur, attr.var_dur, attr.no_CLRS from ALLOCATED_CLUSTERS scored, connection_table attr where and and in
266
scored.premium_id = attr.premium_id scored.caller_id = attr.caller_id scored.clus_id
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
( select clus_id from ( select clus_id, count(*) , rank() over(order by count(*)) as top_N from ALLOCATED_CLUSTERS group by clus_id ) as temp where top_n 0 then substr(fqtablename,1,posstr(fqtablename,'.')-1) else USER end ; drop function local.mTab; create function local.mTab( fqtablename varchar(100) ) RETURNS varchar(100) LANGUAGE SQL not DETERMINISTIC NO EXTERNAL ACTION RETURN case when posstr(fqtablename,'.') > 0 then substr(fqtablename,posstr(fqtablename,'.')+1) else fqtablename end ; drop function local.mcols; create function local.mcols( fqtablename varchar(40) ) RETURNS varchar(1000) LANGUAGE SQL not DETERMINISTIC READS SQL DATA NO EXTERNAL ACTION RETURN select cast( substr(xml2clob( xmlagg( xmlelement(name "Column", xmlattributes(colname as "name",typename as "sqlType"))) ),1,460) as varchar(460) ) from syscat.columns where tabname = local.mtab(fqtablename) and tabschema=local.mschema(fqtablename) and keyseq is null and colcard <param name="PARAM NAME #2"value="PARAM VALUE #2 "> ...
Parameters to use You can specify values for the parameters specified in Table H-1. Table H-1 Parameters to embed the IM Visualization applet Parameter
Values
Purpose/description
code
com.ibm.iminer.visualizer. applet.IDMVisualizerApplet
To embed the complete visualizer
com.ibm.iminer.visualizer. applet.IDMMiningViewApplet
To embed a particular view only
"."
The visualizer libraries are located in the same directory as the HTML document. The visualizer libraries are located in a directory that is higher than the directory where the HTML document is stored.
codebase
"../lib" Archive
290
"imvisu.jar"
Always specify the value. This means that the main JAR file that the browser must load is imvisu.jar. Do not change this parameter.
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Parameter
Values
Purpose/description
width, height
param name
The parameters width and height define the size of the Intelligent Miner Visualization Applet on the HTML page in pixels. 1 <param name="PARAM NAME"value="PARAM VALUE">
With this coding between the start tag and the end tag of an applet, you can define parameters to modify the behavior of the visualizer applet. 2
Notes:
1. If you want to embed a push button that launches the Visualizer in its own window, these values define the size of the push button on the HTML page. If you want to embed the complete visualizer or a particular view of a visualizer, these values define the size of the visualizer or the view on the HTML page. 2. For example, you may want to use the parameter model to define the URL where the model that you want to display is located. You can use absolute or relative URLs. The file that contains the model that you want to display may be called model.xml. It might be located on the Web server in the directory models/. The syntax looks like this: <param name="model"value="models/model.xml">
If you omit the parameter model, a File Open dialog is displayed when you launch the Visualizer.
Appendix H. Embedding an IM Visualization applet
291
292
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
I
Appendix I.
IM Scoring Java Bean code example This appendix provides the Java code for running an IM Scoring Java Bean. It is referenced in 8.5, “Integration using Java” on page 185.
© Copyright IBM Corp. 2002. All rights reserved.
293
Source code of IM Scoring Java Bean Example I-1 provides the Java code for running an IM Scoring Java Bean to score a new customer with a specified customer ID. They are scored to any of the clusters as defined in a clustering model that has segmented the customer base of, in this case, a bank. The file is named CustomerScore.java.
Example: I-1 Source code of IM Scoring Java Bean: CustomerScore.java file // Source File name CustomerScore.java // // Licensed Materials -- Property of IBM // // (c) Copyright International Business Machines Corporation, 1996, 1997. // All Rights Reserved. // // US Government Users Restricted Rights // Use, duplication or disclosure restricted by // GSA ADP Schedule Contract with IBM Corp. // // // Organization: IBM Australia // Creation Date: 25/09/2002 // // Pre-requisites: JDK1.3.1 // Add scoring bean to CLASSPATH. See file PATHS.BAT // Use an existing model in PMML 2.0 format, and call it // via the method setModelFile() // And use a database table to score. // // // This sample java code takes a customer ID as an input. // It then retrieves the customer record using JDBC. // It loads the ResultSet into a record. // It uses the scoring bean class RecordScorer to load // a "selected Model" and score the record. // It displays the result of the score. // import import import import
com.ibm.iminer.scoring.*; java.sql.*; java.util.*; java.io.File;
public class customerScore {
294
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
// Instantiates the JDBC driver. // static { try { Class.forName("COM.ibm.db2.jdbc.app.DB2Driver").newInstance(); } catch (Exception e) { e.printStackTrace(); } } private RecordScorer scorer; public customerScore(String modelFile) { initModel(modelFile); } // // // // // // //
Loads the model from a PMML file. Must catch ModelException. Concrete situations where the exception is thrown: - The content of the specified mining model has a wrong format or is corrupted. - The specified mining model cannot be found. public void initModel (String modelFile) { try { if (scorer==null) { System.out.println("Initializing : "+modelFile); scorer = new RecordScorer(modelFile); } else scorer.setModelFile(modelFile); } catch (ModelException e) { e.printStackTrace(); } System.out.println(modelFile+" Model initialised !"); }
private void displayClusteringResult(Map record, String description,int clusterID) { System.out.println("\n Segmentation result:"); Map.Entry entry; String columnName; String value; for (Iterator i=record.entrySet().iterator(); i.hasNext();) { entry =(Map.Entry)i.next(); columnName = (String)entry.getKey(); value = entry.getValue().toString(); System.out.println( columnName + " : " + value); } System.out.println("++++++++++++++++++++++++++++++++++++++++"); System.out.println(" Customer belongs to segment: " + clusterID);
Appendix I. IM Scoring Java Bean code example
295
} public void doScoring(Map record) { if ( scorer !=null) { try { System.out.println("Scoring customer !"); scorer.score(record); if ( scorer.getModelType() == scorer.CLUSTERING_TYPE) { int clusterID = scorer.getClusterID(); displayClusteringResult( record, "Customer belongs to segment", clusterID); } } catch (Exception e) { e.printStackTrace(); } } } // Associate values to the mining fields: // The record.put method pairs a column from the model with the // value of the same column from the ResultSet. // public HashMap loadInputColumns(String clientID, ResultSet customerRecord) { HashMap record= new HashMap(); if ( scorer !=null){ // Get the mining fields which are used by the model. String[] columns = scorer.getFieldNames(); if (columns != null) { try { customerRecord.next(); for (int i=0; i< columns.length; i++ ) { record.put(columns[i], customerRecord.getString(columns[i])); } } catch (SQLException sqle) { System.out.println(sqle.toString()); } } } return record; } public static void main( String argv[]) { File file = new File("customerSegmentation.pmml"); customerScore cluster = new customerScore(file.getAbsolutePath()); String clientID=argv[0]; String url = "jdbc:db2:bank"; System.out.println("running main!");
296
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
try { Connection con = DriverManager.getConnection(url); String query = "select * from bank.all_customers where client_id = ?"; PreparedStatement getRecord = con.prepareStatement(query); getRecord.setString(1,clientID); ResultSet rs = getRecord.executeQuery(); cluster.doScoring(cluster.loadInputColumns(clientID, rs)); rs.close(); getRecord.close(); } catch (SQLException sqle) { System.out.println(sqle.toString()); } } }
Setting up the environment variables: The paths.bat file After you install the IM Scoring Java Beans, you must set up environment variables before you can use them from the Java applications. For a Java application to invoke RecordScorer, you must run the script shown in Example I-2 on a Windows system.
Example: I-2 The paths.bat file rem ----------------------------rem Set PATH and CLASSPATH. rem ----------------------------SET PATH=%PATH%;/bin SET CLASSPATH=%CLASSPATH%;/java/xerces.jar SET CLASSPATH=%CLASSPATH%;/java/idmscore.jar
Appendix I. IM Scoring Java Bean code example
297
298
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
J
Appendix J.
Demographic clustering: Technical differences This appendix discusses the technical differences between DB2 Intelligent Miner for Data and IM Modeling and IM Scoring with respect to demographic clustering. There is more than just the difference in use within the generic data mining method and the audience of users. DB2 Intelligent Miner for Data also provides for slight differences in functionality. Here we state the differences that may matter to most of you who will build the mining models. There are differences between the options of IM Modeling, IM Scoring, and the parameters of the application mode of demographic clustering in the DB2 Intelligent Miner for Data workbench. The small gap between the mining tools also entails some functional differences. The parameters of the application of data against a demographic clustering model are defined in the graphical user interface (GUI) settings of the IBM Intelligent Miner for Data. However, there are no settings in IM Scoring. The parameters, therefore, are read from the PMML model that is derived from the model with the format of the IBM Intelligent Miner for Data. Table J-1 lists the options that are available in the application mode settings of the DB2 Intelligent Miner for Data, in the DB2 Intelligent Miner for Data model, in
© Copyright IBM Corp. 2002. All rights reserved.
299
the PMML model with their new names, and in the IM Scoring for demographic clustering. Table J-1 Demographic clustering: Application mode settings options DB2 Intelligent Miner for Data
DB2 Intelligent Miner for Data model
PMML model
IM Scoring
Note
similarity threshold
-
-
always 0.5
(1)
active/suppl. fields
OK
usageType
OK
field weight
OK
fieldWeight
OK
value weighting
OK
x-weightingType (extension)
OK
distance units
after v6.1
similarity Scale
OK
(2)
outlier treatment (one per model, four different values)
after V6.1.1 (one per model, four different values)
outlier (one per field, only three values, no increasing buckets)
OK (one per model, only three values, no increasing buckets)
(3) (4)
similarity definitions
after V6.1.1
compareFunction = “table” and Matrix
OK
(3)
output fields
N/A
N/A
OK, return values of the scoring functions
Notes:
1. The scoring results of DB2 Intelligent Miner for Data and IM Scoring may be different if you did not install DB2 Intelligent Miner for Data Version 6.1.1. 2. If you do not have DB2 Intelligent Miner for Data version 6.1, distance units are written into the PMML model as half a standard deviation of the field. 3. If you do not have DB2 Intelligent Miner for Data version 6.1.1, the outlier treatment is missing and no similarity definitions are written into the PMML model. 4. For application (scoring), the outlier treatment methods “Place outliers into a lower and upper bucket” and “Create lower and upper buckets until record is accommodated” are identical. They are mapped onto the PMML method in IM Scoring.
300
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
K
Appendix K.
Additional material This redbook refers to additional material that can be downloaded from the Internet as described below.
Locating the Web material The Web material associated with this redbook is available in softcopy on the Internet from the IBM Redbooks Web server. Point your Web browser to: ftp://www.redbooks.ibm.com/redbooks/SG246879
Alternatively, you can go to the IBM Redbooks Web site at: ibm.com/redbooks
Select the Additional materials and open the directory that corresponds with the redbook form number, SG24-6879.
Using the Web material The additional Web material that accompanies this redbook includes the following zip files that contains the scripts used by each business scenario: File name 6879_add_material.zip
© Copyright IBM Corp. 2002. All rights reserved.
Description Zipped script samples
301
System requirements for downloading the Web material The following system configuration is recommended: Hard disk space: Operating System:
1.52 MB Windows
How to use the Web material Create a subdirectory (folder) on your workstation, and unzip the contents of the Web material zip file into this folder.
302
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Glossary This glossary defines terms as they are used in this book. If you do not find the term you are looking for, refer to the IBM Dictionary of Computing. API See application programming interface. application programming interface (API) When a software system features an API, it provides a means by which programs written outside of the system can interface with the system to perform additional functions. For example, a data mining software system may have an API that permits user-written programs to perform such tasks as extract data, perform additional statistical analysis, create specialized charts, generate a model, or make a prediction from a model. associations The relationship of items in a transaction in such a way that items imply the presence of other items in the same transaction. An association algorithm creates rules that describe how often events have occurred together. Consider this example: “When prospectors buy picks, they also buy shovels 14% of the time.” Such relationships are typically expressed with a confidence interval. classification The assignment of objects into groups or categories based on their characteristics. Refers to the data mining problem of attempting to predict the category of categorical data by building a model based on some predictor variables. cluster A group of records with similar characteristics.
© Copyright IBM Corp. 2002. All rights reserved.
confidence Indicates the strength or the reliability of the associations detected. Confidence of rule “B given A” is a measure of how much more likely it is that B occurs when A has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has occurred. Statisticians refer to this as the conditional probability of B given A. A 95% confidence interval for the mean has a probability of .95 of covering the true value of the mean. data mining The process of discovering valuable, hidden facts, unknown information from a large amount of data. The data is analyzed without any preconceived expectation of the results. Data mining delivers knowledge that provides a deeper insight into the data. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis. decision tree A tree-like way of representing a collection of hierarchical rules that lead to a class or value. field Or variable or attribute. A set of one or more related data items grouped for processing. In this document, with regard to database tables and views, field is synonymous with column in a database table. model An important function of data mining is the production of a model. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value from other, known values.
303
OLAP
See Online Analytical Processing.
Online Analytical Processing (OLAP) give the user the capability to perform multi-dimensional analysis of the data.
Tools that
prediction The dependency and the variation of one field’s value within a record on the other fields within the same record. A profile is then generated that can predict a value for the particular field in a new record of the same form, based on its other field values. rule body Represents the specified input data for a mining function. rule head Represents the derived items detected by the Associations mining function. rule A clause in the form head