HANDBOOKS IN INFORMATION SYSTEMS VOLUME 3
Handbooks in Information Systems Advisory Editors
Editor
Ba, Sulin
Andrew B. Whinston
University of Connecticut
Duan, Wenjing The George Washington University
Geng, Xianjun University of Washington
Gupta, Alok
Volume 3
University of Minnesota
Hendershott, Terry University of California at Berkeley
Rao, H.R. SUNY at Buffalo
Santanam, Raghu T. Arizona State University
Zhang, Han Georgia Institute of Technology
United Kingdom North America Japan India Malaysia China
Business Computing
Edited by
Gediminas Adomavicius University of Minnesota
Alok Gupta University of Minnesota
United Kingdom North America Japan India Malaysia China
Emerald Group Publishing Limited Howard House, Wagon Lane, Bingley BD16 1WA, UK First edition 2009 Copyright r 2009 Emerald Group Publishing Limited Reprints and permission service Contact:
[email protected] No part of this book may be reproduced, stored in a retrieval system, transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without either the prior written permission of the publisher or a licence permitting restricted copying issued in the UK by The Copyright Licensing Agency and in the USA by The Copyright Clearance Center. No responsibility is accepted for the accuracy of information contained in the text, illustrations or advertisements. The opinions expressed in these chapters are not necessarily those of the Editor or the publisher. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-84855-264-7 ISSN: 1574-0145
Awarded in recognition of Emerald’s production department’s adherence to quality systems and processes when preparing scholarly journals for print
Contents Preface Introduction
xiii xv
Part I: Enhancing and Managing Customer Value CHAPTER 1 Personalization: The State of the Art and Future Directions Alexander Tuzhilin 1. 2. 3.
3
Introduction Definition of personalization Types of personalization
3 7 14
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
14 15 16 17 19 20
Provider- vs. consumer- vs. market-centric personalization Types of personalized offerings Individual vs. segment-based personalization Smart vs. trivial personalization Intrusive vs. non-intrusive personalization Static vs. dynamic personalization
4. When does it pay to personalize? 5. Personalization process 6. Integrating the personalization process 7. Future research directions in personalization Acknowledgments References
21 24 36 37 39 40
CHAPTER 2 Web Mining for Business Computing Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Woong-Kee Loh and Vamsee Venuturumilli
45
1. 2.
Introduction Web mining
45 46
2.1. 2.2.
47 49
3.
Data-centric Web mining taxonomy Web mining techniques—state-of-the-art
How Web mining can enhance major business functions
50
3.1. 3.2. 3.3.
51 55 56
Sales Purchasing Operations
v
vi 4.
5.
Contents Gaps in existing technology
62
4.1. 4.2. 4.3.
62 63 63
Lack of data preparation for Web mining Under-utilization of domain knowledge repositories Under-utilization of Web log data
Looking ahead: The future of Web mining in business
64
5.1. 5.2. 5.3. 5.4.
64 64 65 66
Microformats Mining and incorporating sentiments e-CRM to p-CRM Other directions
6. Conclusion Acknowledgments References
66 66 67
CHAPTER 3 Current Issues in Keyword Auctions De Liu, Jianqing Chen and Andrew B. Whinston
69
1. 2.
Introduction A historical look at keyword auctions
70 72
2.1. 2.2. 2.3. 2.4.
73 73 74 75
3.
4. 5.
6.
Early Internet advertising contracts Keyword auctions by GoTo.com Subsequent innovations by Google Beyond search engine advertising
Models of keyword auctions
76
3.1. 3.2. 3.3.
77 78 80
Generalized first-price auction Generalized second-price auction Weighted unit–price auction
How to rank advertisers How to package resources
81 85
5.1. 5.2. 5.3.
86 87 90
The revenue-maximizing share structure problem Results on revenue-maximizing share structures Other issues on resource packaging
Click fraud
91
6.1. 6.2.
93 94
Detection Prevention
7. Concluding remarks References
96 96
CHAPTER 4 Web Clickstream Data and Pattern Discovery: A Framework and Applications Balaji Padmanabhan
99
1. 2. 3.
Background Web clickstream data and pattern discovery A framework for pattern discovery
99 101 103
3.1. 3.2.
103 104
Representation Evaluation
Contents 3.3. 3.4.
Search Discussion and examples
vii 104 105
4. Online segmentation from clickstream data 5. Other applications 6. Conclusion References
108 111 113 114
CHAPTER 5 Customer Delay in E-Commerce Sites: Design and Strategic Implications Deborah Barnes and Vijay Mookerjee
117
1.
E-commerce environment and consumer behavior
119
1.1. 1.2. 1.3.
119 119 120
2.
The long-term capacity planning problem 2.1.
3.
4.
E-commerce environment Demand generation and consumer behaviors System processing technique Allocating spending between advertising and information technology in electronic retailing
121 121
The short-term capacity allocation problem
127
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
127 128 129 130 131 132
Optimal processing policies for an e-commerce web server Environmental assumptions Priority processing scheme Profit-focused policy Quality of service (QoS) focused policy Practical implications
The effects of competition
132
4.1. 4.2. 4.3. 4.4.
132 133 133 136
A multiperiod approach to competition for capacity allocation Practical implications Long-term capacity planning under competition Practical applications and future adaptations
5. Conclusions and future research References
136 138
Part II: Computational Approaches for Business Processes CHAPTER 6 An Autonomous Agent for Supply Chain Management David Pardoe and Peter Stone
141
1. 2.
Introduction The TAC SCM scenario
141 142
2.1. 2.2. 2.3.
143 144 145
3.
Component procurement Computer sales Production and delivery
Overview of TacTex-06 3.1.
Agent components
145 145
viii 4.
5. 6. 7. 8.
Contents The Demand Manager
147
4.1. 4.2. 4.3.
147 149 152
Demand Model Offer Acceptance Predictor Demand Manager
The Supply Manager
156
5.1. 5.2.
157 158
Supplier Model Supply Manager
Adaptation over a series of games
161
6.1. 6.2.
162 163
Initial component orders Endgame sales
2006 Competition results Experiments
164 165
8.1. 8.2.
166 166
Supply price prediction modification Offer Acceptance Predictor
9. Related work 10. Conclusions and future work Acknowledgments References
168 170 171 171
CHAPTER 7 IT Advances for Industrial Procurement: Automating Data Cleansing for Enterprise Spend Aggregation Moninder Singh and Jayant R. Kalagnanam
173
1. 2.
Introduction Techniques for data cleansing
174 177
2.1. 2.2. 2.3. 2.4.
178 179 183 185
3.
Overview of data cleansing approaches Text similarity methods Clustering methods Classification methods
Automating data cleansing for spend aggregation
186
3.1. 3.2.
187 192
Data cleansing tasks for spend aggregation Automating data cleansing tasks for spend aggregation
4. Conclusion References
203 204
CHAPTER 8 Spatial-Temporal Data Analysis and Its Applications in Infectious Disease Informatics Daniel Zeng, James Ma, Hsinchun Chen and Wei Chang
207
1. 2.
Introduction Retrospective and prospective spatial clustering
207 209
2.1. 2.2. 2.3. 2.4.
209 213 217 223
Literature review Support vector clustering-based spatial-temporal data analysis Experimental studies A case study: Public health surveillance
Contents 3.
ix
Spatial-temporal cross-correlation analysis
224
3.1. 3.2. 3.3.
225 228 229
Literature review Extended K(r) function with temporal considerations A case study with infectious disease data
4. Conclusions Acknowledgments References
233 234 234
CHAPTER 9 Studying Heterogeneity of Price Evolution in eBay Auctions via Functional Clustering Wolfgang Jank and Galit Shmueli
237
1. 2.
Introduction Auction structure and data on eBay.com
237 240
2.1. 2.2.
240 240
3.
4.
How eBay auctions work eBay’s data
Estimating price evolution and price dynamics
242
3.1. 3.2. 3.3.
243 245 246
Estimating a continuous price curve via smoothing Estimating price dynamics via curve derivatives Heterogeneity of price dynamics
Auction segmentation via curve clustering
247
4.1. 4.2. 4.3. 4.4. 4.5.
247 249 251 254 256
Clustering mechanism and number of clusters Comparing price dynamics of auction clusters A differential equation for price Comparing dynamic and non-dynamic cluster features A comparison with ‘‘traditional’’ clustering
5. Conclusions References
257 260
CHAPTER 10 Scheduling Tasks Using Combinatorial Auctions: The MAGNET Approach John Collins and Maria Gini
263
1. 2.
Introduction Decision processes in a MAGNET customer agent
263 265
2.1. 2.2. 2.3. 2.4. 2.5. 2.6.
265 266 268 271 276 278
3.
Agents and their environment Planning Planning the bidding process Composing a request for quotes Evaluating bids Awarding bids
Solving the MAGNET winner-determination problem
279
3.1. 3.2. 3.3.
280 282 284
Bidtree framework A formulation Iterative-deepening A
x 4.
Contents Related work
286
4.1. 4.2. 4.3.
286 288 289
Multi-agent negotiation Combinatorial auctions Deliberation scheduling
5. Conclusions References
290 292
Part III: Supporting Knowledge Enterprise CHAPTER 11 Structuring Knowledge Bases Using Metagraphs Amit Basu and Robert Blanning
297
1. 2. 3.
Introduction The components of organizational knowledge Metagraphs and metapaths
297 298 300
3.1. 3.2. 3.3. 3.4.
301 303 304 306
4.
5.
Metagraph definition Metapaths Metagraph algebra Metapath dominance and metagraph projection
Metagraphs and knowledge bases
308
4.1. 4.2. 4.3.
308 311 313
Applications of metagraphs to the four information types Combining data, models, rules, and workflows Metagraph views
Conclusion
314
5.1. 5.2.
314 315
Related work Research opportunities
References
316
CHAPTER 12 Information Systems Security and Statistical Databases: Preserving Confidentiality through Camouflage Robert Garfinkel, Ram Gopal, Manuel Nunez and Daniel Rice
319
1. 2.
Introduction DB Concepts
319 321
2.1. 2.2. 2.3. 2.4.
321 322 323 324
3.
Types of statistical databases (SDBs) Privacy-preserving data-mining applications A simple database model Statistical inference in SDBs
Protecting against disclosure in SDBs
325
3.1. 3.2. 3.3. 3.4.
326 327 327 328
Protecting against statistical inference The query restriction approach The data masking approach The confidentiality via camouflage (CVC) approach
Contents 4. 5.
6.
xi
Protecting data with CVC
328
4.1. 4.2.
329 331
Computing certain queries in CVC Star
Linking security to a market for private information—A compensation model
332
5.1. 5.2. 5.3. 5.4. 5.5. 5.6.
332 333 334 335 337 339
A market for private information Compensating subjects for increased risk of disclosure Improvement in answer quality The compensation model Shrinking algorithm The advantages of the star mechanism
Simulation model and computational results
340
6.1. 6.2. 6.3.
340 341 341
Sample database User queries Results
7. Conclusion References
344 345
CHAPTER 13 The Efficacy of Mobile Computing for Enterprise Applications John Burke, Judith Gebauer and Michael J. Shaw
347
1. 2.
Introduction Trends
347 349
2.1. 2.2. 2.3. 2.4.
349 349 350 351
3.
4. 5.
Initial experiments in mobile information systems The trend towards user mobility The trend towards pervasive computing The future: ubiquitous computing
Theoretical frameworks
352
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
352 353 353 354 355 356
Introduction The technology acceptance model Example of the technology acceptance model Limitations of the technology acceptance model The task technology fit model Limitations of the task technology fit model
Case study: mobile E-procurement
357
4.1. 4.2.
357 358
Introduction A TTF model for mobile technologies
Case study findings
359
5.1. 5.2.
359 361
Functionality User experiences
6. Conclusions from the case study 7. New research opportunities 8. Conclusion References
363 367 368 370
xii
Contents
CHAPTER 14 Web-Based Business Intelligence Systems: A Review and Case Studies Wingyan Chung and Hsinchun Chen
373
1. 2.
Introduction Literature review
374 374
2.1. 2.2.
375 376
3.
4.
Business intelligence systems Mining the Web for BI
A framework for discovering BI on the Web
378
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
379 381 381 382 382 383
Collection Conversion Extraction Analysis Visualization Comparison with existing frameworks
Case studies
383
4.1. 4.2. 4.3.
384 389 392
Case 1: Searching for BI across different regions Case 2: Exploring BI using Web visualization techniques Case 3: Business stakeholder analysis using Web classification techniques
5. Summary and future directions References
396 397
PREFACE
Fueled by the rapid growth of the Internet, continuously increasing accessibility to communication technologies, and the vast amount of information collected by transactional systems, information overabundance has become an increasingly important problem. Technology evolution has also given rise to new challenges that frustrate both researchers and practitioners. For example, information overload has created data management problems for firms, while the analysis of very large datasets is forcing researchers to look beyond the bounds of inferential statistics. As a result, researchers and practitioners have been focusing on new techniques of data analysis that allow identification, organization, and processing of data in innovative ways to facilitate meaningful analysis. These approaches are based on data mining, machine learning, and advanced statistical learning techniques. The goal of these approaches is to discover models and/or identify patterns of potential interest that lead to strategic or operational opportunities. In addition, privacy, security, and trust issues have grown in importance. Recent legislation (e.g., Sarbanes–Oxley) is also beginning to impact IT infrastructure deployment. While popular press has given a lot of attention to entrepreneurial activities that information technologies, in particular computer networking technologies, have facilitated, the tremendous impact to business practices has received less direct attention. Enterprises are continuously leveraging advances in computing paradigms and techniques to redefine business processes and to increase process effectiveness leading to better productivity. Some of the important questions in these dimensions include: What new business models are created by the evolution of advanced computing infrastructures for innovative business computing? What are the IT infrastructure and risk management issues for these new business models? Business computing has been the foundation of these, often internal, innovations. The research contributions in this collection present modeling, computational, and statistical techniques that are being developed and deployed as cutting-edge research approaches to address the problems and challenges posed by information overabundance in electronic business and electronic commerce. This book is an attempt to bring articles from xiii
xiv
Preface
thought leaders in their respective areas to bring together information on state-of-the-art knowledge in business computing research, emerging innovative techniques, and futuristic reflections and approaches that will find their way in mainstream business processes in near future. The intended audiences for this book are students in both graduate business and applied computer science classes who want to understand the role of modern computing machinery in business applications. The book also serves as a comprehensive research handbook for researchers that intend to conduct research on design, evaluation, and management of computing-based innovation for business processes. Business practitioners (e.g., IT managers or technology analysts) should find the book useful as a reference on a variety of novel (current and emerging) computing approaches to important business problems. While the focus of many book chapters is data-centric, it also provides frameworks for making business case for computing technology’s role in creating value for organizations.
INTRODUCTION
An overview of the book The book is broadly organized in three parts. The first section (Enhancing and Managing Customer Value) focuses on presenting the state-of-knowledge in managing and enhancing customer value through extraction of consumer-centric knowledge from mountains of data that modern interactive applications generate. The extracted information can then be used to provide more personalized information to customers, provide more relevant information or products, and even to create innovative business processes to enhance overall value to customers. The second section in the book (Computational Approaches for Business Processes) focuses on presenting several specific innovative computing artifacts and tools developed by researchers that are not yet commercially used. These represent cutting-edge thought and advances in business computing research that should soon find utility in real-world applications or as a tool to analyze real-world scenarios. The final section in the book (Supporting Knowledge Enterprise) presents approaches and frameworks that focus on ability of an enterprise to analyze, build, and protect computing infrastructure that supports value-added dimensions to the enterprise’s existing business processes. Chapter summaries Part I: Enhancing and managing customer value The chapters in this part are, primarily, surveys of the state-of-the-art in research; however, each chapter points to the business applications as well as future opportunities for research. The first chapter by Alexander Tuzhilin (Personalization: The State of the Art and Future Directions) provides a survey of research in personalization technologies. The chapter focuses on providing a structured view of personalization and presents a six-step process of providing effective personalization. The chapter points out why, despite the hype, personalization applications have not reached their true potential and lays the groundwork for significant future research. xv
xvi
Introduction
The second chapter, by Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Woong-Kee Loh, and Vamsee Venuturumilli (Web Mining for Business Computing), focuses on knowledge extraction from data collected over the Web. The chapter discusses different forms of data that can be collected and mined from different Web-based sources to extract knowledge about the content, structure, or organization of resources and their usage patterns. The chapter discusses the usage of the knowledge extracted from transactional websites in all areas of business applications, including human resources, finance, and technology infrastructure management. One of the results of Web mining has been the better understanding of consumers’ browsing and search behavior and the introduction of advanced Web-based technologies and tools. The chapter by De Liu, Jianqing Chen, and Andrew Whinston (Current Issues in Keyword Auctions) presents the state of knowledge and research opportunities in the area of markets for Web search keywords. For example, Google’s popular AdWords and AdSense applications provide a way for advertisers to drive traffic to their sites or place appropriate advertisements on their webspace based on users’ search or browsing patterns. While the technology issues surrounding the intent and purpose of search and matching that with appropriate advertisers are also challenging, the chapter points out the challenges in organizing the markets for these keywords. The chapter presents the state-of-knowledge in keyword auctions as well as a comprehensive research agenda and issues that can lead to better and more economically efficient outcomes. Another chapter in this part by Balaji Padmanabhan (Web Clickstream Data and Pattern Discovery: A Framework and Applications) focuses specifically on pattern discovery in clickstream data. Management research has long distinguished between intent and action. Before the availability of clickstream data, the only data available regarding the action of consumers on electronic commerce websites was their final product selection. However, availability of data that captures not only buying behavior, but browsing behavior as well, can provide valuable insights into the choice criteria and product selection process of consumers. This information can be further used to design streamlined storefronts, presentation protocols, purchase processes and, of course, personalized browsing and shopping experience. The chapter provides a framework for pattern discovery that encompasses the process of representation, learning, and evaluation of patterns illustrated by conceptual and applied examples of discovering useful patterns. The part ends with a chapter by Deborah Barnes and Vijay Mookerjee (Customer Delay in E-Commerce Sites: Design and Strategic Implications) examining the operational strategies and concerns with respect to delays suffered by customers on e-commerce sites. The delay management directly affects customers’ satisfaction with a website and, as chapter points out, has implications for decisions regarding the extent of efforts devoted to generating traffic, managing content, and making infrastructure decisions.
Introduction
xvii
The chapter also presents ideas regarding creating innovative business practices such as ‘‘express lane’’ and/or intentionally delaying customers when appropriate and acceptable. The chapter also examines the effect of competition on determination of capacity and service levels. Part II: Computational approaches for business processes The first chapter in this part by David Pardoe and Peter Stone (An Autonomous Agent for Supply Chain Management) describes the details of their winning agent in Trading Agent Competition for Supply Chain Management. This competition allows autonomous software agents to compete in raw-material acquisition, inventory control, production, and sales decisions in a realistic simulated environment that lasts for 220 simulated days. The complexity and multidimensional nature of agent’s decisions makes the problem intractable from an analytical perspective. However, an agent still needs to predict future state of the market and to take competitive dynamics into account to make profitable sales. It is likely that, in the not-sodistant future, several types of negotiations, particularly for commodities, may be fully automated. Therefore, intelligent and adaptive agent design, as described in this chapter, is an important area of business computing that is likely to make significant contribution to practice. The second chapter in this part by Moninder Singh and Jayant Kalagnanam (IT Advances for Industrial Procurement: Automating Data Cleansing for Enterprise Spend Aggregation) examines the problem of cleansing massive amounts of data that a reverse aggregator may need in order to make efficient buying decisions on behalf of several buyers. Increasingly businesses are outsourcing the non-core procurement. In such environments, a reverse aggregator needs to create complex negotiation mechanisms (such as electronic request for quotes and request for proposals). An essential part of preparing these mechanisms is to provide rationale and business value of outsourcing. Simple tools such as spreadsheets are not sufficient to handle the scale of operations, in addition to being non-standardized and error-prone. The chapter provides a detailed roadmap and principles to develop automated system for aggregation and clean-up of data across multiple enterprises as a first step towards facilitating such a mechanism. The third chapter in this part by Daniel Zeng, James Ma, Wei Chang, and Hsinchun Chen (Spatial-Temporal Data Analysis and Its Applications in Infectious Disease Informatics) discusses the use of spatial-temporal data analysis techniques to correlate information from offline and online data sources. The research addresses important questions of interest, such as whether current trends are exceptional, and whether they are due to random variations or a new systematic pattern is emerging. Furthermore, the ability to discover temporal patterns and whether they match any known event in the past is also of crucial importance in many application domains, for
xviii
Introduction
example, in the areas of public health (e.g., infectious disease outbreaks), public safety, food safety, transportation systems, and financial fraud detection. The chapter provides case studies in the domain of infectious disease informatics to demonstrate the utility of the analysis techniques. The fourth chapter by Wolfgang Jank and Galit Shmueli (Studying Heterogeneity of Price Evolution in eBay Auctions via Functional Clustering) provides a novel technique to study price formation in online auctions. While there has been an explosion of studies that analyze online auctions from empirical perspective in the past decade, most of the studies provide either a comparative statics analysis of prices (i.e., the factors that affect prices in an auction) or a structural view of price formation process (i.e., assuming that game-theoretic constructs of price formation are known and captured by the data). However, the dynamics of the price formation process has been rarely studied. The dynamics of the process can provide valuable and actionable insights to both a seller and a buyer. For example, different factors may drive prices at different phases in the auction; in particular, the starting bid or number of items available may be the driver of price movement at the beginning of an auction, while the nature of bidding activity would be the driver in the middle of the auction. The technique discussed in the chapter provides a fresh statistical approach to characterize the price formation process and can identify dynamic drivers of this process. The chapter shows the information that can be gained from this process and opens up potential for designing a new generation of online mechanisms. The fifth and final chapter in this part by John Collins and Maria Gini (Scheduling Tasks Using Combinatorial Auctions: The MAGNET Approach) presents a combinatorial auction mechanism as a solution to complex business transactions that require coordinated combinations of goods and services under several business constraints, often resulting in complex combinatorial optimization problems. The chapter presents a new generation of systems that will help organizations and individuals find and exploit opportunities that are otherwise inaccessible or too complex to evaluate. These systems will help potential partners find each other and negotiate mutually beneficial deals. The authors evaluate their environment and proposed approach using the Multi-AGent NEgotiation Testbed (MAGNET). The testbed allows self-interested agents to negotiate complex coordinated tasks with a variety of constraints, including precedence and time constraints. Using the testbed, the chapter demonstrates how a customer agent can solve the complex problems that arise in such an environment. Part III: Supporting knowledge enterprise The first chapter in this part by Amit Basu and Robert Blanning (Structuring Knowledge Bases Using Metagraphs) provides a graphical
Introduction
xix
modeling and analysis technique called metagraphs. Metagraphs can represent, integrate, and analyze various types of knowledge bases existing in an organization, such as data and their relationships, decision models, information structures, and organizational constraints and rules. While other graphical techniques to represent such knowledge bases exist, usually these approaches are purely representational and do not provide methods and techniques to conduct inferential analysis. A given metagraph allows the use of graph-theoretic techniques and several algebraic operations in order to conduct analysis of its constructs and the relationship among them. The chapter presents the constructs and methods available in metagraphs, some examples of usage, and directions for future research and applications. The second chapter in this part by Robert Garfinkel, Ram Gopal, Manuel Nunez, and Daniel Rice (Information Systems Security and Statistical Databases: Preserving Confidentiality through Camouflage) describes an innovative camouflage-based technique to ensure statistical confidentiality of data. The basic and innovative idea of this approach, as opposed to perturbation-based approaches to data confidentiality, is to provide the ability of being able to conduct aggregate analysis with exact and correct answers to the queries posed to a database and, at the same time, provide confidentiality by ensuring that no combinations of queries reveal exact privacy-compromising information. This provides an important approach for business applications where personal data often needs to be legally protected. The third chapter by John Burke, Michael Shaw, and Judith Gebauer (The Efficacy of Mobile Computing for Enterprise Applications) analyzes the efficacy of the mobile platform for enterprise and business applications. The chapter provides insights as to why firms have not been able to adopt the mobile platform in a widespread manner. They posit that gaps exist between users’ task needs and technological capabilities that prevent users from adopting these applications. They find antecedents to acceptance of mobile applications in the context of a requisition system at a Fortune 100 company and provide insights as to what factors can enhance the chances of acceptance of the mobile platform for business applications. The final chapter in this part and in the book by Wingyan Chung and Hsinchun Chen (Web-based Business Intelligence Systems: A Review and Case Studies) reviews the state of knowledge in building Web-based Business Intelligence (BI) systems and propose a framework for developing such systems. A Web-based BI system can provide managers with real-time capabilities of assessing their competitive environments and supporting managerial decisions. The authors discuss various steps in building a Webbased BI system such as collection, conversion, extraction, analysis, and visualization of data for BI purposes. They provide three case studies of developing Web-based BI systems and present results from experimental studies regarding the efficacy of these systems.
xx
Introduction
Concluding remarks The importance of the topic of business computing is unquestionable. Information technology and computing-based initiatives have been and continue to be on the forefront of many business innovations. This book is intended to provide an overview of the current state of knowledge in business computing research as well as the emerging computing-based approaches and technologies that may appear in the innovative business processes of the near future. We hope that this book will serve as a source of information to researchers and practitioners and also will facilitate further discussions on the topic of business computing and lead to the inspiration for further research and applications. This book has been for several years in the making, and we are excited to see it come to life. This book contains a collection of 14 chapters written by experts in the areas of information technologies and systems, computer science, business intelligence, and advanced data analytics. We would like to thank all the authors of the book chapters for their commitment and contributions to this book. We would also like to thank all the diligent reviewers who provided comprehensive and insightful reviews of the chapters, in the process making this a much better book—our sincere thanks go to Jesse Bockstedt, Wingyan Chung, Sanjukta Das Smith, Gilbert Karuga, Wolfgang Ketter, YoungOk Kwon, Chen Li, Balaji Padmanabhan, Claudia Perlich, Pallab Sanyal, Mu Xia, Xia Zhao, and Dmitry Zhdanov. We also extend our gratitude to Emerald for their encouragement and help throughout the book publication process. Gediminas Adomavicius and Alok Gupta
Part I Enhancing and Managing Customer Value
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 1
Personalization: The State of the Art and Future Directions
Alexander Tuzhilin Stern School of Business, New York University, 44 West 4th Street, Room 8-92, New York, NY 10012, USA
Abstract This chapter examines the major definitions and concepts of personalization, reviews various personalization types and discusses when it makes sense to personalize and when it does not. It also reviews the personalization process and discusses how various stages of this process can be integrated into a tightly coupled manner in order to avoid ‘‘discontinuity points’’ between its different stages. Finally, future research directions in personalization are discussed.
1
Introduction
Personalization, the ability to tailor products and services to individuals based on knowledge about their preferences and behavior, was listed in the July 2006 issue of the Wired Magazine among the six major trends driving the global economy (Kelleher, 2006). This observation was echoed by Eric Schmidt, the CEO of Google, who observed in (Schmidt, 2006) that ‘‘we have the tiger by the tail in that we have this huge phenomenon of personalization.’’ This is in sharp contrast to the previously reported disappointments with personalization, as expressed by numerous prior authors and eloquently summarized by Kemp (2001): No set of e-business applications has disappointed as much as personalization has. Vendors and their customers are realizing that, for example, truly personalized Web commerce requires a reexamination of business processes and marketing strategies as much as installation of shrinkwrapped software. Part of the problem is that personalization means something different to each e-business.
3
4
A. Tuzhilin
Many of these disappointing experiences happened because various businesses jumped on the popular ‘‘personalization bandwagon’’ in the late 1990s and early 2000s without putting considerable thought into such questions as what, why and when it makes sense to personalize. This situation fits well the Gartner’s hype-and-gloom curve, presented in Fig. 1, which characterizes growth patterns of many technologies with personalization being one of them. As Fig. 1 demonstrates, expectations of personalization technologies initially exceeded the actual technological developments (the area of inflated expectations), then were followed by profound disappointments with these technologies (in early 2000s), as reflected in the Kemp’s quote above, and finally reached the ‘‘slope of enlightment’’ when expectations from the technologies coincided with the actual technological developments. The Wired magazine’s article cited above and the remark by Eric Schmidt acknowledge the fact that the personalization technology has significantly matured by now and that it has a very large potential if understood well and implemented properly. It turns out that the hype-and-gloom situation with personalization of the 1990s and 2000s described above constitutes only the most recent developments in the field. The roots of personalization can be traced back to antiquity when business owners knew their customers and provided different products and services to different customers based on extensive knowledge of who they were and on good understanding of their needs. More recently, elements of personalization can be traced to the second half
Fig. 1.
Gartner’s hype-and-gloom curve for personalization technologies.
Ch. 1. Personalization: The State of the Art and Future Directions
5
of the 19th century, when Montgomery Ward added some of the simple personalization features to their, otherwise, mass-produced catalogs (Ross, 1992). However, all these early personalization activities were either done on a small scale or were quite elementary. On a large scale, the roots of personalization can be traced to direct marketing when the customer segmentation method based on the recencyfrequency-monetary (RFM) model was developed by a catalog company to decide which customers should receive their catalog (Peterson et al., 1997). Also, the direct marketing company Metromail has developed the Selection by Individual Families and Tracts (SIFT) system in mid-1960s that segmented the customers based on such attributes as telephone ownership, length of residence, head of household, gender and the type of dwelling to make catalog shipping decisions. This approach was later refined in late 1960s when customers were also segmented based on their ZIP codes. These segmentation efforts were also combined with content customization when Time magazine experimented with sending mass-produced letters in 1940s that began with the salutation ‘‘Dear Mr. Smith . . . ’’ addressed to all the Mr. Smith’s on the company’s mailing list (Reed, 1949). However, all these early-day personalization approaches were implemented ‘‘by hand’’ without using Information Technologies. It was only in the mid-1960s, however, that direct marketers began using IT to provide personalized services, such as producing computer-generated letters that were customized to the needs of particular segments of customers. As an early example of such computerized targeted marketing, Fingerhut targeted New York residents with personalized letters that began, ‘‘Remember last January when temperatures in the state of New York dropped to a chilly 32 degrees?’’ (Peterson et al., 1997). Similarly, Burger King was one of the modern early adopters of personalization with ‘‘Have it your way’’ campaign launched in mid-1970s. However, it was not until 1980s that the areas of direct marketing and personalization experienced major advances due to the development of more powerful computers, database technologies and more advanced data analysis methods (Peterson et al., 1997) and automated personalization became a reality. Personalization was taken to the next level in the mid- to late-1990s with the advancement of the Web technologies and various personalization tools helping marketers interact with their customers on a 1-to-1 basis in real time. As a result, a new wave of personalization companies has emerged, such as Broadvision, ATG, Blue Martini, e.Piphany, Kana, DoubleClick, Claria, ChoiceStream and several others. As an example, the PersonalWeb platform developed by Claria provides behavioral targeting of website visitors by ‘‘watching’’ their clicks and delivering personalized online content, such as targeted ads, news and RSS feeds, based on the analysis of their online activities. Claria achieves this behavioral targeting by requesting online users to download and install the behavior-tracking
6
A. Tuzhilin
software on their computers. Similarly, ChoiceStream software helps Yahoo, AOL, Columbia House, Blockbuster and other companies to personalize home pages for their customers and thus deliver relevant content, products, search results and advertising to them. The benefits derived from such personalized solutions should be balanced against possible problems of violating consumer privacy (Kobsa, 2007). Therefore, some of these personalization companies, including DoubleClick and Claria, had problems with consumer privacy advocates in the past. On the academic front, personalization has been explored in the marketing community since the 1980s. For example, Surprenant and Solomon (1987) studied personalization of services and concluded that personalization is a multidimensional construct that must be approached carefully in the context of service design since personalization does not necessarily result in greater consumer satisfaction with the service offerings in all the cases. The field of personalization was popularized by Peppers and Rogers since the publication of their first book (Peppers and Rogers, 1993) on 1-to-1 marketing in 1993. Since that time, many publications appeared on personalization in computer science, information systems, marketing, management science and economics literature. In the computer science and information system literature, special issues of the CACM (Communications of the ACM, 2000) and the ACM TOIT (Mobasher and Anand, 2007) journals were dedicated to personalization technologies already, and another one (Mobasher and Tuzhilin, 2009) will be published shortly. Some of the most recent reviews and surveys of personalization include Adomavicius and Tuzhilin (2005a), Eirinaki and Vazirgiannis (2003) and Pierrakos et al. (2003). The main topics in personalization studied by computer scientists include Web personalization (Eirinaki and Vazirgiannis, 2003; Mobasher et al., 2000; Mobasher et al., 2002; Mulvenna et al., 2000; Nasraoui, 2005; Pierrakos et al., 2003; Spiliopoulou, 2000; Srivastava et al., 2000; Yang and Padmanabhan, 2005), recommender systems (Adomavicius and Tuzhilin, 2005b; Hill et al., 1995; Pazzani, 1999; Resnick et al., 1994; Schafer et al., 2001; Shardanand and Maes, 1995), building user profiles and models (Adomavicius and Tuzhilin, 2001a; Billsus and Pazzani, 2000; Cadez et al., 2001; Jiang and Tuzhilin, 2006a,b; Manavoglu et al., 2003; Mobasher et al., 2002; Pazzani and Billsus, 1997), design and analysis of personalization systems (Adomavicius and Tuzhilin, 2002; Adomavicius and Tuzhilin, 2005a; Eirinaki and Vazirgiannis, 2003; Padmanabhan et al., 2001; Pierrakos et al., 2003; Wu et al., 2003) and studies of personalized searches (Qiu and Cho, 2006; Tsoi et al., 2006).1 Most of these areas have a vast body of literature and can be a subject of a separate survey. For example, the survey of recommender 1 There are many papers published in each of these areas. The references cited above are either surveys or serve only as representative examples of some of this work demonstrating the scope of the efforts in these areas; they do not provide exhaustive lists of citations in each of the areas.
Ch. 1. Personalization: The State of the Art and Future Directions
7
systems (Adomavicius and Tuzhilin, 2005b) cites over a 100 papers, and the 2003 survey of Web personalization (Eirinaki and Vazirgiannis, 2003) cites 40 papers on the corresponding topics, and these numbers grow rapidly each year. In the marketing literature, the early work on personalization (Surprenant and Solomon, 1987) and (Peppers and Rogers, 1993), described above, was followed by several authors studying such problems as targeted marketing (Chen and Iyer, 2002; Chen et al., 2001; Rossi et al., 1996), competitive personalized promotions (Shaffer and Zhang, 2002), recommender systems (Ansari et al., 2000; Haubl and Murray, 2003; Ying et al., 2006), customization (Ansari and Mela, 2003) and studies of effective strategies of personalization services firms (Pancras and Sudhir, 2007). In the economics literature, there has been work done on studying personalized pricing when companies charge different prices to different customers or customer segments (Choudhary et al., 2005; Elmaghraby and Keskinocak, 2003; Jain and Kannan, 2002; Liu and Zhang, 2006; Ulph and Vulkan, 2001). In the management science literature, the focus has been on interactions between operations issues and personalized pricing (Elmaghraby and Keskinocak, 2003) and also on the mass customization problems (Pine, 1999; Tseng and Jiao, 2001) and their limitations (Zipkin, 2001). Some of the management science and economics-based approaches to Internet-based product customization and pricing are described in Dewan et al. (2000). A review of the role of management science in research on personalization is presented in Murthi and Sarkar (2003). With all these advances in the academic research on personalization and in developing personalized solutions in the industry, personalization ‘‘is back,’’ as is evidenced by the aforementioned quotes from the Wired magazine article and Eric Schmidt. In order to understand these sharp swings in perception about personalization, as described above, and grasp general developments in the field, we first review the basic concepts of personalization, starting with its definition in Section 2. In Section 3, we examine different types of personalization since, according to David Smith (2000), ‘‘there are myriad ways to get personal,’’ and we need to understand them to have a good grasp of personalization. In Section 4, we discuss when it makes sense to personalize. In Section 5, we present a personalization process. In Section 6, we explain how different stages of the personalization process can be integrated into one coherent system. Finally, we discuss future research directions in personalization in Section 7.
2
Definition of personalization
Since personalization constitutes a rapidly developing field, there still exist different points of view on what personalization is, as expressed by
8
A. Tuzhilin
academics and practitioners. Some representative definitions of personalization proposed in the literature are ‘‘Personalization is the ability to provide content and services that are tailored to individuals based on knowledge about their preferences and behavior’’ (Hagen, 1999). ‘‘Personalization is the capability to customize communication based on knowledge preferences and behaviors at the time of interaction’’ (Dyche, 2002). ‘‘Personalization is about building customer loyalty by building a meaningful 1-to-1 relationship; by understanding the needs of each individual and helping satisfy a goal that efficiently and knowledgeably addresses each individual’s need in a given context’’ (Riecken, 2000). ‘‘Personalization involves the process of gathering user information during interaction with the user, which is then used to deliver appropriate content and services, tailor-made to the user’s needs’’ (www.ariadne.ac.uk/issue28/personalization). ‘‘Personalization is the ability of a company to recognize and treat its customers as individuals through personal messaging, targeted banner ads, special offers, . . . or other personal transactions’’ (Imhoff et al., 2001). ‘‘Personalization is the combined use of technology and customer information to tailor electronic commerce interactions between a business and each individual customer. Using information either previously obtained or provided in real-time about the customer and other customers, the exchange between the parties is altered to fit that customer’s stated needs so that the transaction requires less time and delivers a product best suited to that customer’’ (www.personalization. com—as it was defined on this website in early 2000s). Although different, all these definitions identify several important points about personalization. Collectively, they maintain that personalization tailors certain offerings by providers to consumers based on certain knowledge about them, on the context in which these offerings are provided and with certain goal(s) in mind. Moreover, these personalized offerings are delivered from providers to consumers through personalization engines along certain distribution channels based on the knowledge about the consumers, the context and the personalization goals. Each of the italicized words above is important, and will be explained below. 1. Offerings. Personalized offerings can be of very different types. Some examples of these offerings include Products, both ready-made that are selected for the particular consumer (such as books, CDs, vacation packages and other ready-made products offered by a retailer) and manufactured in a custom-made
Ch. 1. Personalization: The State of the Art and Future Directions
9
fashion for a particular consumer (such as custom-made CDs and custom-designed clothes and shoes). Services, such as individualized subscriptions to concerts and personalized access to certain information services. Communications. Personalized offerings can include a broad range of marketing and other types of communications, including targeted ads, promotions and personalized email. Online content. Personalized content can be generated for an individual customer and delivered to him or her in the best possible manner. This personalized content can include dynamically generated Web pages, new and modified links and insertion of various communications described above into pre-generated Web pages. Information searches. Depending on the past search history and on other personal characteristics of an online user, a search engine can return different search results or present them in a different order to customize them to the needs of a particular user (Qiu and Cho, 2006; Tsoi et al., 2006). Dynamic prices. Different prices can be charged for different products depending on personal characteristics of the consumer (Choudhary et al., 2005).
These offerings constitute the marketing outputs of the personalization process (Vesanen and Raulas, 2006). Given a particular type of offering, it is necessary to specify the universe (or the space) of offerings O of that type and identify its structure. For example, in case of the personalized online content, it is necessary to identify what kind of content can be delivered to the consumer, how ‘‘granular’’ it is and what the structure of this content is. Similarly, in case of personalized emails, it is necessary to specify what the structure of an email message is, which parts of the message can be tailored and which are fixed, and what the ‘‘space’’ of all the email messages is. Similarly, in case of personalized prices, it is important to know what the price ranges are and what the granularity of the price unit is if the prices are discrete. 2. Consumers can be considered either at the individual level or grouped into segments depending on the particular type of personalization, the type of targeting and personalization objectives. The former case fits into the 1-to-1 paradigm (Peppers and Rogers, 1993), whereas the latter one into the segmentation paradigm (Wedel and Kamakura, 2000). It is an interesting and important research question to determine which of these two approaches is better and in which sense. The 1-to-1 approach builds truly personalized models of consumers but may suffer from not having enough data and the data being ‘‘noisy,’’ i.e., containing various types of consumer biases, imperfect information, mistakes, etc. (Chen et al., 2001), whereas the segmentation approach has sufficient data but may suffer from the problem of having heterogeneous populations of consumers
10
A. Tuzhilin
within the segments. This question has been studied before by marketers and the results of this work are summarized in Wedel and Kamakura (2000). In the IS/CS literature, some solutions to this problem are described in Jiang and Tuzhilin (2006a,b, 2007). Moreover, this problem will be discussed further in Section 3.3. Note that some of the definitions of personalization presented above refer to customers, while others to users and individuals. In the most general setting, personalization is applicable to a broad set of entities, including customers, suppliers, partners, employees and other stakeholders in the organization. In this chapter, we will collectively refer to these entities as consumers by using the most general meaning of this term in the sense described above. 3. Providers are the entities that provide personalized offerings, such as e-commerce websites, search engines and various offline outlets and organizations. 4. Tailoring. Given the space O of all the possible offerings described above and a particular consumer or a segment of consumers c, which offering or a set of offerings should be selected from the space O in each particular situation to customize the offering(s) to the needs of c according to the personalization goal(s) described below. How to deliver these customized offerings to individual consumers constitutes one of the key questions of personalization. We will address this question in Section 5 (Stage 3) when describing the ‘‘matchmaking’’ stage of the personalization process. 5. Knowledge about consumers. All the available information about the consumer, including demographic, psychographic, browsing, purchasing and other transactional information, is collected, processed, transformed, analyzed and converted into actionable knowledge that is stored in consumer profiles. This information is gathered from multiple sources. One of the crucial sources of this knowledge is the transactional information about interactions between the personalization system and the consumer, including purchasing transactions, browsing activities and various types of inquiries and information gathering interactions. This knowledge obtained from the collected data and stored in the consumer profiles is subsequently used to determine how to customize offerings to the consumers. The consumer profiles contain two types of knowledge. First, it has factual knowledge about consumers containing demographic, transactional and other crucial consumer information that is processed and aggregated into a collection of facts about the person, including various statistics about the consumer’s behavior. Simple factual information about the consumer can be stored as a record in a relational database or as a consumer-centric data warehouse (DW) (Kimball, 1996). More complicated factual information, such as the information about the social network of a person
Ch. 1. Personalization: The State of the Art and Future Directions
11
and his or her relationships and interactions with other consumers, may require the use of taxonomies and ontologies and can be captured using XML or special languages for defining ontologies (Staab and Studer, 2003), such as OWL (Antoniou and Harmelen, 2003). Second, the consumer profile contains one or several data mining and statistical models capturing behavior either of this particular consumer of the segment of similar consumers to which the person belongs. These models are stored as a part of the consumer-centric modelbase (Liu and Tuzhilin, 2008). Together, these two parts form the consumer profile that will be described in greater detail in Section 5. 6. Context. Tailoring of particular offering to the needs of the consumers depends not only on the knowledge about the consumer, but also on the context in which this tailoring occurs. For example, when recommending a movie to the consumer, it is not only important to know his or her movie preferences, but also the context in which these recommendations are made, such as with whom the person is going to see a movie, when and where. If a person wants to see a movie with his girlfriend in a movie theater on Saturday night, then, perhaps, a different movie should be recommended than in the case if he wants to see it with his parents on Thursday evening at home on a VCR. Similarly, when a consumer shops for a gift, different products should be offered to her in this context than when she shops for herself. 7. Goal(s) determine the purpose of personalization. Tailoring particular offerings to the consumers can have various objectives, including Maximizing consumer satisfaction with the provided offering and the overall consumer experience with the provider. Maximizing the Lifetime Value (LTV) (Dwyer, 1989) of the consumer that determines the total discounted value of the person derived over the entire lifespan of the consumer. This maximization is done over a long-range time horizon rather than pursuing short-term satisfaction. Improving consumer retention and loyalty and decreasing churn. For example, the provider should tailor its offerings so that this tailoring would maximize repeat visits of the consumer to the provider. The dual problem is to minimize the churn rates, i.e., the rates at which the current consumers abandon the provider. Better anticipate consumers’ needs and, therefore, serve them better. One way to do this would be to design the personalization engine so that it would maximize predictive performance of tailored offerings, i.e., it would try to select the offerings that the consumer likes. Make interactions between providers and consumers efficient, satisfying and easier for both of them. For example, in case of Web
12
A. Tuzhilin
personalization, this amounts to the improvement of the website design and helping visitors to find relevant information quickly and efficiently. Efficiency may also include saving consumer time. For example, a well-organized websites may help consumers to come in, efficiently buy product(s) and exist, thus saving precious time for the consumer. Maximize conversion rates whenever applicable, i.e., convert prospective customers into buyers. For example, in case of Web personalization, this would amount to converting website visitors and browsers into buyers. Increase cross- and up-selling of provider’s offerings.
The goals listed above can be classified into marketing- and economicsoriented. In the former case, the goal is to understand and satisfy the needs of the consumers, sometimes even at the expense of the short-term financial performance for the company, as is clearly demonstrated with the second (LTV) goal. For example, an online retailer may offer products and services to the consumer to satisfy his or her needs even if these offerings are not profitable to the retailer in the short term. In the latter case, the goal is to improve the short-term financial performance of the provider of the personalization service. As was extensively argued in the marketing literature, all the marketing-oriented goals, eventually, contribute to the long-term financial performance of the company (Kotler, 2003). Therefore, the difference between the marketing- and the economics-oriented goals boils down to the long- vs. the sort-term performance of the company and, thus, both types of goals are based on the fundamental economic principles. Among the seven examples of personalization goals listed above, the first five goals are marketing-oriented, whereas the last two are economicsoriented since their objectives are to increase the immediate financial performance of the company. Finally, a personalization service provider can simultaneously pursue multiple goals, among which some can be marketing- and others economics-oriented goals. 8. Personalization engine is a software system that delivers personalized offerings from providers to consumers. It is responsible for providing customized offerings to the consumers according to the goals of the personalization system, such as the ones described above. 9. Distribution channel. Personalized offerings are delivered from the producers to the consumers along one or several distribution channels, such as a website, physical stores, email, etc. Selecting the right distribution channel for a particular customized offering often constitutes an important marketing decision. Moreover, the same offering can be delivered along multiple distribution channels. Selecting the right mixture of channels complementing each other and maximizing the distribution effects
Ch. 1. Personalization: The State of the Art and Future Directions
13
constitutes the cross-channel optimization problem in marketing (IBM Consulting Services, 2006). If implemented properly, personalization can provide several important advantages for the consumers and providers of personalized offerings depending on the choice of specific goals listed in item (7) above. In particular, it can improve consumer satisfaction with the offerings and the consumer experience with the providers; it can make consumer interactions easier, more satisfying, efficient and less time consuming. It can improve consumer loyalty, increase retention, decrease churn rates and thus can lead to higher LTVs of some of the consumers. Finally, well-designed economics-oriented personalization programs lead to higher conversion and click-through rates and better up- and cross-selling results. Besides personalization, mass customization (Tseng and Jiao, 2001; Zipkin, 2001) constitutes another popular concept in marketing and operations management, which is sometimes used interchangeably with personalization in the popular press. Therefore, it is important to distinguish these two concepts to avoid possible confusion. According to Tseng and Jiao (2001), mass customization is defined as ‘‘producing goods and services to meet individual customer’s needs with near mass production efficiency.’’ According to this definition, mass customization deals with efficient production of goods and services, including manufacturing of certain products according to specified customer needs and desires. It is also important to note that these needs and desires are usually explicitly specified by the customers in mass customization systems, such as specifying the body parameters for manufacturing customized jeans, the feet parameters for manufacturing customized shoes and computer configurations for customized PCs. In contrast to the case of mass customization, offerings are usually tailored to individual consumers without any significant production processes in case of personalization. Also, in case of personalization, the knowledge about the needs and desires of consumers is usually implicitly learned from multiple interactions with them rather than it being explicitly specified by the consumers in case of mass customization. For example, in case of customized websites, such as myYahoo!, the user specifies her interests, and the website generates content according to the specified interests of the user. This is in contrast to the personalized web page on Amazon, when Amazon observes the consumer purchases, implicitly learns her preferences and desires from these purchases and personalizes her ‘‘welcome’’ page according to this acquired knowledge. Therefore, personalization is about learning and responding to customer needs, whereas mass customization is about explicit specification of these needs by the customers and customizing offered products and services to these needs by tailoring production processes. In this section, we explained what personalization means. In the next section, we describe different types of personalization.
14 3
A. Tuzhilin
Types of personalization
Tailoring of personalized offerings by providers to consumers can come in many different forms and shapes, thus resulting in various types of personalization. As David Smith put it, ‘‘there are myriad ways to get personal’’ (Smith, 2000). In this section, we describe different types of personalization. 3.1 Provider- vs. consumer- vs. market-centric personalization Personalized offerings can be delivered from providers to consumers by personalization engines in three ways, as presented in Fig. 2 (Adomavicius and Tuzhilin, 2005a). In these diagrams, providers and consumers of personalized offerings are denoted by white boxes, personalization engines by gray boxes and the interactions between consumers and providers by solid lines. Figure 2(a) presents the provider-centric personalization approach that assumes that each provider has its own personalization engine that tailors the provider’s content to its consumers. This is the most common approach to personalization, as popularized by Amazon.com, Netflix and the Pandora streaming music service. In this approach, there are two sets of goals for the personalization engines. On the one hand, they should provide the best marketing service to their customers and fulfill some of the marketing-oriented goals presented in Section 2. On the other hand, these provider-centric personalization services are designed to improve financial performance of the providers of these services (e.g., Amazon.com and Netflix), and therefore their behavior is driven by the economics-oriented goals listed in Section 2. Therefore, the challenge for the provider-centric approaches to personalization is to strike a balance between the two sets of goals by keeping the customers happy with tailored offerings and making personalization solutions financially viable for the provider. Consumers
Providers
(a) Provider-centric
Fig. 2.
Consumers
Providers
(b) Consumer-centric
Consumers
Providers
(c) Market-centric
Classification of personalization approaches.
Ch. 1. Personalization: The State of the Art and Future Directions
15
The second approach, presented in Fig. 2(b), is the consumer-centric approach, which assumes that each consumer has its own personalization engine (or agent) that ‘‘understands’’ this particular consumer and provides personalization services across several providers based on this knowledge. This type of consumer-centric personalization delivered across a broad range of providers and offerings is called an e-Butler service (Adomavicius and Tuzhilin, 2002) and is popularized by the PersonalWeb service from Claria (www.claria.com). The goals of a consumer-centric personalization service are limited exclusively to the needs of the consumer and should pursue only the consumer-centric objectives listed in Section 2, such as anticipating consumer needs and making interactions with a website more efficient and satisfying for the consumer. The problem with this approach lies in developing personalization service of such quality and value to the consumers that they would be willing to pay for it. This would remove dependency on advertising and other sources of revenues coming from the providers of personalized services, which would go against the philosophy of the purely consumer-centric service. The third approach, presented in Fig. 2(c), is the market-centric approach that provides personalization services for a marketplace in a certain industry or sector. In this case, the personalization engine performs the role of an infomediary by knowing the needs of the consumers and the providers’ offerings and trying to match the two parties in the best ways according to their internal goals. Personalized portals customizing the services offered by its corporate partners to the individual needs of their customers would be an example of this market-centric approach.
3.2 Types of personalized offerings Types of personalization methods can vary very significantly depending on the type of offering provided by the personalization application. For example, methods for determining personalized searches (Qiu and Cho, 2006) differ significantly from the methods for determining personalized pricing (Choudhary et al., 2005), which also differ significantly from the methods for delivering personalized content to the Web pages (Sheth et al., 2002) and personalized recommendations for useful products (Adomavicius and Tuzhilin, 2005b). In Section 2, we identified various types of offerings including Products and services, Communications, including targeted ads, promotions and personalized email, Online content, including dynamically generated Web pages and links, Information searches, Dynamic prices.
16
A. Tuzhilin
One of the defining factors responsible for differences in methods of delivering various types of personalized offerings is the structure and complexity of the offerings space O that can vary quite significantly across the types of offerings listed above. For example, in case of dynamic prices, the structure of the offering space O is relatively simple (e.g., discrete or continuous variable within a certain range), whereas in case of online content tailoring it can be very large and complex depending on the granularity of the web content and how the content is structured on the web pages of a particular personalization application. Another defining factor is conceptually different methods for delivering various types of targeted offerings. For example, how to specify dynamic prices depends on the underlying economic theories, whereas providing personalized recommendations depends on the underlying data mining and other recommendation methods discussed in Section 5. Similarly, methods of delivering personalized searches depend on underlying information retrieval and web search theories. A particular application can also deal with a mixture of various types of offerings described above, which can result in a combination of different personalization methods. For example, if an online retailer decides to add dynamic prices to the already developed personalized product offerings (i.e., customer X receives a recommendation for book Y at a personalized price Z), then this means combining personalized recommendation methods, such as the ones discussed in Section 5, with personalized pricing methods. Alternatively, a search engine may deliver personalized search results and personalized search-related ads targeted to individuals that are based not only on the search keywords specified by the consumer, but also on the personal characteristics of the consumer, as defined in his or her profile, such as the past search history, geographic location and demographic data in case it is available. 3.3 Individual vs. segment-based personalization As was pointed out in Section 2, personalized offerings can be tailored either to the needs of individuals or segments of consumers. In the former case, the consumer profile is built exclusively from the data pertaining to this and only this consumer (Adomavicius and Tuzhilin, 2001a; Jiang and Tuzhilin, 2006a). In the latter case, the consumer is grouped into a segment of similar individuals, and the profile is built for the whole segment. This profile is subsequently applied to target the same offering to the whole segment. The smaller the segment size, the finer the targeting of the offering to the consumers in that segment and, therefore, the more personalized the offerings become. Thus, by varying segment sizes, we change the degree of personalization from being coarse for large segments to being fine for
Ch. 1. Personalization: The State of the Art and Future Directions
17
smaller segments. In the limit, complete personalization is reached for the 1-to-1 marketing when the segment size is always one. Although strongly advocated in the popular press (Peppers and Rogers, 1993; Peppers and Rogers, 2004), it is not clear that targeting personalized offerings to individual consumers will always be better than for segments of consumers because of the tradeoff between sparsity of data for individual consumers and heterogeneity of consumers within segments: individual consumer profiles may suffer from sparse data resulting in high variance of performance measures of individual consumer models, whereas aggregate profiles of consumer segments suffer from high levels of customer heterogeneity, resulting in high performance biases. Depending on which effect dominates the other, it is possible that individualized personalization models outperform the segmented or aggregated models, and vice versa. The tradeoff between these two approaches has been studied in Jiang and Tuzhilin (2006a), where performance of individual, aggregate and segmented models of consumer behavior was compared empirically across a broad spectrum of experimental settings. It was shown that for the highly transacting consumers or poor segmentation techniques, individual-level consumer models outperform segmentation models of consumer behavior. These results reaffirm the anecdotal evidence about the advantages of personalization and the 1-to-1 marketing stipulated in the popular press (Peppers and Rogers, 1993; Peppers and Rogers, 2004). However, the experiments reported in Jiang and Tuzhilin (2006a) also show that segmentation models, taken at the best granularity level(s) and generated using effective clustering methods, dominate individual-level consumer models when modeling consumers with little transactional data. Moreover, this best granularity level is significantly skewed towards the 1-to-1 case and is usually achieved at the finest segmentation levels. This finding provides additional support for the case of micro-segmentation (Kotler, 2003; McDonnell, 2001)—when consumer segmentation is done at a highly granular level. In conclusion, determining the right segment sizes and the optimal degree of personalization constitutes an important decision in personalization applications and involves the tradeoff between heterogeneity of consumer behavior in segmented models vs. sparsity of data for small segment sizes and individual models. 3.4 Smart vs. trivial personalization Some personalization systems provide only superficial solutions, including presenting trivial content for the consumers, such as greeting them by name or recommending a book similar to the one the person has bought recently. As another example, a popular website personalization.com (or its alias personalizationmall.com) provides personalized engravings on various
18
A. Tuzhilin
items ranging from children’s backpacks to personalized beer mugs. These examples constitute cases of trivial (Hagen, 1999) [shallow or cosmetic (Gilmore and Pine, 1997)] personalization. In contrast to this, if offerings are actively tailored to individuals based on rich knowledge about their preferences and behavior, then this constitutes smart (or deep) personalization (Hagen, 1999). Continuing this categorization further, Paul Hagen classifies personalization applications into the following four categories, described with the 2 2 matrix shown in Fig. 3 (Hagen, 1999). According to Fig. 3 and Hagen (1999), one classification dimension constitutes consumer profiles that are classified into rich vs. poor. Rich profiles contain comprehensive information about consumers and their behavior of the type described in Section 2 and further explained in Section 5. Poor profiles capture only partial and trivial information about consumers, such as their names and basic preferences. The second dimension of the 2 2 matrix in Fig. 3 constitutes tailoring (customization) of the offerings. According to Hagen (1999), the offerings can be tailored either reactively or proactively. Reactive tailoring takes already existing knowledge about consumers’ preferences and ‘‘parrots’’ these preferences back to them without producing any new insights about potentially new and interesting offerings. In contrast, the proactive tailoring takes consumer preferences stored in consumer profiles and generates new useful offerings by using innovative matchmaking methods to be described in Section 5. Using these two dimensions, Hagen (1999) classifies personalization applications into Trivial personalizers: These applications have poor profiles and provide reactive targeting. For example, a company can ask many relevant questions about consumer preferences, but would not use this knowledge about them to build rich profiles of the customers and deliver truly personalized and relevant content. Instead, the company insults its customers by ignoring their inputs and delivering irrelevant marketing messages or doing cosmetic personalization, such as greeting the customers by name. Lazy personalizers: These applications build rich profiles, but do only reactive targeting. For example, an online drugstore can have rich Rich profile Poor profile
Lazy personalizers Trivial personalizers Reactive tailoring
Fig. 3.
Smart personalizers Overeager personalizers Proactive tailoring
Classification of personalization applications (Hagen, 1999).
Ch. 1. Personalization: The State of the Art and Future Directions
19
information about customer’s allergies, but miss or even ignore this information when recommending certain drugs to patients. This can lead to recommending drugs causing allergies in patients, although the allergies information is contained in the patients’ profiles. Overeager personalizers: These applications have poor profiles but make proactive targeting of its offerings. This can often lead to poor results because of the limited information about consumers and faulty assumptions about their preferences. Examples of these types of applications include recommending books similar to the ones the consumer bought recently and various types of baby products to a woman who recently had a miscarriage. Smart personalizers: These applications use rich profiles and provide proactive targeting of the offerings. For example, an online gardening website may warn a customer that the plant she just bought would not grow well in the climate of the region where the customer lives. In addition, the website would recommend alternative plants based on the customers’ preferences and past purchases that would fit better the climate where the customer lives. On the basis of this classification, Hagen (1999), obviously, argues for the need to develop smart personalization applications by building rich profiles of consumers and actively tailoring personalized offerings to them. At the heart of smart personalization lie two problems (a) how to build rich profiles of consumers and (b) how to match the targeted offerings to these profiles well. Solutions to these two problems will be discussed further in Section 5. 3.5 Intrusive vs. non-intrusive personalization Tailored offerings can be delivered to the consumer in an automated manner without distracting her with questions and requests for information and preferences. Alternatively the personalization engine can ask the consumer various questions in order to provide better offerings. For example, Amazon.com, Netflix and other similar systems that recommend various products and services to individual consumers ask these consumers for some initial set of ratings of the products and services before providing recommendations regarding them. Also, when a multidimensional recommender system wants to provide a recommendation in a specific context, such as recommending a movie to a person who wants to see it with his girlfriend on Saturday night in a movie theater, the system would first ask (a) when he wants to see the movie, (b) where and (c) with whom before providing a specific recommendation (Adomavicius et al., 2005). Such personalization systems are intrusive in the sense that they keep asking consumers questions before delivering personalized offerings to
20
A. Tuzhilin
them, and the levels of consumer involvement can be very significant in some cases. Alternatively, personalization systems may not ask consumers explicit questions, but non-intrusively learn consumer preferences from various automated interactions with them. For example, the amount of time a consumer spends reading a newsgroup article can serve as a proxy of how much the consumer is interested in this article. Clearly, non-intrusive personalization systems are preferable from the consumer point of view, but they may provide less accurate recommendations. Studying the tradeoffs between intrusive and non-intrusive personalization systems and determining optimal levels of intrusiveness in various personalization applications constitutes an interesting and important research problem. This problem has already been studied by several researchers in the context of recommender systems. In particular, Oard and Kim (1998) described several ways of obtaining implicit feedback for recommender systems. The methods of minimizing the number of intrusive questions for obtaining user ratings in recommender systems have also been studied in Pennock et al. (2000), Rashid et al. (2002), Boutilier et al. (2003), Montgomery and Srinivasan (2003) and Yu et al. (2004).
3.6 Static vs. dynamic personalization Personalization applications can be classified in terms of who can select and deliver the offerings and how this is done. On the one hand, the offerings can be selected and delivered dynamically by the personalization system. For example, the system may be monitoring the activities of the consumer and the environment and dynamically deciding to change the content of the web pages for the consumer based on her activities and the changes in the environment. One promising type of dynamic personalization constitutes ubiquitous personalization based on mobile location-based services (LBS) (Rao and Minakakis, 2003) that deploy various types of wireless technologies that identify the location and other types of contextual information, such as the current time, the consumer schedule and the purpose of the trip, in order to provide dynamic personalized services to the consumer based on this contextual information and the consumer profile. Examples of these LBS including suggestions of various shops, restaurants, entertainment events and other points of interest in the geographical and temporal vicinities of the consumer. On the other hand, the offerings delivered to the consumer can be selected either by the consumer herself or by the system administrator who had selected a fixed set of business rules governing the delivery of the offerings to specific segments of the consumers. In this case, this selection was done statically and can be changed only by the consumer or the system administrator depending on the case.
Ch. 1. Personalization: The State of the Art and Future Directions
21
Obviously, the dynamic selection of offerings is more flexible and is more preferred than the static selection process. On the other hand, it should be done in a smart way, as described in Section 3.4 above, to avoid the substandard performance of the personalization system. In summary, we discussed various types of personalization in this section, following the dictum of David Smith that ‘‘there are myriad ways to get personal’’ (Smith, 2000). Therefore, specific types of personalization approaches need to be selected carefully depending on the particular personalization application at hand and on the goals that this application tries to accomplish, such as the ones described in Section 2.
4
When does it pay to personalize?
One of the reasons why personalization has its share of successes and disappointments is that it does not always make sense to personalize both for the technical and economic reasons. One of such technical reasons is that provision of personalized offerings can lead to questionable outcomes that do not benefit, or even worse, insult the consumer. For example, an online maternity store can start recommending various types of baby products to a woman who has bought maternity clothes for herself a few months ago without realizing that she had recently had a miscarriage. One of the fundamental assumptions in personalization is that of the stability of consumer preferences and the assumption that the past consumer activities can be used to predict their possible future preferences and actions. As the previous example clearly demonstrates, this assumption does not hold in some cases. In those cases, the dangers of personalization and the risks of falling into the de-personalization trap (to be discussed in Section 5) may overweight the potential benefits of personalization, thus making it impractical. On the economic side, proponents of personalization first need to build a strong business case before launching a personalization project. At the most general level, personalization should be done when the benefits derived from a personalization project exceed its costs for both providers and consumers of personalized offerings. Otherwise, one of the parties will refuse to participate in the personalization project. In the rest of this section, we examine the costs vs. benefits tradeoff for both providers and consumers of personalized offerings. Consumers. From the consumers’ perspective, the benefits of personalization constitute more relevant offerings delivered by the providers at the most opportune moments. One problem with these types of benefits is that it is hard to measure their effects, as will be discussed in Section 5. The costs of personalization consist of two parts for the consumers: direct and indirect. The direct costs are subscription costs paid by the consumers. For
22
A. Tuzhilin
the provider-centric personalization, personalization services are usually provided for free and, therefore, the direct costs for the consumers are usually zero. In case of the consumer-centric personalization, consumers should pay for these services, as discussed in Section 3.1, and these fees constitute the direct costs for the consumers. Indirect costs to the consumers include time and cognitive efforts of installing and configuring personalization services, and the privacy and security issues associated with these services. As a part of the subscription, the consumers should provide certain personal information to the personalization service providers, and there are always some risks that this personal information can be misused by the providers. As in the case of benefits, these indirect costs are also hard to measure. Providers. For the providers, the key question is whether or not they should customize their offerings and if yes, then to what degree and scope. The decision to personalize its offerings or not depends on the tradeoff between the personalization costs vs. the benefits derived by the providers from personalized offerings to the consumers. We will now examine these costs and benefits. Customization does not come for free since it requires additional costs to customize offerings in most of the cases, especially in the case of customized products that need to be manufactured. Also, the more personalized an offering is, the more customization is usually required. For example, one matter is to make a certain type of shoe in 20 different varieties depending on the color, foot size and width, and a completely different and more expensive proposition is to manufacture a personal pair of shoes for a specific customer. In general, the more customized an offering is and the smaller the targeted segment, the more costly the manufacturing process becomes. In the limit, manufacturing for the segment of one is the most expensive, and it requires stronger business justification to adopt this option (Zipkin, 2001). One interesting research question is whether firms should customize their products based on one or multiple attributes and whether different firms should select the same or different attributes for customization purposes. In Syam et al. (2005), it is shown that it is better for the firms to select only one and the same attribute as a basis for customization. This problem was further explored in Ghose and Huang (2007). Moreover, it is also important to know how customization of products and prices affect each other. In Ghose and Huang (2006), it is shown that if the fixed costs of personalization are low, firms are always better off personalizing both prices and products. Shaffer and Zhang (2002) also show that similar effects can arise if firms are asymmetric in market share. As for the consumers, personalization costs for the providers consist of direct and indirect costs. The direct costs are associated with extra efforts required to customize personalized offerings, whereas indirect costs are associated with the potential problems pertaining to providing personalized
Ch. 1. Personalization: The State of the Art and Future Directions
23
solutions, such as privacy-related and legal costs. For example, Doubleclick and some other personalization companies had to deal with legal challenges pertaining to privacy, had to incur significant legal costs and the subsequent decisions to abstain from certain types of personalization. Benefits derived from providing personalized offerings include Premium prices charged for these offerings under certain competitive economic conditions (Chen et al., 2001; Ghose and Huang, 2007; Shaffer and Zhang, 2002; Syam et al., 2005). For example, a shoe manufacturer can charge premium prices for the custom-made shoes in many cases. Additional customer satisfaction, loyalty and higher retention rates resulting in higher LTV values for the customers and less churn. Achieving higher conversion rates from prospective to real and to loyal customers. Achieving higher average revenue levels per customer via cross- and up-selling capabilities. Unfortunately, as discussed in Section 5, some of these benefits are hard to measure. Therefore, it is often hard to produce exact numbers measuring personalization benefits. To deal with this problem, Rangaswamy and Anchel (2003) proposed the framework where the decision to personalize or not for providers is measured in terms of the tradeoffs between the customization costs incurred and the heterogeneity of consumers’ wants. Rangaswamy and Anchel (2003) present a 2 2 matrix having dimensions ‘‘customization costs’’ and ‘‘heterogeneity of consumer wants’’ and classify various manufacturing products into the quadrants of this matrix. Such products as mutual funds, music and similar types of digital products have low customization costs, while consumer wants for these products are very heterogeneous. Therefore, these types of products are primary candidates for personalization, as is witnessed by the personalized Internet radio station Pandora (www. pandora.com). On the other end of the spectrum are such products as cars, photocopiers and MBA programs. Customization costs for such products are high, whereas consumer wants are significantly more homogeneous than for the other types of products. Therefore, it is less attractive for the providers to personalize such products. An interesting situation happens for the class of products where consumer wants and customization costs are in between these two extremes, i.e., they are not too high and not too low. According to Rangaswamy and Anchel (2003), examples of such products include clothes, certain food items, computers, watches, etc. Therefore, we see certain personalization efforts for these products, such as certain customized clothes (e.g., jeans), foods prepared for individual consumers and customized computers (e.g., Dell), while still none for others (e.g., mass-produced clothes, watches, etc.).
24
A. Tuzhilin
In summary, when the benefits of personalization exceed its costs for both providers and consumers of personalized offerings, only then it makes business sense to personalize, which happens only for certain types of offerings and usually on a case-by-case basis. Moreover, it is difficult to measure the costs and benefits of personalization in many cases. Therefore, personalization decisions are often hard to make in real business settings, and they require careful cost-benefit analysis and evaluation.
5
Personalization process
As was argued by Adomavicius and Tuzhilin (2001b), personalization should be considered as an iterative process consisting of several stages that are integrated together into one tight system. In particular, Adomavicius and Tuzhilin (2001b) proposed the following five stages: (a) collecting customer data, (b) building customer profiles using this data, (c) matchmaking customized offerings to specific customer profiles to determine the most relevant offerings to individual customers, (d) delivery and presentation of customized information and offerings through the most relevant channels, at the most appropriate times and in the most appropriate form and (e) measuring customer responses to the delivered offerings. Moreover, Adomavicius and Tuzhilin (2001b) argued for the necessity of a feedback loop mechanism that takes customers’ responses to the current personalization solution, transfers appropriate information to the earlier stages of the personalization process, and adjusts, improves and corrects various activities in these earlier stages that cause poor responses from the customers. This approach of viewing personalization as a process was further developed by Murthi and Sarkar (2003), who partitioned the personalization process into the following three stages: (a) learning customer preferences, (b) matching offerings to customers’ preferences and (c) evaluation of the learning and matching processes. Murthi and Sarkar (2003) also placed personalization within the firm’s overall Value Net framework and connected it to the general business strategy of the firm. Subsequently, Adomavicius and Tuzhilin (2005a) extended and refined the previous approaches by proposing the Understand–Deliver–Measure (UDM) framework, according to which the personalization process is defined in terms of the UDM cycle consisting of the following stages as shown in Fig. 4: Understand consumers by collecting comprehensive information about them and converting it into actionable knowledge stored in consumer profiles. The output of this stage is a consumer-centric DW (Kimball, 1996) and the consumer-centric modelbase (Liu and Tuzhilin, 2008). The consumer-centric DW stores factual profiles of each consumer.
Ch. 1. Personalization: The State of the Art and Future Directions
Adjusting Personalization Strategy
Feedback loop
Measuring Personalization Impact
Delivery and Presentation
25
Measure Impact of Personalization
Deliver Customized Offerings
Matchmaking
Building Consumer Profiles
Understand the Consumer
Data Collection
Fig. 4.
Personalization process.
The consumer-centric modelbase stores data mining and statistical models describing behavior of individual consumers. Collectively, factual profile and the collection of data mining models of the consumer form the consumer profile. Deliver customized offering based on the knowledge about each consumer C, as stored in the consumer profiles, and on the information about the space of offerings O. The personalization engine should find the customized offerings from the space O that are the most relevant to each consumer C within the specified context and deliver them to C in the best possible manner, including at the most appropriate time(s), through the most appropriate channels and in the most appropriate form. These customized offerings constitute marketing outputs of the personalization process. Measure personalization impact by determining how much the consumer is satisfied with the marketing outputs (in the form of delivered personalized offerings). It provides information that can enhance our understanding about consumers or point out the deficiencies of the methods for personalized delivery. Therefore, this additional information serves as a feedback for possible improvements to each of the other components of personalization process. This feedback information completes one cycle of the personalization process, and sets the stage for the next cycle where improved personalization techniques can make better personalization decisions. More recently, Vesanen and Raulas (2006) presented an alternative approach to describing the personalization process that consists of interaction,
26
A. Tuzhilin
processing, customization and delivery stages. In addition, Vesanen and Raulas (2006) explicitly introduced four objects into its framework: customers, customer data, customer profiles and marketing outputs, and showed how the aforementioned four stages are connected to these four objects. In particular, they described how customer data is obtained from the customers via interactions with them and from the external sources, then how it is preprocessed into the customer profiles, and then how marketing outputs are customized based on the profiling information. Vesanen and Raulas (2006) also argue for the importance of integrating various personalization stages and describe possible problems arising from improper integration of various stages of the personalization process and the existence of the ‘‘discontinuity points.’’ Finally, Vesanen and Raulas (2006) present a case study describing how the described personalization process was implemented in a direct marketing company. Although each of the described approaches covers different aspects of the personalization process, we will follow below the modified UDM model from Adomavicius and Tuzhilin (2005a) that is schematically described in Fig. 4, because we believe that this modified UDM model covers all the aspects of the personalization process. For example, the four personalization stages presented in Vesanen and Raulas (2006) are closely related to the six stages of the personalization process presented in Fig. 4. The UDM framework described above constitutes the high-level conceptual description of the personalization process. The technical implementation of the UDM framework consists of the following six stages (Adomavicius and Tuzhilin, 2005a) presented in Fig. 4: Stage 1: Data Collection. The personalization process begins with collecting data across different channels of interaction between consumers and providers (e.g., Web, phone, direct mail and other channels) and from various other external data sources with the objective of obtaining the most comprehensive ‘‘picture’’ of a consumer. Examples of the ‘‘interactions’’ data includes browsing, searching and purchasing data on the Web, direct mail, phone and email interactions data, and various demographic and psychographic data collected through filling various online and offline forms and surveys. Examples of external data can be economic, industryspecific, geographic and census data either purchased or obtained from the external sources through means other than direct interactions with the consumer. Stage 2: Building Customer Profiles. Once the data is collected, one of the key issues in developing personalization applications is integrating this data and constructing accurate and comprehensive consumer profiles based on the collected data. Many personalization systems represent consumer profiles in terms of a collection of facts about the consumer. These facts may include consumer’s demographics, such as name, gender, date of birth and address. The facts can also be derived from the past transactions of a consumer,
Ch. 1. Personalization: The State of the Art and Future Directions
27
e.g., the favorite product category of the consumer or the value of the largest purchase made at a Web site. As explained in Section 2, this simple factual information about the consumer can be stored as a record in a relational database or a consumer-centric DW. Also, more complicated factual information, such as the information about the social network of a person and his or her relationships and interactions with other consumers, may require the use of taxonomies and ontologies and can be captured using XML or special languages for defining ontologies (Staab and Studer, 2003), such as OWL (Antoniou and Harmelen, 2003). However, such factual profiles containing collections of facts may not be sufficient in certain more advanced personalization applications, including high-precision personalized content delivery and certain advanced recommendation applications. Such applications may require the deployment of more advanced profiling techniques that include the development of data mining and statistical models capturing various aspects of behavior of individuals or segments of consumers. These consumer models may include predictive data mining models, such as decision trees, logistic regressions and Support Vector Machines (SVMs), predicting various aspects of consumer behavior. These models can be built either for individuals or segments of consumers. The tradeoff between individual and segment-based models lies in idiosyncrasy of individual models vs. the lack of sufficient amounts of data to build reliable predictive models (Jiang and Tuzhilin, 2006a). As was shown in Jiang and Tuzhilin (2006a), for the applications where individual consumers performed many transactions and it is possible to build reliable individual predictive models, individual models dominate the segment-based models of consumers. In contrast, in the low-frequency applications micro-segmentation models outperform individual models of consumers, assuming consumers are grouped into segments using high-quality clustering methods. In addition to the predictive models, profiles may also include descriptive models of consumer behavior based on such data mining methods as descriptive rules (including association rules), sequential and temporal models and signatures (Adomavicius and Tuzhilin, 2005a). An example of a rule describing consumer’s movie viewing behavior is ‘‘John Doe prefers to see action movies on weekends’’ (i.e., Name ¼ ‘‘John Doe’’ & MovieType ¼ ‘‘action’’ - TimeOfWeek ¼ ‘‘weekend’’). Such rules can be learned from the transactional history of the consumer (e.g., John Doe in this case) using the techniques described in Adomavicius and Tuzhilin (2001a). Consumer profiles can also contain important and frequently occurring sequences of consumer’s most popular activities, such as sequences of Web browsing behavior and various temporal sequences. For example, we may want to store in John Doe’s profile his typical browsing sequence ‘‘when John Doe visits the book Web site XYZ, he usually first accesses the home page, then goes to the Home&Gardening section of the site, then browses the
28
A. Tuzhilin
Gardening section and then leaves the Web site’’ (i.e., XYZ: StartPage Home&Gardening - Gardening - Exit). Such sequences can be learned from the transactional histories of consumers using frequent episodes and other sequence learning methods (Hand et al., 2001). Finally, consumer profiles can also contain signatures of consumer behavior (Cortes et al., 2000) that are the data structures used to capture the evolving behavior learned from large data streams of simple transactions (Cortes et al., 2000). For example, ‘‘top 5 most frequently browsed product categories over the last 30 days’’ would be an example of a signature that could be stored in individual consumer profiles in a Web store application. In summary, besides factual information about consumers, their profiles can also contain various data mining and statistical models describing consumer behavior, such as predictive, descriptive rule-based, sequential, temporal models and signatures. All this consumer profiling information can be stored in two types of repositories Consumer-centric DW (Kimball, 1996), where each consumer has a unique record or a taxonomy containing demographic and other factual information describing his or her activities. Consumer-centric modelbase (Liu and Tuzhilin, 2008) containing one or several models describing different aspects of behavior of a consumer. As explained before, a model can be unique for a consumer or a segment of consumers, and can be organized and stored in the modelbase in several different ways (Liu and Tuzhilin, 2008). However, each consumer should have a list of all the models describing behavior of that consumer that is easily accessible and managed. Collectively, the set of all the models of all the consumers forms a modelbase, and it is organized and managed according to the principles described in Liu and Tuzhilin (2008). Stage 3: Matchmaking. Once the consumer profiles are constructed, personalization systems must be able to match customized offerings to individuals or segments of consumers within a certain context, such as shopping for yourself vs. for a friend, based on the consumer profiling information obtained in Stage 2 and on the information about the space of offerings O. The matchmaking process should find customized offerings from the space O that are the most relevant to each consumer C within the specified context. Before describing the matchmaking process, we first need to clarify the following concepts: 1. Space of offerings O: This space has a certain structure that varies significantly among the offerings. For example, in case of dynamic prices, the space of offerings O can consist of a range of possible prices (e.g., from $10 to $100), whereas for the content management systems presenting personalized content in the form of dynamically generated pages, links and
Ch. 1. Personalization: The State of the Art and Future Directions
29
other content, the space O can consist of a complex structure with a certain taxonomy or ontology specifying granular and hierarchical content for a particular application.2 For example, space O for the book portion of the Amazon website needs to specify taxonomy of books (such as the one specified on the left-hand-side of the home page of the Amazon’s book section and containing classification of books based on categories, such as arts & entertainment, business & technology, children’s books, fiction, travel and subcategories, such as travel to Africa). Besides the book taxonomy, the home page of the Amazon’s book section has the granular content containing various templates, sections and slots that are filled with the specific content. Some examples of these sections include the middle section for the most interesting and appropriate books for the customer, the bargain offers section, the recent history section at the bottom and so on, with each section having its own structure and taxonomy. In summary, all these granular offerings need to be organized according to some taxonomy, which should be hierarchical in structure with complex relationships among its various components. The problem of specifying space O and organizing online content is a part of the bigger content management problem, which has been studied in Sheth et al. (2002). Defining appropriate taxonomies and ontologies of offerings for optimal targeting to consumers constitutes a challenging research problem for certain types of offerings, such as online content, and needs to be carefully studied further. 2. Space of consumers: In addition to offerings, we need to build an ontology or a taxonomy of consumers by categorizing them according to one or more methods. For example, consumers can be categorized based on the geography, occupation, consumption and spending patterns. Each of these dimensions can have a complex hierarchical or other structure, such as geographic dimension divided into country, region, state, city, zip and other categories. One of the ways to categorize the consumers is to partition them into some segmentation hierarchy (Jiang and Tuzhilin, 2006a). For each segment, one or several models can be built describing the behavior of this segment of consumers. As explained in Stage 2 above, these models are part of consumer profiles. More generally, profiles can be built not only for the individuals but also for various segments of consumers. Also, we can support more complex ontologies of consumers that incorporate their social networks and other relationships among themselves and with the products and services in which they may be interested, including various types of reviews and opinions. The problem of defining and building appropriate consumer ontologies and taxonomies, including social networks, for 2 Ontology is a more general concept than taxonomy and includes representation of a set of concepts, such as various types of offerings, and different types of relationships among them. However, it is more difficult to support fully fledged ontologies in the matchmaking process, and taxonomies of offerings (as well as consumers discussed below) may constitute a reasonable compromise.
30
A. Tuzhilin
optimal targeting of customized offerings constitutes an interesting research question. 3. Context: Personalization systems can deliver significantly different customized offerings to consumers depending on the context in which these offerings are made. For example, if an online book retailer knows that a consumer looks for a book for a course that she takes at a university, a different type of offering will be provided to her than in the case when she looks for a gift for her boyfriend. Defining and specifying the context can significantly improve the personalization results, as was shown in Adomavicius et al. (2005) and Gorgoglione et al. (2006). Moreover, the more specific the context and the more individualized models are built, the more this context matters for better customizing offerings to the consumers (Gorgoglione et al., 2006). Given these preliminary concepts, the matchmaking process can be defined as follows. For a given context and a specified segment of consumers (a) find the appropriate granularity level in the taxonomy associated with the offerings space O at which the offerings should be made and (b) select the best tailoring of the offering at that granularity level. For example, assume that a female high school teacher, 30–35 years old from New York is buying a book online for herself. Then the personalization engine should figure out which books should be placed on the starting page of a female high school teacher from New York in the specified age category. It should also identify how many books should be placed and how to reorder various categories of books in the list to personalize the list for the teacher and her segment in general. The personalization engine may also expand some categories of books most suitable for the teacher into subcategories to make the book selection process more convenient for her. A related issue is how often the online retailer should send the teacher emails with various book offerings. Note that this tailoring is done for a segment of consumers (with the specified characteristics) and within a particular context (personal purchase). The special case is when this matchmaking is done for individual consumers, i.e., for the segments of one. The answer to this question depends on various factors, including the goals of personalization: what goals we want to accomplish with this particular customization of offering(s). One such goal is to maximize utility of the offering o in O for the segment of consumers s in the context c, U(o, s, c), i.e., we want to select such offering o that maximizes utility U(o, s, c) for the given context c and the consumer segment s. As we said before, the special case of this problem is when the segment s consists of a single consumer. The tailoring process can be of two types It requires manufacturing processes with the appropriate time delays, delivery issues and costs incurred to customize the offering(s).
Ch. 1. Personalization: The State of the Art and Future Directions
31
Examples of such customized offerings include customized jeans, shoes, CD records and personal computers. It does not require any manufacturing and only needs to deal with the selection and configurability issues, such as selection of appropriate books to display on a website or generation of personalized web pages and other online content. Such customization can be done in real time with negligible costs. Although both problems are important, we will focus on the latter one in the rest of this section since the first one is a large subject on its own and can constitute a separate stand-alone paper. Although some of the matchmaking principles are common across all the targeted offerings and applications, such as maximizing the utility of the offering o, U(o, s, c), other matchmaking methods depend critically on the particular offering and/or application. For example, website personalization has its own set of matchmaking algorithms that are quite different from recommending books and personalization of product prices to consumers. These differences come in part from using different objective functions and dealing with different structures of the offering space O across these applications. For example, in case of recommending books, one of the possible objectives is to maximize the predictive accuracy of a recommendation. In contrast to this, one of the objectives of website personalization is to maximize navigational simplicity of the website. Because of these different objectives and different offering spaces, the matchmaking approaches can be quite different. There are many matchmaking technologies proposed in the literature, including recommender systems, statistics-based predictive approaches and rule-based systems, where an expert specifies business rules governing delivery of content and services that depend on the conditions specified in the antecedent part of the rule. In particular, several industrial personalization solutions initially developed by BroadVision and subsequently integrated into various personalization servers, support rule-based matchmaking, where the rules are defined by a domain expert. For example, a marketing manager may specify the following business rule: if a consumer of a certain type visits the online grocery store on a Sunday night, then this consumer should be shown the discount coupons for diapers. There has been much work done on developing various recommendationbased matchmaking technologies over the past decade since the appearance of the first papers on collaborative filtering in the mid-1990s (Hill et al., 1995; Resnick et al, 1994; Shardanand and Maes, 1995). These technologies are based on a broad range of different approaches and feature a variety of methods from such disciplines as statistics, machine learning, information retrieval and human–computer interactions. Moreover, these methods are often classified into broad categories according to their recommendation approach as well as their algorithmic technique. In particular, Balabanovic
32
A. Tuzhilin
and Shoham (1997) classify these methods based on the recommendation approach as follows: Content-based recommendations: the consumer is recommended items (e.g., content, services, products) similar to the ones the consumer preferred in the past. In other words, content-based methods analyze the commonalities among the items the consumer has rated highly in the past. Then, only the items that have high similarity with the consumer’s past preferences would get recommended. Collaborative recommendations (or collaborative filtering): the consumer is recommended items that people with similar tastes and preferences liked in the past. Collaborative methods first find the closest peers for each consumer, i.e., the ones with the most similar tastes and preferences. Then, only the items that are most liked by the peers would get recommended. Hybrid approaches: these methods combine collaborative and contentbased methods. This combination can be done in many different ways, e.g., separate content-based and collaborative systems are implemented and their results are combined to produce the final recommendations. Another approach would be to use content-based and collaborative techniques in a single recommendation model, rather than implementing them separately. Classifications based on the algorithmic technique (Breese et al., 1998) are Heuristic-based techniques constitute heuristics that calculate recommendations based on the previous transactions made by the consumers. An example of such a heuristic for a movie recommender system could be to find consumer X whose taste in movies is the closest to the tastes of consumer Y, and recommend to consumer Y everything that X liked that Y has not yet seen. Model-based techniques use the previous transactions to learn a model (usually using a machine learning or a statistical technique), which is then used to make recommendations. For example, based on the movies that consumer X has seen, a probabilistic model is built to estimate the probability of how consumer X would like each of the yet unseen movies. These two classifications are orthogonal and give rise to six classes of matchmaking methods corresponding to six possible combinations of these classifications. Adomavicius and Tuzhilin survey various recommendation methods within the specified framework in Adomavicius and Tuzhilin (2005b), and the interested reader is referred to this article. Although there has been much work done on developing different matchmaking methods, most of them do not address certain issues that are crucial for personalization technologies to be successfully deployed in real-life applications, such as not fully considering contextual information, working
Ch. 1. Personalization: The State of the Art and Future Directions
33
only with a single-criterion ratings, not fully addressing explainability, trustworthiness, privacy and other issues. A detailed list of limitations of the current generation of recommender systems and the discussion of possible approaches to overcome these limitations is presented in Adomavicius and Tuzhilin (2005b). How to address privacy issues in personalization is discussed in Kobsa (2007). Many commercial ventures implemented recommender systems over the past several years to provide useful recommendations to their customers. Examples of such companies include Amazon (book, CD and recommendations of other products), Tivo (TV programs), Netflix and Yahoo! (movies), Pandora and eMusic (music) and Verizon (phone service plans and configurations). Despite all the recent progress in developing successful matchmaking methods, ‘‘smart matchmaking’’ remains a complex and difficult problem and much more work is required to advance the state-of-the-art to achieve better personalization results. To advance the state-of-the-art in recommender systems, Netflix has launched a $1 million prize competition in October 2006 to improve its recommendation methods so that these recommendations would achieve better performance results (Bennett and Lanning, 2007). This competition and other related activities further reinvigorated research interests in recommender systems, as demonstrated by launching a new ACM Conference on Recommender Systems (RecSys) in 2007. Stage 4: Delivery and Presentation. As a result of matchmaking, one or several customized offerings are selected for the consumer. Next, these offerings should be delivered and presented to the consumer in the best possible manner, i.e., at the most appropriate time(s), through the most appropriate channels and in the most appropriate form, such as lists of offerings ordered by relevance or other criteria, through visualization methods, or using narratives. These customized offerings, when delivered to the consumer, constitute marketing outputs of the personalization process. One classification of delivery methods is pull, push and passive (Schafer et al., 2001). Push methods reach a consumer who is not currently interacting with the system, e.g., by sending an email message. Pull methods notify consumers that personalized information is available but display this information only when the consumer explicitly requests it. Passive delivery displays personalized information as a by-product of other activities of the consumer, such as up- and cross-selling activities. For example, while looking at a product on a Web site, a consumer also sees recommendations for related products. The problem of selecting the most appropriate delivery methods, including the choice of push, pull or passive methods and determination of the most appropriate times, channels and forms, constitutes an interesting and underexplored problem of personalization.
34
A. Tuzhilin
Stage 5: Measuring Personalization Impact. In this step, it is necessary to evaluate the effectiveness of personalization using various metrics, such as accuracy, consumer LTV, loyalty value and purchasing and consumption experience metrics. The most commonly used metrics for measuring personalization impact are accuracy-related metrics, i.e., they measure how the consumer liked a specific personalized offering, e.g., how accurate and relevant the recommendation was (Breese et al., 1998; Pazzani, 1999). Although important and most widely used now, accuracy-based metrics are quite simplistic and do not capture more complex and subtle aspects of personalization. Therefore, attempts have been made to develop and study more general aspects of personalization effectiveness by advocating the use of more advanced and comprehensive personalization metrics, such as consumer LTV, loyalty value, purchasing and consumption experience and other metrics based on the return on consumer (Peppers and Rogers, 2004, Chapter 11). However, they constitute only initial steps, and clearly much more work is required to develop better and more feasible ways to measure the impact of personalization. Stage 6: Adjusting Personalization Strategy. Finally, after the personalization impact is measured, these metrics can be used for possible improvements to each of the other five stages of the personalization process. If we are not satisfied with the measurement results, we need to identify the causes of this poor performance and adjust some of the previously discussed methods associated with the previous five stages of the personalization process based on the feedback loops presented in Fig. 4. In other words, if the performance metrics suggest that the personalization strategy is not performing well, we need to understand if this happens because of poor data collection, inaccurate consumer profiles, poorly chosen techniques for matchmaking or content delivery. After identifying possible sources of the problem, it is necessary to fix it through a feedback mechanism. Alternatively, we may determine that the selected performance metric measures wrong indicators that are irrelevant for the personalization application and needed to be replaced with more relevant metric(s). It was called a feedback integration problem in Adomavicius and Tuzhilin (2005a), since it determines how to adjust different stages of the personalization process based on the feedback from the performance measures. For example, assume that a personalization system delivers recommendations of restaurants to the consumers and does it poorly so that recommendation performance measures described in Stage 5 above remain low. In Stage 6, we need to examine causes of this poor performance and identify which of the prior stages are responsible for this. For example, poor recommendation results might be due to poorly collected data in Stage 1, such as incomplete list of restaurants available for recommendation purposes, insufficient information about these restaurants (e.g., absence of information about its chef or absence of consumer reviews and
Ch. 1. Personalization: The State of the Art and Future Directions
35
comments about the restaurants). Alternatively, the data about the consumers may be insufficient and needs to be enhanced. Further, consumer profiles can be poorly constructed in Stage 2 and need to be readjusted or completely rebuilt. For example, it may be the case that we did not include the list of the person’s favorite websites or the list of friends in the person’s profile, thus cutting access to the consumers’ social network and thus decreasing the quality of recommendations. Finally, we may need to re-examine the selected recommendation algorithm in Stage 3 or consider deploying a different one that can achieve better performance results. All these are examples of how we can adjust the chosen personalization solution in order to achieve better performance results. Note, that the feedback integration problem is a recursive one, i.e., if we are able to identify the underperforming stages of the personalization process, we may still face similar challenges when deciding on the specific adjustments within each stage. For example, if we need to improve the data collection phase of the personalization process, we would have to decide if we should collect more data, different data or just use better data preprocessing techniques. If this feedback is properly integrated in the personalization process, the quality of interactions with individual consumers, as measured by the metrics discussed above, should grow over time resulting in the virtuous cycle of personalization.3 If this virtuous cycle is achieved, then the personalization becomes a powerful process of delivering ever-increasing value to the stakeholders. This virtuous cycle is not only essential for improving the personalized service over time, it is also crucial in order for the personalization system to keep up with the constantly changing environment, e.g., to be able to adjust to changes in the tastes and preferences of individual customers and to changes in product offerings. The opposite of the virtuous cycle is the process of de-personalization (Adomavicius and Tuzhilin, 2005a). It can occur when the metrics of consumer satisfaction are low from the start or when they are decreasing over time, or when the system cannot adjust in time to the changing environment. In either case, the consumers get so frustrated with the personalization systems that they stop using it. The de-personalization effect is largely responsible for failures of some of the personalization projects. Therefore, one of the main challenges of personalization is the ability to achieve the virtuous cycle of personalization and not fall into the de-personalization trap. This completes the description of the personalization process. As was argued in Adomavicius and Tuzhilin (2005a) and Vesanen and Raulas 3 The term ‘‘virtuous cycle’’ was conceived in 1950s. According to www.wordspy.com/words/ virtuouscycle.asp, virtuous cycle is a situation in which improvement in one element of a chain of circumstances leads to improvement in another element, which then leads to further improvement in the original element, and so on.
36
A. Tuzhilin
(2006), it is really important to integrate all the stages of the personalization process into one smooth iterative process to achieve the virtuous cycle of personalization. This issue is addressed in the next section. 6
Integrating the personalization process
As was pointed out above, various stages of the personalization process described in Section 5 need to be integrated through the carefully developed transitions from one stage to another in a tightly coupled manner (Adomavicius and Tuzhilin, 2005a). Without such tight coupling, there will be discontinuity points between various stages of personalization (Vesanen and Raulas, 2006), and this would result in a failure to achieve the virtuous cycle of personalization. Some of the failures of personalization projects in the past are attributed to the lack of this integration. In particular many companies have developed piecemeal solutions to their personalization initiatives by focusing on individual stages of the personalization process without putting much thinking into how to integrate different stages into an organic process. For instance, Vesanen and Raulas (2006) present an example of a ‘‘discontinuity point’’ in a large oil and fuel marketing company where the marketing department of the company owns and manages the company’s credit cards. However, the customers’ purchasing data is owned and managed by the finance department that produces credit card bills based on the data. Unfortunately, the finance department does not share purchasing data with the marketing department, thus creating a discontinuity point in the personalization process in that company. This is unfortunate because marketing department cannot do much in terms of building personalized relationships with the customers without such purchasing data and customer profiles built from this data. Vesanen and Raulas (2006) also present a case study of a mail-order company where they identify other discontinuity points in their personalization process. This situation is typical for many personalization projects since few of them support (a) all the six stages of the personalization process presented in Fig. 4, including extensive measurement mechanisms of personalization impacts, (b) feedback loops allowing adjustments of personalization strategies based on the feedbacks and (c) integration of all the adjacent personalization stages in Fig. 4 to avoid discontinuity points. This is unfortunate because developing good evaluation measures, sound methods for adjusting personalization strategies and proper feedback loops constitutes one of the most important tasks of personalization, and achieving virtuous cycle of personalization (or falling into the traps of depersonalization) crucially depends on how well these steps are implemented. A successful implementation of the personalization process that achieves the virtuous cycle of personalization needs to deploy
Ch. 1. Personalization: The State of the Art and Future Directions
37
1. viable solutions for each of the six stages of the personalization process, 2. sound design principles of integrating these six stages into the complete personalization process. The technologies used in each of the six stages of the process were discussed in Section 5. Integration principles for the personalization process are presented in Adomavicius and Tuzhilin (2005a), where they are classified into data-driven and goal-driven. According to Adomavicius and Tuzhilin (2005a), the most currently widespread method for designing the personalization process is the data-driven (or ‘‘forward’’) method. According to this method, the data is usually collected first (or has already been collected), then consumer profiles are built based on the collected data, then these profiles are used in the matchmaking algorithms, etc. In contrast to this currently adopted practice, Adomavicius and Tuzhilin (2005a) advocate designing the personalization process backwards in accordance with the well-known dictum that ‘‘you cannot manage what you cannot measure.’’ This means, that the design of the personalization process should start with the specification of the measures used for determining impact of the personalization process. The selected measure(s) should determine what types of personalized offerings should be delivered to consumers. Next, the profiling and matchmaking technologies for delivering these offerings need to be determined, as well as the types of information that should be stored in the consumer profiles and how this information should be organized in the profiles. Finally, the types of relevant data to be collected for building comprehensive profiles of consumers need to be determined. Adomavicius and Tuzhilin (2005a) call this approach goal-driven (as opposed to the aforementioned data-driven approach), because it starts with a predefined set of goal-oriented measures. Adomavicius and Tuzhilin (2005a) argue that the goal-oriented approach can realize the virtuous cycle of personalization better than the data-driven approach, because it starts with personalization goals and, therefore, would provide more value to the providers and consumers. However, Adomavicius and Tuzhilin (2005a) also maintain that the goal-oriented approach has not been systematically studied before, and therefore this conjecture needs to be rigorously validated by personalization researchers.
7
Future research directions in personalization
Although much work has been done in field of personalization, as is evidenced by this survey, personalization still remains a young field, and much more research is needed to advance the state-of-the-art in the field. Throughout the chapter, we identified various open problems or discussed possible extensions and new directions for the already studied problems.
38
A. Tuzhilin
Therefore, we will not repeat these observations in this section. Instead, we will summarize the issues that are currently the most important in the field in our opinion. We believe that the following topics are among the most important for advancement of the field: 1. Improving each of the six stages of the personalization process presented in Fig. 4. Although some of these six stages, such as data collection and matchmaking, have been studied more extensively than others, still more work is required to develop deeper understanding and improving performance of personalization systems in all the six stages. We believe that the performance measurement and consumer profile building stages are the most underexplored and one of the most crucial among the six stages. Therefore, particular emphasis should be paid towards advancing our understanding of these stages. Although, there has been much work done on the matchmaking stage recently, including work on recommender systems, much more additional research is also required to advance this crucial stage. 2. As was argued in Adomavicius and Tuzhilin (2005a) and Vesanen and Raulas (2006), integration of different stages of the personalization process constitutes a very important problem, and little work has been done in this area. In addition to integrating adjacent stages, it is also important to develop viable feedback loop methods, and practically no research exists on this important problem. 3. Developing specific personalization techniques for particular types of offerings. Although the overall personalization framework, as described in this chapter, is applicable to various types of offerings listed in Section 2, some personalization methods in various stages of the personalization process can vary across different offerings, as was explained in Section 3.2. For example, the techniques for matchmaking of personalized prices can be quite different from the personalized searching and from the product recommendation techniques. Therefore, it is necessary to advance the state-of-the-art for each of the offerings-specific methods in addition to developing novel offeringindependent techniques. Although this is a general problem that is important for various types of offerings described in Section 2, delivery of targeted communications, including targeted ads, promotions and personalized emails, stands out because of its importance in business. Solutions to this problem have been developed since mid-1990s when companies such as Doubleclick and 24/7 have introduced targeted ad delivery methods for online advertising. Still this topic constitutes an interesting and important area of research that became even more important in more recent years due to the advent of search engine marketing and advertising that was popularized by sponsored search products provided by Yahoo (Overture) and Google (AdWords).
Ch. 1. Personalization: The State of the Art and Future Directions
39
4. Formalization of the whole personalization process. As stated before in Sections 5 and 6, most of the personalization research focused only on a few stages of the personalization process, and appropriate formal methods have been developed for the corresponding stages. For example, the field of recommender systems has witnessed rich theoretical developments over the past few years (Adomavicius and Tuzhilin, 2005b). Unfortunately, little mathematically rigorous work has been done on formalizing the whole personalization process, including formal definitions of the feedback loop mechanisms. We believe that this work is needed to gain deeper understanding of personalization and also to be able to abstract particular personalization problems for a subsequent theoretical analysis. 5. Understand how stability (or rather instability) of consumer preferences affects the whole personalization (and customization) process. As discussed in Section 4, one of the fundamental assumptions behind the personalization approach is the stability of consumer preferences and the assumption that the past consumer activities can be used to predict their possible future preferences and actions. Since consumer preferences change over time, it is important to understand how these changes affect the delivery of personalized offerings to them. Simonson (2005) provides several important insights into this problem and outlines possible future research directions. Continuation of this line of work constitutes an important research topic that should be pursued by personalization researchers. 6. Privacy and its relationship to personalization constitutes another important topic of future research. A recent paper by Kobsa (2007) examines the tensions between personalization and privacy and outlines some of the possible approaches for finding the balance between the two. We believe that these six areas require immediate attention of personalization researchers. However, as stated before, these are not the only important problems in the personalization field, and numerous other open problems were formulated throughout this chapter. On the basis of this observation, we believe that personalization constitutes a rich area of research that will only grow in its importance over time since, as Eric Schmidt from Google pointed out, we indeed ‘‘have the tiger by the tail in that we have this huge phenomenon of personalization’’ (Schmidt, 2006).
Acknowledgments The author would like to thank Anindya Ghose from NYU and two anonymous reviewers for their insightful comments that helped to improve the quality of the chapter.
40
A. Tuzhilin
References Adomavicius, G., A. Tuzhilin (2001a). Using data mining methods to build customer profiles. IEEE Computer 34(2), 74–82. Adomavicius, G., A. Tuzhilin (2001b). Expert-driven validation of rule-based user models in personalization applications. Data Mining and Knowledge Discovery 5(1–2), 33–58. Adomavicius, G., A. Tuzhilin (2002). An architecture of e-butler—a consumer-centric online personalization system. International Journal of Computational Intelligence and Applications 2(3), 313–327. Adomavicius, G., A. Tuzhilin (2005a). Personalization technologies: a process-oriented perspective. Communications of the ACM 48(10), 83–90. Adomavicius, G., A. Tuzhilin (2005b). Towards the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17(6), 734–749. Adomavicius, G., R. Sankaranarayanan, S. Sen, A. Tuzhilin (2005). Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems 23(1), 103–145. Ansari, A., C. Mela (2003). E-customization. Journal of Marketing Research 40(2), 131–146. Ansari, A., S. Essegaier, R. Kohli (2000). Internet recommendations systems. Journal of Marketing Research 37(3), 363–375. Antoniou, G., F. Harmelen (2003). Web ontology language, in: S. Staab, R. Studer (eds.), Handbook on Ontologies in Information Systems. Springer-Verlag, Berlin. Balabanovic, M., Y. Shoham (1997). Fab: content-based, collaborative recommendation. Communications of the ACM 40(3), 66–72. Bennett, J., S. Lanning (2007). The Netflix Prize, in: Proceedings of the KDD Cup and Workshop, San Jose, CA. Billsus, D., M. Pazzani (2000). User modeling for adaptive news access. User Modeling and UserAdapted Interaction 10(2–3), 147–180. Boutilier, C., R. Zemel, B. Marlin (2003). Active collaborative filtering, in: Proceedings of the 19th Conference on Uncertainty in AI, Acapulco, Mexico. Breese, J.S., D. Heckerman, C. Kadie (1998). Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July 1998. Cadez, I.V., P. Smyth, H. Mannila (2001). Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction, in: Proceedings of the ACM KDD Conference, San Francisco, CA. Chen, Y., C. Narasimhan, Z. Zhang (2001). Individual marketing with imperfect targetability. Marketing Science 20, 23–43. Chen, Y., G. Iyer (2002). Consumer addressability and customized pricing. Marketing Science 21(2), 197–208. Choudhary, V., A. Ghose, T. Mukhopadhyay, U. Rajan (2005). Personalized pricing and quality differentiation. Management Science 51(7), 1120–1130. Communications of the ACM (2000). Special issue on personalization. 43(8). Cortes, C., K. Fisher, D. Pregibon, A. Rogers, F. Smith (2000). Hancock: a language for extracting signatures from data streams. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA. Dewan, R., B. Jing, A. Seidmann (2000). Adoption of internet-based product customization and pricing strategies. Journal of Management Information Systems 17(2), 9–28. Dwyer, F.R. (1989). Customer lifetime valuation to support marketing decision making. Journal of Direct Marketing 3(4), 8–15. Dyche, J. (2002). The CRM Handbook. Addison-Wesley, Boston, MA. Eirinaki, M., M. Vazirgiannis (2003). Web mining for web personalization. ACM Transactions on Internet Technologies 3(1), 1–27.
Ch. 1. Personalization: The State of the Art and Future Directions
41
Elmaghraby, W., P. Keskinocak (2003). Dynamic pricing in the presence of inventory considerations: research overview, current practices, and future directions. Management Science 49(10), p. 47. Ghose, A., K. Huang (2006). Personalized Pricing and Quality Design, Working Paper CeDER-06-06, Stern School, New York University. Ghose, A., K. Huang (2007). Personalization in a two dimensional model. Unpublished manuscript. Gilmore, J., B.J. Pine (1997). The four faces of mass customization. Harvard Business Review 75(1), 91–101. Gorgoglione, M., C. Palmisano, A. Tuzhilin (2006). Personalization in context: does context matter when building personalized customer models? IEEE International Conference on Data Mining, Hong Kong. Hagen, P. (1999). Smart personalization. Forrester Report. Hand, D., H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge, MA. Haubl, G., K. Murray (2003). Preference construction and persistence in digital marketplaces: the role of electronic recommendation agents. Journal of Consumer Psychology 13(1), 75–91. Hill, W., L. Stead, M. Rosenstein, G. Furnas (1995). Recommending and evaluating choices in a virtual community of use, in: Proceedings of the CHI Conference. IBM Consulting Services. (2006). Cross-Channel Optimization: A Strategic Roadmap for Multichannel Retailers. The Wharton School Publishing. Imhoff, C., L. Loftis, J. Geiger (2001). Building the Customer-Centric Enterprise, Data Warehousing Techniques for Supporting Customer Relationship Management. Wiley, New York, NY. Jain, S., P.K. Kannan (2002). Pricing of information products on online servers: issues, models, and analysis. Management Science 48(9), 1123–1143. Jiang, T., A. Tuzhilin (2006a). Segmenting customers from populations to individuals: does 1-to-1 keep your customers forever? IEEE Transactions on Knowledge and Data Engineering 18(10), 1297–1311. Jiang, T., A. Tuzhilin (2006b). Improving personalization solutions through optimal segmentation of customer bases, in: Proceedings of the IEEE ICDM Conference, Hong Kong. Jiang, T., A. Tuzhilin (2007). Dynamic micro targeting: fitness-based approach to predicting individual preferences, in: Proceedings of the IEEE ICDM Conference, Omaha, NE. Kelleher, K. (2006). Personalize it. Wired Magazine, July. Kemp, T. (2001). Personalization isn’t a product. Internet Week 864, 1–2. Kimball, R. (1996). The Data Warehousing Toolkit. Wiley, New York, NY. Kobsa, A. (2007). Privacy-enhanced personalization. Communications of the ACM 50(8), 24–33. Kotler, P. (2003). Marketing Management. 11th ed. Prentice Hall. Liu, B., A. Tuzhilin (2008). Managing and analyzing large collections of data mining models. Communications of the ACM 51(2), 85–89. Liu, Y., Z.J. Zhang (2006). The benefits of personalized pricing in a channel. Marketing Science 25(1), 97–105. Manavoglu, E., D. Pavlov, C.L. Giles (2003). Probabilistic user behavior models, in: Proceedings of the ICDM Conference, Melbourne, FL. McDonnell, S. (2001). Microsegmentation, ComputerWorld, January 29. Mobasher, B., A. Tuzhilin (2009). Data mining for personalization. Special Issue of the User Modeling and User-Adapted Interaction Journal, in press. Mobasher, B., H. Dai, T. Luo, M. Nakagawa (2002). Discovery and evaluation of aggregate usage profiles for web personalization. Data Mining and Knowledge Discovery 6(1), 61–82. Mobasher, B., R. Cooley, J. Srivastava (2000). Automatic personalization based on web usage mining. Communications of the ACM 43(8), 142–151. Mobasher, B., S. Anand (eds.). (2007). Intelligent techniques for web personalization. Special Issue of the ACM Transactions on Internet Technologies 7(4). Montgomery, A., K. Srinivasan (2003). Learning about customers without asking, in: N. Pal, A. Rangaswamy (eds.), The Power of One: Gaining Business Value from Personalization Technologies. Trafford Publishing, Victoria, BC, Canada. Mulvenna, M., S. Anand, A. Buchner (2000). Personalization on the net using web mining. Communications of the ACM 43(8), 122–125. Murthi, B.P., S. Sarkar (2003). The role of the management sciences in research on personalization. Management Science 49(10), 1344–1362.
42
A. Tuzhilin
Nasraoui, O. (2005). World wide web personalization, in: J. Wang (ed.), The Encyclopedia of Data Warehousing and Mining, pp. 1235–1241. Oard, D.W., J. Kim (1998). Implicit feedback for recommender systems, in: Recommender Systems Papers from the 1998 Workshop. AAAI Press, Menlo Park, CA. Padmanabhan, B., Z. Zheng, S. O. Kimbrough (2001). Personalization from incomplete data: what you don’t know can hurt, in: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA. Pancras, J., K. Sudhir (2007). Optimal marketing strategies for a customer data intermediary. Journal of Marketing Research XLIV(4), 560–578. Pazzani, M. (1999). A framework for collaborative, content-based and demographic filtering. Artificial Intelligence Review 13(5–6), 393–408. Pazzani, M., D. Billsus (1997). Learning and revising user profiles: the identification of interesting web sites. Machine Learning 27, 313–331. Pennock, D.M., E. Horvitz, S. Lawrence, C.L. Giles (2000). Collaborative filtering by personality diagnosis: a hybrid memory- and model-based approach, in: Proceedings of the 16th Conference on Uncertainty in AI, Stanford, CA. Peppers, D., M. Rogers (1993). The One-to-One Future. Doubleday, New York. Peppers, D., M. Rogers (2004). Managing Customer Relationships: A Strategic Framework. Wiley, New York, NY. Peterson, L.A., R.C. Blattberg, P. Wang (1997). Database marketing: past, present, and future. Journal of Direct marketing 11(4), 27–43. Pierrakos, D., G. Paliouras, C. Papatheodorou, C. Spyropoulos (2003). Web usage mining as a tool for personalization: a survey. User Modeling and User-Adapted Interaction 13, 311–372. Pine, J. (1999). Mass Customization: The New Frontier in Business Competition. HBS Press, Cambridge, MA. Qiu, F., J. Cho (2006). Automatic identification of USER interest for personalized search, in: Proceedings of the WWW Conference, May, Edinburgh, Scotland. Rangaswamy, A., J. Anchel (2003). From many to one: personalized product fulfillment systems, in: N. Pal, A. Rangaswamy (eds.), The Power of One: Gaining Business Value from Personalization Technologies. Trafford Publishing, Victoria, BC, Canada. Rao, B., L. Minakakis (2003). Evolution of mobile location-based services. Communications of the ACM 46(12), 61–65. Rashid, A.M., I. Albert, D. Cosley, S.K. Lam, S.M. McNee, J.A. Konstan, J. Riedl (2002). Getting to know you: learning new user preferences in recommender systems, in: Proceedings of the International Conference on Intelligent User Interfaces, Gran Canaria, Canary Islands, Spain. Reed, O. (1949). Some Random Thoughts . . . On Personalizing, The Reporter of Direct Mail Advertising, April. Resnick, P., N. Iakovou, M. Sushak, P. Bergstrom, J. Riedl (1994). GroupLens: an open architecture for collaborative filtering of netnews, in: Proceedings of the 1994 Computer Supported Cooperative Work Conference. Riecken, D. (2000). Personalized views of personalization. Communications of the ACM 43(8), 26–28. Ross, N. (1992). A history of direct marketing. Unpublished paper, Direct Marketing Association. Rossi, P.E., R.E., McCulloch, G.M. Allenby (1996). The value of purchase history data in target marketing. Marketing Science 15, 321–340. Schafer, J.B., J.A. Konstan, J. Riedl (2001). E-commerce recommendation applications. Data Mining and Knowledge Discovery 5(1/2), 115–153. Schmidt, E. (2006). ‘‘Succeed with Simplicity’’ (interview with Eric Schmidt of Google). Business 2.0 7(11), p. 86. Shaffer, G., Z. Zhang (2002). Competitive one-to-one promotions. Management Science 48(9), 1143–1160. Shardanand, U., P. Maes (1995). Social information filtering: algorithms for automating ‘word of mouth’, in: Proceedings of the Conference on Human Factors in Computing Systems.
Ch. 1. Personalization: The State of the Art and Future Directions
43
Sheth, A., C. Bertram, D. Avant, B. Hammond, K. Kochut, Y. Warke (2002). Semantic Content Management for Enterprises and the Web, IEEE Computing, July/August. Simonson, I. (2005). Determinants of customers’ responses to customized offers: conceptual framework and research propositions. Journal of Marketing 69, 32–45. Smith, D. (2000). There are myriad ways to get personal. Internet Week, May 15. Spiliopoulou, M. (2000). Web usage mining for web site evaluation: making a site better fit its users. Communications of the ACM 43(8), 127–134. Srivastava, J., R. Cooley, M. Deshpande, P.-N. Tan (2000). Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2), 12–23. Staab, S., R. Studer (2003). Handbook on Ontologies in Information Systems. Springer-Verlag, Berlin. Surprenant, C., M.R. Solomon (1987). Predictability and personalization in the service encounter. Journal of Marketing 51, 86–96. Syam, N., R. Ruan, J. Hess (2005). Customized products: a competitive analysis. Marketing Science 24(4), 569–584. Tseng, M.M., J. Jiao (2001). Mass customization, in: Handbook of Industrial Engineering, Technology and Operation Management, 3rd ed. Wiley, New York, NY. Tsoi, A., M. Hagenbuchner, F. Scarselli (2006). Computing customized page ranks. ACM Transactions on Internet Technology 6(4), 381–414. Ulph, D., N. Vulkan (2001). E-commerce, mass customisation and price discrimination, Working Paper, Said Business School, Oxford University. Vesanen, J., M. Raulas (2006). Building bridges for personalization: a process model for marketing. Journal of Interactive Marketing 20(1), 5–20. Wedel, M., W. Kamakura (2000). Market segmentation: conceptual and methodological foundations. 2nd ed. Kluwer Publishers, Dordrecht, Boston. Wu, D., I. Im, M. Tremaine, K. Instone, M. Turoff (2003). A framework for classifying personalization scheme used on e-commerce websites, in: Proceedings of the HICSS Conference, Big Island, HI, USA. Yang, Y., B. Padmanabhan (2005). Evaluation of online personalization systems: a survey of evaluation schemes and a knowledge-based approach. Journal of Electronic Commerce Research 6(2), 112–122. Ying, Y., F. Feinberg, M. Wedel (2006). Leveraging missing ratings to improve online recommendation systems. Journal of Marketing Research 43(3), 355–365. Yu, K., A. Schwaighofer, V. Tresp, X. Xu, H.-P. Kriegel (2004). Probabilistic memory-based collaborative filtering. IEEE Transactions on Knowledge and Data Engineering 16(1), 56–69. Zipkin, P. (2001). The limits of mass customization. MIT Sloan Management Review 42(3), 81–87.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 2
Web Mining for Business Computing
Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Woong-Kee Loh and Vamsee Venuturumilli Department of Computer Science and Engineering, 200 Union Street SE, Room 4-192, University of Minnesota, Minneapolis, MN 55455, USA
Abstract Over the past decade, there has been a paradigm shift in business computing with the emphasis moving from data collection and warehousing to knowledge extraction. Central to this shift has been the explosive growth of the World Wide Web, which has enabled myriad technologies, including online stores, Web services, blogs, and social networking websites. As the number of online competitors has increased, as well as consumer demand for personalization, new techniques for large-scale knowledge extraction from the Web have been developed. A popular and successful suite of techniques which has shown much promise is ‘‘Web mining.’’ Web mining is essentially data mining for Web data, enabling businesses to turn their vast repositories of transactional and website usage data into actionable knowledge that is useful at every level of the enterprise—not just the front-end of an online store. This chapter provides an introduction to the field of Web mining and examines existing and potential Web mining applications for several business functions, such as marketing, human resources, and fiscal administration. Suggestions for improving information technology infrastructure are given, which can help businesses interested in Web mining begin implementing projects quickly.
1
Introduction
The Internet has changed the rules for today’s businesses, which now increasingly face the challenge of sustaining and improving performance throughout the enterprise. The growth of the World Wide Web and 45
46
P. Desikan et al.
enabling technologies has made data collection, data exchange, and information exchange easier and has resulted in speeding up of most major business functions. Delays in retail, manufacturing, shipping, and customer service processes are no longer accepted as necessary evils, and firms improving upon these (and other) critical functions have an edge in their battle at the margins. Technology has been brought to bear on myriad business processes and affected massive change in the form of automation, tracking, and communications, but many of the most profound changes are yet to come. Leaps in computational power have enabled businesses to collect and process large amounts of data of different kinds. The availability of data and the necessary computational resources, together with the potential of data mining, has shown great promise in having a transformational effect on the way businesses perform their work. Well-known successes of companies such as Amazon.com have provided evidence to that end. By leveraging large repositories of data collected by corporations, data mining techniques and methods offer unprecedented opportunities in understanding business processes and in predicting future behavior. With the Web serving as the realm of many of today’s businesses, firms can improve their ability to know when and what customers want by understanding customer behavior, find bottlenecks in internal processes, and better anticipate industry trends. Companies such as Amazon, Google, and Yahoo have been top performers of B2C commerce because of their ability to understand the consumer and effectively communicate. This chapter examines past success stories, current efforts, and future directions of Web mining as it applies to business computing. Examples are given in several different business aspects, such as product recommendations, fraud detection, process mining, inventory management, and how the use of Web mining can enable revenue growth, cost minimization, and enhancement of strategic vision. Gaps in existing technology are also elaborated on, along with pointers to future directions. 2
Web mining
Web mining is the application of data mining techniques to extract knowledge from Web data, including Web documents, hyperlinks between documents, and usage logs of websites. A panel organized at ICTAI 1997 (Srivastava and Mobasher, 1997) asked the question ‘‘Is there anything distinct about Web mining (compared to data mining in general)?’’ While no definitive conclusions were reached then, the tremendous attention on Web mining in past decade, and the number of significant ideas that have been developed have answered this question in the affirmative. In addition, a fairly stable community of researchers interested in the area has been formed, through the successful series of workshops such as WebKDD
Ch. 2. Web Mining for Business Computing
47
(held annually in conjunction with the ACM SIGKDD Conference) and the Web Analytics (held in conjunction with the SIAM data mining conference). Many informative surveys exist in the literature that addresses various aspects of Web mining (Cooley et al., 1997; Kosala and Blockeel, 2000; Mobasher, 2005). Two different approaches have been taken in defining Web mining. First was a ‘‘process-centric view,’’ which defined Web mining as a sequence of tasks (Etzioni, 1996). Second was a ‘‘data-centric view,’’ which defined Web mining in terms of the types of Web data that was being used in the mining process (Cooley et al., 1997). The second definition has become more acceptable, as is evident from the approach adopted in most recent papers that have addressed the issue. In this chapter, we use the data-centric view of Web mining, which is defined as, Web mining is the application of data mining techniques to extract knowledge from Web data, i.e. Web Content, Web Structure and Web Usage data.
The attention paid to Web mining in research, software industry, and Web-based organizations, has led to the accumulation of a lot of experiences. Its application in business computing has also found tremendous utility. In the following sub-sections, we describe the taxonomy of Web mining research and applicability of Web mining to business computing. In the following sub-sections we point out some key aspects of Web mining that makes its different from traditional data mining techniques. Firstly, in Section 2.1, we present the different kinds of Web data that can be captured and classify the area of Web mining according to the kinds of data collected. This classification is natural since the techniques adopted by each kind of data is more or less unique to extract knowledge from the specific kind of data. Second, the Web data by its unique nature also has led to novel problems that could not be addressed by earlier data mining techniques due to lack of enabling infrastructure such as the Web to collect data. Typical examples include user-session identification, robot identification, online recommendations, etc. In Section 2.2, we present an overview of Web mining techniques and relevant pointers in the literature to the state-of-the-art. Some of the techniques developed have been exclusive to Web mining because of the nature of the data collected. 2.1 Data-centric Web mining taxonomy Web mining can be broadly divided into three distinct categories according to the kinds of data to be mined. We provide a brief overview of the three categories and an illustration depicting the taxonomy is shown in Fig. 1. Web content mining. Web content mining is the process of extracting useful information from the contents of Web documents. Content data
48
P. Desikan et al.
Fig. 1.
Web mining taxonomy.
corresponds to the collection of information on a Web page, which is conveyed to users. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to Web content has been the most widely researched. Issues addressed in text mining include topic discovery, extracting association patterns, clustering of Web documents, and classification of Web pages. Research activities on this topic have drawn heavily on techniques developed in other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While a significant body of work in extracting knowledge from images, in the fields of image processing and computer vision exists, the application of these techniques to Web content mining has been limited. Web structure mining. Web structure mining is the process of discovering structure information from the Web. The structure of a typical Web graph consists of Web pages as nodes and hyperlinks as edges connecting related pages. Web structure mining can be further divided into two kinds based on the type of structured information used. Hyperlinks: A hyperlink is a structural unit that connects a location in a Web page to different location, either within the same Web page or on a different Web page. A hyperlink that connects to a different part of the same page is called an Intra-Document Hyperlink, and a hyperlink that connects two different pages is called an Inter-Document Hyperlink. There has been a significant body of work on hyperlink analysis (see survey paper on hyperlink analysis, Desikan et al., 2002).
Ch. 2. Web Mining for Business Computing
49
Document structure: The content within a Web page can also be organized in a tree-structured format, based on the various HTML and XML tags within the page. Here, mining efforts have focused on automatically extracting document object model (DOM) structures out of documents. Web usage mining. Web usage mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a website. Web usage mining itself is further classified depending on the kind of usage data used: Web server data: The user logs are collected by Web server. Typical data includes IP address, page reference, and access time. Application server data: Commercial application servers, for example, Weblogic, etc. have significant features in the framework to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. Application level data: New kinds of events can always be defined in an application, and logging can be turned on for them—generating histories of these specially defined events. 2.2 Web mining techniques—state-of-the-art Enabling technologies such as the Internet has not only generated a new kind of data such as the Web data, but has also generated a new class of techniques that are associated with this kind of data and applications based on this platform. For example, ease of obtaining user feedback resulted in obtaining data and developing new techniques for collaborative filtering. The relation between content and structure of the Web has itself led to developing a new class of relevance rank measures. Web usage data collection itself has given rise to new kinds of problems and techniques to address them, such as user-session identification, user identification, spam detection etc. Thus the Web has not only generated new kinds of data but has opened a series of new problems that can be addressed with availability of such data and its applications on the Web, which are different from traditional data mining approaches. In the following paragraph we discuss the state-of-art in Web mining research. Web mining techniques have also adopted significant ideas from the field of information retrieval. However, our focus in this chapter is restricted to core Web mining techniques and we do not delve into depth in the large area of information retrieval. The interest of research community and the rapid growth of work in this area have resulted in significant research contributions which have been
50
P. Desikan et al.
summarized in a number of surveys and book chapters over the past few years (Cooley et al., 1997; Kosala and Blockeel, 2000; Srivastava et al., 2004). Research on Web content mining has focused on issues such as extracting information from structured and unstructured data and integrating information from various sources of content. Earlier work on Web content mining can be found in Kosala’s work (Kosala and Blockeel, 2000). Web content mining together has found its utility in a variety of applications such as Web page categorization and topic distillation. A special issue on Web content mining (Liu and Chang, 2004) captures the recent issues addressed by the research community in the area. Web structure mining has focused primarily on hyperlink analysis. A survey on hyperlink analysis techniques and a methodology to pursue research has been proposed by Desikan et al. (2002). Most of these techniques can be used independently or in conjunction with techniques proposed with Web content and Web usage. The most popular application is ranking of Web pages. PageRank (Page et al., 1998), developed by Google founders, is a popular metric for ranking the importance of hypertext documents for Web search. The key idea in PageRank is that a page has a high rank if many highly ranked pages point it to, and hence the rank of a page depends upon the ranks of pages pointing to it. Another popular measure is hub and authority scores. The underlying model for computing these scores is a bipartite graph (Kleinberg, 1998). The Web pages are modeled as ‘‘fans’’ and ‘‘centers’’ of a bipartite core, where a ‘‘fan’’ is regarded as a hub page and ‘‘center’’ as an authority page. For a given query, a set of relevant pages is retrieved. And for each page in such a set, a hub score and an authority score (Kleinberg, 1998). Web usage data has is the key to understand user’s perspective of the Web, while content and structure reflect the creator’s perspective. Understanding user profiles and user navigation patterns for better adaptive websites and predicting user access patterns has evoked interest to the research and the business community. The primary step for Web usage mining is pre-processing the user log data, such as to separate Web page references into those made for navigational purposes and those made for content purposes (Cooley et al., 1999). The concept of adaptive Web was introduced by researchers from University of Washington, Seattle (Perkowitz and Etzioni, 1997). Markov models have been the most popular form of techniques to predict user behavior (Pirolli and Pitkow, 1999; Sarukkai, 1999; Zhu et al. 2002). A more detailed information about various aspects of Web usage mining techniques can be found in a recent extensive survey on this topic (Mobasher, 2005). 3
How Web mining can enhance major business functions
This section discusses existing and potential efforts in the application of Web mining techniques to the major functional areas of businesses. Some
Ch. 2. Web Mining for Business Computing
51
Table 1 Summary of how Web mining techniques are applicable to different business functions Area
Function
Application
Technique
Sales
Product marketing
Product recommendations Product trends Expert-driven recommendations Inventory management
Association rules
Consumer marketing Customer service Purchasing
Shipping and inventory
Operations
Human resources
Sales management Fiscal management Information technology Business process management
Time series data mining Association rules, text mining, link analysis Clustering, association rules, forecasting HR call centers Sequence similarities, clustering, association rules Sales leads identification Multi-stage supervised and assignment learning Fraud detection Link mining Developer duplication Clustering, text mining reduction Process mining Clustering, association rules
examples of deployed systems, as well as frameworks for emerging applications yet-to-be-built, are discussed. It should be noted that the examples are should not be regarded as solutions to all problems within the business function area they are cited in. Their purpose is to illustrate that Web mining techniques have been applied successfully to handle certain kind of problems, providing evidence of their utility. Table 1 provides the summary of how Web mining techniques have been successfully applied to address various issues that arise in business functions.
3.1 Sales 3.1.1 Product recommendations Recommending products to customers is a key issue for all businesses. Currently, traditional brick-and-mortar stores have to rely on data collected explicitly from customers through surveys to offer customercentric recommendations. However, the advent of e-commerce not only enables a level of personalization in customer-to-store interaction that is far greater than imaginable in the physical world, but also leads to unprecedented levels of data collection, especially about the ‘‘process of shopping.’’ The desire to understand individual customer shopping behavior and psychology in detail through data mining has led to significant advances in online customer-relationship management (e-CRM),
52
P. Desikan et al.
Fig. 2.
NetFlix.com—An example of product recommendation using Web usage mining.
as well as providing services such as real-time recommendations. A recent survey (Adomavicius and Tuzhilin, 2005) provides an excellent taxonomy of various techniques that have been developed for online recommendations. NetFlix.com is a good example of how an online store uses Web mining techniques for recommending products, such as movies, to customers based on their past rental profile and movie ratings together with profiles of users who have similar movie rating and renting patterns. As shown in Fig. 2, Netflix uses a collaborative-filtering-based recommendation system called Cinematch (Bennet, 2007) that analyzes movie ratings given by users to make personalized recommendations based on their profile. Knowledge gained from Web data is the key driver of NetFlix’s features such as favorite genres, recommendations based on earlier movies rated by users, or recommendations based on information shared with friends who are a part of their social network. Other companies such as Amazon.com use a host of Web mining techniques, such as associations between pages visited and click-path analysis, which are used to improve the customer’s experience and provide recommendations during a ‘‘store visit.’’ Techniques for automatic generation of personalized product recommendations (Mobasher et al., 2000) form the basis of most Web-mining-based recommendation models. 3.1.2 Product area and trend analysis John Ralston Saul, the Canadian author, essayist and philosopher noted With the past, we can see trajectories into the future—both catastrophic and creative projections.
Businesses would definitely like to see such projections of the future, especially identifying new product areas based on emerging trends—key for
Ch. 2. Web Mining for Business Computing
53
any business to build market share. Prediction using trend analysis for a new product typically addresses two kinds of issues: first, the potential market for a particular product and second, that a single product may result in a platform to develop a class of products having potentially high market value. Different methods have been implemented for such prediction purposes. Among the popular approaches are surveying techniques and time-series forecasting techniques. Traditionally, sufficient data collection was a major hurdle in the application of such techniques. However, with the advent of the Web, the task of filling out forms and recording results has been reduced to a series of clicks. This enabling technology has caused a huge shift in the amount and types of data collected, especially in regards to understanding customer behavior. For example, applying Web mining to data collected from online community interactions provides a very good understanding of how such communities are defined, which can then used for targeted marketing through advertisements and e-mail solicitation. A good example is AOL’s concept of ‘‘community sponsorship,’’ whereby an organization, Nike, for instance, may sponsor a community called ‘‘Young Athletic TwentySomethings.’’ In return, consumer survey and new product development experts of the sponsoring organization are able to participate in that community, perhaps without the knowledge of other participants. The idea is to treat the community as a highly specialized focus group, understand its needs and opinions towards existing and new products, and to test strategies for influencing opinions. New product sales can also be modeled using other techniques, such as co-integration analysis (Franses, 1994). The second most popular technique is time-series analysis. Box and Jenkins give an excellent account of various time series analysis and forecasting techniques in their book (Box and Jenkins, 1994). It has been also shown how time-series analysis can be used for decision-making in business administration (Arsham, 2006). These techniques have broad applicability and can be used for predicting trends for potential products. While most of these techniques have been based on statistical approaches, recent work have shown the data mining can be successfully used to discover patterns of interest in time-series data. Keogh (Keogh, 2004) provides a good overview of data mining techniques in time-series analysis, most of which can also be applied to the Web data. With the growth of Web search and keyword search-based ad placement, query words have assumed a great deal of significance in the advertising world. These query words represent topics or products popular among users. Search engines have been increasingly focusing on analyzing trends in these query words, as well as their click-through rates, for improving queryrelated ad delivery. Fig. 3 gives an example of how keywords can be analyzed for trends. It depicts the trends in keywords ‘‘Web Mining’’ and ‘‘Business Computing.’’ For example, a possible conclusion seems to be that since the two keywords have correlated search volume as of late,
54
P. Desikan et al.
Fig. 3.
Google trends.
collaboration between the two fields may be possible. The news articles represent randomly selected news articles on a particular topic when the search for topic was high. 3.1.3 Expert-driven recommendations for customer assistance Most recommender systems used in business today are product-focused, where recommendations made to a customer are typically a function of his/ her interests in products (based on his/her browsing history) and that of other similar customers. However, in many cases, recommendations must be made without knowledge about a customer’s preferences, such as in customer service call centers. In such cases, call center employees leverage their domain knowledge in order to help align customer inquiries with appropriate answers. Here, a customer may be wrong, which is often observed when domain experts are asked questions by non-experts. Many businesses must maintain large customer service call centers, especially in retail-based operations, in order to address this need. However, advances in Web-based recommender systems may enable to improve call center capacity by offering expert-based recommendations online DeLong et al., 2005). Fig. 4 gives an overview of expert-driven customer assistance recommendations. Similar to a customer talking to a call center assistant, the recommendation system equates customer browsing behavior as a series of ‘‘questions’’
55
Ch. 2. Web Mining for Business Computing
Concept-Page Graph <page_id, topic_id>
Browsing Sequence < p1, p2, p4, p7... > Recommendations
Customer
Fig. 4.
Web Usage Logs <seq1, seq2, ...>
Recommender System
Overview of expert-driven customer assistance recommendations.
the customer wants answered or, more generally, expressions of interest in the topic matter of a clicked-on Web page. Given the order and topic matter covered by such sequences of clicks, the recommendation system continuously refines its recommendations, which are not themselves directly a function of customer interest. Rather, they are generated by querying an abstract representation of customer service website, called a ‘‘concept-page graph.’’ This graph contains a ranked set of topic/Web page combinations, and as the customer clicks through the website, the system looks for Web pages best capturing the topics that a customer is seeking to know more about. And since their browsing behavior helps determine the questions they want answered, the eventual recommendations are more likely find the correct answer to their question, rather than a potentially misleading one based on interest alone.
3.2 Purchasing 3.2.1 Predictive inventory management A significant cost for a business that sells large quantities of products is the maintenance of an inventory management system to support sales. The goal of inventory management is to keep the inventory acquisition and maintenance costs low while simultaneously maximizing customer satisfaction through product availability. As such, inventory management systems must keep track of customer product demand through sales trend and market analysis. By analyzing transaction data for purchasing trends, inventory needs can addressed in a pre-emptive fashion, improving efficiency by enabling ‘‘just-in-time’’ inventory. As the Internet has permeated business computing, the task of browsing and purchasing products has been reduced to a series of clicks. This has made shopping extremely simple for customers and has lowered the barrier for businesses to obtain detailed customer feedback and shopping behavior
56
P. Desikan et al.
data. And though Web mining techniques have provided an excellent framework for personalized shopping, as well as improved direct marketing and advertisement, it can also aid companies in understanding customer access and purchasing patterns for inventory management at a very detailed level. Web mining techniques can improve inventory management in a variety of ways. First, using techniques such as Web content mining to search the Web—including competitors’ websites—businesses can discover new or alternate vendors and third-party manufacturers. Second, trend analysis using Web usage mining can yield valuable information about potentially latent relationships between products, helping gauge demand for one or more products based on the sales of others. Taken together, the identification of inventory gaps (where there is demand for a product not yet stocked) can be addressed and added to inventory at levels corresponding to estimated demand from product relationship analysis. Amazon.com, again, is a great example. As it became one of the largest B2C websites, its ever-increasing customer base and product breadth (and depth) demanded an efficient inventory management system. As such, Amazon.com adopted advanced Web mining techniques to manage and plan material resource availability. These techniques have enabled Amazon.com to decrease costs incurred from idle stock maintenance and consequently increase product choice for customers, greatly increasing Amazon’s revenue. By taking advantage of Web usage mining techniques and applying them to website usage data, transaction data, and external website data, other companies can reap the benefits of such predictive inventory management. 3.3 Operations 3.3.1 Human resources call centers Human resource (HR) departments of large companies are faced with answering many policy, benefits, and payroll related questions from employees. As the size of the company grows, the task becomes more difficult as they not only need to handle the number of employees, but also consider other issues such as geographically local policies and issues. An often-used approach of handling this problem is to have ‘‘call centers,’’ where human representatives provide expert assistance to employees through telephone conversations. Due in part to the cost associated with call centers, many companies have also published all relevant policy and process information to their corporate intranet websites for easy perusal by employees. However, in the face of such large repositories of detailed content, many employees tend to still primarily seek the advice of human representatives at call centers to help them more quickly sort through their policy and procedure questions, resulting in call center escalation.
Ch. 2. Web Mining for Business Computing
57
A recent study (Bose et al., 2006) has shown promise in applying Web mining techniques, as well as gene sequence similarity approaches from bioinformatics, to the problem of providing knowledge assistance in cases such as the HR call center escalation. As a result a Web recommendation system was developed to assist employees navigate HR websites by reducing the number of clicks they would need to locate answers to their questions. This was done by coupling the conceptual and structural characteristics of the website such that relevant pages for a particular topic (e.g., 401K plans, retirement, etc.) could be determined. In the study, the conceptual characteristics are represented by the logical organization of website as designated by the website administrator, while the structural characteristics provide a navigational path starting from a given Web page. By using this information, expert knowledge can be incorporated into website usage analysis, which recent studies (DeLong et al., 2005) have shown play an important role in improving the predictive power of recommendation engines. Figure 5 gives an example of an employee benefits website, with a sample of recommendations provided to a user looking for information related to 401(K) plan. 3.3.2 Sales leads identification and assignment To quote Jeff Thull, a leading sales and marketing strategist Accepting that 20% of your salespeople bring in 80% of your revenue is like accepting that 80% of your manufacturing machines are, on the average, producing one-fourth of your most productive machines.
Fig. 5.
Recommendations provided to the user of an employee benefits website.
58
P. Desikan et al.
In many businesses, an effective way of countering this problem is to develop a process (or system) that will allow sales managers to learn from past performance and track their current performance, both qualitatively and quantitatively. With the Internet, a number of Web-based businesses have emerged to enable customer-relationship management (CRM), a related approach used to collect, store, and analyze customer information. For example, Salesforce.com (http://www.salesforce.com) offers Web-based infrastructure for CRM to companies. Web-based approaches enable easy customization and integration of different application tools, as well as geographical independence of viewing the CRM data due to the usage of a Web-based interface. Macroscopic and microscopic (detailed) views and spatio-temporal partitions of information are also possible. Further, real-time dashboards allow easy visualization of various parameters and their statistical properties (e.g., means, medians, standard deviation). Commonly, tools for more sophisticated statistical analysis of the data are available in such applications. However, much of the burden of analysis and interpretation of such data lies with the sales managers. Given the number of types of data collected and possible analysis techniques that can be applied, it becomes difficult for a sales manager to apply all possible techniques and search for interesting results. As such, many sales managers will use only a few parameters for analysis on a daily basis. Additionally, the learning curve for this process is slow due to manual effort required in learning which parameters are important. Effective analysis is, therefore, made that much more difficult and cumbersome due to limited analytical resources and constraints on the sales manager’s time, which can result in errors and the inability to properly forecast emerging leads. In sales leads identification, traditional sources of information, such as phone calls, are supplemented by Web data, providing an additional means of collecting information about buyers. For example, eBay (http:// www.ebay.com/) creates behavior profiles about the buyers (and sellers) of products on their Web portal. Web usage data such as products bought and browsed for by buyers provide critical sales lead information. For similar portals, a buyer can be a group of individuals, a small business, or a large organization, depending upon its type (e.g., customer-to-customer, customerto-business, or business-to-business e-commerce). Web content data from these external websites can be analyzed using Web content mining to help identify new customers and understand their requirements. Competitor’s websites can be used to learn about their past performance and customers, as well as helping identify competitors in different geographical regions. This rich information about buyers can be harnessed by an enterprise to determine the most profitable markets, find new sales leads, and align them with a business’s offerings. Here, Web mining approaches can play an important role in identifying connections between various customers, analyzing them, and understanding their various business impacts.
Ch. 2. Web Mining for Business Computing
59
3.3.3 Fraud analysis The Internet has dramatically changed the ways in which businesses sell products. There are now many well-established Internet sites for e-commerce, and huge numbers of items have been bought and sold online. Meanwhile, fraudulent attempts to unjustly obtain property on websites have also been increasing. Although a great deal of effort has been expended in investigating and preventing Internet fraud, criminals have shown they are also capable of quickly adapting to existing defensive methods and continue to create more sophisticated ways of perpetrating fraud. Some Internet fraud, such as shilling, also exists in offline commerce. However, the frequency of such fraud has dramatically increased in online e-commerce applications due to its ease of implementation in an online setting. While some fraudulent activities are ignored when detected, others are more serious, involve large sums of lost money and property, and can result in lawsuits brought by their victims. Much Internet-based fraud is perpetrated in a cooperative manner among multiple associates. For example, in online auction shilling, fake customers (who are actually associates of a fraudulent seller) pretend not to have any connection with the seller and raise the bid price so that the seller’s item is sold at a higher price than its real value. Such associates are called shills, though shilling can be perpetrated without human associates. A seller can have multiple ids and pose as different customers simultaneously by participating in a single auction using multiple computers having different IP addresses, pretending to be different bidders. Detecting such fraud often means tracking the relationships between sellers and customers over a period of time. Clearly, non-automated techniques of accomplishing this task on a wide scale will incur significant costs. For addressing such issues, Web mining techniques have risen to prominence through their capacity to automatically detect ‘‘fraudulent behavior’’ in Web usage data. Since Web mining techniques are often focused on discovering relationships in usage, content, and transaction data, they can be readily applied to analyzing the relationships among people participating in online trading. As previously mentioned, much Internet fraud is perpetrated in cooperation with multiple ‘‘associates.’’ In order to detect such fraudulent activity, graph mining techniques can be used to uncover latent relationships between associates by finding graphs with similar topological structures. Since a number of frauds may be perpetrated by the same group of fraudsters, identifying the group’s other frauds can be made possible through these techniques, which have been exploited not only for detecting fraud in e-commerce, but also for antiterrorism, financial crime detection, and spam detection. 3.3.4 Developer duplication reduction Many businesses, both large and small, maintain one or more internal application development units. Thus, at any given time, there may be
60
P. Desikan et al.
hundreds, if not thousands, of projects being developed, deployed, and maintained concurrently. Due to overlapping business processes (i.e., human resources and fiscal administration) and multiple project development groups, duplication of source code often occurs (Rajapakse and Jarzabek, 2005) and (Kapser and Godfrey, 2003). Given the non-trivial cost of application development, mitigating such duplication is critical. Source code consistency is also an issue, for example, to prevent a case where only one of two duplicate segments is updated to address a bug and/or feature addition. Turnkey solutions for source code duplication are already available, but they suffer from two major problems: They are not able to address code which is functionally similar, but syntactically different. They only detect duplication after it has already occurred. The goal of a full-featured duplication detection system will be to address existing and potential duplication—the latter of which is currently unavailable. However, Web mining methods may offer a solution. Many businesses maintain intranets containing corporate policy information, best practices manuals, contact information, and project details—the last of which is of particular interest here. Assuming project information is kept current, it is possible to use Web mining techniques to identify functionality that is potentially duplicative, oftentimes syntactically different functions may be described using similar language. Figure 6 gives an overview of a possible approach for identifying potential duplication among multiple projects. First, the project Web pages and documents must be extracted from the intranet. Next, each document is split into fragments using common separators (periods, commas, bullet points, new lines, etc.). These fragments form the most basic element of comparison—the smallest entity capable of expressing a single thought. Using clustering techniques, these fragments can then be grouped into collections of similar fragments. When two or more fragments are part of the same collection, but come from different projects, potential duplication Clusters of Similar Fragments
Cluster “A” f1, f4, f7 Intranet
Project Info
Fragmentation
Cluster “B” f2, f3
Cluster Fragments < f1, f2, f3, f4, ...>
Cluster “C” f5, f6
Fig. 6.
Duplication candidate process overview.
Ch. 2. Web Mining for Business Computing
61
has been identified. These fragments may then be red-flagged and brought to the attention of affected project managers. 3.3.5 Business process mining Business process mining, also called workflow mining, reveals how existing processes work and can provide considerable return on investment (ROI) when used to discover new process efficiencies. In the context of the World Wide Web, business process mining can be defined as the task of extracting useful process-related information from the click stream of users of a website or the usage logs collected by the Web server. For example, mining of market-basket data to understand shopping behavior is perhaps the most well-known and popular application of Web mining. Similarly, one can find better understanding of the shipping process by modeling customer browsing behavior as a state transition diagram while he/she shops online. To implement such a system, Web usage logs and click stream data obtained from servers can be transformed into an XML format. These event logs can then be cleaned and the temporal ordering of business processes inferred. One can then combine Web usage mining with Web structure mining. By determining the number of traversals (usage) on each link (structure), one can estimate the transition probabilities between different states. Using these probabilities, entire business process models can be benchmarked and measured for performance increases/decreases. The discovered process model can also be checked for conformance with previously discovered models. Here, an anomaly detection system can also be to identify deviation in existing business process behavior. Srivastava and Mobasher (Srivastava and Mobasher, 1997) give an example of such a state transition diagram modeling a shopping transaction in a website, shown in Fig. 7. One can also analyze ‘‘process outcome’’ data to understand the value of various parts (e.g., states) of the process model (i.e., the impact of various states on the probability of desired/undesired outcomes). The results of such analysis can be used to help develop strategies for increasing (or decreasing) the probabilities of desired outcomes. A possible end objective of this business process mining would be to maximize the probability of reaching the final state while simultaneously maximizing the expected number of sold products (or value of sold products) from each visit, conduct a sensitivity analysis of the state transition probabilities, and identify appropriate promotion opportunities. In addition to the above-mentioned example, business process mining can also used for e-mail traffic, we can discover how people work and interact with each other in an organization. We can see what kinds of patterns exist in workflow processes and answer questions like do people hand-over their tasks to others, do they sub-contract, do they work together or do they work on similar tasks. It thereby helps in determining the process, data, organizational, and social structure.
62
P. Desikan et al.
Fig. 7.
State transition diagram modeling a shopping transaction in a website (Srivastava and Mobasher, 1997).
Sometimes, the information contained in Web server logs is incomplete or noisy or fine-grained or specific to an application which makes preprocessing a bit more difficult and challenging. Research has to be done in extracting business process models from the server logs. By leveraging business process mining properly, we can re-engineer the business process by reducing work-in-progress, adding additional resources to increase the capacity or eliminating or improving the efficiency of bottleneck processes, thereby boosting the performance of businesses. 4
Gaps in existing technology
Though Web mining techniques can be extremely useful to businesses, there are gaps which must often be bridged (or completely dealt with) in order to properly leverage Web mining’s potential. In this section, we discuss a few such important gaps and how these can be addressed. 4.1 Lack of data preparation for Web mining To properly apply Web mining in a production setting (e.g., recommending products to customers), data stored in archival systems must be linked back to online applications. As such, there must be processes in place to clean, transform, and move large segments of data back into a setting, where these can be accessed by Web mining applications quickly and continuously. This often means removing extraneous fields and converting
Ch. 2. Web Mining for Business Computing
63
textual identifiers (names, products, etc.) into numerical identifiers to make the processing of large amounts of transactional data quick. For instance, segmenting data into one-month intervals can cut down on expended computing resources and to ensure that relevant trends are identified by Web mining techniques, provided there is sufficient transactional activity. Additionally, databases for these kinds of intermediate calculations to reduce repeat computations have to be developed. Web mining is often computationally expensive, thus efforts to maximize efficiency are important. 4.2 Under-utilization of domain knowledge repositories Businesses have long made use of domain knowledge repositories to store information about business processes, policies, and projects, and if they are utilized in a Web mining setting, it becomes ever more paramount to manage it. For instance, corporate intranets provide a wealth of information that is useful in expert-oriented recommendations (e.g., customer service) and duplication reduction, but the repository itself must be up-to-date and properly maintained from time to time. One of the best ways to ensure an intranet’s ‘‘freshness’’ is to maintain it with a content management system (CMS) allowing non-professionals to update the website and distributing the responsibility to internal stakeholders. 4.3 Under-utilization of Web log data Most companies keep track of Web browsing behavior of employees by collecting Web logs mostly for security purposes. However, as seen from previous successful applications of Web mining techniques on such kinds of data, companies could utilize this information to better serve their employees. For example, one of the key issues that is usually dealt by human resources department is to keep employees motivated and retain them. A common approach is to offer perks and bonuses in various forms to satisfy the employee. However, most policies are ‘‘corporate-centric’’ and are not geared towards ‘‘employee-centric.’’ With the advance of Web mining techniques, it is now possible to understand employees’ interests in a better way. Two kinds of techniques can be employed. First, is to mine the behavior of employees in company policy and benefits website, in order to understand what employees are looking for. For example, employees browsing retirement benefits related website, could be offered a better retirement package. Other examples include, tuition waiver for employees looking on pursuing professional development course, or a travel package deal to an employee who has shown interest in traveling. A different dimension is to use trend analysis to see what’s new and popular in market, such as a new MP3 player, and offer perks in form of such as products.
64
P. Desikan et al.
Of course, a key issue in such kind of techniques is privacy. Privacy preserving data mining is a currently a hot area of research. Also, it has been studied and shown from examples such as Amazon that people are willing to compromise a certain level of privacy to gain the benefits offered. 5
Looking ahead: The future of Web mining in business
We believe that the future of Web mining is entwined with the emerging needs of businesses, and the development of techniques fueled by the recognition of gaps or areas of improvement in existing techniques. This section examines what is on the horizon for Web mining, the nascent areas currently under research, and how they can help in a business computing setting. 5.1 Microformats It is very important to not only to present the right content on a website, but also in the right format. For example, a first step in formatting for the Web was the use of oHTMLW to give the browser’s ability to parse and present text in a more readable and presentable format. However, researchers soon developed formats with higher semantics and presentability, for example, XML and CSS, for efficient processing of content and extracting useful information. XML is used to store data in formats such that automatic processing can be done to extract meaningful information (not just for presenting it in a website). Today, the trend is moving more towards ‘‘micro-formats’’ which capture the best of XML and CSS. Microformats are design principles for formats and not another new language. They provide a way of thinking about data, which will provide humans a better understanding of the data. They are currently widely used in websites such as blogs. With such new structured data, there arises need for NLP and Web content mining techniques such as data extraction, information integration, knowledge synthesis, template detection, and page-segmentation. This leads to the suggestion for the corporate businesses to decide on right kind of format to best utilize the data for processing, analysis, and presentation. 5.2 Mining and incorporating sentiments Even though automated conceptual discovery from text is still relatively new, difficult, and imperfect, accurately connecting that knowledge to sentiment information—how someone feels about something—is even harder. Natural language processing techniques, melded with Web mining, hold great promise in this area. To understand how someone feels about a
Ch. 2. Web Mining for Business Computing
65
particular product, brand, or initiative, and to project that level of understanding across all customers would give the business a more accurate representation of what customers think to date. Applied to the Web, one could think of an application that collects such topic/sentiment information from the Internet, and returns that information to a business. Accomplishing this would open up many marketing possibilities.
5.3 e-CRM to p-CRM Traditionally, brick-and-mortar stores have been organized in a productoriented manner, with aisles for various product categories. However, success of online e-CRM initiatives in the online world in building customer loyalty is not hidden from CRM practitioners in the physical world, which we refer to as p-CRM for clarity in this chapter. Additionally, the significance of physical stores has motivated a number of online businesses to open physical stores to serve ‘‘real people’’ (Earle, 2005). Many businesses have also moved from running their online and physical stores separately to integrating both, in order to better serve their customers (Stuart, 2000). Carp (Carp, 2001) points out that although the online presence of a business does affect its physical division of its business, people still find entertainment value in shopping in malls and other physical stores. Finally, people prefer to get a feel of products before purchase, and hence prefer to go out to shop instead of shopping online. From these observations, it is evident that physical stores will continue to be the preferred means of conducting consumer commerce for quite some time. However, margins will be under pressure as they must adopt to compete with online stores. These observations led us to posit the following in our previous study (Mane et al., 2005): Given that detailed knowledge of an individual customer’s habits can provide insight into his/her preferences and psychology, which can be used to develop a much higher level of trust in a customer-vendor relationship, the time is ripe for revisiting p-CRM to see what lessons learned from e-CRM are applicable.
Till recently, a significant roadblock in achieving this vision has been the ability to collect and analyze detailed customer data in the physical world, as Underhill’s seminal study (Underhill, 1999) showed, both from cost and customer sensitivity perspectives. With advancements in pervasive computing technologies such as mobile Internet access, third-generation wireless communication, RFIDs, handheld devices, and Bluetooth; there has been a significant increase in the ability to collect detailed customer data. This raises the possibility of bringing e-CRM style real-time, personalized, customer relationship functions to the physical world. For a more detailed study on this, refer to our previous work (Mane et al., 2005).
66
P. Desikan et al.
5.4 Other directions We have mentioned some of the key issues that should be noted by businesses as they proceed to adopt Web mining techniques to improve the business intelligence. However, as claimed earlier, this by no means is an exhaustive list. There are various other issues that need to be addressed from a technical perspective in order to determine the framework necessary to make these techniques more widely applicable to businesses. For example, there are host of open areas of research regarding Web mining, such as extraction of structured data from unstructured data or ranking of Web pages by integrating semantic relationships between documents, and automatic derivation of user sentiment. Businesses must also focus on the types of data that need to be collected for many Web usage mining techniques to be possible. Designing the content of websites also plays a crucial role in deciding what kinds of data can be collected. For example, one viewpoint is that pages with Flash-based content, though attractive, are more of broadcast nature and do not easily facilitate the collection of information about customer behavior. However, recent advances in technologies such as AJAX, which enhance customer/website interaction, not only allow corporations to collect data, but also give the customer a ‘‘sense of control’’ leading to an enriched user experience. 6
Conclusion
This chapter examines how technology, such as Web mining, can aid businesses in gaining an extra information and intelligence. We provide an introduction to Web mining and the various techniques associated with it. We briefly update the reader with state-of-art research in this area. Later, we show how these class of techniques can be used effectively to aid various business functions and provide example applications to illustrate their applicability. These examples provide evidence of Web mining’s potential, as well as existing success, in improving business intelligence. Finally, we point out gaps in existing technologies and elaborate on future directions that should be of interest to the business community at large. In doing so, we also note that we have intentionally left out specific technical details of existing and future work, given the introductory nature of this chapter. Acknowledgments This work was supported in part by AHPCRC contract number DAAD19-01-2-0014, by NSF Grant ISS-0308264 and by ARDA grant F30602-03C-0243. This work does not necessarily reflect the position or policy of government and no official endorsement should be inferred.
Ch. 2. Web Mining for Business Computing
67
We would like to thank the Data Mining Research Group at the University of Minnesota for providing valuable feedback.
References Adomavicius, G., A. Tuzhilin (2005). Towards the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17, 734–749. Arsham, H. (2006). Time-critical decision making for business administration. Available at http:// home.ubalt.edu/ntsbarsh/stat-data/Forecast.htm. Retrieved on 2006. Bennet, J. (2007). The Cinematch system: Operation, scale coverage, accuracy impact. Available at http://blog.recommenders06.com/wp-content/uploads/2006/09/bennett.pdf. Retrieved and accessed on July 30. Bose, A., K. Beemanapalli, J. Srivastava, S. Sahar (2006). Incorporating Concept Hierarchies into Usage Based Recommendations. WEBKDD, Philadelphia, PA, USA. Box, G.E., G.M. Jenkins (1994). Time Series Analysis: Forecasting and Control. 3rd ed. Prentice Hall PTR. Carp, J. (2001). Clicks vs. bricks: Internet sales affect retail properties. Houston Business Journal Cooley, R., B. Mobasher, J. Srivastava (1997). Web mining: Information and pattern discovery on the World Wide Web. 9th IEEE ICTAI. Cooley, R., B. Mobasher, J. Srivastava (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems 1(1), 5–32. DeLong, C., P. Desikan, J. Srivastava (2005). USER (User Sensitive Expert Recommendation): What non-experts NEED to know, in: Proceedings of WebKDD, Chicago, IL. Desikan, P., J., Srivastava, V., Kumar, P.N. Tan (2002). Hyperlink analysis: Techniques and applications. Technical Report 2002-0152, Army High Performance Computing and Research Center. Earle, S. (2005). From clicks to bricks . . . online retailers coming back down to earth. Feature story. Available at http://www.specialtyretail.net/issues/december00/feature_bricks.htm. Retrieved on 2005. Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine? Communications of the ACM 39(11), 65–68. Franses, Ph.H.B.F. (1994). Modeling new product sales; an application of co-integration analysis. International Journal of Research in Marketing. Kapser, C., M.W. Godfrey (2003). Toward a taxonomy of clones in source code: A case study, in: International Workshop on Evolution of Large-scale Industrial Software Applications, Amsterdam, The Netherlands. Keogh, E. (2004). Data mining in time series databases tutorial, in: Proceedings of the IEEE Int. Conference on Data Mining. Kleinberg, J.M. (1998). Authoritative sources in hyperlinked environment, in: 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668–667. Kosala, R., H. Blockeel (2000). Web mining research: A survey. SIGKDD Explorations 2(1), 1–15. Liu, B., K.C.C. Chang (2004). Editorial: Special issue on web content mining. SIGKDD Explorations special issue on Web Content Mining. Mane, S., P. Desikan, J. Srivastava (2005). From clicks to bricks: CRM lessons from E-commerce. Technical report 05-033, Department of Computer Science, University of Minnesota, Minneapolis, USA. Mobasher, B. (2005). Web usage mining and personalization, in: M.P. Singh (ed.), Practical Handbook of Internet Computing. CRC Press. Mobasher, B., R. Cooley, J. Srivastava (2000). Automatic personalization based on web usage mining. Communications of ACM. Page, L., S. Brin, R. Motwani, T. Winograd (1998). The pagerank citation ranking: Bringing order to the web. Stanford Digital Library Technologies Perkowitz, M., O. Etzioni (1997). Adaptive Web Sites: An AI Challenge. IJCAI.
68
P. Desikan et al.
Pirolli, P., J.E. Pitkow (1999). Distribution of surfer’s path through the World Wide Web: Empirical characterization. World Wide Web 1, 1–17. Rajapakse, D.C., S. Jarzabek (2005). An investigation of cloning in web applications, in: Fifth International Conference on Web Engineering, Sydney, Australia. Sarukkai, R.R. (1999). Link prediction and path analysis using Markov chains, in: Proceedings of the 9th World Wide Web Conference. Srivastava, J., P. Desikan, V. Kumar (2004). Web mining-concepts, applications and research directions. Data Mining: Next Generation Challenges and Future Directions, MIT/AAAI. Srivastava, J., B. Mobasher (1997). Panel discussion on ‘‘Web Mining: Hype or Reality?’’, ICTAI. Stuart, A. (2000). Clicks and bricks. CIO Magazine. Underhill, P. (1999). Why We Buy: The Science of Shopping. Simon and Schuster, New York. Zhu, J., J. Hong, J.G. Hughes (2002). Using markov chains for link prediction in adaptive web sites, in: Proceedings of ACM SIGWEB Hypertext.
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 3
Current Issues in Keyword Auctions
De Liu 455Y Gatton College of Business and Economics, University of Kentucky, Lexington, KY 40506, USA
Jianqing Chen Haskaynes School of Business, The University of Calgary, Calgary, AB T1N 2N4, Canada
Andrew B. Whinston Mccombs School of Business, The University of Texas at Austin, Austin, TX 78712, USA
Abstract Search engines developed a unique advertising model a decade ago that matched online users with short-text advertisements based on users’ search keywords. These keyword-based advertisements, also known as ‘‘sponsored links,’’ are the flagship of the thriving Internet advertising business nowadays. Relatively unknown to online users, however, is the fact that slots for search engine advertisements are sold by a special kind of auctions which we call ‘‘keyword auctions.’’ As the most successful online auctions since eBay’s business-to-consumer auctions, keyword auctions form the backbone of the multibillion dollar search advertising industry. Owing to their newness and significance, keyword auctions have captured attention of researchers from information systems, computer science, and economics. Many questions have been raised, including how to best characterize keyword auctions, why keyword auctions, not other selling mechanisms, are used, and how to design keyword auctions optimally. The purpose of this chapter is to summarize the current efforts in addressing these questions. In doing so, we highlight the last question, that is, how to design effective auctions for allocating keyword advertising resources. As keyword auctions are still new, there are still many outstanding issues about keyword auctions. We point out several such issues for future research, including the click-fraud problem associated with keyword auctions.
69
70 1
D. Liu et al.
Introduction
Keyword advertising is a form of targeted online advertising. A basic variation of keyword advertising is ‘‘sponsored links’’ (also known as ‘‘sponsored results’’ and ‘‘sponsored search’’) on search engines. Sponsored links are advertisements triggered by search phrases entered by Internet users on search engines. For example, a search for ‘‘laptop’’ on Google will bring about both the regular search results and advertisements from laptop makers and sellers. Figure 1 shows such a search-result page with sponsored links at the top and on the side of the page. Another variation of keyword advertising is ‘‘contextual advertising’’ on content pages. Unlike sponsored links, contextual advertisements are triggered by certain keywords in the content. For example, a news article about ‘‘Cisco’’ is likely to be displayed with contextual advertisements from Cisco network equipment sellers and Cisco training providers. Both sponsored links and contextual advertisements can target online users who are most likely interested in seeing the advertisements. Because of its superior targeting ability, keyword advertising has quickly gained popularity among marketers, and has become a leading form of online advertising. According to a report by Interactive Advertising Bureau (2007)
Fig. 1.
Search-based keyword advertising.
Ch. 3. Current Issues in Keyword Auctions
Fig. 2.
71
Google’s Adwords and AdSense programs.
and PricewaterhouseCoopers, keyword advertising in the United States reached $6.8 billion in total revenue in 2006. eMarketer (2007) predicts the market for online advertising will rise from $16.9 billion in 2006 to $42 billion in 2011, with keyword advertising accounting for about 40% of the total revenue. A typical keyword advertising market consists of advertisers and publishers (i.e., websites), with keyword advertising providers (KAPs) in between. There are three key KAPs in the U.S. keyword advertising market: Google, Yahoo!, and MSN adCenter. Figure 2 illustrates Google’s keyword-advertising business model. Google has two main advertising programs, Adwords and AdSense. Adwords is Google’s flagship advertising program that interfaces with advertisers. Through Adwords, advertisers can submit advertisements, choose keywords relevant to their businesses, and pay for the cost of their advertising campaigns. Adwords has separate programs for sponsored search (Adwords for search) and for contextual advertising (Adwords for content). In each case, advertisers can choose to place their advertisements on Google’s site only or on publishers’ sites that are part of Google’s advertising network. Advertisers can also choose to display text, image, or, more recently, video advertisements. AdSense is another Google advertising program that interfaces with publishers. Publishers from personal blogs to large portals such as CNN.com can participate in Google’s AdSense program to monetize the traffic to their websites. By signing up with AdSense, publishers agree to publish advertisements and receive payments from Google. Publishers may choose to display text, image, and video advertisements on their sites. They receive payments from Google on either a per-click or per-thousandimpressions basis.1 AdSense has become the single most important revenue source for many Web 2.0 companies. This chapter focuses on keyword auctions, which are used by KAPs in selling their keyword advertising slots to advertisers. A basic form of 1 Google is also beta-testing a per-action based service in which a publisher is paid each time a user carries out a certain action (e.g., a purchase).
72
D. Liu et al.
keyword auction is as follows. Advertisers choose their willingness-to-pay for a keyword phrase either on a per-click (pay-per-click) or on perimpression (pay-per-impression) basis. An automated program ranks advertisers and assigns them to available slots whenever a user searches for the keyword or browses a content page deemed relevant to the keyword. The ranking may be based on advertisers’ pay-per-click/pay-per-impression only. It may also include other information, such as their historical click-through-rate (CTR), namely the ratio of the number of clicks on the advertisement to the number of times the advertisement is displayed. Almost all major KAPs use automated bidding systems, but their specific designs differ from each other and change over time. Keyword auctions are another multibillion-dollar application of auctions in electronic markets since the celebrated eBay-like business-to-consumer auctions. Inevitably, keyword auctions have recently gained attention among researchers. Questions have been raised regarding what a keyword auction is, why keyword auctions should be used, and how keyword auctions should be designed. Some of these questions have been addressed over time, but many are still open. The purpose of this chapter is to summarize the current efforts in addressing these questions. In doing so, we focus mainly on the third question, that is, how to design effective auctions for allocating keyword advertising resources. We also point out several issues for future research. We will examine keyword auctions from a theoretical point of view. The benefits of conducting a rigorous theoretical analysis on real-world keyword auctions are two-fold. On one hand, we hope to learn what makes this new auction format popular and successful. On the other hand, by conducting a theoretical analysis on keyword auctions, we may be able to recommend changes to the existing designs. The rest of the chapter is organized as follows. Next, we discuss the research context by briefly reviewing the history of keyword advertising and keyword auctions. In Section 3, we introduce a few popular models of keyword auctions. In Section 4 and Section 5, we focus on two design issues in keyword auctions, namely, how to rank advertisers and how to package advertising resources. In Section 6, we discuss a threat to the current keyword-advertising model—click fraud. We conclude this chapter in Section 7. 2
A historical look at keyword auctions
Keyword advertising and keyword auctions were born out of practice. They were fashioned to replace the earlier, less efficient market mechanisms and are still being shaped by the accumulative experiences of the keyword advertising industry. In this subsection, we chronicle the design of keyword advertising markets and keyword auctions, and show how they evolved.
Ch. 3. Current Issues in Keyword Auctions
73
2.1 Early Internet advertising contracts In early online advertising, advertising space was sold through advance contracts. These contracts were negotiated on a case-by-case basis. As such negotiations were time-consuming, advertising sales were limited to large advertisers (e.g., those paying at least a few thousand dollars per month). These advertising contracts were typically priced in terms of the number of thousand-page-impressions (cost-per-mille, or CPM). CPM pricing was borrowed directly from off-line advertising, such as TV, radio, and print, where advertising costs are measured on a CPM basis. The problem with CPM pricing is that it provides no indication as to whether users have paid attention to the advertisement. Advertisers may be concerned that their advertisements are pushed to web users without necessarily generating any impact. The lack of accountability is reflected in the saying among marketing professionals: ‘‘I know that I waste half of my advertising budget. The problem is I don’t know which half.’’ 2.2 Keyword auctions by GoTo.com In 1998, a startup company called GoTo.com demonstrated a new proofof-concept search engine at a technology conference in Monterey, California. At that time, all other search engines sorted search results based purely on algorithm-assessed relevancy. GoTo.com, on the other hand, devised a plan to let advertisers bid on top positions of the search result. Specifically, advertisers can submit their advertisements on chosen words or phrases (‘‘search terms’’) together with their pay-per-click on these advertisements. Once the submitted advertisements are validated by GoTo.com’s editorial team, they will appear as a search result. The highest advertiser will appear at the top of the result list, the second-highest advertiser will appear at the second place of the result list, and so on. Each time a user clicks on an advertisement, the advertiser will be billed the amount of the bid. GoTo.com’s advertising model contains several key innovations. First, advertising based on user-entered search terms represents a new form of targeted advertising that is based on users’ behavior. For example, a user who searches ‘‘laptop’’ is highly likely in the process of buying a laptop. Keyword-based search engine advertising opens a new era of behavioral targeted advertising. Second, by billing advertisers only when users click on the advertisements, GoTo.com provides a partial solution to a longstanding issue of lack of accountability. Clicking on an advertisement indicates online users’ interests. Therefore, pay-per-click represents a significant step toward more accountable advertising. The ability to track behavioral outcomes such as clicks is a crucial difference between online advertising and its off-line counterparts. The act
74
D. Liu et al.
of clicking on an advertisement provides an important clue on advertising effectiveness. Accumulated information on clicking behavior can be further used to fine-tune advertisement placement and content. In such a sense, pay-per-click is a significant leap from the CPM scheme and signifies the huge potential of online advertising. Finally, the practice of using auctions to sell advertising slots on a continuous, real-time basis is another innovation. These real-time auctions allow advertisements to go online a few minutes after a successful bidding. As there is no pre-set minimum spending, auction-based advertising has the advantage of tapping into the ‘‘long tail’’ of the advertising market, that is, advertisers who have small spending budgets and are more likely to ‘‘do-it-yourself.’’ GoTo.com was re-branded as Overture Services in 2001 and acquired by Yahoo! in 2003. During the process, however, the auction mechanism and the pay-per-click pricing scheme remained largely unchanged.
2.3 Subsequent innovations by Google Google, among others, made several key innovations to the keyword advertising business model. Some of these have become standard features of today’s keyword advertising. In the following, we briefly review these innovations. 2.3.1 Content vs. advertising The initial design by GoTo.com features a ‘‘paid placement’’ model: paid advertising links are mixed with organic search results so that users cannot tell whether a link is paid for. Google, when introducing its own keyword advertising in 1998, promoted a ‘‘sponsored link’’ model that distinguished advertisements from organic search results. In Google’s design, advertisements are displayed on the side or on top of the result page with a label ‘‘sponsored links.’’ Google’s practice has been welcomed by the industry and policy-makers and has now become standard practice. 2.3.2 Allocation rules Google introduced a new allocation rule in 2002 in its ‘‘Adwords Select’’ program in which listings are ranked not only by bid amount, but also by CTR (later termed as ‘‘quality score’’). Under such a ranking rule, paying a high price alone cannot guarantee a high position. An advertiser with a low CTR will get a lower position than advertisers who bid the same (or slightly lower) but have higher CTRs. In 2006, Google revised its quality score calculation to include not only advertisers’ past CTRs but also the quality of their landing pages. Advertisers with low quality scores are required to pay a high minimum bid or they will become inactive.
Ch. 3. Current Issues in Keyword Auctions
75
Google’s approach to allocation gradually gained acceptance. At the beginning of 2007, Yahoo! conducted a major overhaul of its online advertising platform that considers both the CTRs of an advertisement and other undisclosed factors. Microsoft adCenter, which came into use only at the beginning of 2006, used a ranking rule similar to Google’s Adwords. Before that, all of the advertisements displayed on the MSN search engine were supplied by Yahoo! 2.3.3 Payment rules In the keyword auctions used by GoTo.com, bidders pay the amount of their bids. This way, any decrease in one’s bid will result in less payment. As a result, bidders have incentives to monitor the next highest bids and make sure their own bids are only slightly higher. The benefits from constantly adjusting one’s bid create undesirable volatility in the bidding process. Perhaps as a remedy, Google used a different payment rule in their Adwords Select program. In Adwords Select, bidders do not pay the full amount of their bids. Instead, they pay the lowest possible to remain above the next highest competitor. If the next highest competitor’s bid drops, Google automatically adjusts the advertiser’s payment downward. This feature, termed as ‘‘Adwords Discounter,’’ is essentially an implementation of second-price auctions in a dynamic context. One key advantage of such an arrangement is that bidders’ payments are no longer directly linked to their bids. This reduces bidders’ incentive to game the system. Recognizing this advantage, Yahoo! (Overture) also switched to a similar payment rule. We discuss further the implications of different payment rules in Section 3. 2.3.4 Pricing schemes As of now, Google’s Adwords for search offers only pay-per-click advertising. On the other hand, Adwords for content allows advertisers to bid either pay-per-click or pay-per-thousand-impression. Starting spring 2007, Google began beta-testing a new billing metric called pay-per-action with their Adwords for content. Under pay-per-action metric, advertisers pay only for completed actions of choice, such as a lead, a sale, or a page view, after a user has followed through the advertisement to the publisher’s website. 2.4 Beyond search engine advertising The idea of using keywords to place most relevant advertisements is not limited to search engine advertising. In 2003, Google introduced an ‘‘AdSense’’ program that allows web publishers to generate advertising revenue by receiving advertisements served by Google. AdSense analyzes publishers’ web pages to generate a list of most relevant keywords, which are subsequently used to select the most appropriate advertisements for
76
D. Liu et al.
Fig. 3.
Context-based keyword advertising.
these pages. Figure 3 shows an example of contextual advertising in Gmail. The order of advertisements supplied to a page is determined by Adwords auctions. The proceeds of these advertisements are shared between Google and web publishers. Yahoo! has a similar program called Yahoo! Publisher Network. KAPs also actively sought expansion to domains other than Internet advertising, such as mobile devices, video, print, and TV advertising. Google experimented with classified advertising in the Chicago Sun-Times as early as fall 2005. In February 2006, Google announced a deal with global operator Vodafone to include its search engine on Vodafone Live! mobile Internet service. In April 2007, Google struck a deal with radio broadcaster Clear Channel to start supplying less than 5% of the advertising inventory across Clear Channel’s 600þ radio stations. During the same month, Google signed a multiyear contract with satellite-TV provider EchoStar to sell TV advertisement spots on EchoStar’s Dish service through auctions.
3
Models of keyword auctions
In this section we discuss several models of keyword auctions. The purpose of these models is not to propose new auction designs for keyword-advertising
Ch. 3. Current Issues in Keyword Auctions
77
settings but to capture the essence of keyword auctions accurately. The value of these models lies in that they allow the real-world keyword auctions to be analyzed in a simplified theoretical framework. We start by describing the problem setting. There are n advertisers bidding for m (pn) slots on a specific keyword phrase. Let cij denote the number of clicks generated by advertiser i on slot j. In general, cij depends both on the relevance of the advertisement and on the prominence of the slot. In light of this, we decompose cij to an advertiser (advertisement) factor (qi) and a slot factor (dj ). cij ¼ dj qi
(1)
We interpret the advertiser factor qi as the advertiser i’s CTR. For example, everything else being equal, a brand-name advertiser may attract more clicks and thus have a higher CTR than a non-brand-name advertiser. We interpret the slot factor dj as the click potential of the slot. For example, a slot at the top of a page has higher click potential than a slot at the bottom of the same page. Each advertiser has a valuation-per-click vi. As in most research, we assume that advertisers know their own valuation-per-click. Though in reality, advertisers may have to learn over time their valuation-per-click from the outcome of the keyword advertising. Each advertiser submits a bid b that is the advertiser’s maximum willingness-to-pay per click for the keyword phrase. Each time a user initiates a search for the keyword phrase or requests a content page related to the keyword phrase, the auctioneer (KAP) will examine the bids from all participating advertisers and determine which advertisements should be displayed and in which order according to an allocation rule. If a user clicks on a particular advertisement, the advertiser will be charged a price determined by the payment rule of the keyword auction (which we will explain shortly). The allocation rule and the payment rule used in keyword auctions are different across different KAPs. For example, until recently, Yahoo! ranked advertisers strictly by the prices they bid, and advertisers paid the amount they bid. On the other hand, Google ranks advertisers based on their prices and their CTRs, and advertisers pay the lowest price that keeps the advertiser above the next highest-ranked advertiser. We distinguish the following models of keyword auctions by different allocation or payment rules used. 3.1 Generalized first-price auction In the early days of keyword auctions, bidders paid the price they bid. Such a format is termed ‘‘generalized first-price (GFP)’’ auctions because they essentially extended the first-price auctions to a multiple-object
78
D. Liu et al.
setting. However, people soon discovered that GFP auctions could be instable in a dynamic environment where bidders can observe and react to other bidders’ latest bids as often as they can. For instance, let us assume that there are two slots and two advertisers, 1 and 2, with valuations per click of $2 and $1 only. Assume the minimum bid is $0.10 and slot 1 generates twice the number of clicks that slot 2 generates. Obviously, it is the best for advertiser 1 to bid 1 cent higher than advertiser 2. It is also best for advertiser 2 to bid 1 cent higher than advertiser 1 till advertiser 1 reaches $0.55 or higher, in which case advertiser 2 is better off bidding just the minimum bid $0.1. So the two advertisers will form a bidding cycle that escalates continuously from the minimum bid to $0.55 and starts over again from there. Zhang and Feng (2005) and Asdemir (2005) show that the cyclic bidding pattern illustrated existed in Overture. The cyclic bidding is harmful in three ways. First, the frequent revision of bids requires additional computing resources that can slow down the entire auction system. Second, as shown by Zhang and Feng (2005), the oscillation of prices (because of the bidding cycle) can dramatically reduce KAP’s revenue. Third, GFP auctions are biased toward bidders who can attend and revise their bids more often. Such a bias may be perceived as unfair.
3.2 Generalized second-price auction Edelman et al. (2007) and Varian (2007) study a generalized second price (GSP) auction that captures Google’s payment rule. In GSP auctions, advertisers are arranged in descending order by their pay-per-click bids. The highest-ranked advertiser pays a price that equals the bid of the secondranked advertiser plus a small increment; the second-ranked advertiser pays a price that equals the bid of the third-ranked advertiser plus a small increment, and so on. For example, suppose there are two slots and three advertisers {1, 2, 3} who bid $1, $0.80, and $0.75, respectively. Under the GSP rule, advertiser 1 wins the first slot, and advertiser 2 wins the second slot. Assuming that the minimum increment is ignorable, advertisers 1 and 2 should pay $0.80 and $0.75 per click, respectively (Table 1). Table 1 Payments under generalized second-price auctions Advertiser
Bid ($)
Slot assigned
Pay-per-click
1 2 3
1 0.8 0.75
1 2 –
0.80 0.75 –
79
Ch. 3. Current Issues in Keyword Auctions Table 2 Payments under the VCG mechanism Advertiser
Bid ($)
Slot assigned
Pay-per-click
1 2 3
1 0.8 0.75
1 2 –
0.775 0.75 –
One notable feature of GSP auctions is that advertisers’ payments are not directly affected by their own bids. This feature is also present in the wellknown Vickrey-Clarke-Grove (VCG) mechanism. Under the VCG mechanism, each player’s payment is equal to the opportunity cost the player introduces to other players. To illustrate, in the earlier example, the VCG payment of advertiser 1 should equal the reduction in advertisers 2 and 3’s total valuation because of 1’s participation. Let us assume that all advertisers have the same CTR (normalized to 1), and all bids in Table 2 reflect advertisers’ true valuation-per-click. Let us also assume that the first slot has a (normalized) click potential of 1, and the second slot has a click potential of 0.5. Without advertiser 1, advertisers 2 and 3 will be assigned to the two slots, generating a total valuation of 0.8 1 þ 0.75 0.5( ¼ 1.175). With advertiser 1, advertiser 2 is assigned to the second slot, and advertiser 3 is not assigned a slot, generating a total valuation of 0.8 0.5( ¼ 0.4). The VCG mechanism suggests that advertiser 1 should pay (1.1750.4)/1 ¼ 0.775 per click. Similarly, we can calculate the VCG payment for advertiser 2. Table 2 illustrates the slot allocation and payments under the VCG rule. Clearly, GSP is not a VCG mechanism. Advertisers pay higher prices (except the lowest-ranked winner) under the GSP than under the VCG, provided that they bid the same prices.2 Edelman et al. (2007) show that GSP has no dominant-strategy equilibrium, and truth-telling is not an equilibrium. However, the corresponding generalized English auction has a unique equilibrium, and in such an equilibrium, bidders’ strategies are independent of their beliefs about other bidders’ types. These findings suggest that GSP auctions do offer a certain degree of robustness against opportunism and instability. The above results are obtained in the somewhat restrictive assumption that bidders differ on a single dimension (valuation per click). The reality is that advertisers at least differ on both valuation per click and CTRs. This fact has motivated Google, and later Yahoo! and MSN adCenter, to rank advertisers based on both bid prices and CTRs. In such a sense, GSP is accurate about Google’s payment rule but not its allocation rule. In the next subsection, we discuss an auction framework that captures the allocation rules of keyword auctions. 2 This is not to say that GSP generates higher revenue than VCG because advertisers may bid differently under the two mechanisms.
80
D. Liu et al.
3.3 Weighted unit–price auction Weighted unit–price auction (WUPA) has been studied in Liu and Chen (2006) and Liu et al. (2009). The WUPA is motivated by the fact that Google allocates slots based on a score rule that is a function of advertisers’ unit–price bids. While Google does not fully disclose their scoring formula, Search-Engine Watch reported that the formula used by Google is (Sullivan, 2002) Score ¼ Willingness-to-pay per click CTR
(2)
In 2006, Google updated its ranking rule by replacing CTR in the above formula with a more comprehensive ‘‘quality score,’’ which takes into account advertisers’ CTRs as well as other information such as the quality of their landing pages. In the updated ranking rule, CTR remains a primary consideration in an advertiser’s ‘‘quality score.’’ The idea of using a scoring rule to synthesize multidimensional criteria is not new. ‘‘Scoring auctions’’ have been used in procurement settings (Asker and Cantillon, 2008; Che, 1993) where suppliers (bidders) submit multidimensional bids, such as price p and quality q, and are ranked by a scoring function of the form s(p, q) ¼ f(q)p. A score rule is also used in ‘‘tender auctions’’(Ewerhart and Fieseler, 2003) where a buyer (the auctioneer) requests suppliers to bid a unit price for each input factor (e.g., labor and materials) and ranks suppliers by the weighted sum of their unit–price bids. However, a weighted unit–price score rule is never used on such a large scale. The scoring rule used in keyword auctions is also different from procurement auctions and tender auctions. Therefore, the weighted unit– price auction as a general scoring auction format is not previously studied. The specifics of the WUPA framework are as follows. The auctioneer assigns each advertiser a score s based on the advertiser’s bid and information on the advertiser’s future CTR. s ¼ wb
(3)
where w is a weighting factor assigned to the advertiser based on their expected future CTRs. The auctioneer allocates the first slot to the advertiser with the highest score, the second slot to the advertiser with the second-highest score, and so on. As with the price-only allocation rule, WUPA can also have ‘‘first-score’’ and ‘‘second-score’’ formats. Under the ‘‘first-score’’ rule, each advertiser pays a price that ‘‘fulfills’’ the advertiser’s score. This is equivalent to saying that advertisers need to pay the prices they bid. Under the ‘‘second-score’’ payment rule, each advertiser pays the lowest price that keeps the advertiser above the next highest advertiser’s score. For example, suppose there are only two types of expected CTRs, high and low. Suppose also the weighting factor for high-CTR advertisers is 1 and for low-CTR advertisers is 0.5.
81
Ch. 3. Current Issues in Keyword Auctions Table 3 Payments under first- and second-score WUPAs Advertiser
1 2 3
Bid
1 0.8 0.75
CTR
Low High High
Score
0.5 0.8 0.75
Slot assigned
– 1 2
Pay-per-click First-score
Second-score
– 0.8 0.75
– 0.75 0.5
Continuing with the earlier examples, we assume the expected CTRs of advertisers 1, 2, and 3 are low, high, and high, respectively. Table 3 illustrates the winning advertisers under the WUPA and the price they payper-click under the first-score and second-score rules, respectively. Liu et al. (2009) show that in an incomplete-information setting (i.e., advertisers do not know other advertisers’ valuation-per-click or expected CTRs), the first-score and second-score WUPAs are equivalent in expected revenue. The first-score WUPAs have a unique Bayesian-Nash equilibrium and the equilibrium can be explicitly solved. As in GSP auctions, the second-score WUPAs do not have a truth-telling equilibrium except when there is only one slot. In Section 4, we discuss the implications of different ranking rules.
4
How to rank advertisers Yahoo is strictly capitalistic—pay more and you are number one. Google has more socialist tendencies. They like to give their users a vote. —Dana Todd, SiteLab International
This quote summarizes an interesting dichotomy in Yahoo! and Google’s approaches to advertiser ranking. Yahoo! (Overture), the pioneer in keyword auctions, ranked advertisers purely based on their willingness-topay. On the other hand, Google, the now-leading player, invented a design that ranks advertisers by the product of per-click prices they bid and their historical CTRs. What exactly is the significance of different ranking approaches? Vise and Malseed (2005), authors of The Google Story, noted that Google displays a ‘‘socialist tendency’’ because in Google’s approach, advertisements that Internet users frequently click on are more likely to show up in top positions. Authors from academia, on the other hand, have searched for answers along the lines of revenue-generation and resourceallocation efficiency. A few authors, such as Feng et al. (2007) and Lahaie (2006), studied Google’s and Yahoo!’s approaches strictly as a ranking issue. Liu and Chen (2006) embedded the ranking problem in a larger issue of how to use bidders’ past-performance information. After all, what
82
D. Liu et al.
Google uses is the information on advertisers’ past CTRs, which essentially signals their abilities to generate clicks in the future. In fact, KAPs can also impose differentiated minimum bids for advertisers of different historical CTRs, which is what Google is doing now. This later use of pastperformance information is studied in Liu et al. (2009). Three main questions are associated with different rank rules. What is the impact of adopting different ranking rules on advertisers’ equilibrium bidding? On KAP’s revenue? And on resource-allocation efficiency? The revenue and efficiency criteria may matter at different stages of the keyword advertising industry. At the developing stage of the keyword advertising market, it is often sensible for KAPs to use efficient designs to maximize the ‘‘total pie.’’ After all, if advertisers see high returns from their initial use, they are likely to allocate more budgets to keyword advertising in the future. On the other hand, as the keyword advertising market matures and market shares stabilize, KAPs will more likely focus on revenue. Several authors in economics, information systems, and computer science have approached these questions. Feng et al. (2007) are the earliest to formally compare the ranking rules of Google and Yahoo! One focus of their study is the correlation between advertisers’ pay-per-click and the relevance of their advertisements to the keywords. With numerical simulation, they find that Google’s ranking mechanism performs well and robustly across varying degrees of correlation, while Yahoo!’s performs well only if pay-per-click and relevance are positively correlated. Those observations are sensible. Intuitively, an advertiser’s contribution to the KAP’s revenue is jointly determined by the advertiser’s pay-per-click and the number of clicks the advertiser can generate (i.e., relevance). When pay-per-click is negatively correlated with relevance, ranking purely based on pay-per-click tends to select advertisers with low revenue contribution, which can result in a revenue loss for KAPs. However, their study has certain limitations. Instead of solving the auction model, they simplify by assuming that bidders will bid truthfully under Google’s mechanism. Lahaie (2006) compares Google’s and Yahoo!’s ranking rules based on an explicit solution to the auction-theoretic model. He finds that Google’s ranking rule is efficient while Yahoo!’s is not. Yahoo!’s ranking is inefficient because, as we mentioned earlier, high bid does not necessarily mean high total valuation because total valuation also depends on relevance. In contrast, Google’s ranking rule is efficient. He also shows that no revenue ranking of Google’s and Yahoo!’s ranking mechanism is possible given an arbitrary distribution over bidder values and relevance. His findings are consistent with results derived in a weighted unit–price auction framework by Liu and Chen (2006) and Liu et al. (2009), which we discuss next. While both Feng et al. (2007) and Lahaie (2006) focus on two specific cases: Yahoo!’s price-only ranking rule and Google’s ranking rule, Liu
Ch. 3. Current Issues in Keyword Auctions
83
and Chen (2006) and Liu et al. (2009) study weighted unit–price auctions (WUPAs), which encompass Yahoo! and Google’s ranking rules. Under WUPAs, advertisers bid on their willingness-to-pay per click (or unit–price), and the auctioneer weighs unit–price bids based on advertisers’ expected CTRs. Liu and Chen (2006) consider a single slot setting. Liu et al. (2009) extend to a more general multiple slot setting and study both ranking rules and minimum-bid rules. As in Section 3, advertiser i, if assigned to slot j, will generate cij ¼ djqi clicks, where dj is a deterministic coefficient that captures the prominence of slot j. d1X d2 . . . X dm and d1 ¼ 1. qi is a stochastic number that captures the advertiser i’s CTR. A key assumption of the WUPA framework is that the KAP has information on advertisers’ future CTRs. This assumption is motivated by the fact that e-commerce technologies allow KAPs to track advertisers’ past CTRs and predict their future CTRs. The KAP can make the ranking of advertisers depend on both their pay-per-click and their expected CTRs. In particular, the KAP assigns each advertiser a score s ¼ wb, where the weighting factor w is determined by the advertiser’s expected CTR. If the advertiser has high-expected CTR, then the weighting factor is 1. If the advertiser has low expected CTR, then the weighting factor is wl. Liu et al. (2009) study WUPAs in an incomplete information setting. They assume that each advertiser has a private valuation-per-click v, v 2 ½v; v . The distributions of valuation-per-click, Fh(v) (for high-CTR advertisers) and Fl(v) (for low-CTR advertisers), are known to all advertisers and the KAP. The probabilities of being a high-CTR advertiser, a, and a low-CTR one, 1a, are also known to all advertisers and the KAP. Furthermore, we denote Qh and Ql as the expected CTRs for a high-CTR advertiser and a low-CTR advertiser, respectively. It is assumed that QhWQl. Suppose advertisers’ payoff functions are additive in their total valuation and the payment. Under the first-score payment rule (see Section 3),3 the payoffs for a low-CTR advertiser and a high-CTR advertiser are, respectively, U l ðv; bÞ ¼ Ql ðv bÞ
m X
dj Prfwl b ranks jthg
(4)
j¼1
U h ðv; bÞ ¼ Qh ðv bÞ
m X
dj Prfb ranks jthg
(5)
j¼1
Liu et al.’s analysis generates several insights. First, their analysis illustrates how ranking rules affects equilibrium bidding. The ranking rule affects how low- and high-CTR advertisers match up against each other in equilibrium. Specifically, weighting factors for low- and high-CTR 3
In this model setting, a first-score auction is revenue-equivalent to a second-score auction.
84
D. Liu et al.
advertisers determine the ratio of valuation-per-clicks between a low-CTR advertiser and a high-CTR advertiser who tie in equilibrium. For example, if the low-CTR advertisers are given a weighting factor of 0.6, and the highCTR advertiser, 1, a low-CTR advertiser with valuation-per-click 1 will tie with a high-CTR advertiser with valuation-per-click 0.6 in equilibrium. Furthermore, high-CTR advertisers’ with valuation-per-click higher than 0.6 out-compete all the low-CTR advertisers, and therefore compete only with other high-CTR competitors. As a result, these high-CTR advertisers will bid markedly less aggressively than high-CTR advertisers with valuation-per-click lower than 0.6. Second, they identify the efficient ranking rule under the WUPA framework. Here efficiency is measured by the total expected valuation realized in the auction. The efficient ranking rule under the WUPA framework is remarkably simple: The KAP should weigh advertisers’ pay-per-click by their expected CTRs, as if they bid their true valuationper-click (while in fact they generally do not). Third, they characterize the revenue-maximizing ranking rule under the WUPAs. The revenue-maximizing ranking rule may favor low- or highCTR advertisers relative to the efficient ranking rule. When the distribution of valuation-per-click is the same for high- and low-CTR advertisers, the revenue-maximizing ranking rule should always favor low-CTR advertisers (relative to the efficient design). But when the valuation distribution of lowCTR advertisers become less ‘‘disadvantaged,’’ the revenue-maximizing ranking rule may instead favor high-CTR advertisers (relative to the efficient design). Besides the above-mentioned research on ranking rules, Weber and Zheng (2007) study the revenue-maximizing ranking rule in a setting that resembles ‘‘paid placement.’’ They study a model where two competing firms can reach their customers through sponsored links offered by a search engine intermediary. Consumers differ in ‘‘inspection cost’’ (cost incurred when they click on a sponsored link to find out the surplus they can get from purchasing the product). Thus, some consumers may inspect only the first link, while others inspect both before making a purchase decision. To get the higher position, firms can offer a payment b to the search engine each time a consumer clicks on their product link (their ‘‘bids’’). The search engine’s problem is to choose how to rank the firms, given the consumer surplus u generated by two firms (assumed known to the search engine) and their bids b. The authors study an additive ranking rule sðb; u; bÞ ¼ bu þ ð1 bÞb
(6)
where the parameter b is the focal design factor. They find that the revenuemaximizing ranking design should put nonzero weight on firms’ bid b. In other words, search engines have incentive to accept ‘‘bribes’’ from advertisers to bias the ranking of product links.
Ch. 3. Current Issues in Keyword Auctions
5
85
How to package resources
Keyword auctions maintain that bidders simply bid their willingness-topay per click (thousand-impression, action) and are assigned to slots by an automatic algorithm, with higher-ranked advertisers receiving a better slot (more exposure). This is different from a fixed-price scheme, where sellers specify a menu of price-quantity pairs for buyers to choose from, and from traditional divisible-good auctions where sellers need not specify anything and buyers bid both price and quantity they desire. In a sense, keyword auctions strike a middle ground between the fixed-price scheme and traditional divisible-good auctions: in keyword auctions, the buyers (advertisers) specify prices they desire, and the seller (the KAP) decides the quantities to offer. Given the unique division of tasks in keyword auction settings, how to package resources for auctioning becomes an important issue facing KAPs. Before we address the issue of resource packaging, it is useful to clarify what we mean by resource in keyword auctions and why resource packaging is a practical issue. What KAPs sell to advertisers is impressions. Each time a page is requested by an Internet user, all advertisements on this page get an impression. Though keyword advertising is often priced by the number of clicks or ‘‘actions’’ (e.g., purchases), KAPs can always reallocate impressions from one advertiser to another but cannot do the same with clicks or actions. Therefore, impression is the ultimate resource controlled by KAPs. Although slots on the same page generate the same number of impressions, they may not be equal to advertisers. For example, an advertising slot is noticed more often if it is at the top of a page than at the bottom of the page. Other factors can also affect how often a slot is noticed, such as its geometric size, the time of the day it is displayed, and whether the deployment website is frequented by shoppers. One way to address these differences in page impressions is to substitute raw impressions with standardized effective exposure, which weighs impressions differently based on how much value it can deliver to an average advertiser. For example, if the effective exposure generated by one page impression at the top of a page is 1, then the effective exposure, generated by one page impression at the bottom of the page might be 0.3. In the following we study the packaging of effective exposures rather than raw page impressions.4 With the notion of effective exposure, a keyword auction goes like this. The KAP packages the available effective exposure into several shares, 4 A recommendation based on effective exposure can be transparently translated into a recommendation based on raw page impressions. This is because KAPs can always tailor the exposure allocated to an advertisement by randomizing its placement between different slots, varying the timing and length of its appearance, and/or selecting the number of websites for the advertisement to appear.
86
D. Liu et al.
ordered from large to small. Advertisers will be assigned to shares by their rankings, with the highest-ranked advertiser receiving the largest share, the second-highest-ranked advertiser receiving the second-largest share, and so on. A resource packaging problem in such a setting is to decide how many shares to provide and the size of each share to maximize total revenues. We call this problem a share-structure design problem. The share-structure design problem is relevant to KAP’s day-to-day operations. The demand and supply of keyword advertising resources are highly dynamic. On one hand, the supply of advertising resources fluctuates as new websites join KAPs’ advertising network, and existing websites may lose their draw of online users. On the other hand, the demand for advertising on particular keywords shifts constantly in response to changes in underlying market trends. Therefore, KAPs must constantly adjust their share structures to maximize their total revenue. To do so, KAPs need a good understanding of the share structure design problem. Given that KAPs have become managers of tremendous advertising resources, the issue of share structure design is critical to their success.
5.1 The revenue-maximizing share structure problem Chen et al. (2006, 2009) address the issue of revenue-maximizing share structures with the following specifications. There are n risk-neutral advertisers (bidders). The KAP (auctioneer) packages total effective exposure (normalized to 1) into as many as n shares arranged in a descending order, s1 s2 sn . A share structure refers to vector s ¼ (s1, s2, . . . sn). Table 4 shows some examples of share structures and their interpretations. Bidders’ valuation for a share is determined by the size of the share (s) and a private parameter (v), called the bidder’s type. v is distributed according to a cumulative distribution function F(v) on ½ v; v , with density f(v). Bidders’ valuation of a share take the form of vQ(s), where Q(d) is an increasing function. Bidders are invited to bid their willingness-to-pay per unit exposure (or unit price), and all shares are allocated at once by a rank-order of bidders’ unit–price bids.5 Bidders pay the price they bid.6 Each bidder’s expected payoff is the expected valuation minus expected payment to the auctioneer. Denote pj(b) as the probability of winning share j by placing bid b. The 5 Google ranks advertisers by a product of their willingness-to-pay per click and a click-through-ratebased quality score, which can be loosely interpreted as advertisers’ willingness-to-pay per impression (see Liu and Chen (2006) for a more detailed discussion). Yahoo! used to rank advertisers by their willingness-to-pay per click only, but recently switched to a format similar to Google’s. Our assumption that bidders are ranked by their willingness-to-pay per unit exposure is consistent with both Google’s approach and Yahoo!’s new approach. 6 The expected revenue for the auctioneer is the same if bidders pay the next highest bidder’s price.
Ch. 3. Current Issues in Keyword Auctions
87
Table 4 Examples of share structures s
Interpretation
(1, 0, 0, 0) (0.25, 0.25, 0.25, 0.25) (0.4, 0.2, 0.2, 0.2)
The highest bidder gets all effective exposures Top 4 bidders each get one-fourth of the total effective exposures The top bidder gets 40% of the total effective exposures. The 2nd–4th highest bidders each get 20% of the total effective exposures
expected payoff of a bidder of type v is Uðv; bÞ ¼
n X
pj ðbÞðvQðsj Þ bsj Þ
(7)
j¼1
The auctioneer’s revenue is expected total payments from all bidders. " # n X p ¼ nE b (8) pj ðbÞsj j¼1
Bidders maximize their expected payoff by choosing a unit price b. The auctioneer maximizes the expected revenue by choosing a share structure s. 5.2 Results on revenue-maximizing share structures Chen et al. (2009) showed that the auctioneer’s expected revenue in the above setting is written as Z v n X 1 FðvÞ p¼n Qðsj Þ Pj ðvÞ v f ðvÞdv (9) f ðvÞ v j¼1 where Pj ðvÞ
! n1 FðvÞnj ð1 FðvÞÞ j1 nj
is the equilibrium probability for a bidder of type v to win share j. We denote Z v 1 FðvÞ aj n Pj ðvÞ v f ðvÞdv; j ¼ 1; 2; . . . ; n f ðvÞ v
(10)
(11)
88
D. Liu et al.
The expected revenue (Eq. (9)) can be written as p¼
n X
aj Qðsj Þ
(12)
j¼1
Here aj is interpreted as the return coefficient for the jth share. Chen et al. (2009) showed that the revenue-maximizing share structures may consists of plateaus—a plateau is a set of consecutively ranked shares with the same size. For example, the third example in Table 4 has two plateaus: the first plateau consists of the first share (of size 0.4); the second plateau consists of the second to the fourth share (of size 0.2). Chen et al. (2009) showed that the starting and ending ranks of plateaus in the revenuemaximizing share structure are determined only by the distribution of bidders’ type. Based on their analysis, the optimal starting/ending ranks of plateaus and the optimal sizes of shares in each plateau can be computed using the following algorithm. 1. 2. 3. 4.
Compute return coefficients {aj }, j ¼ 1, . . . ,n. Let jk denote the ending rank of k-th plateau. j 0 ( 0 and P k ( 1. Given jk1, compute j k ( arg maxj2fjk1 þ1;...;ng fð1=ð j j k1 ÞÞ jl¼jk1 þ1 al g. If jk ¼ n, K ( k (K denotes the total number of plateaus) and continue to step 5. Otherwise, k ( k þ 1, go to step 3. 5. Compute the average return coefficient a k ( ð1=ðj k j k1 ÞÞ Pjk a , for plateau k ¼ 1, . . . , K. l l¼j k1 þ1 6. Solve the following nonlinear programming problem for the sizes of shares (z1, z2, . . . ,zK) in all plateaus: max
K X ð j k j k1 Þak Qðzk Þ k¼1
subject to :
K X
ð j k j k1 Þzk ¼ 1 and z1 z2 zK 0
k¼1
A share structure becomes steeper if we allocate more resources to highranked shares and less to low-ranked ones. In Table 4, the steepest share structure is (1, 0, 0, 0), followed by (0.4, 0.2, 0.2, 0.2), and then by (0.25, 0.25, 0.25, 0.25). Chen et al. (2009) obtained several results on how the revenue-maximizing share structures should change in steepness when the underlying demand or supply factors change. First, as bidders’ demands become less price-elastic (as the valuation function Q(d) becomes more concave), the auctioneer should use a less steep share structure. When bidders have perfectly elastic demand (i.e., the bidder’s valuation Q(d) is a
Ch. 3. Current Issues in Keyword Auctions
89
linear function), the auctioneer should use the steepest share structure, winner-take-all. The following example illustrates the above finding. Example 1. Let the number of bidder be six and the type distribution be an (truncated) exponential distribution on [1, 3]. When Q(s) ¼ s, the revenue-maximizing pffiffi share structure is (1, 0, 0, 0, 0, 0) (winner-take-all). When QðsÞ ¼ s, the revenue-maximizing share structure is (0.51, 0.25, 0.13, 0.07, 0.03, 0.01). When Q(s) ¼ s1/4, the revenue-maximizing share structure is (0.40, 0.25, 0.16, 0.10, 0.06, 0.03). Figure 4 plots the first to the sixth shares under three different valuation functions. The figure shows that the revenue-maximizing share structure becomes flatter when bidders’ demand becomes less price-elastic. A change in the type distribution affects the revenue-maximizing share structure through the return coefficients aj’s. In the case of ‘‘scaling’’ (all bidders’ valuation is multiplied by a common factor), all return coefficients are also scaled, and the revenue-maximizing share structure should remain the same. When the type distribution is ‘‘shifted’’ to the right (i.e., every bidder’s v increases by the same amount), the return coefficient for a low-ranked share increases by a larger proportion than the return coefficient for a high-ranked share does, and thus the revenue-maximizing share structure becomes less steep. pffiffi Example 2. Continue with Example 1. Fix QðsÞ ¼ s. When the type distribution is shifted to [5, 7], the revenue-maximizing share structure becomes (0.24, 0.19, 0.17, 0.15, 0.13, 0.12). Figure 5 shows that the revenue-maximizing share structure becomes flatter when the type distribution is shifted from [1, 3] to [5, 7].
Fig. 4.
Effect of price elasticity of demand.
90
D. Liu et al.
Fig. 5.
Effect of type distribution.
Another factor studied in Chen et al. (2009) is the effect of increasing total resources available. They showed that when total resource increases, all shares will increase, but whether the share structure (in terms of percentages of the total resources) becomes flatter or steeper depends on how bidders price elasticity increases or decreases with the resources assigned. When bidders’ price elasticity increases in the amount of resources allocated, the KAP should increase high-ranked shares by a larger percentage. When bidders’ price elasticity of demand decreases, the KAP should increase low-ranked shares by a larger percentage. The above results highlighted the importance of advertisers’ price elasticity of demand and the competitive landscape (as determined by the distribution of bidders’ types). Generally speaking, when bidders become more price-elastic, the share structure should be steeper; when the competition between bidders is fiercer, the share structure should be flatter.
5.3 Other issues on resource packaging The resources managed by KAPs have expanded significantly since the advent of keyword advertising. Leading KAPs have developed vast advertising networks of thousands of websites. Meanwhile, they are also actively seeking expansion to other media, including mobile devices, radio, and print advertising. The issue of resource packaging will only become more important when KAPs manages more advertising resources.
Ch. 3. Current Issues in Keyword Auctions
91
The earlier research addressed only a small part of a larger resourcepackaging problem. There are a few interesting directions for future research on the issue of resource packaging. First, Chen et al.’s (2009) framework assumes bidders share a common valuation function Q. A more general setting is that bidders’ valuation functions are also different. For example, bidders with highly elastic demand and bidders with inelastic demands may coexist. Feng (2008) studies a setting in which bidders differ in price elasticities, but her focus is not on the share structure design. Another interesting direction is to compare keyword auctions with alternative mechanisms for divisible goods such as the conventional discriminatory-price and uniform-price auctions (Wang and Zender, 2002; Wilson, 1979), in which bidders not only bid on prices but also on the quantity desired. The study on revenue-maximizing share structure facilitates such comparison because one would need to pick a revenuemaximizing share structure for keyword auctions to make a meaningful comparison. Also, it is interesting to study the optimal mechanism for allocating keyword-advertising resources. Different mechanisms may be evaluated along the lines of the auctioneer’s revenue, the allocation efficiency, and whether the mechanism encourages bidders to reveal their true valuation. Bapna and Weber (2006) study a mechanism that allows bidders to specify their ‘‘demand curves,’’ rather than just one price. They consider a more general setting in which multiple divisible goods are offered and bidders may have multidimensional private information. More specifically, they consider n bidders that have valuation for fractional allocations of m slots. For a fraction xi ¼ ðx1i ; . . . ; xm i Þ allocated, bidder i’s utility is vi (xi;Zi), where Zi represents bidder i’s private information, or type. The auctioneer first announces its mechanism, which includes a fixed m-dimensional price vector p ¼ (p1, . . . , pm). Then each bidder submits a bid function bi(d; Zi). The bidder’s bids are considered as discounts that will be subtracted from the payment implied by the posted price schedule. Under such a setting, Bapna and Weber show that such a mechanism has a dominant-strategy incentive-compatible equilibrium in which a bidder’s equilibrium bids do not depend on the knowledge of type distribution, the number of bidders, or other bidders’ payoff functions. 6
Click fraud
The keyword advertising industry has been extraordinarily successful in the past few years and continues to grow rapidly. However, its core ‘‘payper-click’’ advertising model faces a threat known as ‘‘click fraud.’’ Click fraud occurs when a person, automated script, or computer program imitates a legitimate user of a web browser clicking on an advertisement, for the purpose of generating a click with no real interest in the target link. The
92
D. Liu et al.
consequences of click fraud include depleting advertisers’ budgets without generating any real returns, increasing uncertainties in the cost of advertising campaigns, and creating difficulty in estimating the impact of keyword advertising campaigns. Click fraud can ultimately harm KAPs because advertisers can lose confidence in keyword advertising and switch to other advertising outlets. Both industrial analysts and KAPs have cited click fraud as a serious threat to the industry. A Microsoft AdCenter spokesperson stated, ‘‘Microsoft recognizes that invalid clicks, which include clicks sometimes referred to as ‘click fraud,’ are a serious issue for pay-per-click advertising.’’7 In its IPO document, Google warned that ‘‘we are exposed to the risk of fraudulent clicks on our ads.’’8 While no consensus exists on how click fraud should be measured, ‘‘most academics and consultants who study online advertising estimate that 10% to 15% of advertisement clicks are fake, representing roughly $1 billion in annual billings’’ (Grow and Elgin, 2006). Click fraud has created a lingering tension between KAPs and advertisers. Because advertisers pay for valid click they receive, it is critical for advertisers not to pay for clicks that are invalid or fraudulent. The tension arises when advertisers and KAPs cannot agree on which clicks are valid. KAPs often do not inform advertisers which clicks are fraudulent clicks, citing the concern that click spammers may use such information against KAPs and undermine KAPs’ effort to fight click fraud. Also, KAPs may have financial incentives to charge advertisers for invalid clicks to increase their revenues. Such incentives may exist at least in a short run. A few events illustrate the tension between advertisers and KAPs. In June 2005, Yahoo! settled a click-fraud lawsuit and agreed to pay the plaintiffs’ $5 million legal bills. In July 2006, Google settled a class-action lawsuit over alleged click fraud by offering a maximum of $90 million credits to marketers who claim they were charged for invalid clicks. Before we proceed, it is useful to clarify the two main sources of fraudulent clicks. The first is from competing advertisers. Knowing that most advertisers have a daily spending budget, an advertiser can initiate a click-fraud attack on competitors to drain their daily budgets. Once the competitors’ daily budgets are exhausted, their advertisements will be suspended for the rest of the day, so the attacker can snag a high rank at less cost. The second and more prevalent source of click fraud comes from publishers who partner with KAPs to display keyword advertisements. Many publishers earn revenue from KAPs on a per-click basis. Therefore, they have incentives to inflate the number of clicks on the advertisements displayed on their sites. This became a major form of click fraud after KAPs expanded keyword advertising services to millions of websites, 7
http://news.com.com/2100-10243-6090939.html http://googleblog.blogspot.com/2006/07/let-click-fraud-happen-uh-no.html
8
Ch. 3. Current Issues in Keyword Auctions
93
including many small and obscure websites that are often built solely for advertising purposes. One argument is that click fraud is not a real threat. This line of argument underlines current Google CEO Eric Schmidt’s comment on click fraud.9 Let’s imagine for purposes of argument that click fraud were not policed by Google and it were rampant . . . Eventually, the price that the advertiser is willing to pay for the conversion will decline, because the advertiser will realize that these are bad clicks, in other words, the value of the ad declines, so over some amount of time, the system is in fact self-correcting. In fact, there is a perfect economic solution which is to let it happen.
Research also shows that Google’s keyword auction mechanisms resist click fraud (Immorlica et al., 2005; Liu and Chen, 2006). The reason is that advertisers who suffer from click fraud also gain in their CTR rating, which works in their favor in future auctions (recall that Google’s ranking mechanism favors advertisers with high historical CTRs). While the above arguments have merits, they also have flaws. The first argument works best when the click-fraud attack is predictable. When the attack is unpredictable, advertisers cannot effectively discount its impact. Also, unpredictable click fraud creates uncertainties for advertisers, which can make keyword advertising unattractive. As to the second argument, while receiving fraudulent clicks has positive effects under the current system, it is unclear whether the positive effects can dominate the negative ones. In what follows, we discuss measures to detect and to prevent click fraud. Detection efforts such as online filtering and off-line detection reduce the negative impact of fraudulent clicks. Preventive measures such as using alternative pricing or a new rental approach can reduce or eliminate incentives to conduct click fraud. 6.1 Detection 6.1.1 Online filtering A major tool used in combating click fraud is an automatic algorithm called ‘‘filter.’’ Before charging the advertisers, major KAPs use automatic filter programs to discount suspected fraudulent clicks as they occur. Such filters are usually rule-based. For example, if a second click on the advertisement occurs immediately after a first click, the second click (‘‘the doubleclick’’) is automatically marked as invalid, and the advertiser will not pay for it. KAPs may deploy multiple filters so that if one filter misses a fraudulent click, another may still have a chance to catch it. Tuzhilin (2006) studied filters used by Google and stated that Google’s effort in filtering out invalid clicks is reasonable, especially after Google started to consider doubleclicks as invalid clicks in 2005. 9
http://googleblog.blogspot.com/2006/07/let-click-fraud-happen-uh-no.html
94
D. Liu et al.
While some fraudulent clicks are easy to detect (e.g., doubleclicks), others are very difficult. For example, it is virtually impossible to determine whether a click is made by a legitimate Internet user or by a laborer hired cheaply in India to click on competitors’ advertisements.10 The current filters are still simplistic (Tuzhilin, 2006). More sophisticated and timeconsuming methods are not used in online filters because they do not work well in real-time. As a result, current filters may miss sophisticated and lesscommon attacks (Tuzhilin, 2006). The fact that advertisers have requested refunds or even pursued lawsuits over click fraud indicates that filter programs alone cannot satisfyingly address the click fraud problem. 6.1.2 Off-line detection Off-line detection methods do not have the real-time constraint. Therefore an off-line detection team can deploy more computationally extensive methods, and consider a larger set of clicking data and many other factors (such as conversion data). Off-line detection can be automatic or manual. Google uses automated off-line detection methods to generate fraud alerts and to terminate publishers’ accounts for fraudulent click patterns. Automatic off-line detection methods are pre-programmed; thus they cannot react to new fraud patterns. Google also uses manual off-line detection to inspect click data questioned by advertisers, alert programs, or internal employees. While such manual detection is powerful, it is hardly scalable. Unlike online filtering, off-line detection does not automatically credit advertisers for invalid clicks. However, if a case of click fraud is found, advertisers will be refunded. 6.2 Prevention First of all, KAPs may prevent click fraud by increasing the cost of conducting click fraud. KAPs have taken several other steps in discouraging click spammers, including (Tuzhilin, 2006): Making it hard for publishers to create duplicate accounts or open new accounts after the old accounts are terminated, Making it hard for publishers to register using false identities, and Automatically discounting fraudulent clicks so that click spammers are discouraged. All of the above prevention efforts rely on a powerful click-fraud detection system. However, a powerful and scalable click-fraud system is very difficult, if not impossible, to develop. The above prevention efforts are dwarfed if sophisticated click spammers can pass the detection. 10
http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms
Ch. 3. Current Issues in Keyword Auctions
95
6.2.1 Alternative pricing Pay-per-click is susceptible to click fraud because clicks can be easily falsified. Witnessing this, some suggest different pricing metrics, such as pay-per-action (e.g., pay-per-call and pay-per-purchase), as a remedy to click fraud. Because purchases and calls are much more costly to falsify, switching to a pay-per-action or pay-per-call pricing scheme will overcome the click-fraud problem. Pay-per-action pricing is unlikely a remedy for all advertisers. Sometimes outcome events such as purchases are hard to track or define (e.g., should KAPs count a purchase if it is made the next day after the customer visited the link?). Other times, advertisers may be reluctant to share purchase information with KAPs. Finally, different advertisers may be interested in different outcome measures. For example, direct marketers are more interested in sales, while brand advertisers may be interested in the time Internet users spend on their websites. One may suggest going back to the pay-per-impression model to prevent click fraud. However, pay-per-impression is subject to fraud of its own kind: knowing that advertisers are charged on per-impression basis, a malicious attacker can request the advertising pages many times to exhaust the advertisers’ budgets; similarly, publishers can recruit viewers to their websites to demand higher revenue from KAPs. Goodman (2005) proposed a pricing scheme based on percentage of impressions. The assumption is that if attackers systematically inflate impressions, advertisers will pay the same amount because they still receive the same percentage of all impressions. While this proposed pricing scheme addresses the click-fraud problem to a large extent, it also has some consequences. For example, such a pricing scheme will not automatically adjust to the changes in overall legitimate traffic. As a result, web publishers have no incentives to increase the popularity of their websites. Also, the pay-per-percentage-impression pricing imposes all risks on advertisers. In general, advertisers are more risk-averse than KAPs, and it is often revenue-maximizing for KAPs to absorb some of the risks. 6.2.2 Rental model Another possible remedy is a rental model in which advertisers bid on how much they are willing to pay per hour exposure. Clearly, such a pricing model is immune to the click-fraud problem. The rental model can be implemented in different ways. One way is to ask each advertiser to bid on each slot, and KAPs will assign the slot to the highest bidder. Alternatively, KAP can ask advertisers to bid on the first slot only, provided that they agree on receiving other slots at a discounted price proportional to their bid for the first slot. Such a rental model can be valuable when advertisers have a reasonable idea about how much exposure they can get from a particular slot. Of course, when the outcome is highly uncertain, a rental model also exposes advertisers to grave risks.
96
D. Liu et al.
In sum, a single best solution to the click-fraud problem may not exist. While alternatives to pay-per-click advertising may remove incentives to conduct click fraud, they often come with other costs and limitations. Clearly, future keyword auction designs must take into account the clickfraud problem.
7
Concluding remarks
In this chapter, we review the current research on keyword advertising auctions. Our emphasis is on keyword-auction design. Keyword auctions are born out of practice and have unique features that previous literature has not studied. Keyword auctions are still evolving, giving us an opportunity to influence future keyword-auction designs. Given the central position of search and advertising in online worlds, research on keyword auctions holds important practical values. It is worth noting that keyword auctions as a mechanism for allocating massive resources in real-time are not limited to online advertising settings. Other promising areas of application of keyword auctions include grid-computing resources, Internet bandwidth, electricity, radio spectrum, and some procurement areas. In fact, Google filed a proposal on May 21, 2007, to the Federal Communications Commission calling on using keyword-auction-like mechanisms to allocate radio spectrum. In the proposal, Google argued that a keyword-auction-like real-time mechanism would improve the fairness and efficiency of spectrum allocation and create a market for innovative digital services. As keyword auctions as a general mechanism are proposed and tested in other settings, several important questions arise. For example, what conditions are required for keyword auctions to perform well? And what needs to be changed for keyword auctions to apply in new settings? This chapter focuses on design issues within keyword advertising settings. It would also be interesting to compare keyword auctions with various other alternative mechanisms in different settings. It is not immediately clear whether keyword auctions are superior to, for instance, dynamic pricing or a uniform-price auction where bidders bid both price and quantity. More research must be done to integrate the brand-new keyword auctions into the existing auction literature. We believe research in such a direction will yield new theoretical insights and contribute to the existing auction literature.
References Asdemir, K. (2005). Bidding Patterns in Search Engine Auctions, Working Paper, University of Alberta School of Business.
Ch. 3. Current Issues in Keyword Auctions
97
Asker, J.W., E. Cantillon (2008). Properties of Scoring Auctions. RAND Journal of Economics 39(1), 69–85. Bapna, A., T.A. Weber (2006). Efficient Allocation of Online Advertising Resources, Working Paper, Stanford University. Che, Y.-K. (1993). Design competition through multidimensional auctions. Rand Journal of Economics 24(4), 668–680. Chen, J., D. Liu, A.B. Whinston (2006). Resource packaging in keyword auctions, in: Proceedings of the 27th International Conference on Information Systems, December, Milwaukee, WI, pp. 1999–2013. Chen, J., D. Liu, A.B. Whinston (2009). Auctioning keywords in online search. Forthcoming in Journal of Marketing. Edelman, B., M. Ostrovsky, M. Schwarz (2007). Internet advertising and the generalized second price auction: Selling billions of dollars worth of keywords. American Economic Review 97(1), 242–259. eMarketer (2007). Online advertising on a rocket ride. eMarketer News Report, November 7. Ewerhart, C., K. Fieseler (2003). Procurement auctions and unit-price contracts. Rand Journal of Economics 34(3), 569–581. Feng, J. (2008). Optimal mechanism for selling a set of commonly ranked objects. Marketing Science 27(3), 501–512. Feng, J., H. Bhargava, D. Pennock (2007). Implementing sponsored search in web search engines: Computational evaluation of alternative mechanisms. INFORMS Journal on Computing 19(1), 137–148. Goodman, J. (2005). Pay-per-percentage of impressions: An advertising method that is highly robust to fraud. Workshop on Sponsored Search Auctions, Vancouver, BC, Canada. Grow, B., B. Elgin (2006). Click fraud: The dark side of online advertising. Business Week, October 2. Immorlica, N., K. Jain, M. Mahdian, K. Talwar (2005). Click Fraud Resistant Methods for Learning Click-Through Rates. Workshop for Internet and Network Economics. Interactive Advertising Bureau. (2007). Internet advertising revenues grow 35% in ’06, hitting a record close to $17 billion. Interactive Advertising Bureau News Press Release, May 23. Lahaie, S. (2006). An analysis of alternative slot auction designs for sponsored search, in: Proceedings of the 7th ACM Conference on Electronic Commerce, Ann Arbor, MI, ACM Press. Liu, D., J. Chen (2006). Designing online auctions with performance information. Decision Support Systems 42(3), 1307–1320. Liu, D., J. Chen, A.B. Whinston (2009). Ex-Ante Information and the Design of Keyword Auctions. Forthcoming in Information Systems Research. Sullivan, D. (2002). Up close with Google Adwords. Search Engine Watch Report. Tuzhilin, A. (2006). The Lane’s Gifts v. Google Report. Available at http://googleblog.blogspot.com/ pdf/TuzhilinReport.pdf. Retrieved on December 25, 2007. Varian, H.R. (2007). Position auctions. International Journal of Industrial Organization 25(6), 1163–1178. Vise, A.D., M. Malseed (2005). The Google story. New York, NY. Wang, J.J.D., J.F. Zender (2002). Auctioning divisible goods. Economic Theory 19(4), 673–705. Weber, T.A., Z. Zheng (2007). A model of search intermediaries and paid referrals. Information Systems Research 18(4), 414–436. Wilson, R. (1979). Auctions of shares. The Quarterly Journal of Economics 93(4), 675–689. Zhang, X., J. Feng (2005). Price cycles in online advertising auctions, in: Proceedings of the 26th International Conference on Information Systems, December, Las Vegas, NV, pp. 769–781.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 4
Web Clickstream Data and Pattern Discovery: A Framework and Applications
Balaji Padmanabhan Information Systems and Decision Sciences Department, College of Business, University of South Florida, 4202 E. Fowler Avenue, CIS 1040, Tampa, FL 33620, USA
Abstract There is tremendous potential for firms to make effective use of Web clickstream data. In a perfect world a firm will be able to optimally manage online customer interactions by using real-time Web clickstream data. Here it may be possible to proactively serve users by determining customer interests and needs before they are even expressed. Effective techniques for learning from this data are needed to bridge the gap between the potential inherent in clickstream data and the goal of optimized customer interactions. Techniques developed in various fields including statistics, machine learning, databases, artificial intelligence, information systems and bioinformatics can be of value here, and indeed several pattern discovery techniques have been proposed in these areas in the recent past. In this chapter we discuss a few applications of pattern discovery in Web clickstream data in the context of a new pattern discovery framework presented here. The framework is general and we note other applications and opportunities for research that this framework suggests.
1
Background
From a business perspective the Web is widely recognized to be a key channel of communication between firms and their current and potential customers, suppliers and partners. Early on several firms used the medium to provide information to customers, a large number of whom still transacted offline. However this has rapidly changed for a number of reasons as noted below. A factor that facilitated this transition has been the steady increase in customer comfort in using the Web to transact. Web user surveys 99
100
B. Padmanabhan
conducted annually reflect this trend clearly.1 For instance, one of the questions in a 1996 survey asked users to provide their degree of agreement to the statement ‘‘Providing credit card information through the Web is just plain foolish’’.2 In 1996 survey respondents were divided on this, although a slightly more number of them disagreed. Responses to such security questions over the years suggest a trend toward greater comfort in transacting online. Improved security technology, such as strong encryption standards, has clearly played a significant role here. If payment information such as credit card data had to be sent in plain text then the cost of battling online fraud would be prohibitive. This of course did not apply to the preWeb days in which users submitted credit card data over telephone due to two differences. First, telephone networks mostly went through switches operated by large telecom firms, and physically tapping into such a network was hard. Second, telephones, unlike personal computers, did not run software applications, some of which may be malicious programs that can intercept and re-direct data. Equally important, comfort with policies that relate to online transactions has also increased. Consumers today are, for instance, more savvy about procedures for online returns. A recent poll by Harris Interactive3 revealed that 85% of respondents indicated that they are not likely to shop again with a direct retailer if they found the returns process inconvenient. Online retailers have recognized the importance of this and have substantially eased the returns process for most goods purchased online. However one area that remains an issue today is online privacy. While firms have online privacy policies it is not clear to what extent consumers read, understand and explicitly accept some of the data use and data sharing policies currently in place. Other key reasons for the transition to transacting online are increased product variety (Brynjolfsson et al., 2003) and the overall convenience of online shopping. Online auctions such as eBay have enabled every item in every home to potentially be available online. Online services contribute to this as well. Services such as Google Answers provide consumers with, often immediate-access to experts on a wide variety of issues ranging from troubleshooting computers to picking the right school for children. In terms of convenience, a book, not required immediately, can be purchased online in minutes compared to the many hours it could have otherwise taken to check its availability and purchase this from a retail bookstore. This increased use of the Web is also evident from macroeconomic indicators released by the U.S. Census Bureau.4 From being virtually 1
http://www-static.cc.gatech.edu/gvu/user_surveys/ http://www-static.cc.gatech.edu/gvu/user_surveys/survey-10-1996/questions/security.html See ‘‘Return to Sender: Customer Satisfaction Can Hinge on Convenient Merchandise Returns’’, Business Wire, Dec 13, 2004. 4 http://www.census.gov/mrts/www/ecomm.html 2 3
Ch. 4. Web Clickstream Data and Pattern Discovery
101
non-existent a decade or so ago, U.S. retail ecommerce sales in 2004 was $71 billion, accounting for 2% of total retail sales in the economy. More recent, in the third quarter of 2007, retail ecommerce sales were estimated at $32.2 billion, accounting now for 3.2% of total retail. 2
Web clickstream data and pattern discovery
As this trend has played out, firms have invested significant resources to tracking, storing and analyzing data about customer interactions online. In a recent commercial report,5 authors indicate that the worldwide Business Intelligence (BI) tools market grew to $6.25 billion in 2006. Note that this amount only captures expenditure on BI software purchase by firms and does not include internally developed tools or the cost of labor. Compared to other channels of interaction with customers, a unique characteristic of the Web is that every single item viewed or content seen by a user is captured in real-time by Web servers. This results in massive amounts of detailed Web clickstream data captured at various servers. Further the two component terms in ‘‘clickstream’’ have both meaningfully changed over the years. When Web servers were first used, hypertext transfer protocol informed what was captured every time a user clicked on a link online. Typically the tracked information was content captured from http headers such as the time of access, the Internet Protocol (IP) address of the user’s computer and the file name of the page requested. Today, firms capture a large amount of more specific data such as the content that was even shown to a user on the screen before a user clicked on a link. For instance, a single user click on some page at an online content site translates into a large number of variables created that relate to the environment at the time of this activity. Examples of these variables include the list of all products that were displayed, list of specific online advertisements that were shown on the page and specific categories of products that appeared on the user’s screen at that time. Further, Internet use has significantly increased in the last several years, perhaps even disproportionally to the number of major content sites that are accessed. Hence the ‘‘stream’’ part of ‘‘clickstream’’ has also significantly increased for the major online firms. Some reports, for instance, put the number of unique users at Yahoo! at more than 400 million in 2007. The rate at which data streams in from such a large user base contributes to several terabytes of data collected each day. Firms are naturally interested in leveraging such a resource, subject to stated privacy policies. Toyota, for instance, may prefer to have its online advertisements shown to users who are tagged as ‘‘likely auto buyers’’ 5 Worldwide Business Intelligence Tools 2006 Vendor Shares, Vesset and McDonough, IDC Inc. June 2007.
102
B. Padmanabhan
rather than to an urban family that may have no interest in cars. The granularity at which clickstream data is collected today has enabled online firms to build much more accurate customer models, such as one that might score a user as a potential auto buyer. In the context of customer relationship management (CRM) in a perfect world, a firm will be able to optimally manage online customer interactions by using real-time Web clickstream data to determine customer interests and needs and proactively serve users. Between Web clickstream data and the implementation of models to manage online interactions is the critical component of learning from this data. One approach to learn from Web clickstream data is to use pattern discovery techniques. As defined in Hand et al. (2001), we use the term ‘‘pattern’’ to mean some local structure that may exist in the data. This is in contrast to ‘‘models’’ (Hand et al., 2001) that represent global structure. Models are also built to make specific predictions, unlike pattern discovery techniques which are often used for exploratory analysis. However, models may also be informed by the pattern discovery process. For instance, pattern discovery may help unearth a previously unknown behavioral pattern of a Web user, such as a specific combination of content that this user consumes. New features that can be constructed from such patterns may help build better prediction models learned from clickstream data. There is a large amount of research in the interdisciplinary data mining literature on pattern discovery from Web clickstream data (see Srivastava et al. (2000) and Kosala and Blockeel (2000) for reviews). Having a framework for understanding the contributions of the different papers in this literature can help in making sense of this large (and growing) body of work. The purpose of this chapter is not to survey this literature but to discuss one framework for pattern discovery under which some of this research can be better understood. Certainly there can be many approaches for grouping various contributions relating to pattern discovery from Web clickstream data. One approach might be based on the application for instance, where different research papers are grouped based on the specific application (i.e. online advertising, product recommender systems, dynamic Web page design, etc.) addressed. Another approach for grouping may be based on the specific pattern discovery technique used. Yet another approach may be based on the reference literature from which the pattern discovery techniques come from, given that pattern discovery has been addressed in various areas. In the next section, we will discuss one framework for pattern discovery that is general and can be applied to provide a useful perspective on specific pattern discovery papers in the literature. Another application of the framework may be to group the literature based on dimensions specific to this framework. We present examples in different domains to show how this framework helps in understanding research in pattern discovery. To motivate the application of this framework in the Web clickstream
Ch. 4. Web Clickstream Data and Pattern Discovery
103
context we will use this framework to explain two different approaches taken in the literature to segment online users using patterns discovered from Web clickstream data. We discuss the relationship between these two specific segmentation applications and conclude by providing a discussion of other opportunities that this framework suggests.
3
A framework for pattern discovery
As noted in Section 2, the data mining literature has a large body of work on pattern discovery from Web clickstream data. One characteristic of the data mining area is a focus on pattern discovery (Hand, 1998). In such cases the focus is usually not on prediction but on learning interesting ‘‘local’’ patterns that hold in a given database. Taking a different perspective, there has also been substantial research in this literature on the learning models from large databases. A primary goal of model building in this literature is for prediction in very large databases. These models are usually computationally intensive and are evaluated based on mainly predictive accuracies. Pattern discovery techniques can be completely described based on three choices—the representation chosen for the pattern, the method of evaluation by which a specific pattern is deemed ‘‘interesting’’ and finally an algorithm for learning interesting patterns in this representation. Below we discuss these three choices and present examples. 3.1 Representation First, pattern discovery techniques have to make an explicit choice or assumption regarding what forms a pattern can take. Specifically, a representational form has to be chosen. Some examples of representations considered in the pattern discovery literature in data mining are itemsets and association rules (Agrawal et al., 1993), quantitative rules (Aumann and Lindell, 1999), sequences (see Roddick and Spiliopoulou, 2002) and temporal logic expressions (Padmanabhan and Tuzhilin, 1996). An itemset is a representation for a set of items {i1, i2, . . . , ik} that occur together in a single transaction. While the initial application for this was market basket analysis, there have been other applications, such as learning the set of Web pages that are frequently accessed together during a single session. An association rule, however, is represented as I1-I2, where I1 and I2 are both itemsets and I1 - I2 ¼ {}. Unlike itemsets, this (association rule) representation is not used to convey a notion of mutual co-occurrence, rather this representation is used to indicate that if I1 exists in a transaction then I2 also exists. For instance it may even be the case that {I1, I2} does not occur often, but whenever I1 occurs in a transaction
104
B. Padmanabhan
then I2 also occurs.6 Depending on what captures the notion of a ‘‘pattern’’ in a specific application one or both of these representations may be useful. The ‘‘items’’ in itemsets are usually based on categorical attributes (although they have been used for continuous attributes based on discretization). Quantitative rules extend the representation of typical association rules to one where the right hand side of the rule is a quantitative expression such as the mean or variance of a continuous attribute (Aumann and Lindell, 1999). A sequence is yet another example of a representation. Srikant and Agrawal (1996) defined a sequence as an ordered list of itemsets oI1, I2, . . . , IkW. The ordering is important and is used to represent a pattern where a series of itemsets follow one another (usually in time where transactions have time stamps associated with them). Such a representation is useful where patterns relating to the order of occurrences are relevant. 3.2 Evaluation Given a representation, what makes a specific pattern in this representation interesting? Some examples of evaluation criteria considered in pattern discovery include support and confidence measures for association rules (Agrawal et al., 1993) and frequency (for sequences and temporal logic expressions). For association rule I1-I2, support is the percentage of all transactions in the data set that contain {I1, I2}. Confidence is defined based on a measure of conditional probability as the percentage of transactions where I1 is present in which I2 is also present. Frequency for sequences is defined as the number of times a specific sequence occurs in a database. The main point here is that these measures—support, confidence, frequency—are all different evaluation criteria for patterns in a given representation. Further, the criteria are specific to each representation—i.e. it is meaningful to compute the support of an itemset but confidence only applies to rules and not to individual itemsets. 3.3 Search Given a representation and a method of evaluation, search is the process of learning patterns in that representation that meet the specified evaluation criteria. The development of efficient search algorithms is a critical component given the size and high dimensionality of the databases that these methods are designed for. 6 Standard association rule discovery algorithms however use itemset frequency constraints for practical as well as computational reasons.
Ch. 4. Web Clickstream Data and Pattern Discovery
105
The Apriori algorithm (Agrawal et al., 1995) and its many improvements (see Hipp et al., 2000) are examples of efficient search algorithms for learning frequent (evaluation) itemsets (representation) and association rules (representation) with high support and confidence (evaluation). The GSP (Generalized Sequential Patterns) algorithm (Srikant and Agrawal, 1996) is a search technique that learns all frequent (evaluation) sequential patterns (representation) subject to specific time constraints (also evaluation) regarding the occurrences of itemsets in the sequence. The time constraint, for instance, can specify that all the itemsets in the sequence have to occur within a specified time window. These additional constraints can be considered as part of the evaluation criteria for a pattern (i.e. a pattern is considered ‘‘good’’ if it is frequent and satisfies each additional constraint). Pattern discovery is also often application-driven. In some cases the context (the domain plus the specific application) drives the choice of specific representation and evaluation criteria. Search is well defined given specific choices of representation and evaluation and hence it is, in this sense, only indirectly application-driven, if at all. 3.4 Discussion and examples The process of making choices in the representation–evaluation–search (R-E-S) dimensions also helps identify specific differences between the inductive methods developed in the data mining literature with those developed in other areas such as statistics. Compared to other literature the data mining area has developed and studied a different set of representations for what constitutes a pattern, developed and studied different evaluation criteria in some cases, and developed and studied various search algorithms for pattern discovery. While it is difficult to compare different representations and evaluation criteria developed across disciplines, studying multiple plausible representations (and evaluations þ search) is by itself a critical component of the process for understanding what constitutes real structure in observed data. Engineers often use the term ‘‘reverse engineering’’ in technology to understand the principles of how something works by observing its operations, and much research in various data-driven fields is similar in spirit and is often guided by (necessary) inductive bias in the R-E-S dimensions (particularly representation). In principle such methods contribute to the inductive process in scientific reasoning. Below we discuss a few examples that illustrate these three choices (Figs. 1–3 summarize pictorially the framework and examples). Padmanabhan and Tuzhilin (1996) addressed the problem of learning patterns in sequences (such as genetic sequences, or a series of discrete system events captured about network behavior). Prior work (Mannila et al., 1995) had used episodes as a representation for a pattern in a
106
B. Padmanabhan
Fig. 1.
Fig. 2.
Three steps in pattern discovery.
Context-motivated choices for representation, evaluation and search.
Fig. 3. Examples of specific choices for representation, evaluation and search. Clockwise from top-left these are from Padmanabhan and Tuzhilin (1996), Padmanabhan and Tuzhilin (1998) and Swanson (1986), respectively.
sequence. An episode was defined as a directed graph in which the links between nodes represented the observation that one event in the sequence occurred before the other event did. Padmanabhan and Tuzhilin (1996) extended the episodes representation to a general form using a temporal logic representation. An example of such a temporal logic representation was A UntilK B, capturing the occurrence of event A zero to K times just before event B occurs. The operator Until is a temporal logic operator. For instance the sequence oC, A, B, D, C, A, A, B, A, B, AW contains the pattern A Until2 B thrice within the sequence.
Ch. 4. Web Clickstream Data and Pattern Discovery
107
In this case, the directed graph representation considered before in Mannila et al. (1995) was extended since the temporal logic approach permitted more general approaches for reasoning about patterns in time. The evaluation for patterns—both for the episodes approach as well as the temporal logic approach—was a simple measure of frequency (counting the number of occurrences of a specific pattern in a sequence). In both cases, the papers also presented new search techniques—a method for efficiently discovering episodes in large sequences in Mannila et al. (1995) and a method for learning specific classes of temporal logic patterns in sequences in Padmanabhan and Tuzhilin (1996). In the previous example, the representation and search dimensions are the ones in which the main contributions were made by the papers discussed. The next example focuses specifically on work where the contribution is mainly in the evaluation dimension. In the late 1990s there was a lot of work in the data mining area on developing fast algorithms for learning association rules in databases. Much research, as well as applications in industry, suggested that most of the patterns discovered by these methods, while considered ‘‘strong’’ based on existing evaluation metrics, were in reality obvious or irrelevant. If strong patterns are not necessarily interesting, what makes patterns interesting and can such patterns be systematically discovered? Padmanabhan and Tuzhilin (1998) developed a new evaluation criterion for the interestingness of patterns. Specifically, they defined an association rule to be interesting if it was unexpected with respect to prior knowledge. This approach required starting from a set of rules that capture prior domain knowledge, which is elicited from experts or from rules embedded in operational systems used by firms. A discovered association rule is deemed interesting if it satisfied threshold significance criteria and if it contradicts a rule in the existing knowledge base. For instance, in a subset of retail scanner data relating to the purchase of beverages (categorized as regular or diet), prior knowledge may represent some condition such as female - diet beverages. A rule that satisfies threshold significance criteria and contradicts the prior knowledge, such as female, advertisement - regular beverages, is defined to be unexpected. The definition presented is based on contradiction in formal logic, and Padmanabhan and Tuzhilin (1998, 2000) present efficient algorithms to learn all unexpected patterns defined in this manner. In this example, the representation for patterns (association rules) was not new. In contrast the evaluation criterion developed was the main contribution and was one that focused specifically on the fundamental problem of what makes patterns interesting. In this case, rather than using the evaluation criterion as a ‘‘filter’’ to select rules generated by an existing technique, new search algorithms were proposed to learn only the unexpected rules, and hence the contribution is along two dimensions (evaluation and search). In the previous two examples the choice of representation, evaluation and search did not depend in any meaningful way on the application domain in
108
B. Padmanabhan
which it was used. In contrast to this consider the following example. In the field of information science, Swanson (1986) made a seminal contribution in a paper on identifying ‘‘undiscovered public knowledge’’. Swanson was particularly interested in learning potential treatments for medical conditions from publicly available information. A well-known example of a discovery facilitated by Swanson (1986) was that fish oil may be a potential treatment for Raynaud’s disease. This was identified as a potential undiscovered treatment since, 1. the Medline literature had numerous published scientific articles about blood viscosity and Raynaud’s disease—the disease apparently was correlated with higher blood viscosity, 2. the literature also had numerous published articles about fish oil and blood viscosity (these articles frequently noted that fish oil helped lower blood viscosity) and 3. the literature had little or no articles that discussed fish oil and Raynaud’s disease directly, suggesting that this was not a well-known link. Note here that (a) the original work was not a completely automated approach and (b) the work was in a different area (information science), and was presented even before the field of data mining gained momentum. However, this is an excellent example of the potential power of inductive approaches such as data mining in a world in which an increasingly large amount of information is automatically captured. In the R-E-S framework, the representation for patterns such as the one discovered in Swanson (1986) is a triple oA, B, CW where A, B and C are phrases. For instance ofish oil, blood viscosity, Raynaud’s diseaseW is a specific such triple (pattern). The evaluation is a measure with two components (1) that requires A, B and C to represent a potential treatment, disease condition and disease, respectively. This requires background knowledge such as domain ontologies. (2) The second component of evaluation is a binary indicator based on the counts of documents that contain the pairwise terms. Specifically, this component may be defined to be one if count(A, B) is high, count(B, C) is high and count(A, C) is low, where count(X, Y) is the number of Medline documents in which the phrases X and Y co-occur. Search is then designing efficient algorithms for learning all such triples. In this example too the main contribution is in the evaluation, but this is an instance where the choice of the three dimensions, from a pattern discovery perspective, is driven by the specific application. 4
Online segmentation from clickstream data
In the previous section we discussed a pattern discovery framework and discussed how specific examples of pattern discovery approaches may be
Ch. 4. Web Clickstream Data and Pattern Discovery
109
viewed in this framework. In this section and the next we continue to examine this link, but specifically for pattern discovery applications that arise in the context of Web clickstream data. The examples discussed in this section are specifically related to learning interesting user segments from Web clickstream data. Zhang et al. (2004) present an approach motivated by the problem of learning interesting market share patterns in online retail. An example of a pattern discovered by this method is Region ¼ South and household size ¼ 4 ) marketshareðxyz:comÞ ¼ 38:54%; support ¼ 5:4% The data set consists of book purchases at a subset of leading online retailers. Each record in the data set consists of one online purchase for books. The discovered rule highlights one customer segment (which covers 5.4% of all records) in which xyz.com has an unusually low market share. Generalizing from this, Zhang et al. (2004) defined a new class of patterns called statistical quantitative rules (SQ rules) in the following manner. Given (i) sets of attributes A and B, (ii) a data set D and (iii) a function f that computes a desired statistic of interest on any subset of data from the B attributes, an SQ rule was defined in (Zhang et al., 2004) as a rule of the form X ) f ðDX Þ ¼ statistic; support ¼ sup where, X is an itemset (conjunction of conditions) involving attributes in A only, DX the subset of D satisfying X, the function f computes some statistic from the values of the B attributes in the subset DX and support the percentage of transactions in D satisfying X. This representation built on prior representations (association rules and quantitative rules) in the data mining literature. In association rules the antecedent and consequent were conjunctions of conditions, whereas in quantitative rules the consequent was a quantitative measure such as the mean of some attribute. In SQ rules the consequent is defined to be a more general function (possibly involving several attributes) of the specific segment considered by the rule. These rules were evaluated based on statistical significance, specifically on whether the computed quantitative measure for a segment was different from values that might be expected by chance alone. As such the evaluation criterion was therefore not novel (standard statistical significance). However to construct the needed confidence intervals, Zhang et al. (2004) use randomization to create data sets where the attributes pertaining to the computed function are made independent of the others. Given the high computational complexity of creating several randomized data sets— particularly when the size of the data is very large—Zhang et al. (2004) present an efficient computational technique that exploits specific problem
110
B. Padmanabhan
characteristics for learning interesting market share rules from the data. Hence the search technique developed—a computational method based on randomization—was a contribution here as well. Along the three dimensions, the representation and search dimensions are the main dimensions in which Zhang et al. (2004) makes contributions to the literature. We note two characteristics of the above application. First, this learns purchase-based segments, i.e. segments defined based on dollar volumes spent at competing online retailers. Second, this uses Web clickstream data gathered on the client-side. Such data is available from data vendors such as comScore Networks and tracks Web activity of users across multiple sites. In contrast to this, we next describe another pattern discovery method for online segmentation that discovers behavioral segments as opposed to purchase-based segments, and that can be used by online retailers directly on the Web clickstream data that they individually observe (i.e. this does not need user clickstream data across firms). Yang and Padmanabhan (2005) present a segmentation approach based on pattern discovery that is motivated by grouping Web sessions into clusters such that behavioral patterns learned from one cluster is very different from behavioral patterns learned from other clusters. This motivation is similar to standard cluster analysis, but the difference is in how behavioral patterns are defined. In their approach, a behavioral pattern is defined as an itemset such as: fday ¼ Saturday; most_visited_category ¼ sports; time spent ¼ highg The evaluation for each such pattern is a count of how often this occurs in a set of Web sessions and in any given cluster the set of all such patterns can be learned efficiently using standard techniques in the literature. Given that any cluster consists of a set of such behavioral patterns learned from the cluster, Yang and Padmanabhan (2005) develop a distance metric that computes the difference between two clusters based on how different the behavioral patterns learned are. Based on this distance metric they develop a greedy hierarchical clustering algorithm that learns patternbased clusters. Hence given a set of user sessions at an online retailer, this approach learns clusters such that online user behavior is very different across different clusters. In this sense this approach develops a behavioral segmentation approach specifically for Web sessions. Interestingly the result of this analysis can, in some cases (where the number of different users is very small), identify individual users. That is, even if the user ID is ignored, the segments learned sometimes end up isolating different users. In most cases the method does not do this but instead isolates different behaviors— which is the main objective of the approach. Yang and Padmanabhan (2005) also showed how this approach can be used to learn explainable clusters in real Web clickstream data.
Ch. 4. Web Clickstream Data and Pattern Discovery
111
In this approach there is no new contribution along the R-E-S dimensions for what constitutes a behavioral pattern online. However, the contribution along these dimensions here is a new representation of a cluster (as a set of behavioral patterns), a new objective function for clustering (i.e. the evaluation dimension) that takes into account differences between patterns in different clusters, and a new greedy heuristic for learning such clusters. While the approaches for online segmentation discussed above are generally viewed as applications of pattern discovery to Web clickstream data, viewing them in the R-E-S framework helps to exactly appreciate the more general contributions that are made to the pattern discovery literature. In both the segmentation examples described in this section the dimensions were largely motivated by the specific application domain (segmentation in Web clickstream data). Yet the choices for the R-E-S dimensions were not standard and existing pattern discovery methods could not directly be used. Instead the papers developed new approaches for these, thereby making more general contributions to the pattern discovery literature. More generally, while the examples in this section and in Section 3.4 do not prove that the R-E-S framework is good, they provide evidence that the framework is useful in some cases for identifying the more general contributions made by applied pattern discovery research. Further the Web segmentation applications show how clickstream data can motivate interesting pattern discovery problems that can result in broader contributions and generalizable pattern discovery techniques. While this section focused on examples in online segmentation, the next section briefly identifies other problems to which pattern discovery from Web clickstream data has been applied and discusses connections to the R-E-S framework. 5
Other applications
Below we briefly discuss other applications of pattern discovery to Web clickstream data and highlight the R-E-S dimensions of importance in these cases. The applications and R-E-S possibilities below are not exhaustive. These are mainly intended to illustrate how this framework can be used to be clear about the specific choices made along the different dimensions and to identify where the main contributions are. There has been substantial research (e.g. Perkowitz and Etzioni, 2000; Srikant and Yang, 2001) on automatically reconfiguring Web sites based on learning patterns related to how users access specific content. For instance, if most users who visit a certain site navigate several levels before which they get to some commonly accessed page, the site design might conceivably be improved to make it easier to access this content. Here the pattern representation considered may be a sequence (of pages). Learning the set of
112
B. Padmanabhan
all frequent sequences of pages accessed can help in understanding the popular paths visitors take at a Web site. The evaluation is usually based on a count of how often sequences occur and existing search algorithms such as GPS/Apriori can be directly used. In the same context of improving Web site design some authors (Srikant and Yang, 2001) have studied backtracking patterns. Such patterns are argued to be important since these suggest cases where users locate content only after some trial and error that involves backtracking (going back to previous pages to follow new links). Here the goal is to learn specific types of sequences where users visit the same page again and branch in a different direction. One example of this is the work of Srikant and Yang (2001). The representation again is sequences, the evaluation is based on counts and on whether there exists a backtracking event in a given sequence. In this case a new algorithm was developed (Srikant and Yang, 2001) for learning such patterns efficiently as well. Rules learned from Web clickstream data can also be used to make recommendations of products or types of content a user may be interested in. There is a large literature on recommender systems and on learning user profiles based on such rules (e.g. Adomavicius and Tuzhilin, 2001; Aggarwal et al., 1998; Mobasher et al., 2002). The representations of these rules may or may not have an explicit temporal component. For instance rules of the form ‘‘if (A, B, C) then (D, E)’’ may indicate that most users who access (like) A, B and C also access (like) D and E. Such rules are easily learned from the matrices used for collaborative filtering. These rules can also be modified to if A, B, C (0, t) then D, E (t, t þ k) thereby explicitly having a temporal component that captures the fact that the behavior in the consequent should only occur after the behavior in the antecedent. In the specific example shown the content accessed in the consequent of the rule is within k time units after a user accesses A, B and C. The literature in sequential pattern discovery (Roddick and Spiliopoulou, 2002; Srikant and Agrawal, 1996) addresses such pattern discovery methods. The research problems here are on developing appropriate new representations and search algorithms. Adomavicius and Tuzhilin (2001) note that rules generated from Web clickstream data may need to be validated by domain experts before they are used in making specific recommendations. However validation of individual rules may be impractical given that rule discovery methods can learn thousands of these rules for each user. To address this, Adomavicius and Tuzhilin (2001) present a system that facilitates this rule validation process by using validation operators that can permit experts to select groups of rules for simultaneous validation. From the R-E-S perspective the rule validation process can be viewed as a critical evaluation component for discovered patterns. The evaluation of patterns is not solely based on traditional strength measures, but on user-defined criteria that the validation system described in Adomavicius and Tuzhilin (2001) facilitates.
Ch. 4. Web Clickstream Data and Pattern Discovery
113
More generally, beyond product recommendations, future systems may need to make process recommendations for online users intelligently. As firms increasingly develop their online presence and as customers increasingly use this channel to transact it will be important to develop proactive methods for assisting customers, just as a sales clerk may come over in a physical store when a customer appears to be in need to help. There is the potential to do this automatically (i.e. determine from observed real-time clickstream data that a customer is in need), but this has not been studied as yet. The R-E-S framework raises questions that can be useful here in building such methods. What is the representation for patterns indicating that a user is in need? What is the evaluation criterion and how can such patterns be learned? Methods for detecting online fraud may also use patterns learned from Web clickstream data. These methods broadly fall into two classes. In the first, these methods must be able to determine that some sequence of Web activity is considered unusual. This requires a definition of usual or normal behavior. One possibility is to define user profiles based on behavioral patterns as done in Adomavicius and Tuzhilin (2001). Then a new stream of clicks can be evaluated against an existing user profile to determine how likely the access is from a given user. The second class of methods actually builds explicit representations for what fraudulent activity may look like. The approaches that do this are in online intrusion detection (Lee et al., 1999) where the goal is to determine hacks or security compromises in computer networks. One example is a (malicious) program attempting to connect on specific port numbers in sequence. If the behavior of such a program is known (based on experts who study how networks get hacked into) then specific patterns in that representation may be learned. In both these examples, the contributions can be in all three R-E-S dimensions. Pattern representations may be novel here, the evaluation criteria (what is usual/unusual) is critical as well and methods for learning typical (or unusual) patterns are important. 6
Conclusion
As firms are increasingly using the Web to interact with customers, Web clickstream data becomes increasingly valuable since this captures information pertaining to every interaction a customer has with a firm. This naturally presents opportunities for leveraging this data using pattern discovery approaches and there has been substantial research on various topics related to pattern discovery from Web clickstream data. This chapter presented a framework for pattern discovery and showed how this framework can be viewed both to understand different pattern discovery techniques proposed in the literature as well as to understand the research on applications of these techniques to Web clickstream data. Examples in a
114
B. Padmanabhan
variety of applications such as online segmentation, Web site design, online recommendations and online fraud highlight both the value that pattern discovery techniques can provide and the value of the R-E-S framework as a tool to better understand the pattern discovery approaches developed for these problems.
References Adomavicius, G., A. Tuzhilin (2001). Using data mining methods to build customer profiles. IEEE Computer 34(2). Aggarwal, C., Z. Sun, P.S. Yu (1998). Online generation of profile association rules, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, August, New York, NY. Agrawal, R., T. Imielinski, A. Swami (1993). Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD Conference on Management of Data, Washington, DC, pp. 207–216. Agrawal, R., H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo (1995). Fast discovery of association rules, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA. Aumann, Y., Y. Lindell (1999). A statistical theory for quantitative association rules, in: Proceedings of The Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 261–270. Brynjolfsson, E., Y. Hu, M.D. Smith (2003). Consumer surplus in the digital economy: estimating the value of increased product variety at online booksellers. Management Science 49(11), 1580–1596. Hand, D. (1998). Data mining: statistics and more. The American Statistician 52, 112–118. Hand, D.J., H. Mannila, P. Smyth (2001). Principles of Data Mining. August. The MIT Press, Cambridge, MA. Hipp, J., U. Guntzer, G. Nakhaeizadeh (2000). Algorithms for association rule mining—a general survey and comparison. SIGKDD Explorations 2(1), 58–64. July. Kosala, R., H. Blockeel (2000). Web mining research: a survey. SIGKDD Explorations 2(1), 1–15. Lee, W., S.J. Stolfo, K.W. Mok (1999). A data mining framework for building intrusion detection models, in: Proceedings of IEEE Symposium on Security and Privacy, Oakland, CA, pp. 120–132. Mannila, H., H. Toivonen, A.I. Verkamo (1995). Discovering frequent episodes in sequences, in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montreal, Canada, August, pp. 210–215. Mobasher, B., H. Dai, T. Luo, M. Nakagawa (2002). Discovery and evaluation of aggregate usage profiles for web personalization. Data Mining and Knowledge Discovery 6(1), 61–82. Padmanabhan, B., A. Tuzhilin (1996). Pattern discovery in temporal databases: a temporal logic approach, in: Proceedings of KDD 1996, Portland, OR, pp. 351–355. Padmanabhan, B., A. Tuzhilin (1998). A belief-driven method for discovering unexpected patterns, in: Proceedings of KDD 1998, New York, NY, pp. 94–100. Padmanabhan, B., A. Tuzhilin (2000). Small is beautiful: discovering the minimal set of unexpected patterns, in: Proceedings of KDD 2000, Boston, MA, pp. 54–64. Perkowitz, M., O. Etzioni (2000). Adaptive web sites. Communications of the ACM 43(8), 152–158. Roddick, J.F., M. Spiliopoulou (2002). A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering 14(4), 750–767. Srikant, R., R. Agrawal (1996). Mining sequential patterns: generalizations and performance improvements, in: Proceedings of the 5th international Conference on Extending Database Technology: Advances in Database Technology, March 25–29, Avignon, France. Srikant, R., Y. Yang (2001). Mining web logs to improve website organization, in: Proceedings of the 10th international Conference on World Wide Web, Hong Kong, May 01–05 (WWW ’01). ACM Press, New York, NY, pp. 430–437.
Ch. 4. Web Clickstream Data and Pattern Discovery
115
Srivastava, J., R. Cooley, M. Deshpande, P. Tan (2000). Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Exploration Newsletter 1(2), 12–23. Swanson, D.R. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30, 7–18. Yang, Y., B. Padmanabhan (2005). GHIC: A hierarchical pattern based clustering algorithm for grouping web transactions. IEEE Transactions on Knowledge and Data Engineering 17(9), 1300–1304. Zhang, H., B. Padmanabhan, A. Tuzhilin (2004). On the discovery of significant statistical quantitative rules, in: Proceedings of KDD 2004, Seattle, WA, pp. 374–383.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 5
Customer Delay in E-Commerce Sites: Design and Strategic Implications
Deborah Barnes Market Research and Member Insights, 9800 Fredericksburg Road, San Antonio, TX 78240, USA
Vijay Mookerjee Information Systems & Operations Management, The School of Management, The University of Texas at Dallas, PO Box 860688 SM 33, Richardson, TX 75080-0688, USA
Abstract This chapter explores how e-commerce firms consider potential delays to their consumers as a component of their overall profitability. A successful e-commerce strategy must incorporate the impact of the potential delays into important firm decisions such as: IT capacity and allocation, advertising dollars spent, the quality of service (i.e., more or less delay) provided to consumers, and the ease at which pricing information can be discovered at competing web sites.
Opportunities to conduct business in the online environment have been considered a vital component of traditional information systems development; however, the need for speed (efficient processing mechanisms) is magnified in the e-business context. System users in this context are clients and customers and thus increasing delays hold a double threat: inefficiency (a symptom found in the traditional context) and potential customer and revenue loss. In order to manage both the efficiency of the system and reduce potential customer loss, e-business firms may approach the delay problem from several points of view. One consideration is to manage the demand arriving to the site. For example, the objective of spending valuable dollars in advertising is to generate traffic to the firm’s web presence; however, what if the demand generated exceeds the capacity of the web space and therefore delays (perhaps in excess of the consumer’s tolerance) are experienced? 117
118
D. Barnes and V. Mookerjee
The firm may want to jointly consider the budgets allocated to advertising and IT capacity in order to manage delays. Another delay management technique might be more short term: creating some sort of ‘‘Express Lane’’ in the online environment. In a traditional store, salespersons do not necessarily provide the same quality of service to all customers. For example, an important customer may get more attention. In an e-business environment, it is not clear if web sites are designed to accommodate differentiated service. One reason for providing differentiated service is that customers may exhibit different amounts of impatience, i.e., the point they leave if they are made to wait. A customer’s impatience could be a function of what the customer intends to purchase, the intended purchase value, and so on. For example, it is quite reasonable to expect that delay tolerance (opposite of impatience) will increase with intended purchase value. This suggests the equivalent of an ‘‘express’’ check out lane in a grocery store. On the other hand, it is also reasonable to expect that the processing requirements of a high value transaction will also be high. In an e-business site, the average time waited depends on the processing requirements of a transaction. This feature is unlike a grocery store where the average amount waited (total time in the queue minus actual service time) at a check out counter does not depend on the items purchased. Thus attention should be allocated across the customers after a careful analysis of several effects: (1) the likelihood of the customer leaving due to impatience, (2) the revenue generated from the sale if it is successfully completed, and (3) the processing requirements of the transaction. Finally, how does a firm manage quality of service and delay in a competitive environment? If the customer has the option to go elsewhere, how will the firm’s strategy change? In addition, are there any circumstances that a firm may want to intentionally delay a customer . . . can delay be used strategically? A firm’s delay management plan may not always be to reduce delay, but instead to employ delay strategically. We see examples of built-in delay in travel sites when the site is ‘‘trying to find the best deal’’. Why might a firm build in well-managed delays such as this perhaps for competitive reasons, to block shop-bots, and how do consumers search behaviors impact the use of the strategic delay? The main outcome of interest in this chapter is the management of delay to optimally benefit the firm. Firms may seek to reduce customer and revenue loss or to use delay to increase market share. The focus of this chapter is how to modify firm decisions such as IT capacity and its allocation, advertising dollars spent, service differentiation technique, and competitive strategy in order to maximize the benefits derived from the firm’s web presence. The chapter is organized in five sections. Section 1 introduces the assumed e-commerce environment and consumer behaviors, Section 2 focuses on balancing demand generation (advertising dollars) with website’s capacity to support customers, Section 3 looks at how to provide
Ch. 5. Customer Delay in E-Commerce Sites
Fig. 1.
119
Simplified e-commerce site structure.
differentiated services (i.e., the online equivalent of an ‘‘Express Lane’’), Section 4 examines how delay should be managed in a competitive environment, and Section 5 concludes the chapter. 1
E-commerce environment and consumer behavior
1.1 E-commerce environment For simplicity we will consider the structure in Fig. 1 for e-commerce sites. It is important to note that browsing and buying behaviors of users are separated out onto two functional servers. The Catalog server is for browsing activities while the Transaction server is for buying activities. Requests are submitted to the Catalog server and processed according to the needs of the consumers. 1.2 Demand generation and consumer behaviors The arrival rate of consumer requests to the catalog server (demand level) is a counting process characterized by a Poisson distribution. The Poisson distribution applies to the probability of events occurring in a discrete nature when the probability is unchanging in time. Using the Poisson distribution to simulate customer arrivals implies that the time between arrivals is exponentially distributed.1,2 This property states that if the last arrival has not occurred for some time (t) then the density that the next arrival will occur in t further time units is the same as the exponential density, i.e., it does not depend on t. Hence the system does not hold any memory. There are three possible outcomes for user behavior for users currently being managed by the Catalog server: (1) the consumer may choose to purchase items and is therefore transferred to the Transaction server, (2) the 1
See Ross (2000), Chapter 5 for an in-depth explanation of the Poisson and Exponential distributions. For further justification of this assumption, see Burman (1981), O’Donovan (1974), and Sakata et al. (1971). 2
120
D. Barnes and V. Mookerjee
consumer may browse the site only and continue to be served by the Catalog server, or (3) a consumer may exit the site early. This last, ‘‘Early Quit’’ scenario may be due to a variety of reasons including consumer impatience or expiring consumer time budgets. While consumers may leave the transaction server before the transaction has been completed, we consider this scenario much less likely as the user has already invested in the process. That is, a user who has provided shipping information and payment method is not as likely to quit before purchase. Therefore, our focus on customer loss examines the impact of delays experienced on the Catalog server. 1.3 System processing technique The system needs to be able to support multiple requests simultaneously while preventing too much delay on the part of any of the separate requests. The system needs to allocate the processing power equitably across all requests. Round-Robin Processing is one of the simplest time-shared processing techniques. In this technique, all users are given an equal share of the processing time in turn. This technique is well suited to environments where the jobs to be executed are relatively similar in size. By using this method, we can control expected waiting times for users based on the possible distribution of jobs and the quantum allocated to jobs. The time waited for each job will be proportional the processing time attained. A new customer or user enters the system at the end of the queue and must wait some unknown amount of time before they receive their allotted processing unit. It is possible that a customer leaves the queue before receiving their processing unit. If the customer’s processing needs are met after one processing unit, their session ends and they exit the system, if not the customer returns to the end of the queue. The customer will repeat this process until his processing needs are met at which time the customer exits the system. For example, a time slot or quantum could be 50 milliseconds per user. Imagine a queue with 3 users: user 1 has a 75 millisecond job, user 2 has a 25 millisecond job and user 3 has a 150 millisecond job. The first user would use all 50 milliseconds of processing time and then return to the end of the queue. The second user would self-terminate after the job was complete after 25 milliseconds and exit the system. The third user would use all 50 milliseconds of processing time and then return to the end of the queue. At this time there are only two users in the queue: user 1 and user 2. A new user arrives, user 4 with a 40 millisecond job and is added to the end of the queue. This time user 1 completes his job and self terminates after 25 milliseconds and exits the system. Again, user 3 uses all 50 milliseconds of processing time and returns to the end of the queue. User 4 completes his job and self terminates after 40 milliseconds and exits the system. Finally, user 3, the only remaining user in the system is able to complete his job after 50 milliseconds and exits the system.
In addition, instead of using a fixed quantum, the processor could allocate time based on the number of jobs in the queue. For example, if the number of jobs is n, then each user will receive (1/n) of the processing capacity of the resource.
Ch. 5. Customer Delay in E-Commerce Sites
2
121
The long-term capacity planning problem
2.1 Allocating spending between advertising and information technology in electronic retailing3 Firms must make decisions regarding the capacity of the servers supporting e-retail sites. Generally, this is a long-term decision and cannot be adjusted daily in order to meet varying traffic demands of the e-retail site. If the firm does not provide enough capacity to support the site, consumers will experience delays and possibly leave the site, causing a loss in potential revenue for the firm. However, additional capacity is not cheap, and the firm does not want to over spend on capacity as this will reduce profits. Therefore, it is important that the e-retailer choose an appropriate level of capacity to support the e-commerce site. Demand for the site can be affected by many factors. Seasonal trends, item popularity, and advertising campaigns can all influence demand for products sold on an e-commerce site. While a firm does not have explicit control on the demand (traffic) for the e-commerce site, the e-retailer may influence traffic patterns using advertising campaigns to promote items on the site. In order to address this complex relationship between advertising’s demand stimulation and corresponding server capacities, we must first examine a few related topics: (1) coordinated decision-making among functional units of the firm and (2) advertising response curves which relate advertising and the stimulation of demand. 2.1.1 Coordinated decision-making There has been a separation of the functional components of firms into distinct units: marketing, production, accounting, payroll, information technology, and so on. Each of these distinct units has performance objectives to satisfy which may be influenced by factors internal to the firm and factors external to the firm. In addition, these objectives may not be aligned globally across departmental units which may cause problems for the firm at large. It may be in the interest of the firm to enforce coordination schemes across functional units in order to ensure the global objectives of the firm are met. For example, the marketing department has launched an enormous campaign promoting the widgets sold by Firm XYZ. In the production department of the same firm, the machine to make widgets has been experiencing problems lowering production. The marketing campaign boosts demand for the widgets of the firm; however the firm cannot fulfill the orders due to production problems and thus loses out on potential profits and wastes the advertising budget spent on the campaign. If instead the advertising campaign budget was determined in conjunction with the 3
For additional information, see Tan and Mookerjee (2005).
122
D. Barnes and V. Mookerjee
Fig. 2.
Advertising response as an S-curve.
production capabilities of the firm at present, Firm XYZ would have reduced the size of the campaign (if allowing one at all) to suit the level of production available.
Although a simple example, it illustrates the need for coordination across departments. Functional units optimizing local objectives do not guarantee an optimal global solution for the firm. The levels of demand satisfying the individual and local objectives of each department most likely do not match and therefore the firm will incur some kind of loss due to this mismatch. If production levels are too high, the firm will incur inventory-holding costs; whereas, if the marketing campaign generates demand that is too high, some of the marketing budget will have been wasted. 2.1.2 Advertising response functions In order to increase demand for products and services, firms often use sales promotions (in various forms such as coupons, price discounts, and other mechanisms) to entice customers to purchase their products. It has been seen by studying the relationship between advertising and market share that there is maximum saturation level (total market size) beyond which additional advertising dollars will not improve and increase market share. This relationship between advertising and market share has been described as an S-shaped curve4 where very low advertising expenditures and very high advertising expenditures have little impact on market share, while advertising expenditures in a ‘‘middle’’ range show clear improvements in market share. Figure 2 illustrates the S-curve graphically. 2.1.3 Departmental roles The relationship between marketing and production has been well analyzed5; however, the relationship between marketing and IT capacities has not been bridged. Advertising expenditures are known to stimulate demand to a threshold level. Assuming the simple e-commerce site structure 4 See Carpenter et al. (1988), Johansson (1979), Little (1975), Mahajan and Muller (1986), and VillasBoas (1993). 5 See Eliashberg and Steinberg (1993), Fauli-Oller and Giralt (1995), Ho et al. (2002), Morgan et al. (2001), and Shapiro (1977).
Ch. 5. Customer Delay in E-Commerce Sites
123
in Fig. 1 the advertising-IT problem is presented. Similar to the production problem discussed regarding firm XYZ and the widgets, in an e-commerce setting flooding an e-retailer’s site with requests can actually cause exorbitant wait times and possibly denial of service. In fact, in most firms IT departments are demand takers, meaning that based on the advertising budget allocated for advertising campaigns, IT resources must provide support for the demand stimulated at the e-retailer’s site. It may be in the best interest of the firm to evaluate the costs of allowing IT to function as a demand taker such as overspending on capacity. If coordination between marketing and IT can be achieved, the e-retailer may be better off. Marketing department. The marketing department launches advertising campaigns in an attempt to increase demand. With no advertising the existing demand is l0(l0W0). By spending on ad campaigns, demand can be increased following the S-curve presented in Fig. 2. The specific functional form which relates the demand level (l) and advertising spending is A ¼ a ln
l1 l1 1 1 l0 l1 1
(1)
where A is the advertising spending, a the advertising cost parameter that measures advertising effectiveness, l0 the initial demand level, and lN the total market size. IT department. The IT department is responsible for processing customer requests during a session. m represents the capacity of the IT resource (the catalog server). The larger the value of m the faster the processing of customer session requests. When modeling the loss of customer mathematically, there are many influencing variables such as the ratio of arrival rate to processing time, the service rate, and the maximum number of sessions where the services rate and maximum number of sessions both represent site capacity parameters. If the site is experiencing response problems due to maximum number of sessions then the site would most likely be experiencing a denial of service attack, which is not the focus of this discussion. If the maximum number of sessions is infinite then the number of sessions is not a constraint and the response time/service rate(m) can be the object of focus. 2.1.4 Site performance assumptions In order to define the site performance characteristics, several assumptions regarding demand, processing time, customer impatience, and customer time budgets are put in place. As described earlier, customers arrive according to a Poisson distribution. In addition, we focus on the capacity needs of the Catalog server not the Transaction server as illustrated in Fig. 1.
124
D. Barnes and V. Mookerjee
Generic processing time. Each session requires a mean processing time 1/m and a generic distribution (this means that no specific probability distribution has been assumed). The processing time for a session is the total time required to process all requests generated during a given session. Due to differing processing time per request there can be a variable number of requests in a session; therefore, allowing the total processing time distribution for a session to be generic provides robustness. Customer impatience. Although customers remain impatient, the server does not immediately account for customers who have left due to impatience. Therefore, the loss of impatient customers does not relieve congestion on the server. Customer time budgets exponentially distributed. Characterizing the time budgets as exponential takes advantage of the aforementioned property that the time incurred thus far does not impact the likelihood of the time budget expiring during a discrete future time interval. In general, this scenario can be described as a M/G/1/K/PS queue (Exponential, General, One-Server, K-Connections, Processor-Sharing). M refers to the memory-less nature of the arrivals to the queue expressed by the exponential inter-arrival times characterized in Poisson counting processes. The remaining elements are expressed as follows: (G) processing times are generic, (1) one-server, (K) possible sessions at a time and (PS) a shared processor. 2.1.5 Centralized planner case In a centralized setting, the firm is a profit maximizer, and hopes to balance the revenues generated from the site, the cost of additional capacity, and the cost of advertising. The e-retailer’s profit function can be written as: p Sðl; mÞ ðg0 þ g1 mÞ AðlÞ
(2)
where S(l,m) is the net revenue based on the value of arriving customers less the lost customers (h(lL)). m is a variable in the revenue function as the processing capacity will determine how many customers are processed and also those that are lost. (g0 þ g1m) is the IT cost and A(l) is the advertising cost based on the S-curve response function discussed earlier in (Eq. (1)). Neither the IT department or the marketing department have access to all of this information; therefore, a centralized planner must solve this problem collecting the IT capacity costs from the IT department, the advertising costs from the marketing department, and compute the revenue function. By evaluating the profits, demand levels, and processing capacity with respect to the change of advertising and capacity costs using the partial
Ch. 5. Customer Delay in E-Commerce Sites
125
derivatives, we find the following: @F o0 @a
and
@F o0 @g1
(3)
where F ¼ p, l, m the optimal profit, demand level, and processing capacity respectively. From the first partial derivative ((@F/@a)o0) we can see that as advertising becomes more costly IT capacity should be reduced, and from the second (@F/@g1)o0 that as IT capacity becomes more costly advertising should be reduced. With an increase in the cost of advertising or IT capacity, it is intuitive that both the optimal demand level and the IT capacity should decrease. For example, if the advertising cost parameter (a) is held constant while the IT capacity cost l1 is increased, the e-retailer should decrease capacity. In addition, this has a negative impact on the number of sessions that can be processed and completed. Therefore, the level of advertising should also be adjusted downward, even though the cost of advertising has been held constant. This example highlights the importance of coordinating the marketing and IT functions. In a decentralized setting, marketing chooses a demand level, based purely on the advertising cost parameter, a. However, it is clear from the above argument that IT capacity cost should be considered by marketing to arrive at the optimal level of advertising. 2.1.6 Uncoordinated case If IT is set up as a demand taker then the marketing department will choose a demand level l and derive value (hlA) without considering the capacity issues the IT department may face based on this traffic perturbation. There is inherent asymmetry in this setting: marketing chooses demand locally, whereas IT reacts to this demand with an optimal choice for the capacity. This case is uncoordinated as there is no cooperation across the IT and Marketing departments. Based on the given demand level, the IT department chooses a capacity level m and derives value (hL(l, m)g0g1m) balancing the value of lost customers with the cost of capacity. Marketing overspends on advertising to attract more demand than what is optimal, causing IT to incur too much cost for increasing capacity. In this case, over-advertisement will result in the loss of profit. In fact, profit in decentralized case becomes worse as cost of IT capacity increases. This shows additional support for the need to coordinate the marketing and IT decisions. When an appropriate coordination scheme is imposed, the optimal demand level and IT capacity can be achieved.
126
D. Barnes and V. Mookerjee
2.1.7 IT as a cost center In the uncoordinated case, the capacity costs of the IT department are ignored by the marketing department. In this case, the marketing department chooses a demand level which is sub-optimal and too high. In order to achieve the optimal profits that would be attained if there was a central planner, the e-retailer must make the IT department’s capacity costs a part of the marketing departments advertising decisions. Currently, the marketing department makes advertising decisions based on h, the average per session value of the site. In order to adjust the decision-making process of the marketing department, a reduced session value x can be used by the marketing department to make advertising budget decisions. If the session value is reduced, then the marketing department will choose to spend less on advertising, and thus reduce the demand burden on the IT department. The objective of the e-retailer is to find x such that the IT demand levels and the marketing demand levels are equivalent. Although not optimal, an approximate solution for determining the reduced session value x follows: Step 1: IT assumes that a ¼ 0 and estimates x as rffiffiffiffiffiffiffiffiffiffiffiffiffiffi n hg x h g1 l1 1 Step 2: Marketing chooses a demand level l based on the reduced session value x provided. Step 3: IT chooses capacity: m ¼ c(l). In determining x, the IT department uses parameters known to the IT department such as the average per session value h, the marginal cost of capacity g1, the customer impatience level n, and the maximum demand level lN. While this approach does not allow for the exact profits gained in a centralized case, it does approximate those profits very well. In addition, this policy allows a simple coordination of the marketing and IT departments which yields higher profits than a strictly uncoordinated case. 2.1.8 IT as a profit center If the IT department operates as a profit center a processing fee is charged to marketing becomes a source of revenue to the IT department. Marketing is required to pay this processing fee as there is no alternate source of IT resources within the firm. In this setup, Marketing chooses a demand level lM deriving value (hlMZ(lM)A), and Z(lM) is the processing contract of the IT department for use of resources. The process contract specifies the cost of using the IT resources. In equilibrium the two demand levels will match. Likewise, IT chooses capacity m and demand level lIT and derives value Z(lIT)hL(lIT, m)(g0þg1m). In equilibrium, lIT ¼ lM. Furthermore, a processing contract can be derived such that the equilibrium is reached.
Ch. 5. Customer Delay in E-Commerce Sites
127
An additional property of this processing contract is that as more capacity is required the price charged to the marketing department will be reduced (i.e., quantity discounts are in effect). 2.1.9 Practical implications By examining possible coordination and cooperation between marketing and IT departments, advertising campaigns can be added as an additional factor in the customer loss web. In addition, an implicit connection between demand and processing capacity is made based on the IT department’s ability to choose both demand and processing capacity levels. Advertising spending by the marketing department stimulates demand levels. While IT acts as a demand taker and adjusts capacity in order to prevent waiting intolerances and customer loss, the capacity level is sub-optimal. In addition, advertising costs and IT capacity costs can be recognized as additional factors determining the IT capacity and advertising campaign decisions of the e-retailer. By enforcing a coordination scheme between the IT department and the Marketing department, an e-retailer can avoid over-stimulation of demand which causes high capacity costs. Although the increased capacity prevents customer reneging, the benefit from retaining the customers is lessened due to the capacity costs required to keep them. Firms must carefully balance the advertising campaigns put in place with the capacity of the servers available to support the e-commerce site functionality.
3
The short-term capacity allocation problem
3.1 Optimal processing policies for an e-commerce web server6 E-retailers must make long-term decisions balancing their advertising expenditures with the demand capacity available on their web sites. However, once the capacity decision is made there are additional concerns regarding customer impatience. It may be possible to provide differentiated services to customers based on known information regarding the customer. Using differentiated services, the e-retailer can allocate additional capacity to more consumers more sensitive to delay, and less capacity to those consumers less sensitive to delay. Given a fixed capacity level, the e-retailer can allocate processing time to customers, based on customer shopping characteristics. The e-commerce environment is very amenable to this type of discrimination as customers do not witness the service provided to others. Priority processing. Priority processing considers that different users may have a higher level of priority in the system and should in turn receive more processing time based on their needs. 6
For more information, see Tan et al. (2005).
128
D. Barnes and V. Mookerjee
That is, based on a user’s priority class, the users will be allocated different quantum. For example, the higher the priority, the longer the processing time allocated to the user. Implementation of priority processing systems can vary. For example, a modified Round-Robin processing system can be used where the time slot is modified by the priority class level. This type of priority processing requires that the priority level of jobs are known ex ante. While priority processing schemes have been put in place for resources shared by a firm’s internal members, in the e-commerce context, where the users are external to the firm’s boundaries, it is more difficult to implement effective priority processing schemes. The firm’s objectives are tied to unknown consumer characteristics such as the consumer’s willingness to buy, the amount the consumer will spend, the probability that the consumer will renege (leave the site before purchase), and other factors not determined by the firm. The firm wishes to keep the consumers who will spend the most while balancing the processing required for those users. Therefore, establishing a priority-processing scheme in this e-commerce context will be more difficult for the firm than the traditional internal scenario. As aforementioned, the objective of the firm is to implement a priorityprocessing scheme which will allocate capacity to optimize some firm goals. In a super market we see express lanes designed to reduce delay for customers with fewer items. This is an example of a priority scheme. Consumers with fewer items are more likely to leave without purchasing their items; therefore, providing them priority check out allows the firm to retain these consumers. Likewise, e-commerce firms need to implement similar priority processing systems based on the characteristics of consumers. Instead of using a server which allocates equal quantum based on Round-Robin approach, firms should implement a server which uses consumer characteristics to assign priority classes and time slots. 3.2 Environmental assumptions In order to define the environment, several assumptions are made regarding the expected purchase value of consumers, the time to process an order and impatience of consumers, the rate at which consumers arrive at the site, and finally that a Round-Robin processing scheme is in place. Static purchase value. Specifically, the purchase value of a given consumer is static throughout the course of shopping. This means that the current behavior of the consumer does not drive the assigned purchase value. Using historical data, the firm can compute averages of previous visits to the site, or assign a pre-determined value such as the market-average for first time visitors. The predicted purchase value of a customer (denoted by h) follows a distribution f(h). In fact, the e-retailer can determine both the processing
Ch. 5. Customer Delay in E-Commerce Sites
129
time and the delay tolerance (customer impatience) based on the static purchase value assigned to them initially. The only information needed by the e-retailer is the static purchase value of the customer. Value-dependent delay tolerance and processing time. Customer impatience may lead to the customer leaving the system if their tolerance for waiting (delay tolerance) is exceeded. Again referring to the express lane example, customers with a low purchase value (generally fewer items) have a lower delay tolerance and therefore should be given priority while customers with a high purchase value (generally more items) have a higher delay tolerance and can be made to wait. Specifically, customers with a static purchase value h will be willing to wait a random time that is exponentially distributed with a mean w(h).7 Exponential processing time. Characterizing the processing time (t(h)) as exponential implies that the lifetime of the consumer in the system is exponentially distributed. This implies that the probability of a consumer leaving the system in a discrete future time interval, given they have been in the system 1 millisecond or an infinite time, is just as likely; therefore, the processing time incurred does not impact the likelihood of the processing completing during a discrete future time interval. Round-Robin processing. Round-Robin processing is commonly used in prominent e-commerce server software and takes advantage of the idle time between user requests. As mentioned earlier, this type of processing makes better use of processing power by dividing the processing time into processing units allocated to each user. 3.3 Priority processing scheme By using the predicted purchase value (h) of a consumer to determine the priority class k, we can assign processing time g(h)kQ to each priority class (g(h)kZ0), where g(h)k is a priority scheme assigning weights adjusting a fixed quantum Q. The e-retailer is concerned with the loss incurred due to intolerable delays on the e-commerce site. The loss function density (l(h)) defined as the number of customers lost per unit time per unit value for customers with value h is lðhÞ ¼
tðhÞðgðhÞ þ 1Þ lf ðhÞ tðhÞðgðhÞ þ 1Þ þ wðhÞgðhÞ
(4)
t(h) is the processing time per customer. w(h) is the mean delay tolerance. g(h) is the priority assigned. l is the arrival rate of consumers to the catalog server. f(h) is the value distribution of h. We now want to examine how the 7
For further justification of this assumption, see Ancker and Gafarian (1962).
130
D. Barnes and V. Mookerjee
priority scheme (g(h)k) impacts this loss. We will examine two possible priority schemes: a Profit-focused policy and a Quality of Service (QoS) focused policy. 3.4 Profit-focused policy In this problem the e-retailer determines the priority weights by maximizing profits. Priority weights will be assigned such that the e-retailer’s profits will be the highest. The e-retailer’s problem is to choose capacity and processing weight to maximize expected profit per unit time. The total expected revenue (per unit time) is Z S¼ hðlf ðhÞ lðhÞÞdh (5) h2H
where (lf(h)l(h)) are the Net Customers with value h. In addition, the serviceable set (the set of customers worth retaining) is h S H ¼ h: c; h 2 H (6) tðhÞ This expression denotes that the ratio of the static value to the processing time should be greater than a threshold value c.8 Therefore, only customers who exceed this threshold value will be allotted processing time. This demonstrates that the e-retailer is interested in balancing the predicted purchase value (value gained) with the amount of time needed to gain that value (processing time). The priority weights need to be determined. In this case, continuous priority weights are determined based on the delay tolerance, predicted purchase value, and processing time of a given user. The expression for the priority scheme is as follows: 8 ! 1=2 > 1 1 h wðhÞ > < 1 ; h 2 HS 1þ 1 1þ 1 þ wðhÞ=tðhÞ c tðhÞ tðhÞ g ðhÞ ¼ > > : 0; heH S (7) Therefore, given any h value the firm first decides whether or not the individual will be serviced based on the determined serviceable set HS and then assigns a priority weight using the above expression to determine the customer’s capacity allocation. By examining the partial derivatives of the 8
For details regarding the derivation of these expressions see Tan et al. (2005).
Ch. 5. Customer Delay in E-Commerce Sites
131
priority weights with respect to the rate of revenue realization (h/t(h)) and priority to processing ratio (w(h)/t(h)) we find i: ii:
@g ðhÞ
40 @ h=tðhÞ @g ðhÞ
o0 @ wðhÞ=tðhÞ
(8)
These results can be interpreted as follows: (i.) customers with a higher rate of revenue realization receive more processing and (ii.) customers with a higher value of priority to patience can tolerate more delay and hence receive less processing time. Profits may not be the only variable of interest to an e-retailer. While profits are certainly important to all firms, some may wish to focus on customer satisfaction and reducing customer loss due to poor service quality rather than on a policy that treats the customer as a dollar sign. 3.5 Quality of service (QoS) focused policy In this policy, the e-retailer assigns priority weights such that lost customers are minimized. With QoS as the performance objective, the e-retailers’ problem becomes Z min L lðhÞdh (9) h2H
when (Eq. (4)) defines l(h). This means that the objective of the e-retailer is to minimize the loss of customers. The optimal processing allocation can be obtained as 8 1=2 > wðhÞ tc 1 < 1 þ tðhÞ 1 1 þ tðhÞ 1 ; tðhÞotc g ðhÞ ¼ 1þwðhÞ=tðhÞ (10) > : 0; tðhÞ tc It is obvious from the nature of the above equation that customers who require more processing (with higher values of 1/t(h)) will be assigned less priority.More patient buyers (with higher w(h)/t(h) ratio) also receive less processing time. The value threshold hc is given by t(hc) ¼ tc, and can be found by substituting the above into (Eq. (1)). Assuming t(h) increases with the value h, customers with value above the value threshold hc will not receive any processing capacity. Because of the nature of the policies, the profit-focused policy will outperform the QoS-focused policy when profit is the performance variable and vice versa when only a single-period performance variable is considered.
132
D. Barnes and V. Mookerjee
However, e-retailers generally do not operate in a single period and must consider future profits, and how decisions and policies in earlier periods may impact those future profits. 3.6 Practical implications In order to implement these priority processing schemes for a practical server, the e-retailer would need to gather information regarding delay tolerances (w(h)) and expected processing time (t(h)). Click-stream behavior can be analyzed to derive insights regarding these functional forms. Currently, information is commonly collected regarding purchases, customer click behavior, and so on while little information has been gathered regarding customer delay tolerances. This illustrates the need for firms to collect information regarding delay tolerances and expected processing time as it relates to the predicted purchase value. Using real-time click-stream information as well as shopping cart information may allow the implementation of a dynamic policy which uses the current shopping cart value to determine the priority weight assigned to users. One possible drawback of using current shopping cart information instead of a static predicted purchase value is that users may attempt to gain higher priority by adding items to their cart and later removing those items, essentially, users would be ‘‘gaming’’ the system. 4
The effects of competition
In the previous considerations, the e-retailer’s decisions were being made in a monopolistic setting where competition from other e-retailers was not considered. One approximation to the competitive nature of the environment is to consider a multiperiod model, where poor QoS early on may cost the e-retailer the continuing business of a customer. This is a crude approximation, as losing the customer implies that there is another provider of the services or products. 4.1 A multiperiod approach to competition for capacity allocation When considering that customers who are lost rarely come back, an e-retailer considering future profits may have a different outlook. In fact, there is an externality effect that demand growth depends on QoS provided in previous periods. It may be a better policy to first build a solid customer base by focusing on QoS rather than focus on the value of orders. The multiperiod model considers multiple periods indexed by j. The QoS in the earlier periods impacts the demand (lj) in later periods. E-retailers can increase the processing capacity in later periods to match the demand
Ch. 5. Customer Delay in E-Commerce Sites
133
generated. Several new factors come into play in this model: (1) the probability of a dissatisfied customer returning (pj), (2) word of mouth impact on demand, and (3) capacity costs decreasing over time. QoS in earlier periods will impact demand in later periods; therefore, the demand in period j is modeled as follows: ljþ1 ¼ Lj pj þ ðlj Lj Þrj
(11)
where (ljLj) represents the word of mouth impact of QoS in the first period. rj is the growth due to satisfied customers, Lj represents the customers lost, and pj the probability of unsatisfied customers returning. Increasing capacity (Cj) reduces processing time(t(h)) and therefore we modify the processing time expression: tðhÞ ¼ t0 ðhÞ=C j
(12)
Acquiring additional capacity comes with some cost (gj) per unit of capacity. The cost of new capacity acquisition is gj(Cjþ1Cj). This cost must be considered in the profit function. In addition, capacity costs (gj) decrease over time as processing power becomes less expensive. 4.2 Practical implications The specific policy is not shown, but attributes of the policy are discussed. As the discount factor (d) increases, the value of future period profits becomes more valuable in the current period. It will be to the advantage of firms to start with a loss in the first period in order to boost demand and profits in later periods when the discount factor is high enough. That is, initially the firm follows a policy mimicking the earlier QoS-focused policy. In fact, while the policy appears to operate as a QoS policy in initial periods, it is optimizing the long-run profits of the firm. However, if the discount factor is relatively small, then the current period is of greater importance than future periods and a policy like the profit-focused policy will be followed. In the multiperiod problem, the e-retailer may lose money in the first period in order to provide better service and growth in future demand and the server capacity is determined optimally in each period. 4.3 Long-term capacity planning under competition With the introduction of competition, e-retailers may begin to see a new perspective on delay and integrate delay strategically based on known pricing information. In contrast to the more traditional views of delay where delays act as an impediment to the functioning of e-commerce, a new view of delay is explored where a firm may intentionally impose delays. Previously, we focused on the catalog server and the browsing behavior of
134
D. Barnes and V. Mookerjee
the consumer, in this case the browsing behavior is also considered; however users are seeking specific information regarding product prices that can be found at multiple e-retailers. Waiting at a web site for a consumer may not have a negative impact on a user’s evaluation of the site if the waiting is well-managed. Consumers tend not to get as frustrated if the waiting occurs at expected positions like before the web text appears on the screen and not in the middle of interaction. Thus, a ‘‘well managed’’ delay is one that is not in the middle of a sensitive transaction, like the transmission of personal information including financial information. Thus it is conceivable that an e-retailer design its site in a way to engage the customer and make the experience enjoyable enough so that the probability of a purchase increases despite taking a longer time. This Strategic Engagement Model in attempts to capture the strategic motives of e-retailers that are associated with building delay into their interaction with potential customers. There are two e-firms, firm 1 and firm 2 selling an identical product. The firms have identical engagement costs c(ti), where dc(ti)/dtW0. Engagement costs increase with the delay or engagement time (ti) that is built into the browsing process by firm j (where j ¼ 1 or 2) at its web site.9 Thus an e-firm needs to incur higher costs (computing and personnel resources, operating costs, etc.) in order to keep customers engaged for a longer time. The consumer side of the market comprises of homogenous agents, who have unit demand for the product and always buy from the firm with the lower price if they discover that price. A representative consumer does not browse indefinitely in order to discover the price, but has a search time budget t that is known to both firms, and may be defined as the maximum time spent for the purpose of price discovery. The consumer randomly chooses one of the web sites to begin the search for the lower price. If the time budget is expended without discovering the price offered by the second site, a purchase is made from the first site. No purchase is made if neither price is discovered. If both prices are discovered, then the product is purchased from the lower priced firm. The prices (p1Wp2) charged are common knowledge to the firms. The firms interact strategically, setting delay times in a simultaneous game. We calculate the firms’ best response delay functions and solve for the Nash equilibrium in delays. For the lower priced firm 2, it is good strategy to display its price upfront without any delay or engagement time. If it is the first firm to be visited by the consumer, all firm 2 needs to do is to display its price in the shortest time. A longer period of price discovery will merely augment its costs without increasing its expected profit. If firm 2 is the second to be visited then too its optimal strategy is to provide the smallest possible price discovery time as any longer than the minimum time would increase the 9
This is also the time taken by a potential customer for price discovery at a firm.
Ch. 5. Customer Delay in E-Commerce Sites
Fig. 3.
135
Pure strategy Nash equilibrium of the engagement model.
chances of the consumer’s time budget (t) being exceeded. If e (eW0) is the smallest time period that a firm can use to display its price, then the optimal engagement time for firm 2 (t 2 ) is t 2 ¼
(13)
The higher priced firm 1 knows that the only way it can affect a sale is by not allowing the other (lower priced) firm to display its price to the consumer. Thus its optimal strategy is to set its engagement time an e above tt2, the remainder of the time budget after subtracting firm 2’s choice of delay.10 t 1 ¼ t t2 þ
(14)
Mapping out the best response functions for firms 1 and 2 in Fig. 3 we get the pure strategy Nash equilibrium (NE) given by (t, e) which are the NE strategies for firms 1 and 2 respectively. Thus the cheaper vendor makes its price discovery almost instantaneous while the more expensive provider attempts to engage the potential customer and exhaust the search time budget. 10 Notice that the higher priced firm can never win the consumer if it is visited second. The higher priced firm makes a sale only if it is visited first and sets its delay so that the consumer exhausts the time budget before discovering the lower priced firm’s price. However, the higher priced firm should not build delays any higher than e above the ‘‘available’’ time (tt2) since engagement time is costly.
136
D. Barnes and V. Mookerjee
4.4 Practical applications and future adaptations The basic wisdom from the theoretical result above, namely cheaper firms affect quicker price discovery and do not attempt to invest in engaging a customer is seen in e-commerce interfaces all over the World Wide Web. If we compare car loan web sites with more established loan agencies like Chase Automotive Finance, we find that the former, which typically offer lower APRs and a ‘‘bad credit, no problem’’ attitude has a much quicker price (in this case the interest rate) discovery than the latter. Organizations like Chase invest in making the customer go through many ‘‘personalized’’ steps (ostensibly for benefiting the customer), but in reality may just be attempting to engage customers long enough in order to induce them to sign them up for a loan. A similar pattern is observed in the online provision of health insurance. Lower premium e-tailers like Affordable Health Insurance Quotes have a quicker price discovery than larger and more established insurance firms. Several extensions can be made to the current model. It may be useful to consider a model where the cost of engagement time includes a technology cost component that reduces with time, for example, the total cost of engagement could be expressed as: c(t) ¼ a(t)þb(t), where a(t) increases with time whereas b(t) (technology cost) decreases with time. This may lead to interior solutions to the game. Another extension is to consider n firms to see if the basic intuition (high priced firms deliberately delay customers) carries through in the general case. It may also be useful to model consumers as being impatient, i.e., they may leave a slow web site with some probability even if their time budget has not been exhausted. Finally, it will be interesting to conduct an experiment to see if the model predicts correctly with human search agents. 5
Conclusions and future research
There are many issues and perspectives related to customer loss caused by delays in e-commerce settings. In the last section, delay is examined as a strategic tool for an e-retailer; however, this tool can only be enabled in a competitive environment. An e-retailer’s view of delay may change based on the competitive context. Therefore the firm’s competitive environment which could be monopolistic or competitive becomes an additional consideration for customer loss. In addition, forcing coordination between IT departments and the Marketing department is in a firm’s best interest. If IT is treated as a demand taker, the Marketing department over spends on advertising forcing the IT department to make sub-optimal capacity decisions. While these sub-optimal capacity decisions are made in an effort to prevent customer delay and eventually customer loss, the cost of the capacity is not
Ch. 5. Customer Delay in E-Commerce Sites
137
compensated. Using contracts from the IT department for IT resources, the Marketing department sees IT capacity as a cost and the IT department gains revenues from providing resources. This contract enables the departments to work in cooperation to set demand levels in an optimal fashion, and aligns the incentives of the departments. Alternatively, the IT department can generate a reduced average session value for the marketing department to consider. This reduced average session value helps account for the capacity costs experienced by the IT department and drives the demand level decisions for each department towards an optimal choice. Capacity adjustments in e-retailer servers are essential to providing quality service. In fact, quality of service is more important to an e-retailer trying to build a customer base in early periods. By implementing priority processing schemes that focus on maintaining the customer base initially, the e-retailer builds his market share. Later, once the market share is stationary, the e-retailer can maintain his market share using profit-focused priority processing schemes. In addition to an e-commerce environment, an online decision support system environment such as an online help desk may also have queuing effects as discussed in Sections 2 and 3. Capacity planning and allocation will be important decisions in this domain as well. We will also investigate the issue of queuing externalities in the context of decision-oriented systems. The analysis and design of such systems has been greatly facilitated by the use Sequential Decision Models (SDM). These models provide a powerful framework to improve system operation. The objective is to optimize cost or value over a horizon of sequential decisions. In sequential decision-making, the decision maker is assumed to possess a set of beliefs about the state of nature and a set of payoffs about alternatives. The decision maker can either make an immediate decision given current beliefs or make a costly observation to revise current beliefs. The next observation or input to acquire depends on the values of previously acquired inputs. For example, a physician collects relevant information by asking questions or conducting clinical tests in an order that depends on the specific case. Once enough information has been acquired, the decision maker selects the best alternative. Consider an example of an e-business information system where customers log on to a web site to obtain advice on health related matters. Because of queuing effects, customers may have to wait before and/or during the consulting session. The operating policy could adjust (increases or decreases) the quality of the advice offered by the system depending on the length of the queue. Such a policy may aim to optimize an objective such as the total expected cost incurred, namely, the sum of the expected error cost (associated with the advice offered by the system) and the expected waiting cost. The examination of delay experienced in an e-retail environment is of utmost important in the modern age where much business is transacted
138
D. Barnes and V. Mookerjee
online. Just as firms have carefully planned the logistics of their brick and mortar stores, so must they pay special attention to the logistics of their web presence. Customers who are impatient may leave the e-store with the slightest delay; therefore firms must carefully examine the value that may be lost for any given customer type. Understanding the behaviors and value of the individual customer will allow the firms to strategically design the web presence with an appropriate amount of delay. References Ancker, C.J., A.V. Gafarian (1962). Queuing with impatient customers who leave at random. Journal of Industrial Engineering 13, 84–90. Burman, D. (1981). Insensitivity in queuing systems. Advances in Applied Probability 13, 846–859. Carpenter, G.S., L.G. Cooper, D.M. Hanssens, D.F. Midgley (1988). Modeling asymmetric competition. Marketing Science 7, 393–412. Eliashberg, J., R. Steinberg (1993). Marketing-production joint decision making, in: J. Elizashberg, G.L. Lilien (eds.), Marketing, DHandbooks in Operations Research and Management Science, Vol. 5. Elsevier, North Holland. Fauli-Oller, R., M. Giralt (1995). Competition and cooperation within a multidivisional firm. Journal of Industrial Economics XLIII, 77–99. Ho, T.-H., S. Savin, C. Terwiesch (2002). Managing demand and sales dynamics in new product diffusion under supply constraint. Management Science 48(4), 402–419. Johansson, J.K. (1979). Advertising and the S-curve: A new approach. Journal of Marketing Research XVI 346–354. Little, J.D.C. (1975). BRANDAID: A marketing mix model part 1: Structure. Operations Research 23, 628–655. Mahajan, V., E. Muller (1986). Advertising pulsing policies for generating awareness for new products. Marketing Science 5, 86–106. Morgan, L.O., R.L. Daniels, P. Kouvelis (2001). Marketing/manufacturing tradeoffs in product line management. IIE Transactions 33 949–962. O’Donovan, T.M. (1974). Direct solution of M/G/1 processor sharing models. Operations Research 22, 1232–1235. Ross, S.M. (2000). Introduction to Probability Models. 7th ed. Harcourt Academic Press. Sakata, M., S. Noguchi, J. Oizumi (1971). An analysis of the M/G/1 queue under round-robin scheduling. Operations Research 19, 371–385. Shapiro, B.P. (1977). Can marketing and manufacturing coexist? Harvard Business Review 55, 104–114. Tan, Y., K. Moionzaeh, V.S. Mookerjee (2005). Optimal processing policies for an e-commerce web server. Informs Journal on Computing 17(1), 99–110. Tan, Y., V.S. Mookerjee (2005). Allocating spending between advertising and information technology in electronic retailing. Management Science 51(8), 1236–1249. Villas-Boas, J.M. (1993). Predicting advertising pulsing policies in and oligopoly: A model and empirical test. Marketing Science 12, 88–102.
Part II Computational Approaches for Business Processes
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 6
An Autonomous Agent for Supply Chain Management
David Pardoe and Peter Stone Department of Computer Sciences, The University of Texas at Austin, 1 University Station CO500, Austin, TX 78712-0233, USA
Abstract Supply chain management (SCM) involves planning for the procurement of materials, assembly of finished products from these materials, and distribution of products to customers. The Trading Agent Competition Supply Chain Management (TAC SCM) scenario provides a competitive benchmarking environment for developing and testing agent-based solutions to SCM. Autonomous software agents must perform the above tasks while competing against each other as computer manufacturers: each agent must purchase components such as memory and hard drives from suppliers, manage a factory where computers are assembled, and negotiate with customers to sell computers. In this chapter, we describe TacTex-06, the winning agent in the 2006 TAC SCM competition. TacTex-06 operates by making predictions about the future of the economy, such as the prices that will be offered by component suppliers and the level of customer demand, and then planning its future actions in order to maximize profits. A key component of TacTex-06 is the ability to adapt these predictions based on the observed behavior of other agents. Although the agent is described in full, particular emphasis is given to agent components that differ from the previous year’s winner, TacTex-05, and the importance of these components is demonstrated through controlled experiments.
1
Introduction
In today’s industrial world, supply chains are ubiquitous in the manufacturing of many complex products. Traditionally, supply chains have been created through the interactions of human representatives of the various companies involved. However, recent advances in autonomous agent 141
142
D. Pardoe and P. Stone
technologies have sparked an interest, both in academia and in industry, in automating the process (Chen et al., 1999; Kumar, 2001; Sadeh et al., 2001). Creating a fully autonomous agent for supply chain management (SCM) is difficult due to the large number of tasks such an agent must perform. In general, the agent must procure resources for, manage the assembly of, and negotiate the sale of a completed product. To perform these tasks intelligently, the agent must be able to plan in the face of uncertainty, schedule the optimal use of its resources, and adapt to changing market conditions. One barrier to SCM research is that it can be difficult to benchmark automated strategies in a live business environment, both due to the proprietary nature of the systems and due to the high cost of errors. The Trading Agent Competition Supply Chain Management (TAC SCM) scenario provides a unique testbed for studying and prototyping SCM agents by providing a competitive environment in which independently created agents can be tested against each other over the course of many simulations in an open academic setting. A particularly appealing feature of TAC is that, unlike in many simulation environments, the other agents are real profit-maximizing agents with incentive to perform well, rather than strawman benchmarks. In a TAC SCM game, each agent acts as an independent computer manufacturer in a simulated economy. The agent must procure components such as CPUs and memory; decide what types of computers to manufacture from these components as constrained by its factory resources; bid for sales contracts with customers; and decide which computers to deliver to whom and by when. In this chapter, we describe TacTex-06 , the winner of the 2006 TAC SCM competition. In particular, we describe the various components that make up the agent and discuss how they are combined to result in an effective SCM agent. Emphasis is given to those components that differ from the previous year’s winner, TacTex-05, and the importance of these components is demonstrated through controlled experiments. The remainder of the chapter is organized as follows. We first summarize the TAC SCM scenario, and then give an overview of the design of TacTex-06 . Next, we describe in detail the individual components: three predictive modules, two decisionmaking modules that attempt to identify optimal behavior with respect to the predictions, and two methods of adapting to opponent behavior based on past games. Finally, we examine the success of the complete agent, through both analysis of competition results and controlled experiments. 2
The TAC SCM scenario
In this section, we provide a summary of the TAC SCM scenario. Full details are available in the official specification document (Collins et al., 2005).
Ch. 6. An Autonomous Agent for Supply Chain Management
143
In a TAC SCM game, six agents act as computer manufacturers in a simulated economy that is managed by a game server. The length of a game is 220 simulated days, with each day lasting 15 s of real time. At the beginning of each day, agents receive messages from the game server with information concerning the state of the game, such as the customer requests for quotes (RFQs) for that day, and agents have until the end of the day to send messages to the server indicating their actions for that day, such as making offers to customers. The game can be divided into three parts: (i) component procurement, (ii) computer sales, and (iii) production and delivery as explained below and illustrated in Fig. 1. 2.1 Component procurement The computers are made from four components: CPUs, motherboards, memory, and hard drives, each of which come in multiple varieties. From these components, 16 different computer configurations can be made. Each component has a base price that is used as a reference point by suppliers making offers. Agents wanting to purchase components send RFQs to suppliers indicating the type and quantity of components desired, the date on which they should be delivered, and a reserve price stating the maximum amount
Fig. 1.
The TAC SCM Scenario (Collins et al., 2005).
144
D. Pardoe and P. Stone
the agent is willing to pay. Agents are limited to sending at most five RFQs per component per supplier per day. Suppliers respond to RFQs the next day by offering a price for the requested components if the request can be satisfied. Agents may then accept or reject the offers. Suppliers have a limited capacity for producing components, and this capacity varies throughout the game according to a random walk. Suppliers base their prices offered in response to RFQs on the fraction of their capacity that is currently free. When determining prices for RFQs for a particular component, a supplier simulates scheduling the production of all components currently ordered plus those components requested in the RFQs as late as possible. From the production schedule, the supplier can determine the remaining free capacity between the current day and any future day. The price offered in response to an RFQ is equal to the base price of the component discounted by an amount proportional to the fraction of the supplier’s free capacity before the due date. Agents may send zero-quantity RFQs to serve as price probes. Due to the nature of the supplier pricing model, it is possible for prices to be as low when components are requested at the last minute as when they are requested well in advance. Agents thus face an interesting tradeoff : they may either commit to ordering while knowledge of future customer demand is still limited (see below), or wait to order and risk being unable to purchase needed components. To prevent agents from driving up prices by sending RFQs with no intention of buying, each supplier keeps track of a reputation rating for each agent that represents the fraction of offered components that have been accepted by the agent. If this reputation falls below a minimum acceptable purchase ratio (75% for CPU suppliers and 45% for others), then the prices and availability of components are affected for that agent. Agents must therefore plan component purchases carefully, sending RFQs only when they believe it is likely that they will accept the offers received. 2.2 Computer sales Customers wishing to buy computers send the agents RFQs consisting of the type and quantity of computer desired, the due date, a reserve price indicating the maximum amount the customer is willing to pay per computer, and a penalty that must be paid for each day the delivery is late. Agents respond to the RFQs by bidding in a first-price auction: the agent offering the lowest price on each RFQ wins the order. Agents are unable to see the prices offered by other agents or even the winning prices, but they do receive a report each day indicating the highest and lowest price at which each type of computer sold on the previous day. Each RFQ is for between 1 and 20 computers, with due dates ranging from 3 to 12 days in the future, and reserve prices ranging from 75% to
Ch. 6. An Autonomous Agent for Supply Chain Management
145
125% of the base price of the requested computer type. (The base price of a computer is equal to the sum of the base prices of its parts.) The number of RFQs sent by customers each day depends on the level of customer demand, which fluctuates throughout the game. Demand is broken into three segments, each containing about one-third of the 16 computer types: high, mid, and low range. Each range has its own level of demand. The total number of RFQs per day ranges between roughly 80 and 320, all of which can be bid upon by all six agents. It is possible for demand levels to change rapidly, limiting the ability of agents to plan for the future with confidence. 2.3 Production and delivery Each agent manages a factory where computers are assembled. Factory operation is constrained by both the components in inventory and assembly cycles. Factories are limited to producing roughly 360 computers per day (depending on their types). Each day an agent must send a production schedule and a delivery schedule to the server indicating its actions for the next day. The production schedule specifies how many of each computer will be assembled by the factory, while the delivery schedule indicates which customer orders will be filled from the completed computers in inventory. Agents are required to pay a small daily storage fee for all components in inventory at the factory. This cost is sufficiently high to discourage agents from holding large inventories of components for long periods.
3
Overview of TacTex-06
Given the detail and complexity of the TAC SCM scenario, creating an effective agent requires the development of tightly coupled modules for interacting with suppliers, customers, and the factory. The fact that each day’s decisions must be made in less than 15 s constrains the set of possible approaches. TacTex-06 is a fully implemented agent that operates within the TAC SCM scenario. We present a high-level overview of the agent in this section, and full details in the sections that follow. 3.1 Agent components Figure 2 illustrates the basic components of TacTex-06 and their interaction. There are five basic tasks a TAC SCM agent must perform: (1) Sending RFQs to suppliers to request components; (2) Deciding which offers from suppliers to accept;
146
D. Pardoe and P. Stone
Supplier Model
Supply Manager plan for component purchases negotiate with suppliers
offers and deliveries projected inventory and costs
projected component use
Demand Manager bid on customer RFQs produce and deliver computers
offers and deliveries
Customers
computer RFQs and orders
Demand Model
Suppliers
component RFQs and orders
Offer Acceptance Predictor TacTex
Fig. 2.
An overview of the main agent components.
(3) Bidding on RFQs from customers requesting computers; (4) Sending the daily production schedule to the factory; (5) Delivering completed computers. We assign the first two tasks to a Supply Manager module, and the last three to a Demand Manager module. The Supply Manager handles all planning related to component inventories and purchases, and requires no information about computer production except for a projection of future component use, which is provided by the Demand Manager. The Demand Manager, in turn, handles all planning related to computer sales and production. The only information about components required by the Demand Manager is a projection of the current inventory and future component deliveries, along with an estimated replacement cost for each component used. This information is provided by the Supply Manager. We view the tasks to be performed by these two managers as optimization tasks: the Supply Manager tries to minimize the cost of obtaining
Ch. 6. An Autonomous Agent for Supply Chain Management
147
the components required by the Demand Manager, whereas the Demand Manager seeks to maximize the profits from computer sales subject to the information provided by the Supply Manager. In order to perform these tasks, the two managers need to be able to make predictions about the results of their actions and the future of the economy. TacTex-06 uses three predictive models to assist the managers with these predictions: a predictive Supplier Model, a predictive Demand Model, and an Offer Acceptance Predictor. The Supplier Model keeps track of all information available about each supplier, such as TacTex-06’s outstanding orders and the prices that have been offered in response to RFQs. Using this information, the Supplier Model can assist the Supply Manager by making predictions concerning future component availability and prices. The Demand Model tracks the customer demand in each of the three market segments, and tries to estimate the underlying demand parameters in each segment. With these estimates, it is possible to predict the number of RFQs that will be received on any future day. The Demand Manager can then use these predictions to plan for future production. When deciding what bids to make in response to customer RFQs, the Demand Manager needs to be able to estimate the probability of a particular bid being accepted (which depends on the bidding behavior of the other agents). This prediction is handled by the Offer Acceptance Predictor. On the basis of past bidding results, the Offer Acceptance Predictor produces a function for each RFQ that maps bid prices to the predicted probability of winning the order. The steps taken each day by TacTex-06 as it performs the five tasks described previously are presented in Table 1.
4
The Demand Manager
The Demand Manager handles all computation related to computer sales and production. This section describes the Demand Manager, along with the Demand Predictor and the Offer Acceptance Predictor upon which it relies.
4.1 Demand Model When planning for future computer production, the Demand Manager needs to be able to make predictions about future demand in each market segment. For example, if more RFQs are expected for high range than low range computers, the planned production should reflect this fact. The Demand Model is responsible for making these predictions.
148
D. Pardoe and P. Stone
Table 1 Overview of the steps taken each day by TacTex-06 Record information received from the server and update prediction modules. The Supply Manager takes the supplier offers as input and performs the following: decide which offers to accept, update projected future inventory, update replacement costs. The Demand Manager takes customer RFQs, current orders, projected inventory,and replacement costs as input and performs the following:
predict future customer demand using the Demand Model, use the Offer Acceptance Predictor to generate acceptance functions for RFQs, schedule production several days into the future, extract the current day’s production, delivery, and bids from the schedule, update projected future component use.
The Supply Manager takes the projected future component use as input andperforms the following: determine the future deliveries needed to maintain a threshold inventory, use the Supplier Model to predict future component prices, decide what RFQs need to be sent on the current day.
To explain its operation, further detail is required about the customer demand model. The state of each demand segment (high, mid, and low range computers) is represented by parameters Qd and td (both of which are internal to the game server). Qd represents the expected number of RFQs on day d, and td is the trend in demand (increasing or decreasing) on day d. The actual number of RFQs is generated randomly from a Poisson distribution with Qd as its mean. The next day’s demand, Qdþ1, is set to Qdtd, and tdþ1 is determined from td according to a random walk. To predict future demand, the Demand Manager estimates the values of Qd and td for each segment using an approach first used by the agent DeepMaize in 2003 (Kiekintveld et al., 2004). Basically, this is a Bayesian approach that involves maintaining a probability distribution over (Qd, td) pairs for each segment. The number of RFQs received each day from the segment represents information that can be used to update this distribution, and the distribution over (Qdþ1, tdþ1) pairs can then be generated based on the game’s demand model. By repeating this last step, the expected value of Qi can be determined for any future day i and used as the number of RFQs predicted on that day. Full details of the approach are available in Kiekintveld et al. (2004).1 1 The DeepMaize team has released their code for this approach: http://www.eecs.umich.edu/ Bckiekint/downloads/DeepMaize_CustomerDemand_Release.tar.gz
Ch. 6. An Autonomous Agent for Supply Chain Management
149
4.2 Offer Acceptance Predictor2 To bid on customer RFQs, the Demand Manager needs to be able to predict the orders that will result from the offers it makes. A simple method of prediction would be to estimate the winning price for each RFQ, and assume that any bid below this price would result in an order. Alternatively, for each RFQ the probability of winning the order could be estimated as a function of the current bid. This latter approach is the one implemented by the Offer Acceptance Predictor. For each customer RFQ received, the Offer Acceptance Predictor generates a function mapping the possible bid prices to the probability of acceptance. (The function can thus be viewed as a cumulative distribution function.) This approach involves three components: a particle filter used to generate initial predictions, an adaptive means of revising the predictions to account for the impact of an RFQ’s due date, and a learned predictor that predicts how the prices of computers will change in the future. A visual inspection of each day’s winning prices for each type of computer in a typical completed game suggests that these prices tend to follow a normal distribution. To estimate these distributions during a game, the Offer Acceptance Predictor makes use of a separate particle filter [specifically a Sampling Importance Resampling filter (Arulampalam et al., 2002)] for each computer type. A particle filter is a sequential Monte Carlo method that tracks the changing state of a system by using a set of weighted samples (called particles) to estimate a posterior density function over the possible states. The weight of each particle represents its relative probability, and particles and weights are revised each time an observation (conditioned on the current state) is received. In this case, each of the 100 particles used per filter represents a normal distribution (indicating the probability that a given price will be the winning price on the computer) with a particular mean and variance. At the beginning of each game, weights are set equally and each particle is assigned a mean and variance drawn randomly from a distribution that is generated by analyzing the first day prices from a large data set of past games. (The source of this data set will be described below.) Each succeeding day, a new set of particles is generated from the old. For each new particle to be generated, an old particle is selected at random based on weight, and the new particle’s estimate of mean and variance are set to those of the old particle plus small changes, drawn randomly from the distribution of day-to-day changes seen in the data set of past games. The new particles are then reweighted, with the weight of each particle set to the probability of the previous day’s pricerelated observations occurring according to the distribution represented. These observations consist of the reported highest and lowest winning 2
This section presents a significant addition to the previous agent, TacTex-05.
150
D. Pardoe and P. Stone
prices and the acceptance or rejection of each offer made to a customer for the given type of computer. Finally, the weights are normalized to sum to one. The distribution of winning prices predicted by the particle filter is simply the weighted sum of the individual particles’ distributions, and from this distribution the function mapping each possible bid price to a probability of acceptance can be determined. These functions are then modified using values we call day factors, which are designed to measure the effect of the due date on offer acceptance. The due dates for RFQs range from 3 to 12 days in the future, and a separate day factor is learned for each day in this range. Each day factor is set to the ratio of actual orders received to orders expected based on the linear heuristic, for all recent offers made. When an offer is made on an RFQ, the Offer Acceptance Predictor computes the probability of an order by multiplying the initial prediction by the corresponding day factor. The day factors therefore serve both as a means of gauging the impact of due dates on computer prices and as a mechanism for ensuring that the number of orders received is roughly the number expected. To maximize revenue from the computers sold, the Demand Manager needs to consider not only the prices it will offer in response to the current day’s RFQs, but also what computers it will wish to sell on future days. In fact, the Demand Manager plans ahead for 10 days and considers future as well as current RFQs when making offers, which will be described in the next section. It is therefore important for the Offer Acceptance Predictor to be able to predict future changes in computer prices. To illustrate why this is important, Fig. 3 shows the prices at which one type of computer sold during a single game of the 2006 finals. For each day, points representing one standard deviation above and below the average price are plotted. On most days, there is clearly little variance between the winning prices, but prices often change drastically over the course of a few days. This fact suggests that it may be even more valuable to be able to predict future changes in price than to predict the distribution of winning prices on a single day. By simply selling a computer a few days earlier or later, it might be possible for the Demand Manager to significantly increase the price it obtains. To make these predictions of price changes, the Offer Acceptance Predictor performs machine learning on data from past games. Each training instance consists of 31 features representing data available to the agent during the game, such as the date, estimated levels of customer demand and demand trend, and current and recent computer prices. The label for each instance is the amount by which the average price changes in 10 days. Once the Offer Acceptance Predictor has learned to predict this quantity, it can predict the change in average price for any day between zero and ten days in the future through linear interpolation. No effort is made to predict changes in the shape of the distribution, i.e., the variance. Thus, to generate an offer acceptance function for a future RFQ, the Offer
Ch. 6. An Autonomous Agent for Supply Chain Management
151
3000
Sales price
one standard deviation below average one standard deviation above average
2400
1800
1200 0
50
100
150
200
Day
Fig. 3.
Average prices at which one type of computer sold during one game of the 2006 finals. One standard deviation above and below the average is shown.
Acceptance Predictor simply shifts the predicted distribution over winning prices up or down depending on the predicted change in average price, and bases the acceptance function on this modified distribution. To train the price change predictor, a learning algorithm and source of training data must be chosen. After experimenting with various algorithms from the WEKA machine learning package (Witten and Frank, 1999), we selected additive regression with decision stumps, an iterative method in which a decision stump is repeatedly fit to the residual from the previous step. (M5 regression trees gave nearly identical performance, but the models generated were significantly larger.) For training data, we could have used data from games in the competition, but instead we ran a large number of games of our own using both variations of TacTex-06 and other agents taken from the TAC Agent Repository,3 a collection of agents provided by the teams involved in the competition. Doing so allowed us to generate separate training and testing data sets for various combinations of six agents, which we then used to test whether predictors trained on data from games with one set of agents would generalize to games involving a different set of agents. In particular, for four different groups of six agents, we ran 40 games, and we generated training data using 30 games and testing data with the other 10. We then trained a separate predictor on each training set. Fortunately, generalization was good: for each of the four testing data sets, all four predictors were reasonably accurate. In other words, in order to predict price changes in a game with a particular group of agents, it was not absolutely necessary to have trained on data specific to 3
http://www.sics.se/tac/showagents.php
152
D. Pardoe and P. Stone
those agents. We thus chose to train a single predictor on the entire set of data from these games, and use the same predictor throughout the competition.4 4.3 Demand Manager The Demand Manager is responsible for bidding on customer RFQs, producing computers, and delivering them to customers. All three tasks can be performed using the same production scheduling algorithm. As these tasks compete for the same resources (components, completed computers, and factory cycles), the Demand Manager begins by planning to satisfy existing orders, and then uses the remaining resources in planning for RFQs. The latest possible due date for an RFQ received on the current day is 12 days in the future, meaning the production schedule for the needed computers must be sent within the next 10 days. The Demand Manager thus always plans for the next 10 days of production. Each day, the Demand Manager (i) schedules production of existing orders, (ii) schedules production of predicted future orders, and then (iii) extracts the next day’s production and delivery schedule from the result. The production scheduling algorithm, these three steps, and the means of predicting production beyond 10 days are described in the following sections. 4.3.1 Production scheduling algorithm The goal of the production scheduler is to take a set of orders and to determine the 10-day production schedule that maximizes profit, subject to the available resources. The resources provided are:
A fixed number of factory cycles per day; The components in inventory; The components projected to be delivered; Completed computers in inventory.
The profit for each order is equal to its price (if it could be delivered) minus any penalties for late delivery and the replacement costs for the components involved as specified by the Supply Manager. The scheduling algorithm used by the Demand Manager is a greedy algorithm that attempts to produce each order as late as possible. Orders are sorted by profit, and the scheduler tries to produce each order using cycles and components from the latest possible dates. If any part of the order cannot be produced, the needed computers will be taken from the 4 In our post-competition analysis, we found that this was a reasonable decision given the limited number of games that would have been available during the competition to use for training. In more recent work, however, we explore methods of making use of both sources of data (games from the competition and games run on our own) and show that improvements in predictor accuracy are possible (Pardoe and Stone, 2007).
Ch. 6. An Autonomous Agent for Supply Chain Management
153
existing inventory of completed computers, if possible. The purpose of scheduling production as late as possible is to preserve resources that might be needed by orders with earlier due dates. A record is kept of what production took place on each day and how each order was filled. It should be noted that the scheduling problem at hand lends itself to the use of linear programming to determine an optimal solution. We initially experimented with this approach, using a linear program similar to one designed for a slightly simplified scenario by Benisch et al. (2004a). However, due to the game’s time constraints (15 s allowed per simulated day), the need to use the scheduler multiple times per day (and in a modified fashion for bidding on customer RFQs, as described below), and the fact that the greedy approach is nearly optimal [observed in our own experiments and confirmed by Benisch et al. (2006a)], we chose to use the greedy approach. 4.3.2 Handling existing orders The Demand Manager plans for the production of existing orders in two steps. Before starting, the production resources are initialized using the values provided by the Supply Manager. Then the production scheduler is applied to the set of orders due in one day or less. All orders that can be taken from inventory (hopefully be all of them to avoid penalties) are scheduled for delivery the next day. The production scheduler is next applied to the remaining orders. No deliveries are scheduled at this time, because there is no reward for early delivery. 4.3.3 Bidding on RFQs and handling predicted orders The goal of the Demand Manager is now to identify the set of bids in response to customer RFQs that will maximize the expected profit from using the remaining production resources for the next 10 days, and to schedule production of the resulting predicted orders. The profit depends not only on the RFQs being bid on the current day, but also on RFQs that will be received on later days for computers due during the period. If these future RFQs were ignored when selecting the current day’s bids, the Demand Manager might plan to use up all available production resources on the current RFQs, leaving it unable to bid on future RFQs. One way to address this issue would be to restrict the resources available to the agent for production of the computers being bid on (as in Benisch et al., 2004a). Instead, the Demand Manager generates a predicted set of all RFQs, using the levels of customer demand predicted by the Demand Model, that will be received for computers due during the period, and chooses bids for these RFQs at the same time as the actual RFQs from the current day. Once the predicted RFQs are generated, the Offer Acceptance Predictor is used to generate an acceptance prediction function for every RFQ, both real and predicted. The acceptance prediction functions for predicted RFQs are shifted based on the price changes predicted, as described in Section 4.2.
154
D. Pardoe and P. Stone
The Demand Manager then considers the production resources remaining, the set of RFQs, and the set of acceptance prediction functions and simultaneously generates a set of bids on RFQs and a production schedule that produces the expected resulting orders, using the following modification of the greedy scheduler. If we were considering only a single RFQ and had no resource constraints, the expected profit resulting from a particular bid price would be: Expected profit ¼ PðorderjpriceÞ ðprice costÞ
(1)
The optimal bid would be the value that maximized this quantity. Computing the expected profit from a set of bids when resource constraints are considered is much more difficult, however, because the profit from each RFQ cannot be computed independently. For each possible set of orders in which it is not possible to fill all orders, the profit obtained depends on the agent’s production and delivery strategy. For any nontrivial production and delivery strategy, precise calculation of the expected profit would require separate consideration of a number of possible outcomes that is exponential in the number of RFQs. If we were guaranteed that we would be able to fill all orders, we would not have this problem. The expected profit from each RFQ could be computed independently, and we would have: X Expected profit ¼ Pðorderi jpricei Þ ðpricei costi Þ (2) i 2 all RFQs
Our bidding heuristic is based on the assumption that the expected number of computers ordered for each RFQ will be the actual number ordered. In other words, we pretend that it is possible to win a part of an order, so that instead of winning an entire order with probability p, we win a fraction p of an order with probability 1. This assumption greatly simplifies the consideration of filling orders, since we now have only one set of orders to consider, while leaving the formulation of expected profit unchanged. As long as it is possible to fill the partial orders, (2) will hold, where the probability term now refers to the fraction of the order won. It would appear that this approach could lead to unfilled orders when the agent wins more orders than expected, but in practice, this is not generally a problem. Most of the RFQs being bid on are the predicted RFQs that will be received on future days, and so the agent can modify its future bidding behavior to correct for an unexpectedly high number of orders resulting from the current day’s RFQs. TacTex-06 indeed tends to have very few late or missed deliveries using this bidding strategy. By using this notion of partial orders, we can transform the problem of bid selection into the problem of finding the most profitable set of partial orders that can be filled with the resources available, and we can solve this
Ch. 6. An Autonomous Agent for Supply Chain Management
155
problem using the greedy production scheduler. All bids are initially set to be just above the reserve price, which means we begin with no orders. The scheduler then chooses an RFQ and an amount by which its bid will be lowered, resulting in an increased partial order for that RFQ. The scheduler simulates filling this increase by scheduling its production as described previously. This process is repeated until no more production is possible or no bid can be reduced without reducing the expected profit. Because we are working with resource constraints, the goal of the greedy production scheduler at each step is to obtain the largest possible increase in profit using the fewest possible production resources. At each step, the scheduler considers each RFQ and determines the bid reduction that will produce the largest increase in profit per additional computer. The scheduler then selects the RFQ for which this value is the largest. In many cases, however, the most limited resource is production cycles, and not components. In such cases, the increase in profit per cycle used is a better measure of the desirability of a partial order than the increase in profit per additional computer, so we divide the latter quantity by the number of cycles required to produce the type of computer requested by the RFQ and use the resulting values to choose which RFQ should be considered next. We consider cycles to be the limiting factor whenever the previous day’s production used more than 90% of the available cycles. The range of possible bid prices is discretized for the sake of efficiency. Even with fairly fine granularity, this bidding heuristic produces a set of bids in significantly less time than the 15 s allowed per simulated game day. The complete bidding heuristic is summarized in Table 2. 4.3.4 Completing production and delivery After applying the production scheduler to the current orders and RFQs, the Demand Manager is left with a 10-day production schedule, a record of how each order was filled, and a set of bids for the actual and predicted RFQs. The bids on actual RFQs can be sent directly to customers in their current form, and computers scheduled for delivery can be shipped. The Demand Manager then considers modifications to the production schedule to send to the factory for the next day. If there are no cycles remaining on the first day of the 10-day production schedule, the first day can be sent unchanged to the factory. Otherwise, the Delivery Manager shifts production from future days into the first day so as to utilize all cycles, if possible. 4.3.5 Production beyond 10 days The components purchased by the Supply Manager depend on the component use projected by the Demand Manager. If we want to allow the possibility of ordering components more than 10 days in advance, the Demand Manager must be able to project its component use beyond the 10-day period for which it plans production. One possibility we considered
156
D. Pardoe and P. Stone
Table 2 The bidding heuristic For each RFQ, compute both the probability of winning and the expected profit as a function of price. Set the bid for each RFQ to be just above the reserve price. Repeat until no RFQs are left in the list of RFQs to be considered: For each RFQ, find the bid lower than the current bid that produces the largest increase in profit per additional computer ordered (or per additional cycle required during periods of high factory utilization). Choose the RFQ and bid that produce the largest increase. Try to schedule production of the partial order resulting from lowering the bid. If it cannot be scheduled, remove the RFQ from the list. If the production was scheduled, but no further decrease in the bid will lead to an increase in profit, remove the RFQ from the list. Return the final bid for each RFQ.
was to extend this period and predict RFQs farther into the future. Another was to predict future computer and component prices by estimating our opponents’ inventories and predicting their future behavior. Neither method provided accurate predictions of the future, and both resulted in large swings in projected component use from one day to the next. The Demand Manager thus uses a simple and conservative prediction of future component use. The Demand Manager attempts to predict its component use for the period between 11 and 40 days in the future. Before 11 days, the components used in the 10-day production schedule are used as the prediction, and situations in which it is advantageous to order components more than 40 days in advance appear to be rare. The Demand Model is used to predict customer demand during this period, and the Demand Manager assumes that it will win, and thus need to produce, some fraction of this demand. This fraction ranges from zero during times of low demand to 1/6 during times of moderate or high demand, although the Demand Manager will not predict a higher level of component use than is possible given the available factory cycles. While this method of projecting component use yields reasonable results, improving the prediction is a significant area for future work. 5
The Supply Manager
The Supply Manager is responsible for purchasing components from suppliers based on the projection of future component use provided by the Demand Manager, and for informing the Demand Manager of expected component deliveries and replacement costs. In order to be effective, the
Ch. 6. An Autonomous Agent for Supply Chain Management
157
Supply Manager must be able to predict future component availability and prices. The Supplier Model assists in these predictions. 5.1 Supplier Model The Supplier Model keeps track of all information sent to and received from suppliers. This information is used to model the state of each supplier, allowing predictions to be made. The Supplier Model performs three main tasks: predicting component prices, tracking reputation, and generating probe RFQs to improve its models. 5.1.1 Price prediction To assist the Supply Manager in choosing which RFQs to send to suppliers, the Supplier Model predicts the price that a supplier will offer in response to an RFQ with a given quantity and due date. The Supplier Model requires an estimate of each supplier’s existing commitments in order to make this prediction. Recall that the price offered in response to an RFQ requesting delivery on a given day is determined entirely by the fraction of the supplier’s capacity that is committed through that day. As a result, the Supplier Model can compute this fraction from the price offered. If two offers with different due dates are available, the fraction of the supplier’s capacity that is committed in the period between the first and second date can be determined by subtracting the total capacity committed before the first date from that committed before the second. With enough offers, the Supplier Model can form a reasonable estimate of the fraction of capacity committed by a supplier on any single day. For each supplier and supply line, the Supply Manager maintains an estimate of free capacity, and updates this estimate daily based on offers received. Using this estimate, the Supplier Model is able to make predictions on the price a supplier will offer for a particular RFQ. 5.1.2 Reputation When deciding which RFQs to send, the Supply Manager needs to be careful to maintain a good reputation with suppliers. Each supplier has a minimum acceptable purchase ratio, and the Supply Manager tries to keep this ratio above the minimum. The Supplier Model tracks the offers accepted from each supplier and informs the Supply Manager of the quantity of offered components that can be rejected from each supplier before the ratio falls below the minimum. 5.1.3 Price probes The Supply Manager will often not need to use the full five RFQs allowed each day per supplier line. In these cases, the remaining RFQs can be used
158
D. Pardoe and P. Stone
as zero-quantity price probes to improve the Supplier Model’s estimate of a supplier’s committed capacity. For each supplier line, the Supplier Model records the last time each future day has been the due date for an offer received. Each day, the Supply Manager informs the Supplier Model of the number of RFQs available per supplier line to be used as probes. The Supplier Model chooses the due dates for these RFQs by finding dates that have been used as due dates least recently.
5.2 Supply Manager The Supply Manager’s goal is to obtain the components that the Demand Manager projects it will use at the lowest possible cost. This process is divided into two steps: first the Supply Manager decides what components will need to be delivered, and then it decides how best to ensure the delivery of these components. These two steps are described below, along with an alternative means of obtaining components. 5.2.1 Deciding what to order The Supply Manager seeks to keep the inventory of each component above a certain threshold. This threshold (determined experimentally) is 800, or 400 in the case of CPUs, and decreases linearly to zero between days 195 and 215. Each day the Supply Manager determines the deliveries that will be needed to maintain the threshold on each day in the future. Starting with the current component inventory, the Supply Manager moves through each future day, adding the deliveries from suppliers expected for that day, subtracting the amount projected to be used by the Demand Manager for that day, and making a note of any new deliveries needed to maintain the threshold. The result is a list of needed deliveries that we will call intended deliveries. When informing the Demand Manager of the expected future component deliveries, the Supply Manager will add these intended deliveries to the actual deliveries expected from previously placed component orders. The idea is that although the Supply Manager has not yet placed the orders guaranteeing these deliveries, it intends to, and is willing to make a commitment to the Demand Manager to have these components available. Because prices offered in response to short-term RFQs can be very unpredictable, the Supply Manager never makes plans to send RFQs requesting delivery in less than five days. (One exception is discussed later.) As discussed previously, no component use is projected beyond 40 days in the future, meaning that the intended deliveries fall in the period between 5 and 40 days in the future.
Ch. 6. An Autonomous Agent for Supply Chain Management
159
5.2.2 Deciding how to order Once the Supply Manager has determined the intended deliveries, it must decide how to ensure their delivery at the lowest possible cost. We simplify this task by requiring that for each component and day, that day’s intended delivery will be supplied by a single order with that day as the due date. Thus, the only decisions left for the Supply Manager are when to send the RFQ and which supplier to send it to. For each individual intended delivery, the Supply Manager predicts whether sending the RFQ immediately will result in a lower offered price than waiting for some future day, and sends the RFQ if this is the case. To make this prediction correctly, the Supply Manager would need to know the prices that would be offered by a supplier on any future day. Although this information is clearly not available, the Supplier Model does have the ability to predict the prices that would be offered by a supplier for any RFQ sent on the current day. To enable the Supply Manager to extend these predictions into the future, we make the simplifying assumption that the price pattern predicted on the current day will remain the same on all future days. In other words, if an RFQ sent on the current day due in i days would result in a certain price, then sending an RFQ on any future day d due on day dþi would result in the same price. This assumption is not entirely unrealistic due to the fact that agents tend to order components a certain number of days in advance, and this number generally changes slowly. Essentially, we are saying, ‘‘Given the current ordering pattern of other agents, prices are lowest when RFQs are sent x days in advance of the due date, so plan to send all RFQs x days in advance.’’ The resulting procedure followed by the Supply Manager is as follows. For each intended delivery, the Supplier Model is asked to predict the prices that would result from sending RFQs today with various due dates requesting the needed quantity. A price is predicted for each due date between 5 and 40 days in the future. (Each price is then modified slightly according to a heuristic that will be presented in the next section.) If there are two suppliers, the lower price is used. If the intended delivery is needed in i days, and the price for ordering i days in advance is lower than that of any smaller number of days, the Supply Manager will send the RFQ. Any spare RFQs will be offered to the Supplier Model to use as probes. The final step is to predict the replacement cost of each component. The Supply Manager assumes that any need for additional components that results from the decisions of the Demand Manager will be felt on the first day on which components are currently needed, i.e., the day with the first intended delivery. Therefore, for each component’s replacement cost, the Supply Manager uses the lowest price found when considering the first intended delivery of that component, even if no RFQ was sent. For each RFQ, a reserve price somewhat higher than the expected offer price is used. Because the Supply Manager believes that the RFQs it sends
160
D. Pardoe and P. Stone
are the ones that will result in the lowest possible prices, all offers are accepted. If the reserve price cannot be met, the Supplier Model’s predictions will be updated accordingly and the Supply Manager will try again the next day. 5.2.3 Waiting to order in certain cases5 When prices are lower for long-term orders than short-term orders, the Supply Manager faces an interesting tradeoff. Waiting to order an intended delivery in the short term is expected to increase costs, but by waiting the agent might gain a clearer picture of its true component needs. For example, if customer demand suddenly drops, the agent may be better off if it has waited to order and can avoid unnecessary purchases, even if prices are somewhat higher for those components which the agent does purchase. Using the ordering strategy of the previous section, however, the Supply Manager would always choose to place long-term orders no matter how small the expected increase in cost would be if it waited. A number of experiments using the previous version of the agent, TacTex-05, suggest that agent performance would improve if the Supply Manager were to postpone ordering in such situations (Pardoe and Stone, 2006). One possible way of ensuring this behavior would be to modify the current strategy so that instead of sending a request as soon as the predicted price is at its lowest point, the request is only sent when it is believed to be unlikely that a reasonably close price can still be obtained. In TacTex-06, the Supply Manager implements an approximation of this strategy using a straightforward heuristic: predictions of offer prices are increased by an amount proportional to the distance of the requested due date. In particular, the predicted price for a requested due date d days away, 5rdr40, is multiplied by 1þxd, where xd ¼ 0.1(d5)/35. Predicted prices are thus increased between 0% and 10%, values chosen through experimentation. As a result, the Supply Manager will wait to order when long-term prices are only slightly lower than short-term prices. 5.2.4 2-Day RFQs As mentioned previously, the prices offered in response to RFQs requesting near-immediate delivery are very unpredictable. If the Supply Manager were to wait until the last minute to send RFQs in hopes of low prices, it might frequently end up paying more than expected or be unable to buy the components at all. To allow for the possibility of getting low priced short-term orders without risk, the Supply Manager sends RFQs due in 2 days, the minimum possible, for small quantities in addition to what is required by the intended deliveries. If the prices offered are lower than those expected from the normal RFQs, the offers will be accepted. 5
This section presents a significant addition to the previous agent, TacTex-05.
Ch. 6. An Autonomous Agent for Supply Chain Management
161
The size of each 2-day RFQ depends on the need for components, the reputation with the supplier, and the success of past 2-day RFQs. Because the Supply Manager may reject many of the offers resulting from 2-day RFQs, it is possible for the agent’s reputation with a supplier to fall below the acceptable purchase ratio. The Supplier Model determines the maximum amount from each supplier that can be rejected before this happens, and the quantity requested is kept below this amount. The Supply Manager decides whether to accept an offer resulting from a 2-day RFQ by comparing the price to the replacement cost and the prices in offers resulting from normal RFQs for that component. If the offer price is lower than any of these other prices, the offer is accepted. If the quantity in another, more expensive offer is smaller than the quantity of the 2-day RFQ, then that offer may safely be rejected. The 2-day RFQs enable the agent to be opportunistic in taking advantage of short-term bargains on components without being dependent on the availability of such bargains.
6
Adaptation over a series of games
The predictions made by the predictive modules as described above are based only on observations from the current game. Another source of information that could be useful in making predictions is the events of past games, made available in log files kept by the game server. During the final rounds of the TAC SCM competition, agents are divided into brackets of six and play a number of games (16 on the final day of competition) against the same set of opponents. When facing the same opponents repeatedly, it makes sense to consider adapting predictions in response to completed games. TacTex-06 makes use of information from these games in its decisions during two phases of the game: buying components at the beginning of the game (impacting mainly the behavior described in Section 5.2), and selling computers at the end of the game (impacting the behavior in Section 4.2). In both cases, only past games within a bracket are considered, and default strategies are used when no game logs are yet available. We chose to focus on these areas for two reasons. Behavior during these two phases varies significantly from one agent to another, possibly due to the fact that these phases are difficult to reason about in general and may thus be handled using special-case heuristic strategies by many agents. At the same time, each agent’s behavior remains somewhat consistent from game to game (e.g. many agents order the same components at the beginning of each game). This fact is critical to the success of an adaptive strategy—the limited number of games played means that it must be possible to learn an effective response from only a few past games.
162
D. Pardoe and P. Stone
6.1 Initial component orders At the beginning of each game, many agents place relatively large component orders (when compared to the rest of the game) to ensure that they will be able to produce computers during the early part of the game. Prices for some components may also be lower on the first day than they will be afterwards, depending on the due date requested. Determining the optimal initial orders to place is difficult, because no information is made available on the first day of the game, and prices depend heavily on the orders of other agents. TacTex-06 addresses this issue by analyzing component costs from past games and deciding what components need to be requested on the first two days in order to ensure a sufficient supply of components early in the game and to take advantage of low prices. The process is very similar to the one described in Section 5.2, except that predictions of prices offered by suppliers are based on past games. First, the components needed are identified, then the decision of which components should be requested is made, and finally the RFQs are generated. The Supply Manager begins by deciding what components will be needed. On the first day, when no demand information is available (customers begin sending RFQs on the second day), the Supply Manager assumes that it will be producing an equal number of each type of computer, and projects the components needed to sustain full factory utilization for 80 days. On the second day, the Supply Manager projects future customer demand as before and assumes it will receive orders for some fraction of RFQs over each of the next 80 days. The projected component use is converted into a list of intended deliveries as before. (The Supply Manager makes no projections beyond the first 80 days, because we have not observed instances where it would be worthwhile to order components so far in advance.) Next, the Supply Manager must decide which components should be requested on the current day (the first or second day of the game). As in Section 5.2.2, the Supply Manager must determine which intended deliveries will be cheapest if they are requested immediately. At the beginning of the game, the Supplier Model will have no information to use in predicting prices, and so information from past games is used. By analyzing the log from a past game and modeling the state of each supplier, it is possible to determine the exact price that would have been offered in response to any possible RFQ. Predictions for the current game can be made by averaging the results from all past games. When modeling the states of suppliers, RFQs and orders from TacTex-06 are omitted to prevent the agent from trying to adapt to its own behavior. If the initial component purchasing strategies of opponents remain the same from game to game, these average values provide a reasonable means of estimating prices.
Ch. 6. An Autonomous Agent for Supply Chain Management
163
At the beginning of the game, the Supply Manager reads in a table from a file that gives the average price for each component for each pair of request date and due date. Using this table, the Supply Manager can determine which intended deliveries will cost less if requested on the current day than on any later day. Intended deliveries due within the first 20 days are always requested on the first day, however, to avoid the possibility that they will be unavailable later. If opponents request many components on the first day of the game but few on the second, the prices offered in response to RFQs sent on the second day will be about the same as if the RFQs had been sent on the first day. Since information about customer demand is available on the second day of the game but not on the first, it might be beneficial to wait until the second day to send RFQs. For this reason, the Supply Manager will not send a request for an intended delivery if the price expected on the second day is less than 3% more than the price expected on the first. Once the Supply Manager has decided which intended deliveries to request, it must decide how to combine these requests into the available number of RFQs (five, or ten if there are two suppliers). In Section 5.2.2, this problem did not arise, because there were typically few requests per day. On the first two days, it is possible for the number of intended deliveries requested to be much larger than the number of RFQs available. Intended deliveries will therefore need to be combined into groups, with delivery on the earliest group member’s delivery date. The choice of grouping can have a large impact on the prices offered. When there is only one supplier, the Supply Manager begins by dividing the 80-day period into five intervals, defined by six interval endpoints, with a roughly equal number of intended deliveries in each interval. Each interval represents a group of intended deliveries that will have delivery requested on the first day of the interval. One at a time, each endpoint is adjusted to minimize the sum of expected prices plus storage costs for those components delivered early. When no more adjustments will reduce the cost, the Supply Manager sends the resulting RFQs. When there are two suppliers, 10 intervals are used, and intervals alternate between suppliers.
6.2 Endgame sales Near the end of each game, some agents tend to run out of inventory and stop bidding on computers, whereas other agents tend to have surplus computers, possibly by design, that they attempt to sell up until the last possible day. As a result, computer prices on the last few days of the game are often either very high or very low. When endgame prices will be high, it can be beneficial to hold on to inventory so as to sell it at a premium during the last days. When prices will be low, the agent should deplete its inventory earlier in the game. TacTex-06 adapts in response to the behavior of its
164
D. Pardoe and P. Stone
competitors in past games by adjusting the predictions of the Offer Acceptance Predictor (Section 4.2) during the last few days of each game. TacTex-06’s endgame strategy is essentially to reserve only as many computers for the final few days as it expects to be able to sell at high prices. In particular, from day 215 to 217, the Demand Manager will always respond to a customer RFQ (if it chooses to respond) by offering a price slightly below the reserve. For RFQs received on these days, the probability predicted by the Offer Acceptance Predictor is set to the fraction of computers that would have sold at the reserve price on that day in past games. When the Demand Manager plans for a period of production that includes one of these days, these acceptance probabilities will hopefully result in an appropriate number of computers being saved for these three days.
7
2006 Competition results
Out of 21 teams that participated in the final round of the 2006 TAC SCM competition, held over three days at AAMAS 2006, six advanced to the final day of competition. After 16 games between these agents, TacTex06 had the highest average score, $5.9 million, followed closely by PhantAgent with $4.1 million, and DeepMaize with $3.6 million.6 Both PhantAgent and DeepMaize were much improved over their 2005 counterparts, and would very likely have beaten the previous year’s champion, TacTex-05, if it had competed unchanged. It thus appears that the improvements present in TacTex-06 were an important part of its victory. Although it is difficult to assign credit for an agent’s performance in the competition to particular components, we can make some observations that support this hypothesis. Figure 4 shows the average, over all 16 games on the final day of the competition, of the profit earned per game day for the top three agents. Daily profit is computed by determining what computers were delivered to customers each day and which components in inventory went into those computers, and then subtracting costs from revenue. TacTex-06 clearly had the highest daily profits over the first 70 days of the game, and after this point profits were roughly equal for all three agents. The difference in profits appears to be accounted for by higher revenue per computer. During the first 70 days of each game, TacTex-06 sold about as many computers as PhantAgent and DeepMaize while paying roughly the same costs for components, but TacTex-06 almost always had a much higher average sales price for each type of computer. After day 70, TacTex-06 still had somewhat higher average computer prices, but these were offset by higher component costs than the other two agents paid. 6
Competition scores are available at http://www.sics.se/tac/scmserver
Ch. 6. An Autonomous Agent for Supply Chain Management
165
80000 60000 40000
Profit
20000 0 -20000 TacTex-06 PhantAgent DeepMaize
-40000 -60000 -80000 0
Fig. 4.
50
100 Day
150
200
Daily profits for the top three agents on the final day of the 2006 competition, averaged over all 16 games.
The ability of TacTex-06 to sell computers at higher prices appears to be due to its attempt to predict future changes in computer prices and react accordingly. During the competition, TacTex-06 could often be seen building up its inventory of completed computers before prices rose or selling off its inventory as prices peaked, while such behavior among other agents was less visible. This behavior can explain not only the fact that TacTex-06 sold computers at higher prices, but also the fact that the advantage was especially large in the first portion of each game (to see why, consider Fig. 3). For this particular game and computer type, prices began very high, then fell rapidly before recovering somewhat. This pattern is actually very common. Agents begin with no components or computers in inventory, and the supply of computers is thus much smaller than the demand in the beginning of each game. As agents obtain components and begin selling computers, prices usually drop rapidly. Due to the rapid changes in computer prices and the predictability of this pattern, the attempts by TacTex-06 to predict and exploit changes in prices are particularly effective in this period of the game. To get a clearer picture of how the improvements in TacTex-06 contribute to its performance, we perform a series of controlled experiments in the following section.
8
Experiments
We now present the results of controlled experiments designed to measure the impact of individual components of TacTex-06 on its overall
166
D. Pardoe and P. Stone
performance. In each experiment, two versions of TacTex-06 compete: one unaltered agent that matches the description provided previously, and one agent that has been modified in a specific way. Each experiment involves 30 games. The other four agents competing—Mertacor, DeepMaize, MinneTAC, and PhantAgent (all versions from 2005)—are taken from the TAC Agent Repository. (Experiments against different combinations of agents appear to produce qualitatively similar results.) Experimental results are shown in Table 3. Each experiment is labeled with a number. The columns represent the averages over the 30 games of the total score (profit), percent of factory utilization over the game (which is closely correlated with the number of computers sold), revenue from selling computers to customers, component costs, and the percentage of games in which the altered agent outscored the unaltered agent. In every experiment, the difference between the altered and unaltered agent is statistically significant with 99% confidence according to a paired t-test. The first row, experiment 0, is provided to give perspective to the results of other experiments. In experiment 0, two unaltered agents are used, and all numbers represent the actual results obtained. In all other rows, the numbers represent the differences between the results of the altered agent and the unaltered agent (from that experiment, not from experiment 0). In general, the results of the unaltered agents are close to those in experiment 0, but there is some variation due to differences between games (e.g. customer demand), and due to the effects of the altered agent on the economy. 8.1 Supply price prediction modification As described in Section 5.2.3, the Supply Manager slightly increases the predictions of prices that will be offered for components by an amount proportional to the number of days before the requested due date. This addition to TacTex-06 is designed to cause the agent to favor short-term component orders over long-term orders if the difference in price is small. In experiment 1, an agent that does not use this technique is tested. Compared to the unaltered agent, this agent has increased component purchases and factory utilization, but the increase in revenue is not enough to offset the higher costs, and the final score is lower than that of the unaltered agent. It appears that the unaltered agent is able to avoid purchasing unprofitable components in some cases by waiting longer to place its orders. 8.2 Offer Acceptance Predictor We now consider the impact of the improvements to the Offer Acceptance Predictor described in Section 4.2. In experiment 2, the altered
Experiment Number
Description
Score
Utilization (%)
Revenue
Costs
Win %
0 1 2 3 4 5
No changes No component price prediction increase No computer price change prediction No particle filter No particle filter or prediction Heuristic price change prediction
$7.28M 1.42 3.51 1.97 3.93 1.74
83 þ3 1 7 6 0
$104.7M þ3.51 4.50M 10.05M 10.99M 1.14M
$94.5M þ4.79 .70M 8.03M 6.83M .64M
— 23 0 0 0 13
Note: In each experiment, one altered version of TacTex-06 and one unaltered version compete in 30 games, along with four additional agents. Columns represent the total score, percent of factory utilization, revenue from customers, component costs, and how often the altered agent outscored the unaltered agent. Numbers represent millions of dollars. In experiment 0, provided to place other experiments’ results in perspective, no alteration is made to TacTex-06, and numbers represent the actual results. In all other experiments, numbers represent the difference between the altered and unaltered agent. In each experiment, the difference between the altered and unaltered agent is statistically significant with 99% confidence according to a paired t-test.
Ch. 6. An Autonomous Agent for Supply Chain Management
Table 3 Experimental results
167
168
D. Pardoe and P. Stone
agent always predicts that future computer prices will remain unchanged. Not surprisingly, the result is a large decrease in revenue and score. The decrease in score is almost twice as large as the margin of victory for TacTex-06 in the 2006 competition ($1.8 million), adding more weight to the claim of Section 7 that the prediction of future price changes played a large role in the winning performance. In experiment 3, the particle filter used to generate predictions of offer acceptance is replaced with a simpler heuristic that was used in TacTex-05. This heuristic used linear regression over the results of the past five days’ offers to generate a linear function used for offer acceptance predictions and was originally used by the agent Botticelli in 2003 (Benisch et al., 2004a). The experiment shows that the particle filter approach is an improvement over this heuristic. The large drop in factory utilization in the altered agent is surprising. Experiment 4 shows the result when the changes of experiments 2 and 3 are combined: the agent makes no predictions of future price changes and uses the linear heuristic instead of the particle filter. The score is only slightly worse than in experiment 2, suggesting that the benefits of using the particle filter are more pronounced when price changes are predicted. It is possible that the more detailed and precise predictions of offer acceptance generated from the particle filter are necessary for the agent to effectively make use of the predictions of future price changes. In experiment 5, the learned predictor of price changes is replaced with a heuristic that performs linear regression on the average computer price over the last 10 days, and extrapolates the trend seen into the future to predict price changes. Although the heuristic’s predictions are reasonably accurate, the performance of the altered agent is about midway between that of the unaltered agent and that of the agent from experiment 2 that makes no predictions at all, demonstrating the value of learning an accurate predictor.
9
Related work
Outside of TAC SCM, much of the work on agent-based SCM has focused on the design of architectures for distributed systems in which multiple agents throughout the supply chain must be able to communicate and coordinate (Fox et al., 2000; Sadeh et al., 2001). These systems may involve a static supply chain or allow for the dynamic formation of supply chains through agent negotiation (Chen et al., 1999). Other work has focused on general solutions to specific subproblems such as procurement or delivery. TAC SCM appears to be unique in that it represents a concrete domain in which individual agents must manage a complete supply chain in a competitive setting.
Ch. 6. An Autonomous Agent for Supply Chain Management
169
A number of agent descriptions for TAC SCM have been published presenting various solutions to the problem. At a high level, many of these agents are similar in design to TacTex-06: they divide the full problem into a number of smaller tasks and generally solve these tasks using decisiontheoretic approaches based on maximizing utility given estimates of various values and prices. The key differences are the specific methods used to solve these tasks. The problem of bidding on customer RFQs has been addressed with a wide variety of solutions. Southampton-SCM (He et al., 2006) takes a fuzzy reasoning approach in which a rule base is developed containing fuzzy rules that specify how to bid in various situations. PSUTAC (Sun et al., 2004) takes a similar knowledge-based approach. DeepMaize (Kiekintveld et al., 2004) performs a game-theoretic analysis of the economy to decide which bids to place. RedAgent (Keller et al., 2004) uses a simulated internal market to allocate resources and determine their values, identifying bid prices in the process. The approach described in this chapter, where probabilities of offer acceptance are predicted and then used in an optimization routine, is also used in various forms by several other agents. CMieux (Benisch et al., 2006b) makes predictions using a form of regression tree that is trained on data from past games, Foreseer (Burke et al., 2006) uses a form of online learning to learn multipliers (similar to the day factors used in TacTex-06) indicating the impact of various RFQ properties on prices, and Botticelli (Benisch et al., 2004a) uses the heuristic described in Section 8.2. Like TacTex-06, many agents use some form of greedy production scheduling, but other, more sophisticated approaches have been studied. These include a stochastic programming approach, in which expected profit is maximized through the use of samples generated from a probabilistic model of possible customer orders (Benisch et al., 2004b) and an approach treating the bidding and scheduling problems as a continuous knapsack problem (Benisch et al., 2006a). In the latter case, an e-optimal solution is presented which is shown to produce results similar to the greedy approach of TacTex-06, but in significantly less time for large problems. Attention has also been paid to the problem of component procurement, although much of it has focused on an unintended feature of the game rules (eliminated in 2005) that caused many agents to purchase the majority of their components at the very beginning of the game (Kiekintveld et al., 2005). Most agents now employ approaches that involve predictions of future component needs and prices and are somewhat similar to the approach described in this chapter. These approaches are often heuristic in nature, although there are some exceptions; NaRC (Buffett and Scott, 2004) models the procurement problem as a Markov decision process and uses dynamic programming to identify optimal actions. Although several agents make efforts to adapt to changing conditions during a single game, such as MinneTAC (Ketter et al., 2005) and
170
D. Pardoe and P. Stone
Southampton-SCM (He et al., 2005), to our knowledge methods of adaptation to a set of opponents over a series of games in TAC SCM have not been reported on by any other agent. [Such adaptation has been used in the TAC Travel competition, however, both during a round of competition (Stone et al., 2001), and in response to hundreds of previous games (Stone et al., 2003).]
10 Conclusions and future work In this chapter, we described TacTex-06, a SCM agent consisting of predictive, optimizing, and adaptive components. We analyzed its winning performance in the 2006 TAC SCM competition, and found evidence that the strategy of exploiting predicted changes in computer prices to increase revenue played a significant role in this performance. Controlled experiments verified the value of a number of improvements made to TacTex-05, the previous winner. A number of areas remain open for future work. There is room for improvement in many of the predictions, possibly though additional uses of learning. Also, by looking farther ahead when planning offers to customers, it may be possible for the agent to better take advantage of the predicted changes in future prices. In addition, there is the question of what would happen if several agents attempted to utilize such a strategy for responding to price changes, and what the proper response to this situation would be. The most important area for improvement, in both TacTex-06 and other TAC SCM agents, is likely increasing the degree to which agents are adaptive to ensure robust performance regardless of market conditions. While developing TacTex-06, we had the opportunity to carefully tune agent parameters (such as inventory thresholds) and to test various agent modifications during several rounds of competition and in our own experiments with the available agent binaries. In addition, we were able to implement learning-based approaches that took advantage of data from past games. When developing agents for real-world supply chains, such sources of feedback and experience would be reduced in quantity or unavailable. Although it would still be possible to test agents in simulation, the market conditions encountered upon deployment might differ significantly from the simulated conditions. Designing agents that can adapt quickly given limited experience is therefore a significant part of our future research agenda. Ultimately, this research drives both towards understanding the implications and challenges of deploying autonomous agents in SCM scenarios, and towards developing new machine-learning-based complete autonomous agents in dynamic multiagent domains.
Ch. 6. An Autonomous Agent for Supply Chain Management
171
Acknowledgments We would like to thank Jan Ulrich and Mark VanMiddlesworth for contributing to the development of TacTex, the SICS team for developing the game server, and all teams that have contributed to the agent repository. This research was supported in part by NSF CAREER award IIS-0237699.
References Arulampalam, S., S. Maskell, N. Gordon, T. Clapp (2002). A tutorial on particle filters for on-line nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188. Benisch M., A. Greenwald, I. Grypari, R. Lederman, V. Naroditskiy, M. Tschantz (2004a). Botticelli: a supply chain management agent. in: Third International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), New York, NY, 3, 1174–1181. Benisch, M., A. Greenwald, V. Naroditskiy, M. Tschantz (2004b). A stochastic programming approach to scheduling in TAC SCM, in: Fifth ACM Conference on Electronic Commerce, New York, NY, 152–159. Benisch, M., J. Andrews, N. Sadeh (2006a). Pricing for customers with probabilistic valuations as a continuous knapsack problem, in: Eighth International Conference on Electronic Commerce, Fredericton, New Brunswick, Canada. Benisch, M., A. Sardinha, J. Andrews, N. Sadeh (2006b). Cmieux: adaptive strategies for competitive supply chain trading, in: Eighth International Conference on Electronic Commerce, Fredericton, New Brunswick, Canada. Buffett, S., N. Scott (2004). An algorithm for procurement in supply chain management, in: AAMAS 2004 Workshop on Trading Agent Design and Analysis, New York, NY. Burke, D.A., K.N. Brown, B. Hnich, A. Tarim (2006). Learning market prices for a real-time supply chain management trading agent, in: AAMAS 2006 Workshop on Trading Agent Design and Analysis/ Agent Mediated Electronic Commerce, Hakodate, Japan. Chen, Y., Y. Peng, T. Finin, Y. Labrou, S. Cost (1999). A negotiation-based multi-agent system for supply chain management, in: Workshop on Agent-Based Decision Support in Managing the InternetEnabled Supply-Chain, at Agents ‘99, Seattle, Washington. Collins, J., R. Arunachalam, N. Sadeh, J. Eriksson, N. Finne, S. Janson (2005). The supply chain management game for the 2006 trading agent competition. Technical report. Available at http:// www.sics.se/tac/tac06scmspec_v16.pdf Fox, M.S., M. Barbuceanu, R. Teigen (2000). Agent-oriented supply-chain management. International Journal of Flexible Manufacturing Systems 12, 165–188. He, M., A. Rogers, E. David, N.R. Jennings (2005). Designing and evaluating an adaptive trading agent for supply chain management applications, in: IJCAI 2005 Workshop on Trading Agent Design and Analysis, Edinborough, Scotland, UK. He, M., A. Rogers, X. Luo, N.R. Jennings (2006). Designing a successful trading agent for supply chain management, in: Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, Hakodate, Japan, 1159–1166. Keller, P.W., F.-O. Duguay, D. Precup (2004). RedAgent—winner of TAC SCM 2003. SIGecom Exchanges: Special Issue on Trading Agent Design and Analysis 4(3), 1–8. Ketter, W., J. Collins, M. Gini, A. Gupta, P. Schrater (2005). Identifying and forecasting economic regimes in TAC SCM, in: IJCAI 2005 Workshop on Trading Agent Design and Analysis, Edinborough, Scotland, UK, 53–60. Kiekintveld, C., M. Wellman, S. Singh, J. Estelle, Y. Vorobeychik, V. Soni, M. Rudary (2004). Distributed feedback control for decision making on supply chains, in: Fourteenth International Conference on Automated Planning and Scheduling, Whistler, British Columbia, Canada.
172
D. Pardoe and P. Stone
Kiekintveld, C., Y. Vorobeychik, M.P. Wellman (2005). An analysis of the 2004 supply chain management trading agent competition, in: IJCAI 2005 Workshop on Trading Agent Design and Analysis, Edinborough, Scotland, UK. Kumar, K. (2001). Technology for supporting supply-chain management. Communications of the ACM 44(6), 58–61. Pardoe, D., P. Stone (2006). Predictive planning for supply chain management, in: Sixteenth International Conference on Automated Planning and Scheduling, Cumbria, UK. Pardoe, D., P. Stone (2007). Adapting price predictions in TAC SCM, in: AAMAS 2007 Workshop on Agent Mediated Electronic Commerce, Honolulu, HI. Sadeh, N., D. Hildum, D. Kjenstad, A. Tseng (2001). MASCOT: an agent-based architecture for dynamic supply chain creation and coordination in the Internet economy. Journal of Production, Planning and Control 12(3), 211–223. Stone, P., M.L. Littman, S. Singh, M. Kearns (2001). ATTac-2000: an adaptive autonomous bidding agent. Journal of Artificial Intelligence Research 15, 189–206. Stone, P., R.E. Schapire, M.L. Littman, J.A. Csirik, D. McAllester (2003). Decision-theoretic bidding based on learned density models in simultaneous, interacting auctions. Journal of Artificial Intelligence Research 19, 209–242. Sun, S., V. Avasarala, T. Mullen, J. Yen (2004). PSUTAC: a trading agent designed from heuristics to knowledge, in: AAMAS 2004 Workshop on Trading Agent Design and Analysis, New York, NY. Witten, I.H., E. Frank (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA.
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 7
IT Advances for Industrial Procurement: Automating Data Cleansing for Enterprise Spend Aggregation
Moninder Singh and Jayant R. Kalagnanam IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, USA
Abstract The last few years have seen tremendous changes in IT applications targeted towards improving the procurement activities of an enterprise. The significant cost savings generated by such changes have in turn led to an even greater focus on, and investments in, the development of tools and systems for streamlining enterprise procurement. While the earliest changes dealt with the development of electronic procurement systems, subsequent developments involved an increased shift to strategic procurement functions, and consequently towards the development of tools such as eRFPS and auctions as negotiation mechanisms. A recent trend is the move towards outsourcing part or all of the procurement function, especially for non-core procurement pieces, to emerging intermediaries who then provide the procurement function for the enterprise. In this practice, called Business Transformation Outsourcing (BTO), such third-parties can substantially reduce procurement costs, in part by doing procurement on behalf of several different enterprises. An essential aspect of managing this outsourced procurement function is the ability to aggregate and analyze the procurement-spend across one or more enterprises, and rationalize this process. This too requires a new set of IT tools that are able to manage unstructured data and provide ways to efficiently aggregate and analyze spend information across potentially several enterprises. Typically, these data cleansing tasks are done manually using rudimentary data analysis techniques and spreadsheets. However, a significant amount of research has been conducted over the past couple of decades in various fields, such as databases, statistics and artificial intelligence, on the development of various data cleansing techniques, and their application to a broad range of applications and domains. This chapter provides a brief survey of these techniques and applications, and then discusses how some of these methods can be adapted to automate the various cleansing activities needs for spend data aggregation. Moreover, the chapter
173
174
M. Singh and J.R. Kalagnanam
provides a detailed roadmap to enable the development of such an automated system for spend aggregation to enable spend aggregation, especially across multiple enterprises, to be done in an efficient, repeatable and automated manner.
1
Introduction
By streamlining its procurement activities, an enterprise can realize substantial cost savings that directly impact the bottom line. Additionally, rapid developments in information technology (IT) have made this streamlining process significantly faster and cheaper than was possible just a few years ago. As such, more and more enterprises are recognizing this to be strategically essential and are devoting considerable effort and resources to improving their procurement activities, both in terms of reducing the total procurement spend as well as using what is spent more effectively. Towards this end, enterprises have been increasingly using IT tools targeted primarily at the procurement activities of enterprises. Over the past few years, these tools have gradually become more and more sophisticated, both from technological and functional aspects. Initially, the focus was primarily on the development of electronic systems to assist daily procurement activity at an operational level. These were the early ‘‘procurement systems’’ that focused largely on managing the business process dealing with operational buying to streamline it to follow procedures and authorization, as well as to make the requisitioning and payment by electronic means. Thereafter, tool development moved to tackle some of the strategic functions of procurement, such as strategic sourcing. This led to an increased interest in the use of tools, such as eRFPs and auctions, as a way of negotiating price and non-price aspects of the requirements and subsequently led to the use of various auctions’ mechanisms for negotiation using electronic exchanges and portals. The latest trend, however, is towards outsourcing non-core parts of the procurement function to emerging intermediaries who then provide the procurement function, especially for non-directs. An essential aspect of managing this outsourced procurement function (as well as for doing strategic sourcing done for procurement that is done in-house) is the ability to analyze the procurement-spend of a company (along various dimensions such as suppliers and commodities) and rationalize this process. Owing to this, one of the most important activities that an enterprise has to undertake prior to doing strategic sourcing or outsourcing its procurement functions is to develop a single, aggregated view of its procurement-spend across the entire enterprise. Since procurement activities take place normally across an enterprise, spanning multiple back-end systems and/or geographic and
Ch. 7. IT Advances for Industrial Procurement
175
functional areas and often using multiple procurement applications, spend aggregation becomes necessary to understand where the money is being spent, and on what. Once an aggregated, enterprise view of spend is developed, it can be used by the enterprise for various strategic activities such as consolidating suppliers and negotiating better volume-based prices. Spend aggregation becomes an even more essential activity in cases of procurement outsourcing. In such cases, a form of business transformation outsourcing (BTO), a third-party (referred to henceforth as a BTO service provider) takes over the procurement functions of one or more enterprises (referred to henceforth as BTO clients). However, in order to do the procurement efficiently, the BTO service provider needs to aggregate spend across all these enterprises (i.e. the BTO clients plus the BTO service provider itself) so as to develop a consistent supplier base and a consistent commodity base resulting in an accurate cross-enterprise view of exactly what is being procured and from whom. Using this view, the BTO service client too can do significant strategic sourcing (similar to an enterprise doing strategic sourcing with its internal spend but on a much larger scale), such as evaluating all suppliers from which a particular commodity is acquired, and negotiating better deals with one or more of them based on the combined volume of that commodity across all the BTO clients. Procurement outsourcing can lead to significant savings for an enterprise, especially since procurement accounts for a major part of enterprise costs. This is due to several reasons. First, by delegating the procurement function (generally a non-core business activity) to a third party, an enterprise can focus more on its core business operations, streamline its business processes and reduce the complexity and overhead of its operations (by eliminating an activity in which it does not have much expertise). Second, procurement outsourcing allows an enterprise to immediately shrink its cost structure by reducing/eliminating procurement-related resources, including headcount as well as hardware and procurement applications. Third, the cost to acquire goods by an enterprise falls since the BTO service provider passes on some of the savings it generates via the bigger (volume-based) discounts it is able to get by aggregating spend over all its BTO clients, thereby generating higher commodity volumes, and directing that to fewer, preferred suppliers. Moreover, the magnitude of the savings that can be generated by the BTO service provider are typically higher than what an enterprise could achieve by doing similar activities (such as volume aggregation, supplier consolidation, etc.) by keeping its procurement activities in-house. This can be attributed to three main reasons. First, the BTO service provider normally has significant expertise in procurement, and can utilize specialized and more efficient procurement processes. Second, taking on the procurement of multiple enterprises allows the service provider to take advantages of economies of scale. Third, the volume-based discounts that a service provider can negotiate with its suppliers are much higher than what any of the client enterprise’s could get by itself, since the
176
M. Singh and J.R. Kalagnanam
service provider can generate significantly higher commodity volumes by aggregating the procurement-spend across all of the clients and combining it with its own spend. For a BTO service provider that itself has significant procurement spend, this allows even greater savings to be negotiated. Figure 1 illustrates the savings that can thus be generated by procurement BTO. Consider an enterprise that typically buys a certain volume, v1, of a given commodity under a pricing contract, c1 it has negotiated with its supplier. Contract c2 corresponds to the BTO service provider who has been able to negotiate a better deal by guaranteeing a larger minimum volume, v2. Now, even by moving the above enterprise to its current pricing contract, the BTO service provider can generate significant savings (volume v ¼ v1 þ v2 on contract c2). However, the BTO service provider may now be able to negotiate an even better deal, say, c3, due to the further increase in the volume of the given commodity which allows even greater savings to be generated. However, spend data within an enterprise generally resides in multiple, disparate data sources often distributed across several functional and geographic organizations. Moreover, data in these repositories comes from a variety of sources and applications, such as invoices, purchase orders, account ledgers, and payments. As such, this data is generally inconsistent, with no cross-indexing between transactions, and different naming conventions used for suppliers and commodities, resulting in the same supplier or commodity being described differently in different transactions and/or systems. Consequently, spend aggregation typically requires a significant
v1
v2
v=v1+v2
Cost per Unit
c1
c2
c3
Volume
Fig. 1.
Example showing cost savings in procurement BTO.
Ch. 7. IT Advances for Industrial Procurement
177
amount of effort since the spend data has to be cleansed and rationalized so that discrepancies between multiple naming conventions get resolved, transactions get mapped to a common spend/commodity taxonomy, etc. Clearly, the level of difficulty, and the effort needed, to do this across multiple enterprises, as required for procurement BTO, gets progressively higher since different supplier bases as well as multiple commodity taxonomies have to be reconciled and mapped. This has lead to renewed focus on the development of new tools and methodologies for managing unstructured content inherent in spend data (e.g. commodity descriptions) and cleansing the data to enable spend aggregation, especially across multiple enterprises, to be done in an efficient, repeatable and automated manner. Data cleansing has long been studied in various fields, such as statistics, databases and machine learning/data mining, resulting in a host of data cleansing techniques that have been applied to a multitude of different problems and domains, such as duplicate record detection in databases/ data warehouses and linkage of medical records belonging to the same individual in different databases. Often, data cleansing has been a labor intensive task requiring substantial human involvement. Automation has generally been addressed only recently and that too in limited cases. Moreover, many of the problems tackled have been of a very specific nature and fairly domain specific. Nevertheless, the underlying techniques behind the solutions developed have generally been quite similar. Also, some of the problems addressed (e.g. duplicate detection) have much in common with some of the cleansing tasks needed for aggregation of enterprise spend. As such, Section 2 provides an overview of various techniques for data cleansing that have been developed, and applied to various cleansing tasks, over the past few decades. Section 2.1 provides a broad, albeit brief, survey of the main data cleansing techniques and applications, while Sections 2.2–2.4 take three of these techniques that are quite useful for developing automated spend aggregation systems and discusses them in detail as well as highlights their pros and cons for various data cleansing activities. Subsequently, Section 3 deals with the automation of data cleansing for spend aggregation, with Section 3.1 detailing the various data cleansing tasks that must be carried out to facilitate effective spend aggregation within and across enterprises and Section 3.2 providing a detailed roadmap for developing an automated system for carrying out those tasks, using data cleansing techniques discussed in Section 2. Finally, we conclude and summarize this discussion in Section 4. 2
Techniques for data cleansing
As discussed previously, data cleansing has been studied in various fields and applied to several different problems and domains. Section 2.1 provides
178
M. Singh and J.R. Kalagnanam
a brief survey some of the data cleansing literature. Section 2.2 then takes a closer look at some of the types of algorithms underlying the commonly used data cleansing techniques.
2.1 Overview of data cleansing approaches The data cleansing problem has been studied over several decades under various names, such as record linkage (Fellegi and Sunter, 1969; Winkler 2002, 2006), duplicate detection (Bitton and Dewitt, 1983; Wang and Madnick, 1989), record matching (Cochinwala et al., 2001), merge/purge problem (Hernandez and Stolfo, 1995), etc.1 This task, in general, refers to the identification of duplicates that may be present in data due to a variety of reasons, such as errors, different representations or notations, inconsistencies in the data, etc. While substantial work around this issue has been conducted in the statistics community with a focus on specific problems, such as record linkage in medical data for identifying medical records for the same person in multiple databases (Jaro, 1995; Newcombe, 1988) or for matching people across census or taxation records (Alvey and Jamerson, 1997; Jaro, 1989), a large body of literature also exists, especially in the database literature, on more general, domain-independent data cleaning, especially in the context of data integration and data warehousing (Bright et al., 1994; Dey et al., 2002; Lim and Chiang, 2004; Monge and Elkan, 1997). From an algorithmic point of view, the techniques that have been studied for addressing the data cleansing problem can be broadly categorized into text similarity methods (Cohen, 2000; Hernandez and Stolfo, 1995; Monge and Elkan, 1997), unsupervised learning approaches, such as clustering (Cohen and Richman, 2002; Fellegi and Sunter, 1969), and supervised learning approaches (Bilenko and Mooney, 2003; Borkar et al., 2000; Winkler, 2002). Winkler (2006) provides an extensive and detailed survey of data cleansing approaches that have been developed using methods in one or more of these categories. A detailed discussion of this subject matter is beyond the scope of this chapter, and the interested reader is referred to Winkler’s paper cited above, as well as other numerous other survey articles (Cohen et al., 2003; Rahm and Do, 2000). Nevertheless, in the following section, we discuss, in some level of detail, a few of the most commonly used techniques that are especially suited for the development of automated data cleansing techniques for enterprise spend aggregation.
1 We refer to this ‘‘classic’’ data cleansing problem as the ‘‘duplicate detection’’ problem in the rest of this chapter.
Ch. 7. IT Advances for Industrial Procurement
179
2.2 Text similarity methods Some of the most commonly used methods for data cleansing have their roots in the information retrieval literature (Baeza-Yates and Ribeiro-Neto, 1999). Generally referred to as string or text similarity methods, these techniques often measure the ‘‘similarity’’ between different strings (with identical strings considered to be the most similar) on the basis of some metric that provides a quantitative measure of the ‘‘distance’’ between multiple strings, the higher the distance between them, the lesser the similarity and vice versa.2 One class of such functions are comprised of the so-called edit distance functions that measure the distance between strings as a cost-function based on the minimum number of operations (character insertions, deletions and substitutions) needed to transform one string to the other. The Levenshtein distance (LD) (Levenshtein, 1966) is a basic edit distance that assumes a unit cost for each such operation. Several different variations that use different costs for the various operations, as well as extensions of the basic edit distance, have also been proposed (Cohen et al., 2003; Navarro, 2001). Computation of the LD between two strings can be done using dynamic programming based on a set of recurrence as described below. Consider the calculation of the LD between two strings, say S and T, with lengths n and m, respectively. Let S [1 . . . i] (and T [1 . . . j ]) and S [i] (and T [j]) represent the substring of the first ‘i’ (and ‘j’) characters and the ith (and jth) character of S (and T ), respectively. Moreover, let LD(S [1 . . . i],T [1 . . . j ]) be the distance between the substrings comprised of the first ‘i’ characters of S and the first ‘j’ characters of T. Then, LD(S,T) is given by LD(S [1 . . . n], T[1 . . . m]). It is easy to see that this computation can be done recursively by looking at the three different ways of transforming S[1 . . . i] to T [1 . . . j]. These are (i) converting S [1 . . . i1] to T [1 . . . j1] followed by converting S [i] to T [ j] leading to a cost of LD(S [1 . . . i1],T [1 . . . j1] plus the cost of replacing S [i] by T [ j] which is either 0 (if same) or 1 (if different), (ii) converting S[1 . . . i1] to T [1 . . . j] and deleting S [i] leading to a cost of LD(S [1 . . . i1],T[1 . . . j]) þ 1 and (iii) converting S [1 . . . i] to T [1 . . . j1] and inserting T [ j ] leading to a cost of LD(S [1 . . . i],T [1 . . . j1]) þ 1. The cost of converting S to T then is given by the minimum of these three costs, thus 8 > < LDðS½1 . . . i 1; T½1 . . . j 1Þ þ C sub LDðS½1 . . . i; T½1 . . . jÞ ¼ min LDðS½1 . . . i 1; T½1 . . . jÞ þ 1; > : LDðS½1 . . . i; T½1 . . . j 1Þ þ 1 where Csub is either 0 or 1 as described in (i) above. 2 We use the terms ‘‘similarity’’ and ‘‘distance’’ interchangeably, depending upon the interpretation that is more commonly used in the literature, with ‘‘higher similarity’’ analogous to ‘‘smaller distance’’ and vice versa.
180
M. Singh and J.R. Kalagnanam
The form of these recurrence relations leads to a dynamic programming formulation of the LD computation as follows: Let C be a n þ 1 by m þ 1 array where C [i,j ] represents LD(S[1 . . . i],T [1 . . . j]). Then, LD(S,T ) ¼ LD(S [1 . . . n],T [1 . . . m]) ¼ C[n,m] is calculated by successively computing C[i,j] based on the recurrence relations above as follows: Initialization: C½0; 0 ¼ 0 C½i; 0 ¼ i; C½0; j ¼ j;
1in 1jm
Calculation: Compute C[i,j] for all nZ i Z1, mZjZ1 using the formula C½i; j ¼ MinðC½i 1; j 1 þ Csub ; C½i 1; j þ 1; C½i; j 1; þ1Þ where Csub ¼ 1 if S[i] 6¼ T [ j ] and 0 otherwise. The advantage of using edit distance measures is that they are fairly robust to spelling errors and small local differences between strings. However, computation of edit distances, as shown above, can be computationally expensive, especially when it has to be done repeatedly for comparing a large number of strings. Another kind of common distance-based similarity methods work by breaking up strings into bags of tokens, and computing the distance between the strings based on these tokens. Tokens can be words (using white space and punctuation as delimiters), n-grams (consecutive n-character substrings), etc. The simplest way to then measure the similarity between the strings is to determine the number of tokens in common between the two strings, the higher the count the greater the similarity. However, since this generally favors longer strings, it is better to use normalized measures such as Jaccard similarity (Jaccard, 1912), Dice similarity (Dice, 1945) or Cosine similarity (Salton and Buckley, 1987). A common way to represent such similarity measures is by considering the strings, say S and T, as vectors in multi-dimensional vector space and representing them as weight-vectors of the form, S ¼ {s1, . . . , sn} and T ¼ {t1, . . . , tn} where si, ti are the weights assigned to the ith token (in the collection of all n tokens present in the system) for the strings S and T, P respectively. Then, the vector product of the two weight
n vectors i¼1 si ti measures the number of tokens that are common to the two strings, and the above-mentioned similarity measures can be
Ch. 7. IT Advances for Industrial Procurement
181
expressed as follows: n P
JaccardðS; TÞ ¼
n P i¼1
i¼1 n P
s2i þ
i¼1
2 DiceðS; TÞ ¼
n P
si ti t2i
n P i¼1
si ti
i¼1 n P i¼1
s2i
þ
(1) si ti
n P i¼1
(2) t2i
n P
si ti CosineðS; TÞ ¼ sffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffi n n P P s2i t2i i¼1
i¼1
(3)
i¼1
In this formulation, if the weights si, ti are assigned such that their value is 1 if the ith token is present in the corresponding string, and 0 otherwise, then the Jaccard similarity can be seen to be the number of tokens in common between the two strings, normalized by the total number of unique tokens in the two strings (union), whereas the Dice similarity can be seen to be the number of tokens in common between the two strings, normalized by the average of the number of tokens in the two strings. Cosine similarity is slightly different in that a vector length normalization factor is used where the weight of each token depends on the weights of the other tokens in the same string. Accordingly, in the above formulation (Salton and Buckley, 1987), the similarity may be considered to be a vector product of the two qffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn 2ffi weight vectors, with the individual weights being si = qffiffiffiffiffiffiffiffiffiffiffiffiffiffi i¼1 si and Pn 2ffi (instead of s t and t , respectively). ti = i i i¼1 i However, these methods do not distinguish between different terms (tokens) in the strings being compared, both in terms of their importance to the strings containing those tokens or their ability to discriminate such strings from other strings not containing those tokens. The TF/IDF (Term Frequency/Inverse Document Frequency) (Salton and Buckley, 1987) approach uses a cosine distance-based similarity measure where each token in a string is assigned a weight representing the importance of that term to that particular string as well as relative to all other strings to which it is compared. While this approach is commonly used for document retrieval, it can also be used to measure similarity between different strings in a given set of strings. In this case, the weight assigned to a token consists of three components: (i) a term frequency component measuring the number of
182
M. Singh and J.R. Kalagnanam
times the token occurs in the string, (ii) an inverse document frequency component that is inversely proportional to the number of strings in which that token occurs and (iii) a normalization component, typically based on the length of the string vector. While the term frequency component measures the importance of a term to the string in which it is contained, the inverse document frequency component measures its ability to discriminate between multiple strings, and the normalization component ensures the longer strings are not unfairly preferred over smaller strings (since longer strings, with more tokens, would otherwise have higher likelihood of having more tokens in common with the string being compared with, as opposed to smaller strings). Thus, typically, we define Term Frequency ðtfÞ ¼ # of times token occurs in the string Inverse Document Frequency ðidfÞ ¼ logðN=nÞ where n is the number of strings in which the token occurs in the entire collection of N strings under consideration. Then, for a string S with a weight-vectors of the form, S ¼ {s1, . . . ,sn}, the weight of the ith token is specified as si ¼ tf si idf si ¼ tf si logðN=ni Þ
(4)
Then, the TF-IDF similarity (Salton and Buckley, 1987) between two strings, say S and T represented as weight-vectors S ¼ {s1, . . . ,sn} and T ¼ {t1, . . . ,tn}, respectively, is given by n P
si ti TF=IDFðS; TÞ ¼ sffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffi n n P P s2i t2i i¼1
i¼1
i¼1
2 tf si tf ti logðN=ni Þ i¼1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n n
2 P
2 P s tf i logðN=ni Þ tf ti logðN=ni Þ n P
i¼1
ð5Þ
i¼1
As can be seen from Eq. (3), this is equivalent to the cosine similarity between the two strings (with the token weights defined as in Eq. (4)). Several different variations of this have been studied as well (Salton and Buckley, 1987).
Ch. 7. IT Advances for Industrial Procurement
183
2.3 Clustering methods Another class of methods that is especially useful for the cleansing activities are clustering techniques. The aim of clustering is to partition a given dataset into a set of groups such that the data items within a group are ‘‘similar’’ in some way to each other, but ‘‘dissimilar’’ from data items belonging to the other groups. This implies that a good clustering of a dataset corresponds to high intra-cluster similarity and low inter-cluster similarity, and as such depends on how such similarity is measured, as well as implemented, by a clustering method. Clustering has been used over the years in various domains and applications such as pattern recognition, image processing, marketing, information retrieval, etc. (Anderberg, 1973; Jain et al., 1999; Jain and Dubes, 1988; Salton, 1991) and a number of different algorithms have been devised to do such analysis. Here, we describe some of the most commonly used methods and discuss their relative advantages and disadvantages; the interested reader is referred to Jain et al. (1999) for a more general review and discussion of different clustering techniques. Arguably the simplest and the most widely used clustering technique is the k-means algorithm (McQueen, 1967). The aim of the k-means clustering algorithm is to partition the dataset into a set of k clusters, the value of k being assigned a priori. The k-means algorithm starts with an initial partition of the data into k clusters (this could be done randomly, for example), and uses a heuristic to search through the space of all possible clusters by cycling through steps (i) and (ii) as follows: (i) for each cluster, the centroid (mean point) is calculated using the data points currently assigned to the cluster and (ii) each data point is then re-assigned to the cluster whose centroid is the least distance from it. This process is continued till some convergence criteria is satisfied (e.g. there is no movement of any data point to a new cluster). Since this is essentially a greedy approach, it generally terminates in a local optimum. The method is very popular due to the fact that it is easy to implement and is fairly computationally efficient. However, it assumes that the number of clusters is known beforehand and, since the method converges to a local optimal solution, the quality of the clusters found is very sensitive to the initial partition (Selim and Ismail, 1984). Although the k-means algorithm belongs to a wider class of clustering algorithms called partitioning algorithms (since they construct various partitions from the dataset and evaluate them using some criterion), another popular set of clustering techniques are hierarchical clustering algorithms which work by creating a hierarchical decomposition (tree) of the dataset using some criterion (normally distance based). There are two types of hierarchical clustering methods: agglomerative and divisive. While agglomerative methods start with each data item being placed in its own cluster and then successively merge the clusters until a termination
184
M. Singh and J.R. Kalagnanam
condition is reached, divisive methods work along the opposite direction, starting with a single cluster consisting of the entire dataset and successively splitting them until a stopping criterion is satisfied. The majority of hierarchical algorithms are agglomerative, differing primarily in the distance measure used and the method of measuring similarity (distance) between clusters; divisive methods are rarely used and we do not discuss them further. For measuring the distance between clusters, any of an extensive array of distance measures can be used, including those that are based on similarity measures as described previously, as well as various other distance metrics that have been used for clustering and similar tasks in the literature, such as Euclidean distance, Minkowski metric, Manhattan (or L1) distance, etc. (Cohen et al., 2003; Jain et al., 1999). For measuring the distance between clusters, the two most commonly used methods are the single-link and complete-link approaches, though other methods have also been used (Jain et al., 1999). In the single-link case, the distance between two clusters is defined as the shortest distance (or maximum similarity) between any member of one cluster and any member of the other cluster. However, in the case of maximum linkage the distance between two clusters is defined as the maximum distance between any member of one cluster and any member of the other cluster. Of these, the maximum linkage method generally leads to a higher degree of intra-cluster homogeneity for a given number of clusters. Once a choice of the distance measure as well as the method of determining inter-cluster distance is made, agglomerative clustering proceeds as described above: starting with single-ton clusters, the pair of clusters (including clusters that have been created by previous merge steps) that have the least distance between them are successively merged till a stopping criterion, such as maximum cluster size threshold or maximum intra-cluster distance threshold is reached. Although hierarchical clustering techniques have the advantage (over k-means) that the number of clusters does not have to be specified a priori, these methods are not as computationally efficient and do not allow any merge (or split) decision taken earlier on to be reversed later. Yet another class of popular clustering algorithms, called model-based clustering methods, assumes certain models for the clusters and attempt to optimize the fit between these models and the data. The most common of this class of methods assumes that each of the clusters can be modeled by a Gaussian distribution (Banfield and Raftery, 1993), and thus the entire data can be modeled by a mixture of Gaussian distributions. The task of identifying the clusters then boils down to the estimation of the parameters of the individual Gaussians. The EM algorithm (Dempster et al., 1997) is commonly used for this parametric estimation. AutoClass (Cheeseman and Stutz, 1996) is another approach that takes a mixture of distributions approach in addition to using Bayesian statistics to estimate the most probable number of clusters, given the data. While model-based clustering
Ch. 7. IT Advances for Industrial Procurement
185
allows the use of established statistical techniques, it differs from the approaches described earlier in that, unlike k-means and hierarchical clustering approaches that are purely data driven, it requires prior assumptions regarding the component distributions. Additionally, as in the case of k-means, the number of clusters also has to be specified a priori (except for AutoClass that estimates that). Irrespective of the type of clustering method used, computational efficiency and scalability become very important issues when clustering is applied to problems that are characterized by large datasets. This can occur due to a large number of records in the dataset, high dimensionality of the feature space, or a large number of underlying clusters into which the data needs to be split up. In such situations, the direct applicability of any of the previously discussed clustering approaches can become highly computationally intensive, and practically infeasible, especially when the dataset being clustered is large due to all these reasons at the same time. Recently, however, new techniques have been developed for performing clustering efficiently on precisely these kinds of high-dimensional datasets. The main idea behind such techniques is to significantly reduce the number of times exact similarity (or distance) measures have to be computed during the clustering process, thereby reducing the computational complexity of the process. One such method is the two-stage clustering technique developed by McCallum et al. (2000). In this method, the first stage is a quick and dirty stage in which cheap and approximate distance measures are used to divide the dataset into a set of overlapping subsets called canopies. This is followed by a more rigorous stage where expensive, exact distance calculations are made only between data items that occur within the same canopy. By ensuring that the canopies are so constructed such that only data items that exist in a common canopy can exist in the same cluster (i.e. clusters cannot span canopies), substantial computational savings are attained by eliminating the exact distance computations between any pair of points that does not belong to the same canopy. Moreover, this allows any of the standard clustering techniques described previously to be used during the second stage; essentially, that clustering approach is used repeatedly to cluster smaller datasets corresponding to the canopies, as opposed to performing clustering on the entire dataset as required when using a traditional clustering approach directly. 2.4 Classification methods A third category of methods, as described previously, that can sometimes prove useful for the data cleansing tasks are classification techniques (also commonly referred to as supervised learning methods). Since data cleansing for often involves mapping and manipulation of textual data, fields such as information retrieval and natural language processing offer a
186
M. Singh and J.R. Kalagnanam
plethora of machine learning techniques that have been found effective in such domains (e.g. maximum entropy (Nigam et al., 1999), support vector machines (Joachims, 1998) and Bayesian methods (McCallum and Nigam, 1998)). However, classification methods need ‘‘labeled’’ data in order to build/ train classifiers which could then be used for the mapping tasks needed for spend aggregation, such as supplier name normalization and commodity mapping. Such labeled data is, however, not always available. This is in stark contrast to the methods described previously, string similarity methods as well as clustering techniques that have no such requirement, and, hence, are generally used instead of the classification techniques. As such, we do not discuss these approaches in detail but refer the interested user to the above-mentioned references. Nevertheless, we do highlight in subsequent sections where classification approaches could be applied, especially in the context of data cleansing for spend aggregation in a procurement-BTO setting, since cleansed data for a given enterprise could potentially provide the labeled data needed for cleansing the data for other enterprises, especially those in the same industrial sector. Irrespective of the actual approach adopted, two steps are involved in using classification methods for data cleansing: (i) learning classification models for predicting one or more attributes of interest (‘‘target’’ attributes) based on the values of other attributes and (ii) applying these models to the unmapped data to determine the appropriate values of the target attributes. Winkler (2006) provides an extensive list of citations to work in the data cleansing literature based on the use of supervised learning techniques for data cleansing, and the interested reader is referred to the same. One particular area, though, in which supervised learning techniques may be relevant in the context of data cleansing for spend aggregation is in the automatic parsing and element extraction from free-text address data (Borkar et al., 2000; Califf, 1998). For this specific task, it may be easier to get labeled data by some combination of standard address databases, manual tagging and labeling, as well as the incremental data cleansing activities that would be performed during procurement BTO as subsequent enterprise repositories encounter free-text addresses that have already been cleansed for previous enterprises. 3
Automating data cleansing for spend aggregation
We now turn our attention to the specific task of automating the cleansing and rationalization of spend across an enterprise so that it can be aggregated and analyzed in a meaningful way. In the case of BTO, this data cleansing has to span multiple enterprises, thus leading to a significantly higher level of complexity. Spend aggregation has traditionally been done manually, generally employing spreadsheets and rudimentary, data analysis
Ch. 7. IT Advances for Industrial Procurement
187
techniques for mapping and normalizing the data prior to aggregation. However, this is an extremely costly and time-intensive process, especially for larger enterprises where the volume and complexity of spend data makes it all the more difficult. Owing to this slow, error-prone and expensive nature of manual cleansing, coupled with the increased focus in this area along with the aforementioned rapid developments in various fields such as databases, data mining and information retrieval, there has been a steady shift towards the development of methods and tools that automate at least some aspects of this cleansing activity. While some enterprises have turned to in-house development of automated solutions for cleansing and aggregating their spend data, others use solutions provided by independent software vendors (ISVs) such as Emptoris, VerticalNet and Zycus to address their spend aggregation needs. Some of these are pure consulting solutions in which the ISV takes the spend data from the enterprise, cleanses, aggregates (automatically and/or manually) and analyzes it, and returns aggregate spend reports back to the enterprise for further action. On the other end are automated spend analysis solutions that are deployed and configured to work directly with the enterprise’s spend data repositories and systems to cleanse, aggregate and analyze the data on a continual basis. However, most of these solutions are primarily for aggregating intracompany spend (traditionally referred to as spend analysis); there are few solutions that deal explicitly with of inter-company spend aggregation which presents many challenges not encountered while aggregating intracompany spend (Singh and Kalagnanam, 2006). In Section 3.1, we discuss in detail the various cleansing tasks that must be carried out in order to convert spend data to a form where effective aggregation is possible, and the issues that must be addressed in order to enable this cleansing to be done in an automated manner. We specifically highlight the similarities of some of these cleansing tasks with the classic duplicate detection problem and also point out the key points where the two differ. Section 3.2 then provides a rough roadmap towards the development of a simple automated spend-aggregation solution using some of the techniques discussed in Section 2. While many of the techniques and methods discussed in Section 2 can be used to create such an automated solution, we focus on only some of the most commonly used such techniques, such as string comparisons and clustering, and address how the various issues that arise during spend-data cleansing activities can be effectively addressed using these methods.
3.1 Data cleansing tasks for spend aggregation Regardless of the techniques adopted, three main tasks generally need to be performed for cleansing spend data to facilitate effective spend aggregation and analysis, and the development of automated solutions to
188
M. Singh and J.R. Kalagnanam
perform these tasks brings forth several technical issues that need to be addressed satisfactorily. One of the cleansing tasks that need to be performed is the normalization of supplier names to enable the development of a consistent supplier base across all the data repositories and systems. This has to be done both for analyzing intra-company spend as well as aggregating spend across multiple enterprises for procurement BTO. The normalization of supplier names involves the mapping of multiple names for the same supplier to a single, common, standard name for that supplier. Multiple names may arise due to various reasons, including errors (e.g. IBM Corp. and IBMCorp.), different locations (e.g. IBM Canada, IBM TJ Watson Research Center, IBM India, IBM Ireland Ltd., etc.), different businesses undertaken by the same supplier (e.g. IBM Software Group, IBM Global Services, IBM Daksh Business Process Services, etc.), parent–child relationships due to acquisitions (e.g. IBM, Tivoli Systems, Lotus Software, Ascential Corporation, etc.) as well as different terminologies and naming conventions employed by an enterprise for its suppliers in different geographic or functional locations (e.g. IBM, I.B.M, I B M, IBM Corporation, IBM Corp., Inter Bus Machines, International Business Machines, etc.). Clearly, the number of possible ways a supplier may be represented within the spend data may be fairly large, and unless they are all normalized to a single, unique nameinstance, the procurement-spend corresponding to that supplier will be significantly underestimated by any aggregation exercise. Moreover, even if the name of suppliers in multiple transactions or systems is the same, it may be the case that other attributes for that supplier, such as address and supplier id, may have differences, again due to reasons described above (e.g. IBM, 1101 Kitchawan Rd, Yorktown, NY; IBM, Rt 134, Yorktown Heights, NY; IBM, 365 Maiden Lane, New York, NY, etc.). As such, to properly normalize supplier names, it is imperative to compare not only the suppliers’ names but also other information such as address and contact information that may be available. This is especially true for enterprises that do world-wide procurement since different suppliers in different countries may in fact have the same, or fairly similar, names. This is more likely in the case of suppliers that have fairly common words in their names. The complexity of supplier normalization increases rapidly in the case of procurement outsourcing, especially as the number of BTO clients increases, as the size of the supplier base that needs to be normalized increases, as does the noise/variability in the data. Figure 2 shows highly simplified views of the procurement spend for three enterprises (ENT 1, ENT 2 and ENT 3). The view for each enterprise shows the procurement-spend aggregated over suppliers and commodities before any data-cleaning activity has been undertaken. In each case, there are multiple name variations of the same supplier. For ENT 1, it would seem that total procurement amount from IBM is of the order of 1 million. Moreover, no supplier would seem to account for more than 2.5 million.
ENT 1
Armonk, NY
H/W Support
1000000
Inter Bus Machines
Armonk, New York
High-end Server maintenance
2500000
IBMCorp
Somers, NY
S/W mainenace & licenses
2500000
HP Corp
3000 Han St Palo Alto CA
S/W mainenace & licenses
1300000
3000 Hanover St, Palo Alto, CA
High-end Server maintenance
1600000
International Armonk, NY Business Machines
Server support
1100000
IBM SWG
Armonk, New York
1500000
IBMCorp
Somers, NY
Enterprise s/w licenses Enterprise
500000
IBM
H.P. Corp. (Sales)
3000 Han St Palo Alto CA
Enterprise software
1300000
HP
Software
4100000
HP
H/W Support
6200000
Hewlett Packard
IBM
16100000
HP
10300000
ENT 2
Hewlett Packard
3000 Hanover St, Palo Alto, CA
IBM Bus Cons serv
Armonk, NY
IBM SWG
Armonk, New
Server support
2000000
Business consulting 2000000 services
Software
2500000
International Somers, NY Business Machines
High/Mid-end server maint
2500000
HP
Palo Alto CA
Software
1500000
Hewlett-Packard Company
San Francisco, CA
High/Mid-end server maint
2600000
H/W Support
7100000
IBM
Software
7000000
Consulting Services
2000000
H/W Support
13300000
Software
11100000
Consulting Services
2000000
ENT 3
Example demonstrating the need for data cleansing for spend aggregation.
189
Fig. 2.
IBM
Ch. 7. IT Advances for Industrial Procurement
IBM
190
M. Singh and J.R. Kalagnanam
However, the true picture is quite different and is obvious after normalizing the supplier names, which shows that IBM actually accounts for about 6 million in all. A similar picture can be observed for the other enterprises as well. Without supplier name normalization, one would assume that there would be 11 different suppliers, none of them accounting for more than 4 million in spend across the three enterprises. However, as the view on the right shows, there are only two suppliers, with IBM accounting for about 16 million and HP accounting for 10 million. Clearly, data cleansing allows a much clearer, cross-enterprise picture of procurement-spend than is available by simple aggregation without any cleansing. As illustrated in Fig. 2, another cleansing activity that needs to be carried out is to make the spend categorization of the entire procurementspend consistent, both within as well as across enterprises, to enable spend aggregation across commodities. Most enterprises label their spend transactions with an associated commodity and/or spend category. Moreover, the commodities, and various spend categories are arranged in a taxonomy, with higher levels corresponding to general categories of spend and lower categories representing more specific ones (and lowest level corresponding to commodities). However, there are often multiple taxonomies in use across an enterprise, resulting in multiple ways of representing the same commodity. One reason may be that the enterprise may have no specific, enterprise-wide taxonomy that is used to categorize the procurement spend; instead different geographic and functional organizations develop and use their own taxonomies for categorizing the procurement spend. Another reason could be that there is no formal taxonomy in place, either at the enterprise level or at a functional/ geographic organization level, and spend is categorized in an ad hoc manner by multiple people spread across the enterprise and using various procurement functions and applications (such as requisitions, supplier catalogs and payments). Clearly, this leads to multiple taxonomies within the enterprise with substantial inconsistencies, disparity, noise and errors in the way the same commodity is referred to across the enterprise. This is especially true in the latter case where multiple descriptions may be used for the same commodities based on different naming conventions and styles, errors and terminologies. Before a meaningful and accurate spend aggregation can be done across the enterprise, all the taxonomies in use across the enterprise must be mapped to/consolidated into a single taxonomy that uses a normalized, consistent commodity base (e.g. hazardous waste handling, pollution control expense, hazardous waste management, HAZMAT removal, etc. need to be mapped to the same commodity, say hazardous waste management). This taxonomy may be one of those currently in use, built by consolidating several taxonomies in use, or may be a standard classification code, such as the United Nations Standard Products and Services Code, or UNSPSC (Granada Research, 2001; UNSPSC). Nowhere is the importance and the complexity of this
Ch. 7. IT Advances for Industrial Procurement
191
mapping more apparent than in the case of procurement BTO where the spend categories of all the involved enterprises (BTO clients and BTO service provider) need to be mapped to a uniform taxonomy in order to accurately determine the total procurement volume of any given commodity. Thus, not only do multiple taxonomies within an enterprise need to be reconciled but also taxonomies across several enterprises have to be mapped to a single, consistent taxonomy across all enterprises to enable a uniform, cross-enterprise, view of commodity spend to be developed. In such cases, a standard taxonomy such as the UNSPSC is best suited since it spans multiple industry verticals, thus enabling a BTO service provider to host procurement for all kinds of clients. As in the case of supplier name normalization, Fig. 2 illustrates the impact that commodity taxonomy mapping has on spend aggregation, especially across multiple enterprises. As the example shows, the same commodities are referred to in several different forms across the different enterprises, and it is only after mapping them to a common taxonomy (shown on the right side of the figure) does the true picture emerge, i.e. three commodities account for the entire spend with h/w support and software accounting for most of it. Thus, by performing supplier normalization and commodity mapping across the various enterprises and then aggregating the spend, the BTO service provider can see that there are significant volumes associated with the commodities being procured, which in turn enables it to negotiate better deals with the suppliers involved; without the cleansing, the view available to the BTO service provider would be quite distorted and would not be amenable to such analysis. Finally, besides supplier name normalization and commodity mapping, individual spend transactions may also need to be consolidated and mapped to a normalized, commodity taxonomy. This generally happens when an enterprise either does not have a formal taxonomy for spend categorization, or does not require its usage for categorizing individual spend transactions. In such cases, the individual spend transactions have to be mapped to a common commodity taxonomy (either the enterprise taxonomy, if it exists, or a standard taxonomy such as the UNSPSC), based on unstructured textual descriptions in the transactions (such as from invoices or purchase orders). Another case where such transactional mapping is needed is when the enterprise spend taxonomy is not specific enough, i.e. spend is categorized at a much more general level than the commodity level needed for the aggregation. In such cases, the transactional level descriptions may provide more information about the actual commodities purchased to allow such mapping to be done. As Singh et al. discuss, there are several factors that make taxonomy, as well as transactional, mapping far more complex and difficult in the case of spend aggregation for procurement BTO, as opposed to aggregation of spend within a single enterprise (Singh et al., 2005; Singh and Kalagnanam, 2006). These arise primarily due to the need to map the procurement-spend
192
M. Singh and J.R. Kalagnanam
of each of the participating enterprises (via mapping of taxonomies and/or transactions) to a common, multi-industry, standard taxonomy (primarily the UNSPSC), since the participating enterprises may be from different industries with little or no overlap between the commodities they deal with. One issue is that though cross-industry standards like the UNSPSC are fairly broad and cover all industry sectors, they are often cases where the taxonomy is not very specific within an industry (i.e. commodities are quite general). However, enterprise-specific taxonomies, though generally smaller (in terms of the number of commodities covered), may have a narrower but more specific coverage of commodities. Many times, the inverse also holds true where the UNSPSC is more specific but the enterprise taxonomy’s commodities are more general. In the former case, multiple enterprise commodities will end up getting mapped to a single UNSPSC commodity. This entails a loss of information during spend aggregation unless the UNSPSC is extended to include more detailed commodities. In the former case, an enterprise commodity will correspond to multiple UNSPSC commodities, which requires the enterprise commodity to either be mapped to a more general UNSPSC class (group of commodities), or the use of transactional descriptions in mapping individual transactions (rather than commodities) to appropriate UNSPSC commodities. A second issue is that fact that while the UNSPSC taxonomy is a true hierarchical taxonomy in which an is a relationship exists across different levels, enterprise taxonomies are seldom organized as such, and, more often than not, reflect functional/operational organizations, or spend categorizations (such as business travel expenses, direct procurement related, etc.). This implies that multiple, related commodities in an enterprise taxonomy (i.e. they have the same parent in the taxonomy) may map to very different areas of the UNSPSC taxonomy, and vice versa. As such, it is not possible to take advantage of the taxonomy structure during the mapping process, since mapping a higher level item in a taxonomy to a similar high-level item in the UNSPSC taxonomy does not imply that all children of that item in the enterprise taxonomy will also map to children of the corresponding UNSPSC commodity; rather, they could be anywhere in the taxonomy. As such, mapping generally has to be done at the commodity level, one commodity at a time. 3.2 Automating data cleansing tasks for spend aggregation As mentioned previously, automating the various data cleansing activities often requires the use of multiple methods and techniques, based on the specific cleansing task as well as the quality of the available data. The best approach depends upon several factors—the cleansing task at hand, the availability of labeled data, the clustering method best suited for the
Ch. 7. IT Advances for Industrial Procurement
193
available data in the absence of labeled data, the similarity measures to be used in computing similarity between different terms, the data attributes available, etc. Also, the data cleansing literature (as discussed in Section 2.1) offers a variety of techniques that have been successfully applied to such activities. At first glance, the supplier normalization task seems to be identical to the extensively studied duplicate detection problem. Clearly, there are significant similarities between the two, especially when enterprises have multiple representations in the data due to errors, different terminologies or nomenclatures, etc. As such, many of the techniques that have been developed in the past for tackling the duplicate detection problem can be adapted for use in the supplier normalization task. However, there is one aspect specific to the supplier normalization task that makes it different from, and often quite difficult than, the duplicate detection problem. This arises from the fact that duplicate detection (in the context of supplier normalization) can be considered to be the task of checking whether the name/address data for two enterprises is syntactically different but semantically equivalent, i.e. both entities actually represent the same enterprise that has been represented differently due to errors, etc., and at least one of the entries is erroneous that must be fixed. However, in enterprise spend data, it is often the case that supplier information is both syntactically and semantically different, but is, in fact, of the same enterprise. Moreover, each representation is in fact correct and must be maintained that way; at the same time, for spend aggregation, they must be considered equivalent and spend must be aggregated over them and associated with a single enterprise. This can arise due to different reasons. One reason arises due to acquisitions and divestitures. In such cases, different parts of the same enterprise may have completely different names/ addresses (e.g. Lotus Software and IBM Corp.). Another reason is that many enterprises operate on a global scale, often with several business units (e.g. IBM Software Group, IBM Systems and Technology Group and/or subsidiaries (e.g. IBM India Pvt Ltd) across multiple geographies, each conducting business directly with its customers, suppliers, etc. In either case, there may be invoice and/or account-payable transactions that have substantially different names and addresses that are entirely correct and must be maintained as such (since, e.g. payments may be made to the individual business units directly). However, for spend-aggregation, the individual entities do need to be mapped to a common (parent) enterprise. In the former case, the only solution is to purchase such information from companies such as Dun & Bradstreet, or to cull news sources for such information and build up a repository of acquisitions/divestitures for use during the supplier normalization task. In the latter case, the differences between names and addresses are often quite significant than those that are introduced by errors or naming conventions, etc. Hence, as we discuss subsequently in Section 3.2.1, standard similarity methods as are used in
194
M. Singh and J.R. Kalagnanam
duplicate detection tasks cannot be directly applied and more elaborate schemes must be devised. Commodity taxonomy mapping and commodity transactional mapping are much more general cleansing problems than the classic duplicate matching problem since they involve unstructured noisy data and direct application of similarity methods is not enough as we discuss in Section 3.2.2. In the following paragraphs, we discuss the pros and cons of using various techniques for the three data cleansing tasks described previously, and describe how they can be used for successfully automating these tasks. 3.2.1 Supplier name normalization One common limiting factor in spend aggregation is the absence of any ‘‘mapped’’ data. For example, there is generally no data that explicitly labels different supplier names as being variations of the same physical enterprise’s name. Similarly, there is often no transactional spend data that is labeled with the appropriate normalized commodity codes. In the absence of such labeled data, the first step in a supplier name normalization exercise for an enterprise is to perform a clustering exercise on the enterprise’s supplier data in order to partition it into a set of groups, each group representing a unique supplier. An alternative approach is to compare data for each supplier in the dataset with the data for (potentially) every other supplier in the data, and that too several times. For example, starting with an empty list of normalized suppliers, this approach would entail comparing each supplier in the dataset with every supplier in the list of normalized suppliers, and either mapped it to one already on the list or adding it to the list as a new normalized supplier. This process would be continued until every supplier is mapped to some supplier on the normalized list. Compared to this approach, a clustering-based approach is usually significantly more computationally efficient, especially in cases where the size of the supplier base is fairly large. Once the clustering is done, each cluster has to be assigned a normalized name which can be done using any of a variety of heuristics, such as the most common supplier name in the cluster, the shortest name in the cluster, or the shortest common prefix among the names in the cluster, etc. Since multiple enterprises are involved in a procurement BTO setting, the task of normalizing the supplier base across all the involved enterprises can then be conducted in two steps: first, the supplier base of each participating enterprise is normalized, and then all the normalized supplier bases are merged together to yield a single, consistent, cross-enterprise, normalized supplier base. To perform supplier name normalization using the cluster-based approach, two decisions needs to be made. First, an appropriate clustering method has to be selected. Second, an appropriate similarity measure has to be selected for comparing data items and/or clusters. There are certain characteristics of this problem that affect the choice of the methods used.
Ch. 7. IT Advances for Industrial Procurement
195
One, an enterprise can have tens of thousands of suppliers. Two, in order to do effective similarity matching between supplier-names, it is necessary to tokenize the names, which greatly increases the dimensionality of the feature space. Three, the number of normalized supplier names (corresponding to the clusters) is also usually quite large, corresponding to a significant part of the non-normalized supplier base. Owing to this, methods such as k-means are generally not suited for clustering the data since it is difficult to estimate beforehand the number of clusters into which the supplier base will eventually be partitioned. Hierarchical methods are better suited since they require no a priori assumptions, either about the number or the distribution of the clusters. More importantly, as discussed in Section 3.2, due to these very reasons the dataset is often quite large (in the number of data items, feature dimensionality as well as the number of clusters) which makes it computationally quite inefficient to do clustering using a straightforward application of any of the standard clustering techniques. As such, a more suitable technique instead is to use a clustering technique meant for large datasets, such as the two-stage canopybased clustering technique (McCallum et al., 2000) discussed previously, in conjunction with a hierarchical (agglomerative) clustering method. As in the case of the clustering technique, several issues also need to be considered while deciding on the similarity approach to use to measure the distance between the data items (and the clusters). While, theoretically, string similarity metrics can be used for directly comparing supplier names by calculating the similarity (or distance) between them, several issues arise that make this practically infeasible. One, edit distance (e.g. Levenshtein distance) calculation is computationally expensive, and its usage on real data with tens of thousands of supplier names can make the mapping process quite computationally expensive. Two, similarity metrics are position invariant. That is, they only provide a quantitative measure of the difference between strings, but no indication of where the differences are. This is especially important in the case of supplier names that often consist of multiple words. Consider the following: ‘‘IBM Corp’’, ‘‘IBMCorp’’ and ‘‘ABM Corp’’. The Levenshtein distance between ‘‘IBM Corp’’ and ‘‘IBMCorp’’ is 1 as is the distance between ‘‘IBM Corp’’ and ‘‘ABM Corp’’. However, while the latter represents a variation (due to an error) in the name of the same enterprise, the latter case corresponds to names of different enterprises. ‘‘Texas Instruments Incorporated’’, ‘‘Texas Instruments Inc’’, ‘‘Westbay Instruments Incorporated’’ and ‘‘Western Instruments Incorporated’’. Here, the Levenshtein distance between the first and second names is 9, but between the first and third (as well as first and fourth) is 7, which implies that first name is more ‘‘similar’’ to the obviously incorrect names than it is to the correct one.
196
M. Singh and J.R. Kalagnanam
‘‘IBM’’, ‘‘IBM Corporation’’ and ‘‘CBS’’. Whereas the first and second are obviously variations for the same enterprise and the third name is obviously different, the edit distance between the first two is 12 while the distance between the first and third is only 3. Part of this problem, however, can be alleviated by tokenizing the names and performing similarity checks on tokens instead of the entire string. Moreover, a token-based similarity technique, such as the TF/IDF approach discussed previously, has the advantage of making similarity comparisons between strings while taking into account distinctions between the various terms (tokens) in those strings, both in terms of their importance to the strings containing them as well as their ability to discriminate these strings from other strings not containing these tokens. However, like the other similarity methods, the TF/IDF approach does not differentiate between the positions of tokens that are dissimilar; it simply considers each string as a ‘‘bag’’ of tokens and calculates the similarity between the strings based on those tokens. As such it also does not distinguish between differences among the compared strings at different positions. However, as the above examples show, it is often the case that differences towards the beginning of supplier names are more significant than differences towards the end of the names. Moreover, it does not take into account the order of the tokens, merely the similarity (or dissimilarity) between the tokens. Thus, names containing the same words but in different positions (e.g. ‘‘Advanced Technology Systems Inc’’ and ‘‘Advanced Systems Technology Inc’’) are considered to be exactly similar. Other issues arise as well. First, an appropriate tokenization scheme has to be selected. Using a tokenization scheme that produces too many, small tokens (such as n-grams with a high ‘n’) introduces too much noise, while schemes with too few tokens (such as word-grams or sequences-of-words) reduce detection of local-differences and/or make the process more computationally intensive. According to Singh et al. (2005, 2006) a wordbased tokenization scheme (that uses white space and punctuation for tokenization) generally provides a suitable compromise between detecting local differences and computational complexity when comparing supplier names. Second, in addition to supplier names, data such as address and contact information is often available, and helpful, for supplier-name normalization. As such, in addition to comparing supplier names, similarity checks also need to be done for various other data attributes such as street address, PO Box number, city, state, etc., and all these comparisons may yield conflicting or inconclusive information that needs to be resolved. On top of it, address information may not be available as attribute-value pairs but simply as unstructured textual data. In that case, values for various attributes such as street name, city, zip code, etc. need to be extracted before
Ch. 7. IT Advances for Industrial Procurement
197
similarity checks can be performed. While several different techniques can be used for extracting this information, a common and efficient method involves the use of regular expressions (Hopcroft et al., 2006) to define patterns corresponding to various attributes and then searching the text for those patterns to find the corresponding attribute values. Regular expressions are textual expressions that are used to concisely represent sets of strings, without enumerating out all the members of the set, according to certain syntax rules. For example, street addresses, such as ‘‘1101 Mount Kisco Avenue’’, often consist of three parts: a numeric part (street number), a sequence of words (street name) followed by a keyword (such as ‘‘avenue’’, ‘‘street’’ or ‘‘lane’’, etc.). A corresponding regular expression to match this could then be defined (using appropriate syntax) as a string consisting of a numeric prefix, followed by white space, then one or more non-numeric, alphabetic words, then some more white space and finally a keyword. Using such a regular expression would allow the street address to be extracted from unstructured address data, and broken up into its constituents to allow similarity matching. Given the type and quality of data, regular expressions may need to be defined for several different attributes, and may also need to be refined several times. This process is best undertaken by taking a suitable sample of the data and using a trial and error method on that data to create and subsequently refine the needed regular expressions. Third, no similarity method can directly handle differences in supplier names due to different naming conventions, acronyms, punctuation, formatting, etc. These must be addressed prior to any clustering/similarity exercises can be successfully carried out. Owing to all the reasons cited above, similarity comparisons between suppliers using a direct application of standard string similarity techniques is likely to yield unsatisfactory results for supplier name normalization. Rather, these comparisons typically have to be carried by applying a variety of different methods to various supplier data attributes. One way is via a repository of rules based on the application of various string similarity methods to tokenized supplier names, as proposed by Singh and Kalagnanam (2006). Such an approach provides a simple, straightforward way of addressing all the issues raised earlier. For example, by constructing and prioritizing rules of varying complexity, and including exact as well as fuzzy matches on whole or parts of supplier name and address attributes, it is possible to limit the use of computationally intensive tasks (such as extensive edit distance calculations) as well as satisfactorily address issues such as the position-invariant nature of string similarity methods. The former can be achieved by using simpler rules (such as exact name match, exact zip code match, etc.) first and using complex rules (involving distance computations) only if the simpler rules are not sufficient to make a decision regarding the level of similarity of the entities being compared, and that too only on some of the data attributes (such as a subset of name tokens, or on
198
M. Singh and J.R. Kalagnanam
the street name). However, the latter can be attained by building rules that specifically address this issue, such as rules that consider differences towards the beginning of names as being more significant than differences towards the end of the names, etc. This rule-based similarity approach can be further enhanced by using various techniques such stop word elimination, formatting and special character removal, number transformations, abbreviation generation and comparison, etc., which help preprocess and ‘‘standardize’’ the supplier data to enable a better comparison between them. Furthermore, rules can be designed to use information from dictionaries and standard company name databases (such as the Fortune 500 list) to assign different weights to different words and tokens in a name, thereby enhancing the normalization process further. For example, a nondictionary word that occurs in the Fortune 500 list, such as ‘‘Intel’’, can be considered to a more significant in similarity measurements than other words. Constructing the repository of similarity rules is generally an iterative process, involving a lot of hit and trial manual work initially. As in the case of regular expressions, it is often helpful to use a sample of data to help create these rules initially and then refine them as more data is analyzed. However, as more and more supplier bases are normalized the rulerepository gets incrementally bigger and better to successfully encompass all types of situations, and lesser and lesser manual intervention is required. As such, the canopy-based clustering technique with hierarchical (agglomerative) clustering using rule-based similarity measurement provides a suitable, efficient approach for supplier name normalization, the idea being to first use computationally cheap methods to make some loose clusters, called canopies, followed by the more computationally intensive methods to refine the canopies further into appropriate clusters. Owing to the extremely large supplier bases encountered for many enterprises, this clustering approach is particularly attractive. To create canopies, cheap methods including zip code matches, phone number matches and name and/or address token matches are used in various rules. Once the canopies have been formed, more expensive techniques consisting of the elaborate similarity measurements are used to form clusters. Once the supplier base of the enterprise under consideration has been clustered, the same process is repeated for each of the other enterprise involved in the procurement BTO exercise. Once all the individual enterprises are normalized, they can be merged into a single, cross-enterprise normalized supplier base. Note, however, that if the enterprise being normalized in a new enterprise being brought on-board an existing BTO platform, then it would need to be merged with the cumulative normalized supplier base formed from all previously normalized clients’ data. In either case, the merger can be easily done using agglomerative clustering in conjunction with the same set of similarity rules used previously, and the sets of clusters in the two supplier bases being the starting clusters from which
Ch. 7. IT Advances for Industrial Procurement
199
the agglomeration process starts (see Section 3.2). A side advantage of this process is that by incrementally building up such a repository of normalized suppliers, and mining the repository for subsequent clients’ normalization tasks, the accuracy and performance of the system can be progressively improved with each additional client. The supplier-normalization approach described above can be summarized as follows: 1. Pre-process the supplier data (name, address, etc.) by eliminating stop words, removing special characters, transforming numbers to uniform format, etc. 2. Define regular expression-based extractors to break up address fields into more specific information such as street name, street number, PO Box number, etc. It may be helpful to take a sample of the data and use that to define the regular expressions (as discussed previously). 3. Create/augment similarity-rules repository. As more and more supplier bases get normalized, he incremental changes needed to this repository decreases. 4. Segment the supplier base of the current enterprise a. Create canopies using cheap similarity rules such as zip code matches, phone number matches, first-token matches (using n-grams) as well as exact and inclusion name matches. b. Use more stringent, computationally intensive similarity rules to create clusters from the canopies. Use cheaper rules first, followed by more expensive rules. Rules include checking for non-dictionary words, Fortune 500 words, similarity in name and address fields, abbreviation matches, etc. 5. Merge the set of clusters with the current normalized supplier base consisting of all enterprises’ data that has already been normalized. This can be easily done using agglomerative clustering using the same repository of similarity rules, as we describe above. 3.2.2 Commodity taxonomy and commodity transactional mapping Like supplier name normalization, commodity taxonomy mapping is also typically limited by the absence of mapped data. This is especially true in the case of taxonomy mapping for BTO procurement as it often involves mapping the taxonomy of one enterprise to a totally different taxonomy. At the transactional level too, there is often no transactional data that is labeled with appropriate commodity codes, either due to the absence of a formal commodity taxonomy for spend categorization, or simply due to a lack of strict enforcement of such labeling for all transactions, etc. In the limited cases in which labeled data is available within an enterprise (by way of descriptions in transactions that are also labeled with appropriate commodity codes), it does not associate the individual transactions with an
200
M. Singh and J.R. Kalagnanam
external taxonomy, as is the case for BTO procurement. As a result, systems for automating commodity mapping, both taxonomy as well as transactional, are once again mostly limited to unsupervised methods, such as similarity and clustering techniques discussed earlier, although, in some cases, classification techniques can play a useful role as we discuss later in this section. Moreover, even clustering techniques are often of little use, especially for commodity taxonomy mapping, since each commodity in taxonomy is generally different (except in some cases where the taxonomy is built on an ad hoc basis which may result in some duplication), leaving similarity methods as one of the most viable technique for building effective commodity mapping solutions. Furthermore, as in the case of supplier-name normalization, several issues have to be considered while deciding on the specific approach for both commodity taxonomy mapping as well as commodity transactional mapping. Some of these are the same issues that also affect suppliername normalization, such as the computational expense of edit distance methods and the absence of positional differentiation in all the string similarity methods. Others are specific to the task of commodity mapping, though they affect taxonomy mapping and transactional mapping to different extents. Nevertheless, they must be considered as well as addressed appropriately for commodity mapping to be done successfully. One, different words (or their various grammatical forms) may be used in different descriptions to represent the same entity, both in taxonomies as well as in transactions. Examples are synonyms, tenses, plurals, etc. Two, commodity taxonomy descriptions are normally very short and concise. As such, each word is significant, albeit to different degrees. However, distinguishing between different potential matches becomes correspondingly harder, since the items in a taxonomy often number in the tens of thousands of which the best one has to be selected based on a couple of words. Moreover, taxonomy descriptions may still have errors such as spelling mistakes as the taxonomy may have been generated on the fly during categorization of spend transactions in day to day procurement activities. Three, commodity descriptions often contain significant amounts of domain-specific terminology. Four, the order of words in commodity descriptions becomes an important issue, one that is not considered in traditional information retrieval methods that use a bag-of-words approach. For example, ‘‘tax software’’ and ‘‘software tax’’ are considered similar by a token-based similarity metric such as TF/IDF. Five, in cases where transactional mapping needs to be done, for the reasons highlighted earlier, the problems get more compounded by the fact that transactional descriptions are often more noisier than taxonomy descriptions, often have substantially more domain-specific terminology, and also entail the need for resolving potentially conflicting matches resulting from multiple descriptions in the same transaction (arising from different sources such as POs and invoices).
Ch. 7. IT Advances for Industrial Procurement
201
Compounding all this is the fact that the source and target taxonomy may have wide structural differences. As a case in point, consider the UNSPSC code. It has roughly 20K commodities in a four-level taxonomy. However, while the taxonomy is very broad, and includes commodities and services in almost all industrial sectors, it is not very deep in any given sector. Company taxonomies, however, are not very broad but are generally far more specific in terms of commodities, especially in the case of items used in production. For example, while the UNSPSC has commodity codes for desktop and notebook computers, companies are much more specific in terms of the specific types of desktop and notebook computers. This is more so in the case of production parts, but also occurs in the case of services. As such, there is often a many-to-many mapping that needs to be done between the two taxonomies. Another important factor is the need to determine exactly what the commodity description is referring to. For example, software tax is a sort of tax while tax software is a type of software. As pointed out earlier, token-based string similarity methods cannot distinguish between these two phrases. More importantly though, they do not distinguish between tokens on their semantic significance to the description but only to their discriminative ability on the basis of token and document frequencies. The problem is that while mapping taxonomies, it is quite common for a sizeable list of possible candidates to be evaluated as being similar to the description being considered on the basis of common tokens, but an accurate match cannot be made unless it is known what the specific object of the description is. To enable the mapping to be done properly, various techniques from classical information retrieval literature including stop word removal, stemming, tokenization using words and grams, etc. can be used in conjunction with dictionaries and domain-specific vocabulary. Moreover, lexical databases, such as WordNet (WordNet, 1998), enable the use of synonyms, sense determination, morphological analysis and part of speech determination in the creation of rules and methods for better identifying the main keyword(s) in a description and ranking the results of mapping better, as well as provide means of enhancing similarity calculation on the basis of synonyms, different word forms, etc. instead of just token similarity as provided by vanilla string similarity methods. For example, consider the descriptions ‘‘safety shoes’’ and ‘‘safety boots’’. With a similarity measure like TF/IDF, they would be considered similar to some extent (due to the common token ‘‘safety’’) but there is no way to know that ‘‘shoes’’ and ‘‘boots’’ are also similar. Use of a system such as WordNet enables such mappings to be correctly made. Finally, a set of post-filtering and ranking rules which assign weights to tokens in the queries and candidate descriptions based on such importance, and re-ranks the candidate results to get a more accurate match list needs to be created. This is necessary since often exact matches are not found; rather, a list of potential matches is found with different sets of tokens in common with the description being
202
M. Singh and J.R. Kalagnanam
mapped, and a decision needs to be made as to which of these is the best match. Thus, for mapping commodity taxonomies, an appropriate mapping method would be to use token-based string similarity methods (TF/IDF) augmented with a lexical database, such as WordNET, and rules based on positional differences between tokens in the query and candidate descriptions. Another step that can prove to be quite beneficial is to mine previously mapped enterprise taxonomies for similarities to the commodity description in question, and use that to acquire the proper UNSPSC mapping when substantial similarities were found. This approach (Singh and Kalagnanam, 2006) can be described as follows: 1. Pre-process the commodity descriptions for the target (UNSPSC) taxonomy, previously mapped enterprise taxonomies and the to-bemapped taxonomy by eliminating stop-words and doing transformations such as stemming and term normalization, generate synonyms, etc. for the descriptions using some lexical database such as WordNET, and generate TF/IDF indexes for each taxonomy. 2. Define/augment weighting rules repository for identifying best match for a given description. a. Define rules for identifying the main object of the description, as well as the related qualifiers. Thus, for ‘‘software tax’’, object would be tax and the qualifier would be software. For ‘‘application software tax’’, there would be an additional qualifier, ‘‘application’’. b. Define rules to rank prospective matches based on the objects, qualifiers and their relative positions. The general idea is as follows: in addition to the presence/absence of various tokens in the query and candidate match, weights are assigned to tokens based on their relative and absolute position, as well as their importance to the query (object, immediate qualifier, distant qualifier, etc). Thus, for example, if the objects matched in value and position, the candidate would be ranked higher than a candidate in which the tokens matched but their relative positions did not. Thus, if the query was ‘‘software tax’’, then a candidate ‘‘tax’’ would be ranked higher than a candidate ‘‘tax software’’ even though the latter is a perfect token-based match. Similarly, ‘‘application software tax’’ would be ranked higher than ‘‘tax software’’ but lower than ‘‘software tax’’. 3. For each commodity description in the to-be-mapped taxonomy, do a. Check for an exact match with a description in the target taxonomy or a previously mapped taxonomy. If found, stop and use the matches. Otherwise, proceed. b. Use TF/IDF similarity method to generate a candidate list of possible matches. c. For the query description, identify the main object and qualifiers, and use the weighting rules to rank the possible matches, and map to the highest rank description.
Ch. 7. IT Advances for Industrial Procurement
203
For transactional mapping, the same techniques and algorithms as described above for taxonomy mapping can be used with some extensions. First, the same clustering technique as used for supplier name normalization (canopy-based clustering in conjunction with hierarchical, agglomerative clustering, using rule-based similarity measurement) can be applied to cluster together similar transactions based on the transactional descriptions. Second, the taxonomy mapping algorithm described above is extended to use transactional descriptions from previously mapped companies’ data as well. Third, simple methods (such as majority rule) are used to combine mapping results arising from multiple descriptions, either for the same transaction or different transactions in the same cluster. Fourth, better repositories are built and improved techniques for filtering out the noise from such descriptions, mainly using stop words and better keyword indices, are designed. By the sheer nature of the task, this step will almost always necessitate extensive human intervention, primarily due to domain-specific terminology and conventions, etc., to do the mapping correctly. However, as more and more data is mapped, subsequent mapping exercises for other enterprises, especially those in the same industry as the ones mapped earlier, should require lesser and lesser human involvement and enable more automation. In this regard, classification methods can also prove useful as models can be induced from the previously mapped data and used to map transactions from newer enterprises, especially in the same industry. 4
Conclusion
This chapter discussed how enterprise spend aggregation can be automated using data cleansing techniques. Over the past few years, more and more enterprises have been heavily investing in IT tools that can significantly improve their procurement activities. An earlier step in this direction was the move towards addressing strategic procurement functions, such as strategic sourcing, that requires aggregation of spend across the entire enterprise. A recent trend is the move towards the outsourcing of the procurement functions (especially, non-core procurement pieces) to third party providers who then provide the procurement function for the enterprise. This practice, called Business Transformation Outsourcing, can generate significant financial benefits for the enterprises involved, but requires spend aggregation to be done on a much larger scale than before, often across multiple enterprises. However, before such spend aggregation can be done, the spend data has to be cleansed and rationalized across, and within, the multiple enterprises, an activity that is typically done manually using rudimentary data analysis techniques and spreadsheets. However, a significant amount of research has been conducted over the past couple of decades in various fields, such as databases, statistics and artificial
204
M. Singh and J.R. Kalagnanam
intelligence, on the development of various data cleansing techniques and their application to a broad range of applications and domains. This chapter provides a brief survey of these techniques and applications, and then discusses how some of these methods can be adapted to automate the various cleansing activities needs for spend data aggregation. Moreover, the chapter provides a detailed roadmap to enable the development of such an automated system for spend aggregation.
References Alvey, W., B. Jamerson (eds.) (1997). Record linkage techniques, in: Proceedings of an International Record Linkage Workshop and Exposition, March 20–21, Arlington, Virginia. Also published by National Academy Press (1999) and available at http://www.fcsm.gov under methodology reports. Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New York. Baeza-Yates, R., B. Ribeiro-Neto (1999). Modern Information Retrieval. Addison-Wesley, Boston, MA. Banfield, J.D., A.E. Raftery (1993). Model based gaussian and non-gaussian clustering. Biometrics 49, 803–821. Bilenko, M., R.J. Mooney (2003). Adaptive duplicate detection using learnable string similarity metrics, in: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. Bitton, D., D.J. DeWitt (1983). Duplicate record elimination in large data files. ACM Transactions on Database Systems 8(2), 255–265. Borkar, V., K. Deshmukh, S. Sarawagi (2000). Automatically extracting structure from free text addresses. Bulletin of the Technical Committee on Data Engineering 23(4), 27–32. Bright, M.W., A.R. Hurson, S. Pakzad (1994). Automated resolution of semantic heterogeneity in multidatabases. ACM Transactions on Database Systems 19(2), 212–253. Califf, M.E. (1998). Relational learning techniques for natural language information extraction. Unpublished doctoral dissertation, University of Texas at Austin, Austin, TX, USA. Cheeseman, P., J. Stutz (1996). Bayesian classification (AutoClass): theory and results, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp. 153–180. Cochinwala, M., S. Dalal, A.K. Elmagarmid, V.S. Verykios (2001). Record matching: past, present and future. Available as Technical Report CSD-TR #01-013, Department of Computer Sciences, Purdue University. Available at http://www.cs.purdue.edu/research/technical_reports/2001/TR%2001-013.pdf Cohen, W.W. (2000). Data integration using similarity Joins and a word-based Information representation language. ACM Transactions on Information Systems 18(3), 288–321. Cohen, W.W., P. Ravikumar, S.E. Fienberg (2003). A comparison of string distance metrics for namematching tasks, in: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73–78. Cohen, W.W., J. Richman (2002). Learning to match and cluster large high-dimensional data sets for data integration, in: Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining, pp. 475–480. Dempster, A.P., N.M. Laird, D.B., Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical. Society, Series B 39(1), 1–38. Dey, D., S. Sarkar, P. De (2002). A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering 14(3), 567–582. Dice, L.R. (1945). Measures of the amount of ecologic association between species. Ecology 26(3), 297–302.
Ch. 7. IT Advances for Industrial Procurement
205
Fellegi, I.P., A.B. Sunter (1969). A theory of record linkage. Journal of the American Statistical Society 64, 1183–1210. Granada Research (2001). Using the UNSPSC – United Nations Standard Products and Services Code White Paper. Available at http://www.unspsc.org/ Hernandez, M.A., S.J. Stolfo (1995) The Merge/Purge problem for large databases, in: Proceedings of the ACM SIGMOD Conference, San Jose, CA. Hopcroft, J.E., R. Motwani, J.D. Ullman (2006). Introduction to automata theory, languages and computation. 3rd ed. Addison-Wesley, Boston, MA. Jaccard, P. (1912). The distribution of flora in the alpine zone. New Phytologist 11, 37–50. Jain, A.K., M.N. Murty, P.J. Flynn (1999). Data clustering: a review. ACM Computing Surveys 31(3). Jain, A.K., R.C. Dubes (1988). Algorithms for Clustering Data. Prentice Hall, Saddle River, NJ. Jaro, M.A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 89, 414–420. Jaro, M.A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine 14, 491–498. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features, in: C. Nedellec, C. Rouveirol (eds.), Lecture Notes in Computer Science: Proceedings of the 10th European Conference on Machine Learning. Springer, London, UK. Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710. Lim, E., R.H.L. Chiang (2004). Accommodating instance heterogeneities in database integration. Decision Support Systems 38(2), 213–231. McCallum, A., K. Nigam (1998). A comparison of event models for Naive Bayes text classification, in: AAAI-98 Workshop on Learning for Text Categorization. McCallum, A., K. Nigam, L.H. Ungar (2000). Efficient clustering of high-dimensional data sets with application to reference matching, in: Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp. 169–178. McQueen, J. (1967). Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, pp. 281–297. Monge, A.E., C. Elkan (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records, in: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ. Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88. Newcombe, H.B. (1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford University Press, New York, NY. Nigam, K., J. Lafferty, A. McCallum (1999). Using maximum entropy for text classification, in: IJCAI99 Workshop on Machine Learning for Information Filtering, pp. 61–67. Rahm, E., H.H. Do (2000). Data cleaning: problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13. Salton, G., C. Buckley (1987). Term weighting approaches in automatic text retrieval. Technical Report No. 87-881, Department of Computer Science, Cornell University, Ithaca, New York. Salton, G. (1991). Developments in automatic text retrieval. Science 253, 974–980. Singh, M., J. Kalagnanam (2006). Using data mining in procurement business transformation outsourcing. 12th ACM SIGKDD Conference on Knowledge Discovery and Data Mining – Workshop on Data Mining for Business Applications, Philadelphia, PA, pp. 80–86. Singh, M., J. Kalagnanam, S. Verma, A. Shah, S. Chalasani (2005). Automated cleansing for spend analytics. CIKM ’05-ACM 14th Conference on Information and Knowledge Management, Bremen, Germany. Selim, S.Z., M.A. Ismail (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 81–87.
206
M. Singh and J.R. Kalagnanam
UNSPSC, The United Nations Standard Products and Services Code. Available at http:// www.unspsc.org Wang, Y.R., S.E. Madnick (1989). The inter-database instance identification problem in integrating autonomous systems, in: Proceedings of the 5th International Conference on Data Engineering, Los Angeles, CA, pp. 46–55. Winkler, W.E. (2002). Record linkage and Bayesian networks, in: Proceedings of the Section on Survey Research Methods, American Statistical Association, Washington, DC. Winkler, W.E. (2006). Overview of record linkage and current research directions. Research Report Series: Statistics #2006-2, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233. Available at http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf WordNet (1998). A lexical database for the English language. Cognitive Science Laboratory, Princeton University, Princeton, NJ. Available at http://wordnet.princeton.edu
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 8
Spatial-Temporal Data Analysis and Its Applications in Infectious Disease Informatics
Daniel Zeng Department of Management Information Systems, The University of Arizona 1130 E. Helen Street, Rm 430, Tucson, AZ 85721-0108, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China
James Ma and Hsinchun Chen Department of Management Information Systems, The University of Arizona 1130 E. Helen Street, Rm 430, Tucson, AZ 85721-0108, USA
Wei Chang Katz Graduate School of Business, The University of Pittsburgh 343 Mervis Hall, Pittsburgh, PA 15213, USA
Abstract Recent years have witnessed significant interest in spatial-temporal data analysis. In this chapter, we introduce two types of spatial-temporal data analysis techniques and discuss their applications in public health informatics. The first technique focuses on clustering or hotspot analysis. Both statistical and machine learning-based analysis techniques are discussed in off-line (retrospective) and online (prospective) data analysis contexts. The second technique aims to analyze multiple data streams and identify significant correlations among them. Both classical spatial correlation analysis methods and new research on spatial-temporal correlation are presented. To illustrate how these spatial-temporal data analysis techniques can be applied in real-world settings, we report case studies in the domain of infectious disease informatics.
1
Introduction
Recent years have witnessed significant interest in spatial-temporal data analysis. The main reason for this interest is the availability of datasets 207
208
D. Zeng et al.
containing important spatial and temporal data elements across a wide spectrum of applications ranging from public health (disease case reports), public safety (crime case reports), search engines (search keyword geographical distributions over time), transportation systems (data from Global Positioning Systems (GPS)), to product lifecycle management (data generated by Radio Frequency Identification (RFID) devices), and financial fraud detection (financial transaction tracking data) (Sonesson and Bock, 2003). The following central questions of great practical importance have arisen in spatial-temporal data analysis and related predictive modeling: (a) How to identify areas having exceptionally high or low measures (hotspots)? (b) How to determine whether the unusual measures can be attributed to known random variations or are statistically significant? In the latter case, how to assess the explanatory factors? (c) How to identify any statistically significant changes in a timely manner in geographic areas? (d) How to identify significant correlations among multiple data streams with spatial and temporal data elements? Questions (a)–(c) can be tackled by spatial-temporal clustering analysis techniques, also known as hotspot analysis techniques. Two types of clustering methods have been developed in the literature. The first type of approach falls under the general umbrella of retrospective models (Yao, 2003; Kulldorff, 1997). It is aimed at testing statistically whether events (e.g., disease cases) are randomly distributed over space and time in a predefined geographical region during a predetermined time period. In many cases, however, this static perspective is inadequate as data often arrive dynamically and continuously, and in many applications there is a critical need for detecting and analyzing emerging spatial patterns on an ongoing basis. The second type of approach, prospective in nature, aims to meet this need with repeated time periodic analyses targeted at identification of statistically significant changes in an online context (Rogerson, 2001). Alerts are usually disseminated whenever such changes are detected. In the first part of this chapter, we present introductory material on both retrospective and prospective spatial-temporal data analysis techniques, and illustrate their applications using public health datasets (Chang et al., 2005; Kulldorff, 2001). To answer question (d), one has to study relationships among multiple datasets. Current correlation analysis is mainly applied in fields such as forestry (Stoyan and Penttinen, 2000), acoustics (Tichy, 1973; Veit, 1976), entomology (Cappaert et al., 1991), or animal science (Lean et al., 1992; Procknor et al., 1986), whose practices focus mostly on either time series or spatial data. One of the widely adopted definitions for spatial correlation analysis is Ripley’s K(r) function (Ripley, 1976, 1981). In order to analyze
Ch. 8. Spatial-Temporal Data Analysis
209
the data sets with both spatial and temporal dimensions, in our recent research, we have extended the traditional K(r) definition by adding a temporal parameter t. By analyzing real-world infectious disease-related data sets, we found that the extended definition K(r, t) is more discriminating than the K(r) function and can discover causal events whose occurrences induce those of other events. In the second part of this chapter, we introduce both Ripley’s K(r) function and its extended form K(r, t), and discuss a case study applying them to a public health dataset concerning mosquito control. The remainder of this chapter is structured as follows. In Section 2 we introduce the retrospective and prospective spatial clustering techniques. Section 3 focuses on spatial and spatial-temporal correlation analysis methods. In Section 4 we conclude by summarizing the chapter.
2
Retrospective and prospective spatial clustering
We first review major types of retrospective and prospective surveillance approaches in Section 2.1. Section 2.2 introduces recently developed spatialtemporal data analysis methods based on a robust support vector machine (SVM)-based spatial clustering technique. The main technical motivation behind such methods is the lack of hotspot analysis techniques capable of detecting unusual geographical regions with arbitrary shapes. Section 2.3 summarizes several computational experiments based on simulated datasets. This experimental study includes a comparative component evaluating the SVM-based approaches against other methods in both retrospective and prospective scenarios. In Section 2.4, we summarize a case study applying spatial-temporal clustering analysis technologies to realworld datasets. 2.1 Literature review In this section, we first introduce retrospective spatial-temporal data analysis, and then present the representative prospective surveillance methods, many of which were developed as extensions to retrospective methods. 2.1.1 Retrospective spatial-temporal data analysis Retrospective approaches determine whether observations or measures are randomly distributed over space and time for a given region. Clusters of data points or measures that are unlikely under the random distribution assumption are reported as anomalies. A key difference between retrospective analysis and standard clustering lies in the concept of ‘‘baseline’’ data. For standard clustering, data points are grouped together directly
210
D. Zeng et al.
based on the distances between them. Retrospective analysis, on the other hand, is not concerned with such clusters. Rather, it aims to find out whether unusual clusters formed by the data points of interest exist relative to the baseline data points. These baseline data points represent how the normal data should be spatially distributed given the known factors or background information. Clusters identified in this relative sense provide clues about dynamic changes in spatial patterns and indicate the possible existence of unknown factors or emerging phenomena that may warrant further investigation. In practice, it is the data analyst’s responsibility to separate the dataset into two groups: baseline data and data points of interest, typically with events corresponding to the baseline data preceding those corresponding to the data points of interest. As such, retrospective analysis can be conceptualized as a spatial ‘‘before and after’’ comparison. For example, Fig. 1 shows certain disease incidents in a city. Asterisks indicate the locations where the disease incidents usually occur in normal situations (corresponding to the baseline cases). Crosses are the recently confirmed incidents (cases of interest). Comparing the distribution of the cases of interest with that of the baseline, one could identify an emerging area containing dense disease incidents, indicative of a possible outbreak. In Fig. 1, this emerging area is identified with an irregularly shape area close to the center. Later we discuss two major types of retrospective analysis methods: scan statistic-based and clustering-based. A comparative study of these two types of retrospective approaches can be found in Zeng et al. (2004).
Fig. 1.
An example of retrospective analysis.
Ch. 8. Spatial-Temporal Data Analysis
211
2.1.1.1 Scan statistic-based retrospective analysis. Various types of scan statistics have been developed in the past four decades for surveillance and monitoring purposes in a wide range of application contexts. For spatialtemporal data analysis, a representative method is the spatial scan statistic approach (Kulldorff, 1997). This method has become one of the most popular methods for detection of geographical disease clusters and is being widely used by public health departments and researchers. In this approach, the number of events, for example, disease cases, may be assumed to be either Poisson or Bernoulli distributed. Algorithmically, the spatial scan statistic method imposes a circular window on the map under study and lets the center of the circle move over the area so that at different positions the window includes different sets of neighboring cases. Over the course of data analysis, the method creates a large number of distinct circular windows (other shapes such as rectangular and ellipse have also been used), each with a different set of neighboring areas within it and each a possible candidate for containing an unusual cluster of events. A likelihood ratio is defined on each circle to compute how likely the cases of interest fall into that circle not by pure chance. The circles with high likelihood ratios are in turn reported as spatial anomalies or hotspots. 2.1.1.2 Clustering-based retrospective analysis. Despite the success of the spatial scan statistic and its variations in spatial anomaly detection, the major computational problems faced by this type of methods is that the scanning windows are limited to simple, fixed symmetrical shapes for analytical and search efficiency reasons. As a result, when the real underlying clusters do not conform to such shapes, the identified regions are often not well localized. Another problem is that it is often difficult to customize and fine-tune the clustering results using scan statistic approaches. For different types of analysis, the users often have different needs as to the level of granularity and the number of the resulting clusters, and they have different degrees of tolerance regarding outliers. These problems have motivated the use of alternative modeling approaches based on clustering. Risk-adjusted nearest neighbor hierarchical clustering (RNNH) (Levine, 2002) is a representative of such approaches. Developed for crime hotspot analysis, RNNH is based on the well-known nearest neighbor hierarchical (NNH) clustering method, combining the hierarchical clustering (Johnson, 1967) capabilities with kernel density interpolation techniques (Levine, 2002). The standard NNH approach identifies clusters of data points that are close together (based on a threshold distance). Many such clusters, however, are due to some background or baseline factors (e.g., the population which is not evenly distributed over the entire area of interest). RNNH is primarily motivated to identify clusters of data points relative to the baseline factor. Algorithmically, it dynamically adjusts the threshold distance inversely proportional to some density measure of the baseline factor (e.g., the threshold should be shorter in regions where
212
D. Zeng et al.
the population is high). Such density measures are computed using kernel density based on the distances between the location under study and some or all other data points. We summarize below the key steps of the RNNH approach. Define a grid over the area of interest; calculate the kernel density of baseline points for each grid cell; rescale such density measures using the total number of cases. Calculate the threshold distances between data points for hierarchical clustering purposes and perform the standard NNH clustering based on the above distance threshold. RNNH has been shown to be a successful tool in detecting spatialtemporal criminal activity patterns (Levine, 2002). We argue that its built-in flexibility of incorporating any given baseline information and computational efficiency also make it a good candidate for analyzing spatialtemporal data in other applications. In Section 2.2.1, we will introduce another clustering-based method, called risk-adjusted support vector clustering (RSVC) (Zeng et al., 2005), the result of our recent attempt to combine the risk adjustment idea of RNNH with a modern, robust clustering mechanism such as SVM to improve the quality of hotspot analysis. 2.1.2 Prospective spatial-temporal surveillance A major advantage that prospective approaches have over retrospective approaches is that they do not require the separation between the baseline cases and cases of interest in the input data. Such a requirement is necessary in retrospective analysis, and is a major source of confusion and difficulty to the end users. Prospective methods bypass this problem, and process data points continuously in an online context. Two types of prospective spatialtemporal data analysis approaches have been developed in the statistics literature (Kulldorff, 2001; Rogerson, 1997, 2001). The first type segments the surveillance data into chunks by arrival time, and then applies a spatial clustering algorithm to identify abnormal changes. In essence, this type of approach reduces a spatial-temporal surveillance problem into a series of spatial surveillance problems. The second type explicitly considers the temporal dimension, and clusters data points directly based on both spatial and temporal coordinates. We briefly summarize representative approaches for both types of methods including Rogerson’s method and the space-time scan statistic. 2.1.2.1 Rogerson’s methods. Rogerson has developed CUSUM-based surveillance methods to monitor spatial statistics such as Tango and Knox statistics, which capture spatial distribution patterns existing in the surveillance data (Rogerson, 1997, 2001). CUSUM is a univariate surveillance approach that monitors the number of events in a fixed
Ch. 8. Spatial-Temporal Data Analysis
213
interval. Let Ct be the spatial statistic (e.g., Tango or Knox) at time t. The surveillance variable is defined as Z t ¼ ðCt EðC t jCt1 ÞÞ=sðCt jC t1 Þ. Refer to Rogerson, 1997, 2001 for the derivation of the conditional expected value EðC t jC t1 Þ and the corresponding variance sðC t jC t1 Þ. Following the CUSUM surveillance approach, when the accumulated deviation Zt exceeds a threshold value, the system will report an anomaly (which typically triggers an alarm in public health applications). Rogerson’s methods have successfully detected the onset of the Burkitt’s lymphoma in Uganda during 1961–1975 (Rogerson, 1997). 2.1.2.2 Space-time scan statistic. Kulldorff has extended his retrospective 2-dimensional spatial scan statistic to a 3-dimensional space-time scan statistic, which can be used as a prospective analysis method (Kulldorff, 2001). The basic intuition is as follows. Instead of using a moving circle to search the area of interest, one can use a cylindrical window in three dimensions. The base of the cylinder represents space, exactly as with the spatial scan statistic, whereas the height of the cylinder represents time. For each possible circle location and size, the algorithm considers every possible starting and ending times. The likelihood ratio test statistic for each cylinder is constructed in the same way as for the spatial scan statistic. After a computationally intensive search process, the algorithm can identify the abnormal clusters with corresponding geolocations and time periods. The space-time scan statistic has successfully detected an increased rate of male thyroid cancer in Los Alamos, New Mexico during 1989–1992 (Kulldorff, 2001). 2.2 Support vector clustering-based spatial-temporal data analysis In this section, we present two recently developed robust spatial-temporal data analysis methods. The first is a retrospective hotspot analysis method called RSVC (Zeng et al., 2005). The second is a prospective analysis method called prospective support vector clustering (PSVC), which uses RSVC as a clustering engine (Chang et al., 2005). 2.2.1 Risk-Adjusted Support Vector Clustering (RSVC) The RSVC is the result of our recent attempt to combine the risk adjustment idea of RNNH with a modern, SVM-based robust clustering mechanism to improve the quality of hotspot analysis. SVM-based clustering (SVC) (Ben-Hur et al., 2001) is a well-known extension of SVM-based classification. However, the standard version of SVC does not take into consideration baseline data points and therefore cannot be directly used in spatial-temporal data analysis. As such, we have developed a risk-adjusted variation, called RSVC, based on ideas similar to those in RNNH. Firstly, using only the baseline points, a density map is constructed
214
D. Zeng et al.
using standard approaches such as the kernel density estimation method. Secondly, the case data points are mapped implicitly to a high-dimensional feature space defined by a kernel function (typically the Gaussian kernel). The width parameter in the Gaussian kernel function determines the dimensionality of the feature space. The larger the width parameter is, the harder the data points in the original space constitute a cluster and hence data points are more likely to belong to smaller clusters. Our algorithm dynamically adjusts the width parameter based on the kernel density estimates obtained in the previous step. The basic intuition is as follows: when the baseline density is high, a larger width value is used to make it harder for points to be clustered together. Thirdly, following the SVM approach, RSVC finds a hypersphere in the feature space with a minimal radius to contain most of the data. The problem of finding this hypersphere can be formulated as a quadratic or linear program depending on the distance function used. Fourthly, the function estimating the support of the underlying data distribution is then constructed using the kernel function and the parameters learned in the third step. When projected back to the original data space, the identified hypersphere is mapped to (possibly multiple) clusters. These clusters are then returned as the output of RSVC. 2.2.2 Prospective support vector clustering Although well-grounded in theoretical development, both Rogerson’s methods and the space–time scan statistic have major computational problems. Rogerson’s approaches can monitor a given target area but cannot search for problematic areas or identify the geographic shape of these areas. The space–time scan statistic method performs poorly when the true abnormal areas do not conform to simple shapes such as circles. Below we introduce the basic ideas behind our approach, which is called PSVC, and summarize its main algorithmic steps. Our PSVC approach follows the design of the first type of the spatialtemporal surveillance method discussed in Section 2.2, which involves repeated spatial clusterings over time. More specifically, the time horizon is first discretized based on the specific characteristics of the data stream under study. Whenever a new batch of data arrives, PSVC treats the data collected during the previous time frame as the baseline data, and runs the retrospective RSVC method. After obtaining a potential abnormal area, PSVC attempts to determine how statistically significant the identified spatial anomaly is. Many indices have been developed to assess the significance of the results of clustering algorithms in general (Halkidi et al., 2002a,b). However, all these criteria assess clustering in an absolute sense without considering the baseline information. Thus, they are not readily suitable for prospective spatial-temporal data analysis. Kulldorff’s (1997) likelihood ratio L(Z) as defined in the following equation is to our best knowledge the only statistic that explicitly takes
Ch. 8. Spatial-Temporal Data Analysis
the baseline information into account. c c c nc C c Cc C c ðNnÞðCcÞ LðZÞ ¼ 1 1 n n Nn Nn
215
(1)
In this definition, C and c are the number of the cases in the entire dataset and the number of the cases within the scanned area Z, respectively. N and n are the total number of the cases and the baseline points in the entire dataset and the total number of the cases and the baseline points within Z, respectively. Since the distribution of the statistic L(Z) is unknown, we use the standard simulation approach to calculate statistical significance measured by the p-value. Specifically, we first generate T replications of the dataset, assuming that the data are randomly distributed. We then calculate the likelihood ratio L(Z) in the same area Z for each replication. Finally, we rank these likelihood ratios and if L takes the Xth position, then the p-value is set to be X/(Tþ1). Note that in a straightforward implementation of the earlier algorithmic design, anomalies are identified (or equivalently alerts are triggered) only when adjacent data batches have significant changes in terms of data spatial distribution. This localized myopic view, however, may lead to significant delay in alarm triggering or even false negatives because in some circumstances, unusual changes may manifest gradually. In such cases, there might not be any significant changes between adjacent data batches. However, the accumulated changes over several consecutive batches can be significant and should trigger an alarm. This observation suggests that a more ‘‘global’’ perspective beyond comparing adjacent data batches is needed. It turns out that the CUSUM approach provides a suitable conceptual framework to help design a computational approach with such a global perspective. The analogy is as follows. In the CUSUM approach, accumulative deviations from the expected value are explicitly kept track of. In prospective analysis, it is difficult to design a single one-dimensional statistic to capture what the normal spatial distribution should look like and to measure the extent to which deviations occur. However, conceptually the output of a retrospective surveillance method such as RSVC can be viewed as the differences or discrepancies between two data batches, with the baseline data representing the expected data distribution. In addition, accumulative discrepancies can be computed by running RSVC with properly set baseline and case data separation. For an efficient implementation, we use a stack as a control data structure to keep track of RSVC runs which now include comparisons beyond data from adjacent single periods. The detailed control strategy is described later. When clusters generated in two consecutive RSVC runs have overlaps, we deem that the areas covered by these clusters are risky areas. We use the stack to store the clusters along with the data batches from which these risky clusters are identified. Then we run RSVC to compare the current data
216
D. Zeng et al.
batch with each element (in the form of a data batch) of the stack sequentially from the top to the bottom to examine whether significant spatial pattern changes have occurred. Stacks whose top data batch is not the current data batch under examination can be emptied since the areas represented by them no longer have the trend to bring on any significant distribution change. This operation resembles one of the steps in the CUSUM calculation where the accumulated deviation is reset to 0 when the monitored variable is no longer within the risky range. We now explain the main steps of the PSVC algorithm as shown in Fig. 2. Each cluster stack represents a candidate abnormal area and the array clusterstacks holds a number of cluster stacks keeping track of all candidate areas at stake. Initially (line 1) clusterstacks is empty. The steps from line 3 to 35 are run whenever a new data batch enters the system. First, the RSVC retrospective method is executed (line 3) to compare the spatial distribution of the new data batch with that of the previous data batch. The resulting abnormal clusters are saved in rsvcresult. Any statistically significant cluster in rsvcresult will immediately trigger an alert (line 5).
1 clusterstacks=[] 2 Whenever a new data batch arrives { 3 rsvcresult=RSVC (previousdate, currentdate) 4 For each cluster C recorded in rsvcresult { /*C records the identified cluster, its p-value, and the date of the associated data batch. */ 5 If (C.p-value