This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Front Matter Table of Contents About the Author
Designing a Data Warehouse: Supporting Customer Relationship Management Chris Todman Publisher: Prentice Hall PTR First Edition December 01, 2000 ISBN: 0-13-089712-4, 360 pages
Today’s next-generation data warehouses are being built with a clear goal; to maximize the power of Customer Relationship Management. To make CRM_focused data warehousing work, you need new techniques, and new methodologies. In Designing A Data Warehouse, Dr. Chris Todman - one of the world’s leading data warehousing consultants - delivers the first start-to-finish methodolgy for defining, designing, and implementing CRM-focused data warehouses. Todman covers all this, and more: A new look at data warehouse conceptual models, logical models, and physical implementation; Project management: deliverables, assumptions, risks, and team-building - including a full breakdown of work; DW futures: temporal databases, OLAP SQL extensions, active decision support, integrating external and unstructured data, search agents, and more. only for RuBoard - do not distribute or recompile
1
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
Designing a Data Warehouse: Supporting Customer Relationship Management List of Figures Preface FIRST GENERATION DATA WAREHOUSES SECOND-GENERATION DATA WAREHOUSES AND CUSTOMER RELATIONSHIP MANAGEMENT WHO SHOULD READ THIS BOOK Acknowledgments 1. Customer Relationship Management THE BUSINESS DIMENSION BUSINESS GOALS BUSINESS STRATEGY THE VALUE PROPOSITION CUSTOMER RELATIONSHIP MANAGEMENT SUMMARY 2. An Introduction to Data Warehousing INTRODUCTION WHAT IS A DATA WAREHOUSE? DIMENSIONAL ANALYSIS BUILDING A DATA WAREHOUSE PROBLEMS WHEN USING RELATIONAL DATABASES SUMMARY 3. Design Problems We Have to Face Up To DIMENSIONAL DATA MODELS WHAT WORKS FOR CRM
2
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
SUMMARY 4. The Implications of Time in Data Warehousing THE ROLE OF TIME PROBLEMS INVOLVING TIME CAPTURING CHANGES FIRST-GENERATION SOLUTIONS FOR TIME VARIATIONS ON A THEME CONCLUSION TO THE REVIEW OF FIRST-GENERATION METHODS 5. The Conceptual Model REQUIREMENTS OF THE CONCEPTUAL MODEL THE IDENTIFICATION OF CHANGES TO DATA DOT MODELING DOT MODELING WORKSHOPS SUMMARY 6. The Logical Model LOGICAL MODELING THE IMPLEMENTATION OF RETROSPECTION THE USE OF THE TIME DIMENSION LOGICAL SCHEMA PERFORMANCE CONSIDERATIONS CHOOSING A SOLUTION FREQUENCY OF CHANGED DATA CAPTURE CONSTRAINTS EVALUATION AND SUMMARY OF THE LOGICAL MODEL 7. The Physical Implementation THE DATA WAREHOUSE ARCHITECTURE CRM APPLICATIONS BACKUP OF THE DATA ARCHIVAL EXTRACTION AND LOAD SUMMARY 8. Business Justification THE INCREMENTAL APPROACH THE SUBMISSION SUMMARY 9. Managing the Project INTRODUCTION WHAT ARE THE DELIVERABLES? WHAT ASSUMPTIONS AND RISKS SHOULD I INCLUDE? WHAT SORT OF TEAM DO I NEED?
3
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Summary 10. Software Products EXTRACTION, TRANSFORMATION, AND LOADING OLAP QUERY TOOLS DATA MINING CAMPAIGN MANAGEMENT PERSONALIZATION METADATA TOOLS SORTS 11. The Future TEMPORAL DATABASES (TEMPORAL EXTENSIONS) OLAP EXTENSIONS TO SQL ACTIVE DECISION SUPPORT EXTERNAL DATA UNSTRUCTURED DATA SEARCH AGENTS DSS AWARE APPLICATIONS A. Wine Club Temporal Classifications B. Dot Model for the Wine Club APPENDIX B DOT MODEL FOR THE WINE CLUB C. Logical Model for the Wine Club D. Customer Attributes HOUSEHOLD AND PERSONAL ATTRIBUTES BEHAVIORAL ATTRIBUTES FINANCIAL ATTRIBUTES EMPLOYMENT ATTRIBUTES INTERESTS AND HOBBY ATTRIBUTES References
only for RuBoard - do not distribute or recompile
4
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Designing a Data Warehouse: Supporting Customer Relationship Management Library of Congress Cataloging-in-Publication Date
Todman, Chris. Designing a data warehouse: in support of customer relationship management/Chris Todman. p. cm. Includes bibliographical references and index. ISBN: 0-13-089712-4 1. Data warehousing. I. Title. HD30.2.T498 2001 CIP 658.4/03/0285574 21 1220202534 Credits
Editorial/Production Supervisor: Kerry Reardon
5
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Project Coordinator: Anne Trowbridge Acquisitions Editor: Jill Pisoni Editorial Assistant: Justin Somma Manufacturing Buyer: Maura Zaldivar Manufacturing Manager: Alexis Heydt Marketing Manager: Dan DePasquale Art Director: Gail Cocker-Bogusz Cover Designer: Nina Scuderi Cover Design Director: Jerry Votta Manager HP Books: Patricia Pekary Editorial Director, Hewlett Packard Professional
6
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Prentice-Hall of Japan, Inc., Tokyo Pearson Education Asia, Pte. Ltd. Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro only for RuBoard - do not distribute or recompile
8
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
List of Figures 1.1
Just who are our best customers?
1.2
CRM in an organization.
1.3
The components of CRM.
1.4
The number of communication channels is growing.
2.1
Fragment of data model for the Wine Club.
2.2
Three-dimensional data cube.
2.3
Wine sales dimensional model for the Wine Club.
2.4
Data model showing multiple join paths.
2.5
The main components of a data warehouse system.
2.6
General-state transition diagram.
2.7
State transition diagram for the orders process.
2.8
Star schema showing the relationships between facts and dimensions.
2.9
Stratification of the data.
2.10 Snowflake schema for the sale of wine. 2.11 Levels of summarization in a data warehouse. 2.12 Modified data warehouse structure incorporating summary navigation and data mining.
3.1
Star schema for the Wine Club.
3.2
Third normal form version of the Wine Club dimensional model.
3.3
Confusing and intimidating hierarchy.
3.4
Common organizational hierarchy.
3.5
Star schema for the Wine Club.
3.6
Sharing information.
3.7
General model for customer details.
3.8
General model for a customer with changing circumstances.
3.9
Example model showing customer with changing circumstances.
3.10 The general model extended to include behavior. 3.11 The example model extended to include behavior. 3.12 General conceptual model for a customer-centric data warehouse. 3.13 Wine Club customer changing circumstances. 3.14 Wine Club customer behavior.
9
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
3.15 Derived segment examples for the Wine Club.
4.1
Fragment of operational data model.
4.2
Operational model with additional sales fact table.
4.3
Sales hierarchy.
4.4
Sales hierarchy with sales table attached.
4.5
Sales hierarchy showing altered relationships.
4.6
Sales hierarchy with intersection entities.
4.7
Sales hierarchy with data.
4.8
Simple general business hierarchy.
4.9
Graphical representation of temporal functions.
4.10 Types of temporal query. 4.11 Traditional resolution of m:n relationships. 4.12 Representation of temporal attributes by attaching them to the dimension. 4.13 Representation of temporal hierarchies by attaching them to the facts. 4.14 Representation of temporal attributes by attaching them to the facts.
5.1
Example of a two-dimensional report.
5.2
Example of a three-dimensional cube.
5.3
Simple multidimensional dot model.
5.4
Representation of the Wine Club using a dot model.
5.5
Customer-centric dot model.
5.6
Initial dot model for the Wine Club.
5.7
Refined dot model for the Wine Club.
5.8
Dot modeling worksheet showing Wine Club sales behavior.
5.9
Example of a hierarchy
6.1
ER diagram showing new relationships to the time dimension.
6.2
Logical model of part of the Wine Club
7.1
The EASI data architecture.
7.2
Metadata model for validation.
7.3
Integration layer.
7.4
Additions to the metadata model to include the source mapping layer.
7.5
Metadata model for the VIM layer.
7.6
Customer nonchanging details.
7.7
The changing circumstances part of the GCM.
10
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
7.8
Behavioral model for the Wine Club.
7.9
Data model for derived segments.
7.10 Daily partitions. 7.11 Duplicated input.
8.1
Development abstraction shown as a roadmap.
9.1
Classic waterfall approach.
9.2
Example project team structure.
10.1 Extraction, transformation, and load processing. 10.2 Typical OLAP architecture. 10.3 Descriptive field distribution. 10.4 Numeric field distribution using a histogram. 10.5 Web plot that relates gender to regions. 10.6 Rule induction for wine sales. 10.7 Example of a multiphase campaign.
only for RuBoard - do not distribute or recompile
11
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Preface The main subject of this book is data warehousing. A data warehouse is a special kind of database that, in recent years, has attracted a great deal of interest in the information technology industry. Quite a few books have been published about data warehousing generally, but very few have focused on the design of data warehouses. There are some notable exceptions, and these will be cited in this book, which concentrates, principally, on the design aspects of data warehousing. Data warehousing is all about making information available. No one doubts the value of information, and everyone agrees that most organizations have a potential “Aladdin's Cave” of information that is locked away within their operational systems. A data warehouse can be the key that opens the door to this information. There is strong evidence to suggest that our early foray in the field of data warehousing, what I refer to as first-generation data warehouses, has not been entirely successful. As is often the case with new ideas, especially in the information technology (IT) industry, the IT practitioners were quick to spot the potential, and they tried hard to secure the competitive advantage for their organizations that the data warehouse promised. In doing so I believe two points were overlooked. The first point is that, at first sight, a data warehouse can appear to be quite a simple application. In reality it is anything but simple. Quite apart from the basic issue of sheer scale (data warehouse databases are amongst the largest on earth) and the consequent performance difficulties presented by this, the data structures are inherently more complex than the early pioneers of these systems realized. As a result, there was a tendency to over-simplify the design so that, although the database was simple to understand and use, many important questions could not be asked. The second point is that data warehouses are unlike other operational systems in that it is not possible to define the requirements precisely. This is at odds with conventional systems where it is the specification of requirements that drives the whole development lifecycle. Our approach to systems design is still, largely, founded on a thorough understanding of requirements–the “hard” systems approach. In data warehousing we often don't know what the problems are that we are trying to solve. Part of the role of the data warehouse should be to help organizations to understand what their problems are. Ultimately it comes down to design and, again, there are two main points to consider. The
12
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
first concerns the data warehouse itself. Just how do we ensure that the data structures will enable us to ask the difficult questions? Secondly, the hard systems approach has been shown to be too restrictive and a softer technique is required. So not only do we need to improve our design of data warehouses, we also need to improve the way in which we
approach the design. It is in response to these two needs that this book has been written. only for RuBoard - do not distribute or recompile
13
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
FIRST GENERATION DATA WAREHOUSES Historically, the first-generation data warehouses were built on certain principles that were laid down by gurus in the industry. This author recognizes two great pioneers in data warehousing: Bill Inmon and Ralph Kimball. These two chaps, in my view, have done more to advance the development of data warehousing than any others. Although many claim to have been “doing data warehousing long before it was ever called data warehousing,” Inmon and Kimball can realistically claim to be the founders because they alone laid down the definitions and design principles that most practitioners are aware of today. Even if their guidelines are not followed precisely, it is still common to refer to Inmon's definition of a data warehouse and Kimball's rules on slowly changing dimensions. Chapter 2 of this book is an introduction to data warehousing. In some respects it should be regarded as a scene-setting chapter, as it introduces data warehouses from first principles by describing the following: Need for decision support How data warehouses can help Differences between operational systems and data warehouses Dimensional models Main components of a data warehouse Chapter 2 lays the foundation for the evolution to the second-generation data warehouses. only for RuBoard - do not distribute or recompile
14
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
SECOND-GENERATION DATA WAREHOUSES AND CUSTOMER RELATIONSHIP MANAGEMENT Before the introduction to data warehousing, we take a look at the business issues in a kind of rough guide to customer relationship management (CRM). Data warehousing has been waiting for CRM to appear. Without it, data warehouses were still popular but, very often, the popularity was as much in the IT domain as anywhere else. The IT management was quick to see the potential of data warehouses, but the business justification was not always the main driver and this has led to the failure of some data warehouse projects. There was often a reluctance on the part of business executives to sponsor these large and expensive database development projects. Those that were sponsored by IT just didn't hit the spot. The advent of CRM changed all that. CRM cannot be practiced in business without a major source of information, which, of course, is the data warehouse raison
d'etre. Interest in data warehousing has been revitalized, and this time it is the business people who are firmly in the driving seat. Having introduced the concept of CRM and described its main components, we explore, with the benefit of hindsight, the flaws in the approach to designing first-generation data warehouses and will propose a method for the next generation. We start by examining some of the design issues and pick our way carefully through the more sensitive areas in which the debate has smoldered, if not raged a little, over the past several years. One of the fundamental issues surrounds the representation of time in our design. There has been very little real support for this, which is a shame, since data warehouses are true temporal applications that have become pervasive and ubiquitous in all kinds of businesses. In formulating a solution, we reintroduce, from the mists of time, the old conceptual, logical, and physical approach to building data warehouses. There are good reasons why we should do this and, along the way, these reasons are aired. We have a short chapter on the business justification. The message is clear. If you cannot justify the development of the data warehouse, then don't build it. No one will thank us for designing and developing a beautifully engineered, high-performing system if, ultimately, it cannot pay for itself within an appropriate time. Many data warehouses can justify themselves several times over, but some cannot. We do not want to add to the list of failed projects. Ultimately, no one benefits from this and we should be quite rigorous in the justification process.
15
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Project management is a crucial part of a data warehouse development. The normal approach to project management doesn't work. There are many seasoned, top-drawer project managers who, in the beginning, are very uncomfortable with data warehouse projects. The uncertainty of the deliverables and the imprecise nature of the acceptance criteria send them howling for the safety net of the famous system specification. It is hoped that the chapter on project management will provide some guidance. People who know me think I have a bit of a “down” on software products and if I'm honest I suppose I do. I get a little irritated when the same old query tools get dusted off and relaunched as each new thing comes along as though they are new products. Once upon a time a query tool was a query tool. Now it's a data mining product, a segmentation product, and a CRM product as well. OK, these vendors have to make a living but, as professional consultants, we have to protect our customers, particularly the gullible ones, from some of these vendors. Some of the products do add value and some, while being astronomically expensive, don't add much value at all. The chapter on software products sheds some light on the types of tools that are available, what they're good at, what they're not good at, and what the vendors won't tell you if you don't ask. only for RuBoard - do not distribute or recompile
16
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
WHO SHOULD READ THIS BOOK Although there is a significant amount of technical material in the book, the potential audience is quite wide: For anyone wishing to learn the principles of data warehousing, Chapter 2 has been adapted from undergraduate course material. It explains, in simple terms: What data warehouses are How they are used The main components The data warehouse “jargon” There is also a description of some of the pitfalls and problems faced in the building of data warehouses. For consultants, the book contains a method for ensuring that the business objectives will be met. The method is a top-down approach using proven workshop techniques. There is also a chapter devoted to assisting in the building of the business justification. For developers of data warehouses, the book contains a massive amount of material about the design, especially in the area of the data model, the treatment of time, and the conceptual, logical, and physical layers of development. The book contains a complete methodology that provides assistance at all levels in the development. The focus is on the creation of a customer-centric model that is ideal for supporting the complex requirements of customer relationship management. For project managers there is an entire chapter that provides guidelines on the approach together with: Full work breakdown structure (WBS) Project team structure Skills needed
17
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The “gotchas” only for RuBoard - do not distribute or recompile
18
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Acknowledgments I would like to thank my wife, Chris, for her unflagging support during the past twenty years. Chris has been a constant source of encouragement, guidance, and good counsel. I would also like to thank Malcolm Standring and Richard Getliffe of Hewlett Packard Consulting for inviting me to join their Data Warehousing practice in 1995. Although I was already deeply involved in database systems, the role in HP has opened doors to many new and exciting possibilities. Thank you, Mike Newton and Prof. Pat Hall of the Open University, for your technical guidance over several years. As far as the book is concerned, thanks are due to Chris, again, for helping to tighten up the grammar. Thanks especially to Jim Meyerson, of Hewlett Packard, for a rigorous technical review and helpful suggestions. Finally, I am grateful to Jill Pisoni and the guys at Prentice Hall for publishing this work. only for RuBoard - do not distribute or recompile
19
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 1. Customer Relationship Management THE BUSINESS DIMENSION BUSINESS GOALS BUSINESS STRATEGY THE VALUE PROPOSITION CUSTOMER RELATIONSHIP MANAGEMENT SUMMARY only for RuBoard - do not distribute or recompile
20
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE BUSINESS DIMENSION First and foremost, this book is about data warehousing. Throughout the book, we will be exploring ways of designing data warehouses with a particular emphasis on the support of a customer relationship management (CRM) strategy. This chapter provides a general introduction to CRM and its major components. Before that, however, we'll take a short detour and review what has happened in the field of data warehousing, from a business perspective. Although data warehousing has received a somewhat mixed reception, it really has captured the imagination of business people. In fact, it has become so popular in industry that it is cited as being the highest-priority postmillennium project of more than half of Information Technology (IT) executives. It has been estimated that, as far back as 1997 (Menefee, 1998), $15 billion was spent on data warehousing worldwide. Recent forecasts (Business Wire, August 31, 1998) expect the market to grow to around $113 billion by the year 2002. A study carried out by the Meta Group (Meyer and Cannon, 1998) found that 95 percent of the companies surveyed intended to build a data warehouse. Data warehousing is being taken so seriously by the industry that the Transaction Processing Council (TPC), which has defined a set of benchmarks for general databases, introduced an additional benchmark specifically aimed at data warehousing applications known as TPC/D, followed up in 1999 by further benchmarks (TPC/H and TPC/R). As a further indication of the “coming of age of data warehousing,” a consortium has developed an adjunct to the TPC benchmark called “The data warehouse challenge” as a means of assisting prospective users in the selection of products. The benefits of building a data warehouse can be significant. For instance, increasing knowledge within an organization about customers' trends and business can provide a significant return on the investment in the warehouse. There are many documented examples of huge increases in revenue and profits as a result of decisions taken based upon information extracted from data warehouses. So if someone asked you the following question:
How many data warehouse projects ultimately are regarded as having failed?
21
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
How would you respond? Amazingly, research has shown that it's over 70 percent! This is quite staggering. Why is it happening and how will we know whether we are being successful? It's all about something we've never really been measured on in the past—business benefit. Data warehouses are different in quite a few ways from other, let us say traditional, IT projects. In order to explain one of these differences, we have to delve back into the past a little. Another charge that has historically been leveled at the IT industry is that the solution that is finally delivered to the customer, or users, is not the solution they were expecting. This problem was caused by the methods that were used by IT departments and system integration companies. Having identified that there was a problem to be solved, they would send in a team of systems analysts to analyze the current situation, interview the users and recommend a solution. This solution would then be built and tested by a system development team and delivered back to the users when it was finished. The system development lifecycle that was adopted consisted of a set of major steps: 1. Requirements gathering 2. Systems analysis 3. System design 4. Coding 5. System testing 6. Implementation The problem with this approach was that each step had to be completed before the next could really begin. It has been called the waterfall approach to systems development, and its other problem was that the whole process was carried out—out of the sight of the users who just continued with their day jobs until, one day, the systems team descended upon them their new, completed system. This process could have taken anything from six months to two years or more. So, when the systems team presents the new system to the users, what happens? The users say, “Oh!, But this isn't what we need.” And the systems project leader exclaims, “But this is what you asked for!” and it all goes a bit pear-shaped after that. Looking back, the problems and issues are clear to see. But suffice it to say there were always lots of arguments. The users were concerned that their requirements had clearly
22
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
not been understood or that they had been ignored. The systems team would get upset because they had worked hard, doing their level best to develop a quality system. Then there would be recriminations. The senior user, usually the one paying for the system, would criticize the IT manager. The systems analyst would be interrogated by the IT manager (a throat grip was sometimes involved in this) and eventually they would agree that there had been a communications misunderstanding. It is likely that the users had not fully explained their needs to the analyst. There are two main reasons for this. The first is that the analyst may have misinterpreted the needs of the user and designed a solution that was simply inappropriate. This happened a lot and was usually the fault of the analyst, whose job it was to ensure that the needs were clearly spelled out. The second reason is more subtle and is due to the fact that businesses change over time as a kind of natural evolution. Simple things like new products in the catalog or people changing roles can cause changes in the everyday business processes. Even if the analyst and the users had not misunderstood each other there is little hope that, after two years, the delivered system would match the needs of the people who were supposed to use it. Subsequently, the business processes would have to change again in order to accommodate the new system. After a little while, things would settle down, the “teething troubles” would be fixed and the system would hopefully provide several years of service. Anyway, switched-on IT managers had to figure out a way of ensuring that the users would not have any reason to complain in the future and the problem was solved by the introduction of the now famous “system specification” or simply “system spec.” Depending on the organization, this document had many different names including system manual, design spec, design manual, system architecture, etc. The purpose of the system spec was to establish a kind of contract between the IT department, or systems integrator, and the users. The system spec contained a full and detailed description of precisely what would be delivered. Each input screen, process, and output was drawn up and included in the document. Both protagonists “signed off” the document that reflected precisely what the IT department were to deliver and, theoretically at least, also reflected what the users expected to receive. So long as the IT department delivered to the system spec, they no longer could be accused of having ignored or misunderstood the requirements of the users. So the system spec was a document that was invented by IT as a means of protecting themselves from impossibly ungrateful users. In this respect it was successful and it is still the cornerstone of most development methods. When data warehouses started to be developed, the developers began using their tried and trusted methodologies to help to build them. And why not? This approach of nailing down the requirements has proved to be
23
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
successful in the past, at least as far as IT departments were concerned. The problem is that data warehouses are different. Until now, IT development practitioners have been working almost exclusively on streamlining and improving business processes and business functions. These are systems that usually have predefined inputs and, to some extent at least, predefined outputs. We know that is true because the system spec said so. The traditional methods we use are sometimes referred to as “hard” systems development methods. This means that they used to solve problems that are well defined. But, and this is where the story really starts, the requirements for a data warehouse are
never well defined! These are the softer “We think there's a problem but we're not sure what it is” type of issue. It's actually very difficult to design systems to solve this kind of problem and our familiar “hard” systems approach is clearly inappropriate. How can we write a system specification that nails down the problem and the solution when we can't even clearly define the problem? Unfortunately, most practitioners have not quite realized that herein lies the crux of the problem and, consequently, the users are often forced to state at least some requirements and sign the inevitable systems specification so that the “solution” can be developed. Then once the document is signed off, the development can begin as normal and the usual systems development life-cycle kicks in and away we go folks. The associated risk is that we'll finish up by contributing to the seventy percent failure statistic. Is there a solution? Yes there is. All it takes is recognition of this soft systems issue and to develop an approach that is sympathetic with it. The original question that was posed near the beginning of this section was “How will we know whether we are being successful?” If the main reason for failure is that we didn't produce sufficient business benefit, then the answer is to focus on the business. That means what the business is trying to achieve and not the requirements of the data warehouse. only for RuBoard - do not distribute or recompile
24
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
BUSINESS GOALS The phrase “Focus on what the business is trying to achieve,” refers to the overall business, or the part of the business that is engaging us in this project. This means going to the very top of the organization. Historically, it has been common for IT departments to invest in the building of data warehouses on a speculative basis, assuming that once in place the data warehouse will draw business users like bees to a honey pot. While the sentiments are laudable, this “build it and they will come” approach is generally doomed to fail from the start. The reason is that the warehouse is built around information that the IT department thinks is important, rather than the business. The main rule is quite simple. If you are hired by the CEO to solve the problem, you have to understand what the CEO is trying to achieve. If you are hired by the marketing director, then you have to find out what drives the marketing director. It all comes down to business goals. Each senior manager in an organization has goals. They may not always be written down. They may not be well known around the organization and to begin with, even the manager may not be able to articulate them clearly but they do exist. As a data warehouse practitioner, we need some extra “soft” skills and techniques to help us help our customers to express these soft system problems, and we explore this subject in detail in Chapter 5 when we build the conceptual model. So what is a business goal? Well it's usually associated with some problem that some executive has to solve. The success or failure on the part of the executive in question may be measured in terms of their ability to solve this problem. Their salary level may depend on their performance in solving this problem, and ultimately their job may depend on it as well. In short, it's the one, two, or three things that sometimes keep them awake at night. (jobwise that is). How is a business goal defined? Well, it's important to be specific. Some managers will say things like, “We need to increase market share” or “We'd like to increase our gross margin” or maybe “We have to get customer churn down.” These are not bad for a start but they aren't specific enough. Look at this one instead: “Our objective is to increase customer loyalty by 1 percent each year for the next five years.” This is a real goal from a real company and it's perfect. The properties of a good business goal are that they should be: 1. Measurable
25
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
2. Time bounded 3. Customer oriented This helps us to answer the question of how we'll know we've been successful. The managers will know whether they have been successful if they hit their measured goal targets within the stated time scale. Just a point about number three on the list. It is not an absolute requirement but it is a good check. There is a kind of edict inside of Hewlett Packard and it goes like this: “If you aren't doing it for a customer, don't do it!” In practice most business goals, in my experience, are customer oriented. Generally, as businesses we want to: Get more good customers Keep our better customers Maybe offload our worst customers Sell more to customers People have started to wake up to the fact that the customer is king. It's the customer we have to identify, convince, and ultimately satisfy if we are to be really successful. It's not about products or efficient processes, although these things are important too. Without the customer we might as well stay at home. Anyway, once we know what a manager's goals are, we can start to talk strategy. only for RuBoard - do not distribute or recompile
26
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
BUSINESS STRATEGY So now we know our customer's goals. The next step in ensuring success is to find out, for each goal, exactly how they plan to achieve it. In other words, what is their business strategy? Before we begin, let's synchronize terminology. There is a risk at this point of sinking into a semantic discussion on what is strategic and what is tactical. For our purposes, a strategy is defined as one or more steps to be employed in pursuit of a business goal. After a manager has explained the business goal, it is reasonable to then follow up with the question “And what is your strategy for achieving this goal?” only for RuBoard - do not distribute or recompile
27
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE VALUE PROPOSITION Every organization, from the very largest down to the very smallest, has a value proposition. A company's value proposition is the thing that distinguishes its business offering from all the others in the marketplace. Most senior managers within the organization should be able to articulate their value proposition but often they cannot. It is helpful, when dealing with these people, to discuss their business. Anything they do should be in some way relevant to the overall enhancement of the value proposition of their organization. It is generally accepted that the value proposition of every organization falls into one of three major categories of value discipline (Treacey and Wiersema, 1993). The three categories are customer intimacy, product leadership, and operational excellence. We'll just briefly examine these.
Customer Intimacy We call this the customer intimacy discipline because companies that operate this type of value proposition are the types of companies that really do try to understand their individual customer's needs and will try to move heaven and earth to accommodate their customers. For instance, in the retail clothing world, a bespoke tailor will know precisely how their customers like to have their clothes cut. They will specially order in the types and colors of fabric that the customer prefers and will always deal with the customer on a one-to-one, personal basis. These companies are definitely not cheap. In fact, their products are usually quite expensive compared to some and this is because personal service is an expensive commodity. It is expensive because it usually has to be administered by highly skilled, and therefore expensive, people. However, their customers prefer to use them because they feel as though they are being properly looked after and their lives are sufficiently enriched to justify the extra cost.
Product Leadership The product leaders are the organizations that could be described as “leading edge.” Their value proposition is that they can keep you ahead of the pack. This means that they are always on the lookout for new products and new ideas that they can exploit to keep their customers interested and excited. Technology companies are an obvious example of this
28
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
type of organization, but they exist in almost every industry. Just as with the bespoke service, there is an example in the retail fashion industry. The so-called designer label clothes are a good example of the inventor type of value proposition. The people who love to buy these products appreciate the “chic-ness” bestowed upon them. Another similarity with the customer intimate service is that these products also tend to be very expensive. A great deal of research and development often goes into the production of these products and the early adopters must expect to pay a premium.
Operational Excellence This type of organization excels at operational efficiency. They are quick, efficient, and usually cheap. Mail order companies that offer big discounts and guaranteed same-day or next-day delivery fall into this category. They have marketing slogans like “It's on time or
it's on us!” If you need something in a hurry and you know what you want, these are the guys who deliver. Don't expect a tailor-made service or much in the way of after-sales support, but do expect the lowest prices in town. Is there a fashion industry equivalent? Well, there have always been mail order clothes stores. Even some of the large department stores, if they're honest with themselves, would regard themselves as operationally efficient rather than being strong on personal service or product leadership. So are we saying that all companies must somehow be classified into one of the three groups? Well not exactly, but all companies would tend to have a stronger affinity to one of the three categories than with the other two, and it is important for an organization to recognize where its strengths lie. The three categories have everything to do with the way in which the organization interacts routinely with its customers. It is just not possible for a company that majors on operational excellence to become a product leader or to provide a bespoke service without a major change in its internal organization and culture. Some companies are very strong in two of the three categories, while others are working hard toward this. Marks and Spencer is a major successful retail fashion company. Traditionally its products are sold through a branch network of large department stores all over the world. It also has a growing mail order business. That would seem to place it pretty squarely in the operational excellence camp. However, recently it has opened up a completely new range of products, called “Autograph,” that is aimed at providing a bespoke service to customers. Large areas of its biggest stores are being turned over to this exciting new idea. So here is one company that has already been successful with one value proposition, aiming to achieve excellence in a second. Oddly enough, the Autograph range of products has been designed, in part, by established designers, and so they might even claim to be nibbling at the edges of the product leadership category, too!
29
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The point is this: an organization needs to understand: 1. How it interacts with its customers 2. How it would like to interact with its customers You can then start to come up with a strategy to help to improve your customer relationship management. only for RuBoard - do not distribute or recompile
30
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
CUSTOMER RELATIONSHIP MANAGEMENT The world of business is changing rapidly as never before. Our customers are better informed and much more demanding than ever. The loyalty of our customers is something that can no longer be taken for granted and the loss of customers, sometimes known as
customer churn is a subject close to the heart of most business people today. It is said that it can cost up to 10 times as much to recruit a new customer as it does to retain an existing customer. The secret is to know who our customers are (you might be surprised how many organizations don't) and what it is that they need from us. If we can understand their needs then we can offer goods and services to satisfy those needs, maybe even go a little further and start to anticipate their needs so that they feel cared for and important. We need to think about finding products for our customers instead of finding customers for our products. Every business person is very keen to advance their share of the market and turn prospects into customers, but we must not forget that each of our customers is on someone else's list of hot prospects. If we do not satisfy their needs, there are many business people out there who will. The advent of the Internet intensifies this problem; our competitors are now just one mouse click away! And the competition is appearing from the strangest places. As an example, U.K supermarkets traditionally sold food and household goods, maybe a little stationery and odds and ends. The banks and insurance companies were shocked when the supermarket chains started offering a very credible range of retail financial services and it hasn't stopped there. Supermarkets now routinely sell: Mobile phones White goods Personal computers Drugs Designer clothes Hi-fi equipment The retail supermarket chains are ideally placed to penetrate almost all markets when products or services become commodity items, which, eventually, they almost always will. They have excellent infrastructure and distribution channels, not to mention economies of scale that most organizations struggle to compete with.
31
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
They also have something else—and that is customers. The one-stop shop service, added to extremely competitive prices offered by supermarkets, is irresistible to many customers and as a result they are tending to abandon the traditional sources of these goods. The message is clear. No organization can take its customers for granted. Any business executive who feels complacent about their relationship with their customers is likely to be heading for a fall.
So What Is CRM? Customer relationship management is a term that has become very popular. Many businesses are investing heavily in this area. What do we mean by CRM? Well, it's really a concept, a sort of cultural and attitudinal thing. However, in order to enable us to think about it in terms of systems, we need to define it. A working definition is:
CRM is a strategy for optimizing the lifetime value of customers. Sometimes, CRM is interpreted as a soft and fluffy, cuddly sort of thing where we have to be excessively nice to all our customers and then everything will become good. This is not the case at all. Of course, at the level of the customer facing part of our organization, courtesy, honesty, and trustworthiness are qualities that should be taken for granted. However, we are in business to make a profit. Our management and shareholders will try to see to it that we do. Studies by the First Manhattan Group have indicated that while 20 percent of a bank's customers contribute 150 percent of the profits, 40 to 50 percent of customers eliminate 50% of the profits. Similar studies reveal the same information in other industries, especially telecommunications. The graph in Figure 1.1 shows this. Notice too that the best (i.e., most profitable) customers are twice as likely to be tempted away to other suppliers as the average customer. So how do we optimize the lifetime value of customers? It is all about these two things: 1. Getting to know our customers better 2. Interacting appropriately with our customers Figure 1.1. Just who are our best customers?
32
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
During the ordinary course of business, we collect vast amounts of information about customers that, if properly analyzed, can provide a detailed insight into the circumstances and behavior of our customers. As we come to comprehend their behavior, we can begin to predict it and perhaps even influence it a little. We are all consumers. We all know the things we like and dislike. We all know how we would like to be treated by our suppliers. Well, surprise, surprise, our customers are just like that, too! We all get annoyed by blanket mail shots that have no relevance to us. Who has not been interrupted during dinner by an indiscriminate telephone call attempting to interest us in UPVC windows or kitchen makeovers? Organizations that continue to adopt this blanket approach to marketing do not deserve to succeed and, in the future, they and their methods will disappear. Our relationship with our customers has to be regarded more in terms of a partnership where they have a need that we can satisfy. If we show real interest in our customers and treat them as the unique creatures they are, then the likelihood is that they will be happy to continue to remain as customers. Clearly, a business that has thousands or even millions of customers cannot realistically expect to have a real personal relationship with each and every one. However, the careful interpretation of information that we routinely hold about customers can drive our behavior so that the customer feels that we understand their needs. If we can show that we
33
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
understand them and can satisfy their needs, then the likelihood is that the relationship will continue. What is needed is not blanket marketing campaigns, but targeted campaigns directed precisely at those customers who might be interested in products on offer. The concept of personalized marketing, sometimes called “one-to-one” marketing epitomizes the methods that are now starting to be employed generally in the marketplace, and we'll be looking at this and other components of CRM in the following sections. It is well known that we are now firmly embedded in the age of information. As business people we have much more information about all aspects of our business than ever before. The first-generation data warehouses were born to help us capture, organize, and analyze the information to help us to make decisions about the future based on past behavior. The idea was that we would identify the data that was needed to be gathered from our operational business systems, place it into the warehouse, and then ask questions of the data in order to derive valuable information. It will become clear, if indeed it is not already clear, that, in almost all cases, a data warehouse provides the foundation of a successful CRM strategy. CRM is partly a cultural thing. It is a service that fits as a kind of layer between our ordinary products and services and our customers. It is the CRM culture within an organization that leads to our getting to know our customers better, by the accumulation of knowledge. Equally, the culture enables the appropriate interactions to be conducted between our organization and our customers. This is shown in Figure 1.2. Figure 1.2. CRM in an organization.
34
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Figure 1.2 shows how the various parts of an organization fit together to provide the information we need to understand our customers better and the processes to enable us to interact with our customers in an appropriate manner. The CRM culture, attitudes, and behaviors can then be built on top of this, hopefully enhancing our customers' experiences in their dealings with us. In the remainder of this section, we will explore the various aspects of CRM. Although this book is intended to help with the design of data warehouses, it is important to understand the business dimension. This section should be enough to help you to understand sufficiently the business imperatives around CRM, and you should regard it as a kind of “rough guide” to CRM. The pie diagram in Figure 1.3 shows the components of CRM. Figure 1.3. The components of CRM.
35
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
As you can see in Figure 1.3, there are many slices in the CRM pie, so let us review some of the major ones:
Customer Loyalty and Customer Churn Customer loyalty and customer churn are just about the most important issues facing most business today, especially businesses that have vast numbers of customers. This is a particular problem for: Telecommunications companies Internet service providers Retail financial services Utilities Retail supermarket chains
36
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
First, let us define what we mean by customer churn. In simple terms, it relates to the number of customers lost to competitors over a period of time (usually a year). These customers are said to have churned. All companies have a churn metric. This metric of churn is the number of customers lost, expressed as a percentage of the total number of active customers at the beginning of the period. So if the company had 1,000 active customers at the beginning of the year and during the year 150 of those customers defected to a competitor, then the churn metric for the company is 15 percent. Typically, the metric is calculated each month on a rolling 12-month moving average. Some companies operate a kind of “net churn” metric. This is simply the number of active customers at the end of the period expressed as a percentage of the number of active customers at the beginning of the period, minus 100. So if the company starts the period with 1,000 customers and ends the period with 920 customers, then the calculation is:
This method is sometimes favored by business executives for two reasons: 1. It's easy to calculate. All you have to do is count the number of active customers at the beginning and end of the period. You don't have to figure out how many active customers you've lost and how many you've recruited. These kinds of numbers can be hard to obtain, as we'll see in the chapters on design. 2. It hides the truth about real churn. The number presents a healthier picture than is the reality. Also, with this method it's possible to end up with negative churn if you happen to recruit more customers than you lose. Great care must be taken when considering churn metrics. Simple counts are okay as a guide but, in themselves, they reveal nothing about your organization, your customers, or your relationship with your customers. For instance, in describing customer churn, several times the term active was used to describe customers. What does this mean? The answer might sound obvious but, astonishingly, in most organizations it's a devil of a job to establish which customers are active and which are not. There are numerous reasons why a customer might defect to another supplier. These reasons are called churn factors. Some common churn factors are:
The wrong value discipline. You might be providing a customer intimate style of service. However, customers who want quick delivery and low prices
37
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
are unlikely to be satisfied with the level of service you can provide in this area. Customers are unlikely to have the same value discipline requirements for all the products and services they use. For instance, one customer might prefer a customer intimacy type of supplier to do their automobile servicing but would prefer an efficient and inexpensive supplier to service their stationery orders.
A change in circumstances. This is a very common cause of customer churn. Your customer might simply move out of your area. They may get a new job with a much bigger salary and want to trade up to a more exclusive supplier. Maybe they need to make economies and your product is one they can live without.
Bad experience. Usually, one bad experience won't make us go elsewhere, especially if the relationship is well established. Continued bad experiences will almost certainly lead to customer churn. Bad experiences can include unkept promises, phone calls not returned, poor quality, brusque treatment, etc. It is important to monitor complaints from customers to measure the trends in bad experiences, although the behavior of customers in this respect varies from one culture to another. For instance, in the United Kingdom, it is uncommon for people to complain. They just walk.
A better offer. This is where your competitors have you beat. These days it is easy for companies in some industries to leap-frog each other in the services they provide and, equally, it is easy for customers to move freely from one supplier to another. A good example of this is the prepay mobile phone business. When it first came out, it was attractive because there was no fixed contract, but it placed restrictions on the minimum number of calls you had to make in any one period, say, $50 per quarter. As more vendors entered this market the restriction was driven down and down until it got to the stage where you only have to make one call in six months! So how can you figure out which of your customers are active, which are not, and which you are at the most risk of losing? What you need is customer insight!
Customer Insight In order to be successful at CRM, you simply have to know your customers. In fact some people define CRM as precisely that—knowing your customers. Actually it is much more
38
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
than that, but if you don't know your customers, then you cannot be successful at CRM. It's obvious really, isn't it? In order for any relationship to be successful, both participants in the relationship have to have some level of understanding so that they can communicate successfully and in a way that is rewarding for both parties. Notice the usage of the term “both.” Ultimately, a relationship consists of two people. Although we can say that we have a relationship with our friends, what we are actually saying is that we have a relationship with each of our friends and that results in many relationships. Each one of those relationships has to be initiated and developed, sometimes with considerable care. We invest part of ourselves, sometimes a huge amount, into maintaining the relationships that we feel are the most important. In order for the relationship to be truly successful, it has to be built on a strong foundation, and that usually means knowing as much as we can about the other person in the relationship. The more we know, the better we can be at responding to the other person's needs. Each of the parties involved in a relationship has to get some return from their investment in order to justify continuing with it. The parallels here are obvious. Customers are people, and if we are to build a sustained relationship with them, we have to make an investment. Whereas with personal relationships, the investments we make are our time and emotion and usually we do not measure these; business relationships involve time and money, and we can and should measure them. The purpose of a business relationship is profit. If we cannot profit from the business relationship, then we must consider whether the relationship is worth the investment. So how do we know whether our relationship with a particular customer is profitable? It's all tied in with the notion of customer insight—knowing about our customers. Whereas our knowledge regarding our friends is mostly in our heads, our knowledge about customers is generally held in stored records. Every order, payment, inquiry, and complaint is a piece of a jigsaw puzzle that, collectively, describes the customer. If we can properly manage this information, then we can build a pretty good picture of our customers, their preferences and dislikes, their behavioral traits, and their personal circumstances. Once we have that picture, we can begin to develop customer insight. Let's look at some of the uses of customer insight, starting with segmentation.
Segmentation Terms like knowing our customers and customer insight are somewhat abstract. As previously stated, a large company may have thousands or even millions of customers, and it is not possible to know each one personally in the same way as with our friends and family. A proper relationship in that sense is not a practical proposition for most
39
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
organizations. It has been tried, however. Some banks, for instance, operate “personal” account managers. These managers are responsible for a relatively small number of customers, and their mission is to build and maintain the relationships by getting to know the customers in their charge. However, this is not a service that is offered to all customers. Mostly, it applies to the more highly valued customer. Even so, this service is expensive to operate and is becoming increasingly rare. Our customers can be divided into categorized groups called segments. This is one way of getting to know them better. We automatically divide our friends into segments, perhaps without realizing that we are doing it. For instance, friends can be segmented as: Males or females Work mates Drinking buddies Football supporters Evening classmates Long-standing personal friends Clearly, there are many ways of classifying and grouping people. The way in which we do it depends on our associations and our interests. Similarly, there are many ways in which we might want to segment our customers and, as before, the way in which we choose to do it would depend on the type of business we are engaged upon. There are three main types of segmentation that we can apply to customers: the customer's circumstances, their behavior, and derived information. Let's have a brief look at these three. Circumstances
This is the term that I use to describe those aspects of the customer that relate to their personal details. Circumstances are the information that define who the customer is. Generally speaking, this type of information is customer specific and independent and has nothing to do with our relationship with the customer. It is the sort of information that any organization might wish to hold. Some obvious elements included in customer circumstances are: Name
40
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Date of birth Sex Marital status Address Telephone number Occupation A characteristic of circumstances is that they are relatively fixed. Some IT people might refer to this type of information as reference data and they would be right. It is just that
circumstances is a more accurate layperson's description. Some less obvious examples of circumstances are: Hobbies Ages of children Club memberships Political affiliations In fact, there is almost no limit to the amount of information relating to circumstances that you can gather if you feel it would be useful. In the appendix to this book, I have included several hundred that I have encountered in the past. One retail bank, for some reason, would even like to know whether the customer's spouse keeps indoor potted plants. The fact that I have described this type of information as relatively fixed means that it may well change. The reality is that some things will change and some will not. This type of data tends to change slowly over time. For instance, official government research shows that, in the United Kingdom, approximately 10 percent of people change their address each year. A date of birth, however, will never change unless it is to correct an error. Theoretically, each of the data elements that we record about a customer could be used for segmentation purposes. It is very common to create segments of people by: Sex
41
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Age group Income group Geography Behavioral Segmentation
Whereas a customer's circumstances tend not to relate to the relationship between us, a customer's behavior relates to their interaction with our organization. Behavior encompasses all types of interaction such as: Purchases—the products or services that the customer has bought from us. Payments—payments made by the customer. Contacts—where the customer has written, telephoned, or communicated in some other way. This might be an inquiry about product, a complaint about service, perhaps a request for assistance, etc. The kind of segmentation that could be applied to this aspect of the relationship could be: Products purchased or groups of products. For instance, an insurance company might segment customers into major product groups such as pensions, motor insurance, household insurance, etc. Spending category. Organizations sometimes segment customers by spending bands. Category of complaint. Derived Segmentation
The previous types of segmentation relating to a customer's circumstances and their behavior is quite straightforward to achieve because it requires no interpretation of data about the customer. For instance, if you wish to segment customers depending on the amount of money they spend with you, then it is a simple matter of adding up the value of all the orders placed over the period in question, and the appropriate segment for any particular customer is immediately obvious.
42
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Very often, however, we need to segment our customers in ways that require significant manipulation and interpretation of information. Such segmentation classifications may be derived from the customer's circumstances or behavior or, indeed, a combination of both. As this book is, essentially, intended to assist in the development of data warehouses for CRM, we will return to the subject of derived segmentation quite frequently in the up coming chapters. However, some examples of derived segmentation are:
Lifetime value. Once we have recorded a representative amount of circumstantial and behavioral information about a customer, we can use it in models to assist in predicting future behavior and future value. For instance, there is a group of young people that have been given the label “young plastics” (YP). The profile of a YP is someone who has recently graduated from college and is now embarking on the first few years of their working life. They have often just landed their first job and are setting about securing the trappings of life in earnest. Their adopted lifestyle usually does not quite match their current earnings, and they are often debt laden, becoming quite dependent on credit. At first glance, these people do not look like a good proposition upon which to build a business relationship. However, their long-term prospects are, statistically, very good. Therefore, it may be more appropriate to consider the potential lifetime value of the relationship and “cut them some slack,” which means developing products and services that are designed for this particular segment of customers.
Propensity to churn. We have already talked about the problem of churn earlier in this chapter. If we can assess our customers in such a way as to grade them on a scale of, say, 1 to 10 where 1 is a safe customer and 10 is a customer who we are at risk of losing, then we would be able to modify our own behavior in our relationship with the customer so as to manage the risk.
Up-sell and cross-sell potential. By carefully analyzing customers' behavior, and possibly combining segments together, it is possible to derive models of future behavior and even potential behavior. Potential behavior is a characteristic we all have; the term describes behavior we might engage in, given the right stimulus. Advertising companies stake their very existence on this. It enables us to identify opportunities to sell more products and services (up-selling) and different products and services (cross selling) to our customers.
43
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Entanglement potential. This is related to up-sell and cross-sell and it applies to customers who might be encouraged to purchase a whole array of products from us. Over time, it can become increasingly difficult and bothersome for the customer to disentangle their relationship and go elsewhere. The retail financial services industry is good at this. The bank that manages our checking account encourages us to pay all our bills out of the account automatically each month. This is good because it means we don't have to remember to write the checks. Also, if the bank can persuade us to buy our house insurance, contents insurance, and maybe even our mortgage from them too, then it becomes a real big deal if we want to transfer, say, the checking account to another bank. “Householding” is another example of this and, again, it's popular with the banks. It works by enticing all the members of a family to participate in a particular product or service so that, collectively, they all benefit from special deals such as a reduced interest rate on overdrawn accounts. If any of the family members withdraws from the service, then the deal can be revoked. This of course has an effect on the remainder of the family, and so there is an inducement to remain loyal to the bank. Sometimes there are relationships between different behavioral components that would not be spotted by analysts and knowledge workers. In order to try to uncover these relationships we have to employ different analytic techniques. The best way to do this is to employ the services of a data mining product. The technical aspects of data mining will be described later in this book, but it is important to recognize that there are other ways of defining segments. As a rather obvious example, we can all comprehend the relationship between, say, soft fruit and ice cream, but how many supermarkets place soft fruit and ice cream next to each other? If there is a statistically significant relationship such that customers who purchase soft fruit also purchase ice cream, then a data mining product would be able to detect such a relationship. As I said, this is an obvious example, but there will be others. For instance, is there a relationship between: Vacuum cleaner bags and dog food? Toothpaste and garlic? Diapers and beer? (Surely not!) As we have seen, there are many ways in which we can classify our customers into segments. Each segment provides us with opportunities that might be exploited. Of course, it is the business people that must decide whether the relationships are real or merely
44
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
coincidental. Once we have identified potential opportunities, we can devise a campaign to help us persuade our target customers of the benefits of our proposal. In order to do this, it might be helpful to employ another facet of CRM, campaign management.
Campaign Management As I see it, there are, essentially, two approaches to marketing: defensive marketing and aggressive marketing. Defensive marketing is all about keeping what we have already. For instance, a strategy for reducing churn in our customers could be regarded as defensive because we are deploying techniques to try to keep our customers from being lured by the aggressive marketing of our competitors. Aggressive marketing, therefore, is trying to get more. By “more” we could be referring to: Capturing more customers Cross-selling to existing customers Up-selling to existing customers A well-structured strategy involving well-organized and managed campaigns is a good example of aggressive marketing. The concept of campaigns is quite simple and well known to most of us. We have all been the target of marketing campaigns at some point. There are three types of campaign:
Single-phase campaigns are usually one-off special offers. The company makes the customer, or prospect, an offer (this is often called a treatment) and if the customer accepts the offer (referred to as the response), then the campaign, in that particular case, can be regarded as having been successful. An example of this would be three months free subscription to a magazine. The publishing company is clearly hoping the customer will enjoy the magazine enough to pay for a subscription at the end of the free period.
Multi-phase campaigns, as the name suggests, involve a set of treatments instead of just one. The first treatment might be the offer of a book voucher or store voucher if the customer, say, visits a car dealer and accepts a test drive in a particular vehicle. This positive response is recorded and may be followed up by a second treatment. The second treatment could be the offer
45
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
to lend that customer the vehicle for an extended period, such as a weekend. A positive response to this would result in further treatments that, in turn, provoke further responses, the purpose ultimately being to increase the sales of a particular model.
Recurring campaigns are usually running continually. For example, if the customer is persuaded to buy the car, then, shortly after, they can expect to receive a “welcome pack” that makes them feel they have joined some exclusive club. Campaigns can be very expensive to execute. Typically, they are operated under very tight budget constraints and are usually expected to show a profit. This means that the cost of running the campaign is expected to be recouped out of the extra profits made by the company as a result of the campaign. The problem is: How do you know whether the sale of your product was influenced by the campaign? The people who ended up buying the product might have done so without being targeted in the campaign. The approach that most organizations adopt in order to establish the efficacy of a campaign is to identify a “control” group in much the same way as is done in clinical trials and other scientific experiments. The control group is identified as a percentage of the target population. This group receives no special treatments. At the end of the campaign, the two groups are compared. If, say, 2 percent of the control group purchase the product and 5 percent of the target group purchase the product, then it is assumed that the 3 percent difference was due to the influence of the campaign. The box on the following page shows how it's figured out. One of the big problems with campaigns, as you can see, is the minuscule responses. If the response in our example had been 4 percent instead of 5, the campaign would have shown a loss of $19,000 instead of the healthy profit we actually got. We can see that the line between profit and loss is indeed a fine line to tread. It seems obvious that careful selection of the target population is critical to success. We can't just blitz the marketplace with a carpet-bombing approach. It's all about knowing, or having a good idea, about who might be in the market to buy a car. If you think about it, this is the most important part. And the scary thing is, it has nothing to do with campaign management. It has everything to do with knowing your customers. Campaign management systems are an important component of CRM, but the most important part, identifying which customers should be targeted, is outside the scope of most campaign management systems.
46
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
It would be really good if we could make our campaigns a little more tailored to the individual people in our target list. If the campaign had a target of one, instead of thousands, and we could be sure that this one target is really interested in the product, our chances of success would be far greater.
Personalized Marketing Personalized marketing is sometimes referred to as a “segment of one” or “one-to-one
marketing” and is the ultimate manifestation of having gotten to know our customers. Ideally, we know precisely: What they need When they need it How much they are willing to pay for it
47
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Then, instead of merely having a set of products that we try to sell to customers, we can begin to tailor our offerings more specifically to our customers. Some of the Internet- based companies are beginning to develop applications that recognize customers as they connect to the site and present them with “content” that is likely to be of interest to them. Different customers, therefore, will see different things on the same Web page depending on their previous visits. This means that the customer should never be presented with material that is of no interest to them. This is a major step forward in responding to customers' needs. The single biggest drawback is that this approach is currently limited to a small number of Internet-based companies. However, as other companies shift their business onto the Internet, then the capability for this type of solution will grow. It is worth noting that the success of these types of application depends entirely on information. It does not matter how sophisticated the application is; it is the information that underpins it that will determine success or failure.
Customer Contact One of the main requirements in the implementation of our CRM strategy is to get to know our customers. In order to do this we recognize the value in information. However, the information that we use tends to be collected in the routine course of daily business. For instance, we routinely store data regarding orders, invoices, and payments so that we can analyze the behavior of customers. In virtually all organizations there exists a rich source of data that is often discarded. Each time a customer, or prospective customer, contacts the organization in any way, we should be thinking about the value of the information that could be collected and used to enhance our knowledge about the customer. Let's consider for a moment the two main types of contact that we encounter every day: 1. Enquiries. Every time an existing customer or prospect makes an inquiry about our products or services, we might reasonably conclude that the customer may be interested in purchasing that product or service. How many companies keep a managed list of prospects as a result of this? If we did, we would have a ready-made list for personalized campaign purposes. 2.
Complaints. Customers complain for many reasons. Usually customers complain for good reasons, and most quality companies have a purpose built system for managing complaints. However, some customers are “serial” moaners and it would be good to know, when faced with people like this, what segments they are
48
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
classified under and whether they are profitable or loss making. If the operators dealing with the communications had access to this information, they could respond appropriately. Remember that appropriate interaction with customers is the second main requirement in the implementation of our CRM strategy. Principally, when we are considering customer contact, we tend to think automatically about telephone contact. These days, however, there is a plethora of different media that a customer might use to contact us. Figure 1.4 shows the major channels of contact information. Figure 1.4. The number of communication channels is growing.
49
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
There is quite a challenge in linking together all these channels into a cohesive system for the collection of customer contact. There is a further point to this. Each customer contact costs money for us to deal with. Remember that the overall objective for a CRM strategy is to optimize the value of a customer. Therefore, the cost of dealing with individual customers should be taken into account if we are to accurately assess the value of customers. Unfortunately, most organizations aren't able to figure out the cost of an individual customer. What tends to
50
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
happen is they sum up the total cost of customer contact and divide it by the total number of customers and use the result as customer cost. This arbitrary approach is OK for general information, but it is unsatisfactory in a CRM system where we really do want to know which are the individual customers that cost us money to service. only for RuBoard - do not distribute or recompile
51
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY Different people, clearly, have different perspectives of the world. Six experts will give six different views as to what is meant by customer relationship management. In this chapter I have expressed my view as to what CRM means, its definition, and why it's important to different types of businesses. Also we have explored the major components of CRM. We looked at ways in which the information that we routinely hold about customers might be used to help to support a CRM strategy. This book, essentially, is about data warehousing. Now we have figured out how to design a data warehouse that will support the kind of questions that we need to ask about customers in order to help us be successful at CRM. only for RuBoard - do not distribute or recompile
52
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 2. An Introduction to Data Warehousing INTRODUCTION WHAT IS A DATA WAREHOUSE? DIMENSIONAL ANALYSIS BUILDING A DATA WAREHOUSE PROBLEMS WHEN USING RELATIONAL DATABASES SUMMARY only for RuBoard - do not distribute or recompile
53
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
INTRODUCTION In this chapter we provide an introduction to data warehousing. It is sensible as a starting point, therefore, to introduce data warehousing using “first generation” principles so that we can then go on to explore the issues in order to develop a “second generation” architecture. Data warehousing relates to a branch of a general business subject known as decision support. So in order to understand what data warehousing is all about, we must first understand the purpose of decision support systems (DSS) in general. Decision support systems have existed, in different forms, for many years. Long before the invention of any form of database management systems (DBMS), information was being extracted from applications to assist managers in the more effective running of their organizations.
So what is a decision support system? The purpose of a decision support system is to provide decision makers in organizations with information. The information advances the decision makers' knowledge in some way so as to assist them in making decisions about the organization's policies and strategy. A DSS tends to have the following characteristics: They tend to be aimed at the less well structured, underspecified problems that more senior managers typically face. They possess capabilities that make them easy to use by noncomputer people interactively. They are flexible and adaptable enough to accommodate changes in the environment and decision-making approach of the user. The job of a DSS is usually to provide a factual answer to a question phrased by the user. For instance, a sales manager would probably be concerned if her actual product sales were falling short of the target set by her boss. The question she would like to be able to ask might be:
54
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Why are my sales not meeting my targets? There are, as yet, no such computer systems available to answer such a question. Imagine trying to construct an SQL (Structured Query Language) query that did that ! Her questioning has to be more systematic such that the DSS can give factual responses. So the first question might be:
For each product, what are the cumulative sales and targets for the year? A DSS would respond with a list of products and the sales figures. It is likely that some of the products are ahead of target and some are behind. A well-constructed report might highlight the offending products to make them easier to see. For instance, they could be displayed in red, or flashing. She could have asked:
What are the cumulative sales and targets for the year for those products where the actual sales are less than the target ? Having discovered those products that are not achieving the target, she might ask what the company's market share is for those products, and whether the market share is decreasing. If it is, maybe it's due to a recently imposed price rise. The purpose of the DSS is to respond to ad hoc questions like these, so that the user can ultimately come to a conclusion and make a decision. A major constraint in the development of DSS is the availability of data—that is, having access to the right data at the right time. Although the proliferation of database systems, and the proper application of the database approach, enables us to separate data from applications and provides for data independence, the provision of data represents a challenge. The introduction of sophisticated DBMSs has certainly eased the problems caused by traditional applications but, nonetheless, unavailability of data still persists as a problem for most organizations. Even today, data remains “locked away” in applications. The main reason is that most organizations evolve over time. As they do, the application systems increasingly fail to meet the functional requirements of the organization. As a result, the applications are continually being modified in order to keep up with the ever-changing business. There comes a time in the life of almost every application when it has been modified to the point where it becomes impossible or impractical to modify it further. At this point a decision is usually made to redevelop the application. When this happens, it is usual for the developers to take advantage of whatever improvements in technology have occurred during the life of the application. For instance, the original
55
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
application may have used indexed sequential files because this was the most appropriate technology of the day. Nowadays, most applications obtain their data through relational database management systems (RDBMS). However, most large organizations have dozens or even hundreds of applications. These applications reach the end of their useful lives at various times and are redeveloped on a piecemeal basis. This means that, at any point in time, an organization is running applications that use many different types of software technology. Further, large organizations usually have their systems on diverse hardware platforms. It is very common to see applications in a single company spread over the following: Large mainframe Several mid-range multi-processor machines External service providers Networked and stand-alone PCs A DSS may require to access information from many of the applications in order to answer the questions being put to it by its users.
Introduction to the Case Study To illustrate the issues, let us examine the operation of a (fictitious) organization that contains some of the features just described. The organization is a mail order wine club. With great originality, it is called the Wine Club. As well as its main products (wines), it also sells accessories to wines such as: Glassware—goblets, decanters, glasses, etc. Tableware—ice buckets, corkscrews, salvers, etc. Literature—books and pamphlets on wine-growing regions, reviews, vintages, etc. It has also recently branched out further into organizing trips to special events such as the Derby, the British Formula One Grand Prix, and the Boat Race. These trips generally involve the provision of a marquee in a prominent position with copious supplies of the
56
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
club's wines and a luxury buffet meal. These are mostly one-day events, but there are an increasing number of longer trips such as those that take in the French wine-growing regions by coach tour. The club's information can be modeled, by an entity attribute relationship (EAR) diagram. A high-level EAR diagram of the club's data is as shown in Figure 2.1 Figure 2.1. Fragment of data model for the Wine Club.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Class (ClassCode,ClassName, Region) Color(ColorCode, ColorDesc) Customer(CustomerCode, CustomerName, CustomerAddress, CustomerPhone)
CustomerOrder (OrderCode, OrderDate, ShipDate, Status, TotalCost) OrderItem(OrderCode,ItemCode, Quantity, ItemCost) ProductGroup(GroupCode, Description) Reservation(CustomerCode, TripCode, Date, NumberOfPeople, Price) Shipment(ShipCode, ShipDate) Shipper(ShipperCode, ShipperName, ShipperAddress, ShipperPhone) Stock(LocationCode, StockOnHand ) Supplier(SupplierCode, SupplierName, SupplierAddress, SupplierPhone) Trip(TripCode, Description, BasicCost) TripDate(TripCode, Date, Supplement, NumberOfPlaces) Wine(WineCode, Name, Vintage, ABV, PricePerBottle, PricePerCase) The Wine Club has the following application systems in place:
Customer administration. This enables the club to add new customers. This is particularly important after an advertising campaign, typically in the Sunday color supplements, when many new customers join the club at the same time. It is important that the new customers' details, and their orders, are promptly dealt with in order to create a good first impression. This application also enables changes to a customer's address to be recorded, as well as removing ex-customers from the database. There are about 100,000 active customers.
Stock control. The goods inward system enables newly arrived stock to
58
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
be added to the stock records. The club carries about 2,200 different wines and 250 accessories from about 150 suppliers.
Order processing. The directors of the club place a high degree of importance on the fulfillment of customers' orders. Much emphasis is given to speed and accuracy. It is a stated policy that orders must be shipped within ten days of receipt of the order. The application systems that support order processing are designed to enable orders to be recorded swiftly so that they can be fulfilled within the required time. The club processes about 750,000 orders per year, with an average of 4.5 items per order.
Shipments. Once an order has been picked, it is packed and placed in a pre-designated part of the dispatch area. Several shipments are made every day.
Trip bookings. This is a new system that records customer bookings for planned trips. It operates quite independently of the other systems, although it shares the customer information held in the customer administration system. The club's systems have evolved over time and have been developed using different technologies. The order processing and shipments systems are based on indexed-sequential files accessed by COBOL programs. The customer administration system is held on a relational database. All these systems are executed on the same mid range computer. The stock control system is a software package that runs on a PC network. The trip bookings system is held on a single PC that runs a PC-based relational database system. There is a general feeling among the directors and senior managers that the club is losing its market share. Within the past three months, two more clubs have been formed and their presence in the market is already being felt. Also, recently, more customers than usual appear to be leaving the club and new customers are being attracted in fewer numbers than before. The directors have held meetings to discuss the situation. The information upon which the discussions are based is largely anecdotal. They are all certain that a problem exists but find it impossible to quantify. They also know that helpful information passes through their systems and should be available to answer questions. In reality, however, while it is not too difficult to get answers
59
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
to the day-to-day operational questions, it is almost impossible to get answers to more strategic questions.
Strategic and Operational Information It is very important to understand the difference between the terms strategic and
operational. In general, strategic matters deal with planning and policy making, and this is where a data warehouse can help. For instance, in the Wine Club the decision as to when a new product should be launched would be regarded as a strategic decision. Examples pertaining to other types of organization include: When a telecommunications company decides to introduce very cheap off-peak tariffs to attract callers away from the peak times, rather than install extra equipment to cope with increasing demand. A large supermarket chain deciding to open its stores on Sundays. A general 20 percent price reduction for one month in order to increase market share. Whereas strategic matters relate to planning and policy, operational matters are generally more concerned with the day-to-day running of a business or organization. Operations can be regarded as the implementation of the organization's strategy (its policies and plans). The day-to-day ordering of supplies, satisfying customers' orders, and hiring new employees are examples of operational procedures. These procedures are usually supported by computer applications and, therefore, they must be able to provide answers to operational questions such as: How many unfulfilled orders are there? On which items are we out of stock? What is the position on a particular order? Typically, operational systems are quite good at answering questions like these because
60
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
they are questions about the situation as it exists right now. You could add the words right
now to the end of each of those questions and they would still make sense. Questions such as these arise out of the normal operation of the organization. The sort of questions the directors of the Wine Club wish to ask are: 1. Which product lines are increasing in popularity and which are decreasing? 2. Which product lines are seasonal? 3. Which customers place the same orders on a regular basis? 4. Are some products more popular in different parts of the country? 5. Do customers tend to purchase a particular class of product? These, clearly, are not “right now” types of questions and, typically, operational systems are not good at answering such questions. Why is this? The answer lies in the nature of operational systems. They are developed to support the operational requirements of the organization. Let's examine the operational systems of the Wine Club and see what they actually do. Each application's role in the organization can usually be expressed in one or two sentences. The customer administration system contains details of current customers. The stock control system contains details of the stock currently held. The order processing system holds details of unfulfilled customer orders and the shipments system records details of fulfilled orders awaiting delivery to the customers. Notice the use of words like details and current in those descriptions. They underline the “right now” nature of operational systems. You could say that the operational systems represent a “snapshot” of an organization at a point in time. The values held are constantly changing. At any point in time, dozens or even hundreds of inserts, updates and deletes may be executing on all, or any, parts of the systems. If you were to freeze the systems momentarily, then they would provide an accurate reflection of the state of the organization at precisely that moment. One second earlier, or one second later, the situation would have changed. Now let us examine the five questions that the directors of the Wine Club need to ask in
61
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
order to reach decisions about their future strategy. What is it that the five questions have in common? If you look closely you will see that each of the five questions is concerned with sales of
products over time. Looking at the first question:
Which product lines are increasing in popularity and which are decreasing? This is obviously a sensible strategic business question. Depending on the answer, the directors might: Expand their range of some products and shrink their range of other products Offer a financial incentive on some products, such as reduced prices or discounts for volume purchases Enhance the promotional or advertising techniques for the products that are decreasing in popularity For the moment, let's focus in on sales of wine and assess whether the information required to ask such a question is available to the directors. Have a look back at the EAR diagram at the beginning of the case study. Remember we are looking for “sales of products over time.” The only way we can assess whether a product line is increasing or decreasing in popularity is to trace its demand over time. If the order processing information was held in a relational database, we could devise an SQL query such as:
Select Name, Sum(Quantity), Sum(ItemCost) Sales From CustomerOrder a, OrderItem b, Wine c Where a.OrderCode = b.OrderCode And a.WineCode = c.WineCode And OrderDate = 1 Order By CustomerName, WineName Question 4: Are some products more popular in different parts of the country?
This query shows, for each wine, both the number of orders and the total number of bottles ordered by area.
Select WineName, AreaDescription, Count(*) "Total Orders," Sum(Quantity) "Total Bottles" From Sales S, Wine W, Area A, Time T Where S.WineCode = W.WineCode And S.AreaCode = A.AreaCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by WineName,AreaDescription Order by WineName,AreaDescription Question 5: Do customers tend to purchase a particular class of product?
This query presents us with a problem. There is no reference to the class of wine in the data warehouse. Information relating to classes does exist in the original EAR model. So it seems that the star schema is incomplete. What we have to do is extend the Schema as shown in Figure 2.10. Figure 2.10. Snowflake schema for the sale of wine.
86
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Of course the Class information has to undergo the extraction and integration processing before it can be inserted into the database. A foreign key constraint must be included in the Wine table to refer to the Class table. The query can now be coded:
Select CustomerName, ClassName, Sum(Quantity) "TotalBottles" From Sales S,Wine W, Customer Cu, Class Cl, Time T Where S.WineCode = W.WineCode And S.CustomerCode = Cu.CustomerCode And W.ClassCode = Cl.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by CustomerName, ClassName Having Sum(Quantity) > 2 * (Select AVG(Quantity)
87
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
From Sales S,Wine W, Class C, Time T Where S.WineCode = W.WineCode And W.ClassCode = C.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002) Order by CustomerName, ClassName The query lists all customers and classes of wines where the customer has ordered that class of wine at more than twice the quantity as the average for all classes of wine. There are other ways that the query could be phrased. It is always a good idea to ask the directors precisely how they would define their questions in business terms before translating the question into an SQL query. There are any number of ways the directors can question the data warehouse in order to answer their strategic business questions. We have shown that the data warehouse supports those types of questions in a way in which the operational applications could never hope to do. The queries show very clearly that the arithmetic functions such as AVG() and particularly SUM() are used in just about every case. Therefore, a golden rule with respect to fact tables can be defined:
The nonkey columns in the fact table must be summable. Data attributes such as Quantity and ItemCost are summable, whereas text columns such as descriptions are not summable. Unfortunately, it is not as straightforward as it seems. Care must be taken to ensure that the summation is meaningful. In some attributes the summation is meaningful only across
certain dimensions. For instance, ItemCost can be summed by product, customer, area, and time with meaningful results. Quantity sold can be summed by product but might be regarded as meaningless across other dimensions. Although this problem applies to the Wine Club, it is much more easily explained in a different organization such as a supermarket. While it is reasonable to sum sales revenue across products (e.g., the revenue from sales of apples added to the revenue from sales of oranges and other fresh fruit each contribute toward the sum of revenue for fresh fruit), adding the quantity of apples sold to the quantity of oranges sold produces a meaningless
88
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
result. Attributes that are summable across some dimensions, but not all dimensions, are referred to as semisummable attributes. Clearly they have a valuable role to play in a data warehouse, but their usage must be restricted to avoid the generation of invalid results. So have we now completed the data warehouse design? Well not quite. Remember that the fact table may grow to more than 62 million rows over time. There is the possibility, therefore, that a query might have to trawl through every single row of the fact table in order to answer a particular question. In fact, it is very likely that many queries will require a large percentage of the rows, if not the whole table, to be taken into account. How long will it take to do that? The answer is - quite a long time. Some queries are quite complex, involving multiple join paths, and this will seriously increase the time taken for the result set to be presented back to the user, perhaps to several hours. The problem is exacerbated when several people are using the system at the same time, each with a complex query to run. If you were to join the 62-million row fact table to the customer table and the wine table, how many rows would the Cartesian product contain?
In principle, there is no need for rapid responses to strategic queries, as they are very different from the kind of day-to-day queries that are executed while someone is hanging on the end of the telephone waiting for a response. In fact, it could be argued that, previously, the answer was impossible to obtain, so even if the query took several days to execute, it would still be worth it. That doesn't mean we shouldn't do what we can as designers to try to speed things up as much as possible. Indexes might help, but in a great deal of cases the queries will need to access more than half the data, and indexes are much less efficient in those cases than a full sequential scan of the tables. No, the answer lies in summaries. Remember we said that almost all queries would be summing large numbers of rows together and returning a result set with a smaller number of rows. Well if we can predict, to
89
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
some degree, the types of queries the users will mostly be executing, we can prepare some summarized fact tables so that the users can access those if they happen to satisfy the requirements of the query. Where the aggregates don't supply the required data, then the user can still access the detail. If we question the users closely enough we should be able to come up with a set, maybe half a dozen or so, of summarized fact tables. The star schema and the snowflake principles still apply, but the result is that we have several fact tables instead of just one. It should be emphasized that this is a physical design consideration only. Its only purpose is to improve the performance of the queries. Some examples of summarization for the Wine Club might be: Customers by wine for each month Customers by wine for each quarter Wine by area for each month Wine by area for each quarter Notice that the above examples are summarizing over time. There are other summaries, and you may like to try to think of some, but summarizing over time is a very common practice in data warehouses. Figure 2.11 shows the levels of summarization commonly in use. Figure 2.11. Levels of summarization in a data warehouse
90
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
One technique that is very useful to people using the data warehouse is the ability to drill
down from one summary level to a lower, more detailed level. For instance, you might observe that a certain range of products was doing particularly well or particularly badly. By drilling down to individual products, you can see whether the whole range or maybe just one isolated product is affected. Conversely, the ability to drill up would enable you to make sure, if you found one product performing badly, that the whole range is not affected. This ability to drill down and drill up are powerful reporting capabilities provided by a data warehouse where summarization is used. The usage of the data warehouse must be monitored to ensure that the summaries are being used by the queries that are exercising the database. If it is found that they are not being used, then they should be dropped and replaced by others that are of more use.
Summary Navigation The introduction of summaries raises some questions: 1. How do users, especially noncomputer professionals, know which summaries are available and how to take advantage of them? 2. How do we monitor which summaries are, in fact, being used? One solution is to use a summary navigation tool. A summary navigator is an additional layer of software, usually a third-party product, that sits between the user interface (the presentation layer) and the database. The summary navigator receives the SQL query from the user and examines it to establish which columns are required and the level of summarization needed. How do summary navigators work? This is a prime example of the use of metadata. Remember metadata is data about data. Summary navigators hold their own metadata within the data warehouse (or in a database separate from the warehouse). The metadata is used to provide a “ mapping ” between the queries formulated by the users and the data warehouse itself. Tables 2.2 and 2.3 are example metadata tables.
91
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Summary_Tables
Table 2.2. Available Summary Tables for Aggregate Navigation Table_Name
DW_Column
Sales_by_Customer_by_Year
Sales
Sales_by_Customer_by_Year
Customer
Sales_by_Customer_by_Year
Year
Sales_by_Customer_by_Quarter
Sales
Sales_by_Customer_by_Quarter
Customer
Sales_by_Customer_by_Quarter
Quarter
Column_Map
Table 2.3. Metadata Mapping Table for Aggregate Navigation User_Column
User_Value
DW_Column
DW_Value
Rating
Year
2001
Year
2001
100
Year
2001
Quarter
Q1_2001
80
Year
2001
Quarter
Q2_2001
80
Year
2001
Quarter
Q3_2001
80
Year
2001
Quarter
Q4_2001
80
The Summary_Tables table contains a list of all the summaries that exist in the data warehouse, together with the columns contained within them. The Column_Map table provides a mapping between the columns specified in the user's query and the columns that are available from the warehouse. Let's look at an example of how it works. We will assume that the user wants to see the sum of sales for each customer for 2001. The simple way to do this is to formulate the following query:
Select CustomerName, Sum(Sales) "Total Sales" From Sales S, Customer C, Time T
92
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Where S.CustomerCode = C.CustomerCode And S.TimeCode = T.TimeCode And T.Year = 2001 Group By C.CustomerName As we know, this query would look at every row in the detailed Sales fact table in order to produce the result set and would very likely take a long time to execute. If, on the other hand, the summary navigator stepped in and grabbed the query before it was passed through to the RDBMS, it could redirect the query to the summary table called “Sales_by_Customer_by_Year.” It does this by: 1. Checking that all the columns needed are present in the summary table. Note that this includes columns in the “Where” clause that are not necessarily required in the result set (such as “Year” in this case). 2. Checking whether there is a translation to be done between what the user has typed and what the summary is expecting. In this particular case, no translation was necessary, because the summary table “Sales_by_Customer_by_Year” contained all the necessary columns. So the resultant query would be:
Select CustomerName, Sum(Sales) "Total Sales" From Sales_by_Customer_by_Year S, Customer C Where S.CustomerCode = C.CustomerCode And S.Year = 2001 Group By C.CustomerName If, however, “Sales_by_Customer_by_Year” did not exist as an aggregate table (but “Sales_by_Customer_by_Quarter” did) then the summary navigator would have more work to do. It would see that Sales by Customer was available and would have to refer to the Column_Map table to see if the “Year” column could be derived. The Column_Map table shows that, when the user types “Year = 2001,” this can be translated to:
Quarter in ("Q1_2001," "Q2_2001," "Q3_2001," "Q4_2001") So, in the absence of “Sales_by_Customer_by_Year,” the query would be reconstructed as follows:
93
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Select CustomerName, Sum(Sales) "Total Sales" From Sales_by_Customer_by_Quarter S, Customer C Where S.CustomerCode = C.CustomerCode And S.Quarter in ("Q1_2001," "Q2_2001," "Q3_2001," "Q4_2001") Group By C.CustomerName Notice that the Column_Map table has a rating column. This tells the summary navigator that “Sales_by_Customer_by_Year” is summarized to a higher level than “Sales_by_Customer_by_Quarter” because it has a higher rating. This directs the summary navigator to select the most efficient path to satisfying the query. You may think that the summary navigator itself adds an overhead to the overall processing time involved in answering queries, and you would be right. Typically, however, the added overhead is in the order of a few seconds, which is a price worth paying for 1000-fold improvements in performance that can be achieved using this technique. We opened this section with two questions. The first question asked how users, especially noncomputer professionals, know which aggregates are available and how to take advantage of them. It is interesting to note that where summary navigation is used, the user never knows which fact table their queries are actually using. This means that they don't need to know which summaries are available and how to take advantage of them. If “Sales_by_Customer_by_Year” were to be dropped, the summary navigator would automatically switch to using “Sales_by_Customer_by_Quarter.” The second question asked how we monitor which summaries are being used. Again, this is simple when you have summary navigator. As it is formulating the actual queries to be executed against the data warehouse, it knows which summary tables are being used and can record the information. Not only that, it can record: The types of queries that are being run to provide statistics so that new summaries can be built Response times
94
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Which users use the system most frequently All kinds of useful statistics can be stored. Where does the summary navigator store this information? In its metadata tables. As a footnote to summary navigation, it is worth mentioning that several of the major RDBMS vendors have expressed the intention of building summary navigation into their products. This development has been triggered by the enormous growth in data warehousing over the past few years.
Presentation of Information The final component of a data warehouse is the method of presentation. This is how the warehouse is presented to the users. Most data warehouse implementations adopt a client-server configuration. The concept of client-server, for our purposes, can be viewed as the separation of the users from the warehouse in that the users will normally be using a personal computer and the data warehouse will reside on a remote host. The connection between the machines is controlled by a computer network. There are very many client products available for accessing relational databases, many of which you may already be familiar with. Most of these products help the user by using the RDBMS schema tables to generate SQL. Similarly, most have the capability to present the results in various forms such as textual reports, pie charts, scatter diagrams, two- and 3-dimensional bar charts, etc. The choice is enormous. Most products are now available on Web servers so that all the users need is a Web browser to display their information. There are, however, some specialized analysis techniques that have largely come about since the invention of data warehouses. The presence of large volumes of time-variant data, hitherto unavailable, has allowed the development of a new process called data
mining. In our exploration into data warehousing and the ways in which it helps with decision support, the onus has always been placed on the user of the warehouse to formulate the queries and to spot any patterns in the results. This leads to more searching questions being asked as more information is returned. Data mining is a technique where the technology does more of the work. The users
95
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
describe the data to the data mining product by identitying the data types and the ranges of valid values. The data mining product is then launched at the database and, by applying standard pattern recognition algorithms, is able to present details of patterns in the data that the user may not be aware of. Figure 2.12 shows how a data mining tool fits into the data warehouse model. Figure 2.12. Modified data warehouse structure incorporating summary navigation and data mining.
96
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The technique has been used very successfully in the insurance industry, where a
97
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
particular insurance company wanted to decrease the number of proposals for life assurance that had to be referred to the company for approval. A data mining program was applied to the data warehouse and reported that men between the ages of 30 and 40 whose height to weight ratio was within a certain range had an increased risk probability of just 0.015. The company immediately included this profile into their automatic underwriting system, thereby increasing the level of automatic underwriting from 50 percent to 70 percent. Even with a data warehouse, it would probably have taken a human “data miner” a long time to spot that pattern because a human would follow logical paths, whereas the data mining program is simply searching for patterns. only for RuBoard - do not distribute or recompile
98
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
PROBLEMS WHEN USING RELATIONAL DATABASES It has been stated that the relational model supports the requirements of data warehousing, and it does. There are, however, a number of areas where the relational model struggles to cope. As we are coming to the end of our introduction to data warehousing, we'll conclude with a brief look at some of these issues.
Problems Involving Time Time variance is one of the most important characteristics of data warehouses. In the section on “Building the Data Warehouse,” we commented on the fact that there appeared to be a certain amount of data redundancy in the warehouse because we were duplicating some of the information, for example, Customers' details, which existed in the operational systems. The reason we have to do this is because of the need to record information over
time. As an example, when a customer changes address we would expect that change to be recorded in the operational database. When we do that we lose the old address. So when a query is next executed where that customer's details are included, any sales of wine, for that customer, will automatically be attributed to the new address. If we are investigating sales by area, the results will be incorrect (assuming the customer moved to a different area) because many of the sales were made when the customer was in another area. That's also the reason why we don't delete customers' details from the data warehouse simply because they are no longer customers. If they have placed any orders at all, then they have to remain within the system. True temporal models are very complex and are not well supported at the moment. We have to introduce techniques such as “start dates” and “end dates” to ensure that the data warehouse returns accurate results. The problems surrounding the representation of time in data warehousing are many. They are fully explored in Chapter 4.
Problems With SQL
99
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
SQL is based on set theory. It treats tables as sets and returns its results in the form of a set. There are cases where the use of procedural logic would improve functionality and performance. Ranking/Top (n)
While it is possible to get ranked output from SQL, it is difficult to do. It involves a correlated subquery, which is beyond the capability of most SQL users. It is also very time-consuming to execute. Some individual RDBMS vendors provide additional features to enable these types of queries to be executed, but they are not standardized. So what works on one RDBMS probably won't work on others. Top n Percent
It is not practically possible, for instance, to get a list of the top 10 percent of customers who place the most orders. Running Balances
It is impossible, in practical terms, to get a report containing a running balance using standard SQL. If you are not clear what a running balance is, it's like a bank statement that lists the payments in one column, receipts in a second column, and the balance, as modified by the receipts and payments, in a third or subsequent column. Complex Arithmetic
Standard SQL provides basic arithmetic functions but does not support more complex functions. The different RDBMS vendors supply their own augmentations but these vary. For instance, if it is required to raise a number by a power, in some systems the power has to be an integer, while in others it can be a decimal. Although data warehouses are used for the production of statistics, standard statistical formulas such as deviations and quartiles, as well as standard mathematical modeling techniques such as integral and differential calculus, are not available in SQL. Variables
Variables cannot be included in a standard SQL query.
100
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Almost all of these, and other deficiencies can be resolved by writing 3GL programs such as “C” or COBOL with embedded SQL. Also most RDBMS vendors provide a procedural extension to their standard SQL product to assist in resolving the problems. However, the standard interface between the products that are available at the presentation layer and the RDBMS is a standard called ODBC, which stands for open database connectivity. ODBC, and the more recent JDBC (Java database connectivity) is very useful because it has forced the industry to adopt a standard approach. It does not, at the time of this writing, support the procedural extensions that the RDBMS vendors have provided. It is worth noting that some of these issues are being tackled. We explore the future in Chapter 11. only for RuBoard - do not distribute or recompile
101
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
SUMMARY Data warehouses are a special type of database that are built for the specific purpose of
getting information out rather than putting data in, which is the purpose of most application databases. The emphasis is on supporting questions of a strategic nature, to assist the managers of organizations in planning for the future. A data warehouse is: Subject Oriented Non Volatile Integrated Time Variant Dimensional analysis is a technique used in identifying the requirements of a data warehouse and this is often depicted using a star schema. The star schema identifies the facts and the dimensions of analysis. A fact is an attribute, such as sales value, or call duration, which is analyzed across dimensions. Dimensions are things like customers and products over which the facts are analyzed. A typical query might be:
Show me the sales value of products by customer for this month and last month. Time is always a dimension of analysis. The data warehouse is almost always kept separate from the application databases because: 1. Application databases are optimized to execute insert and update type queries, whereas data warehouses are optimized to execute select type queries. 2. Application databases are constantly changing, whereas data warehouses are quiet (nonvolatile).
102
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
3. Application databases have large and complex schemas whereas data warehouses are simplified, often denormalized, structures. 4. Data warehouses need historical information and this is usually missing from application databases. There are five main components to a first generation data warehouse. 1. Extraction of the source data from a variety of application databases. These source applications are often using very different technology. 2. Integration of the data. There are two types of integration. First there is format integration, where logically similar data types (e.g., dates) are converted so that they have the same physical data type. Second, semantic integration so that the meaning of the information is consistent. 3. The database itself. The data warehouse database can become enormous as a new layer of fact data is added each day. The star schema is implemented as a series of tables. The fact table (the center of the star ) is long and thin in that it usually has a large number of rows and a small number of columns. The fact columns must be summable. The dimension tables (the points of the star) are joined to the fact table through foreign keys. Where a dimension participates in a hierarchy, the model is sometimes referred to as a snowflake. 4. Aggregate navigation is a technique which enables the users to have their queries automatically directed at aggregate tables without them being aware that it is happening. This is very important for query performance. 5. Presentation of information. This is how the information is presented to the users of the data warehouse. Most implementations opt for a client-server approach, which gives them the capability to view their information in a variety of tabular or graphical formats. Data warehouses are also useful data sources for applications such as data mining, which are software products that scan large databases searching for patterns and reporting the results back to the users. We review products in Chapter 10. There are some problems that have to be overcome, such as the use of time. Care has to be taken to ensure that the facts in the data warehouse are correctly reported with respect to time. We explore the problems surrounding time in Chapter 4.
103
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Also, many of the queries that users typically like to ask of a data warehouse cannot easily be translated into standard SQL queries, and work-arounds have to be used, such as procedural programs with embedded SQL. only for RuBoard - do not distribute or recompile
104
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 3. Design Problems We Have to Face Up To In this chapter, we will be reviewing the traditional approaches to designing data warehouses. During the review we will investigate whether or not these methods are still appropriate now that the business imperatives have been identified. We begin this chapter by picking up on the introduction to data warehousing. only for RuBoard - do not distribute or recompile
105
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
DIMENSIONAL DATA MODELS In Chapter 2, we introduced data warehousing and described, at a high level, how we might approach a design. The approach we adopted follows the style of some of the major luminaries in the development of data warehousing generally. This approach to design can be given the general description of dimensional. Star schemes and snowflake schemes are both examples of a dimensional data model. The dimensional approach was adopted in our introduction for the following reasons: Dimensional data models are easy to understand. Therefore, they provide an ideal introduction to the subject. They are unambiguous. They reflect the way that business people perceive their businesses. Most RDBMS products now provide direct support for dimensional models. Research shows that almost all the literature supports the dimensional approach. Unfortunately, although the dimensional model is generally acclaimed, there are alternative approaches and this has tended to result in “religious” wars (but no deaths as far as I know). Even within the dimensional model camp there are differences of opinion. Some people believe that a perfect star should be used in all cases, while others prefer to see the hierarchies in the dimensions and would tend to opt for a snowflake design. Where deep-rooted preferences exist, it is not the purpose of this book to try to make “road to Damascus” style conversions by making nonbelievers “see the light.” Instead, it is intended to present some ideas and systematic arguments so that readers of this book can make their own architectural decisions based on a sound understanding of the facts of the matter. In any case, I believe there are far more serious design issues that we have to consider once we have overcome these apparent points of principle. Principles aside, we have also to consider any additional demands that customer relationship management might place on the data architecture. A good objective for this chapter would be to devise a general high-level data architecture for data warehousing. In doing so, we'll discuss the following issues:
1. Dimensional versus third normal form (3NF) models 2. Stars versus snowflakes
106
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
3. What works for CRM
Dimensional Versus 3NF There are two principal arguments in favor of dimensional models:
1. They are easy for business people to understand and use. 2. Retrieval performance is generally good. The ease of understanding and ease of use bit is not in dispute. It is a fact that business people can understand and use dimensional models. Most business people can operate spreadsheets and a dimensional model can be likened to a multidimensional spreadsheet. We'll be exploring this in Chapter 5 when we start to investigate the dot modeling methodology. The issue surrounding performance is just as clear cut. The main RDBMS vendors have all tweaked their query optimizers to enable them to recognize and execute dimensional queries more efficiently, and so performance is bound to be good in most instances. Even so, where the dimension tables are massively large, as the customer dimension can be, joins between such tables and an even bigger fact table can be problematic. But this is not a problem that is peculiar to dimensional models. 3NF data structures are optimized for very quick insertion, update, and deletion of discrete data items. They are not optimized for massive extractions of data, and it is nonsensical to argue for a 3NF solution on the grounds of retrieval performance.
What Is Data Normalization?
Normalization is a process that aims to eliminate the unnecessary and uncontrolled duplication of data, often referred to as 'data redundancy'. A detailed examination of normalization is not within the scope of this book. However, a brief overview might be helpful (for more detail see Bruce, 1992, or Batini et al., 1992). Normalization enables data structures to be made to conform to a set of well-defined rules. There are several levels of normalization and these are referred to as first normal form (1NF), second normal form (2NF), third normal form (3NF), and so on. There are exceptions, such as Boyce-Codd normal form (BCNF), but we won't be covering these. Also, we won't explore 4NF and 5NF as, for most purposes, an understanding of the levels up to 3NF is sufficient. In relational theory there exists a rule called the entity integrity rule. This rule concerns the primary key of any given relation and assigns to the key the following two properties:
1. Uniqueness. This ensures that all the rows in a relational table can be uniquely identified. 2. Minimality. The key will consist of one or more attributes. The minimality property ensures that the length of the key is no longer than is necessary to ensure that the first property, uniqueness, is guaranteed.
107
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Within any relation, there are dependencies between the key attributes and the nonkey attributes. Take the following Order relation as an example: Order Order number
Primary key
Item number
Primary key
Order date Customer ID Product ID Product description Quantity Dependencies can be expressed in the form of “determinant” rules, as follows:
1. The Order Number determines the Customer ID. 2. The Order Number and Item Number determine the Product ID. 3. The Order Number and Item Number determine the Product Description. 4. The Order Number and Item Number determine the Quantity. 5. The Order Number determines the Order Date. Notice that some of the items are functionally dependent on Order Number (part of the key), whereas others are functionally dependent on the combination of both the Order Number and the Order Item (the entire key). Where the dependency exists on the entire key, the dependency is said to be a fully functional dependency. Where all the attributes have at least a functional dependency on the primary key, the relation is said to be in 1NF. This is the case in our example. Where all the attributes are engaged in a fully functional relationship with the primary key, the relation is said to be in 2NF. In order to change our relation to 2NF, we have to split some of the attributes into a separate relation as follows: Order Order number
Primary key
Order date Customer ID Order Item Order number
Primary key
Item number
Primary key
Product ID
108
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Product description Quantity These relations are now in 2NF, since all the nonkey attributes are fully functionally dependent on their primary key. There is one relationship that we have not picked up so far: The Product ID determines the Product Description. This is known as a transitive dependency because the Product ID itself can be determined by the combination of Order Number and Order Item (see dependency 2 above). In order for a relation to be classified as a 3NF relation, all transitive dependencies must be removed. So now we have three relations, all of which are in 3NF: Order Order number
Primary key
Order date Customer ID Order Item Order number
Primary key
Item number
Primary key
Product ID Quantity Product Product ID
Primary key
Product description There is one major advantage in a 3NF solution and that is flexibility. Most operational systems are implemented somewhere between 2NF and 3NF, in that some tables will be in 2NF, whereas most will be in 3NF. This adherence to normalization tends to result in quite flexible data structures. We use the term flexible to describe a data structure that is quite easy to change should the need arise. The changing nature of the business requirements has already been described and, therefore, it must be advantageous to implement a data model that is adaptive in the sense that it can change as the business requirements evolve over time. But what is the difference between dimensional and normalized? Let's have another look at the simple star schema for the Wine Club, from Figure 2.3, which we first produced in our introduction in Chapter 2 (see Figure 3.1).
Figure 3.1. Star schema for the Wine Club.
109
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Some of the attributes of the customer dimension are: Customer Dimension Customer ID (Primary key) Customer name Street address Town County Zip Code Account manager ID Account manager name This dimension is currently in 2NF because, although all the nonprimary key columns are fully functionally dependent on the primary key, there is a transitive dependency in that the account manager name can also be determined from the account manager ID. So the 3NF version of the customer dimension above would look like this:
110
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Customer Dimension Customer ID (Primary key) Customer name Street address Town County Zip Code Account manager ID Account Manager Dimension Account Manager ID Account Manager Name The diagram is shown in Figure 3.2.
Figure 3.2. Third normal form version of the Wine Club dimensional model.
111
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
So what we have done is convert a star into the beginnings of a snowflake and, in doing so, have started putting the model into 3NF. If this process is carried out thoroughly with all the dimensions, then we should have a complete dimensional model in 3NF. Well, not quite. We need also to look at the fact table. But first, let's go back to the original source of the Sales fact table, the Order, and Order Item table. They have the following attributes: Order Order number
(Primary key)
Customer ID Time Order Items Order number
(Primary key)
Order Items
(Primary key)
Wine ID Depot ID Quantity Value Both these tables are in 3NF. If we were to collapse them into one table, called Sales, by joining them on the Order Number, then the resulting table would look like this Sales Order number
(Primary key part 1)
Order item
(Primary key part 2)
Customer ID Wine ID Time Depot ID Quantity Value This table is now in 1NF because the Customer ID and Time, while being functionally dependent on the primary key, do not display the property of “full” functional dependency (i.e., they are not dependent on the Order Item). In our dimensional model, we have decided not to include the Order Number and Order Item details. If we remove them, is there another candidate key for the resulting table? The answer is, it depends! Look at this version of the Sales table:
112
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Sales Customer ID
(Primary key part 2)
Wine ID
(Primary key part 1)
Time
(Primary key part 3)
Depot ID Quantity Value The combination of Customer ID, Wine ID, and Time have emerged as a composite candidate key. Is this realistic? The answer is yes, it is. But, it all depends on the Time. The granularity of time has to be sufficiently fine as to ensure that the primary key has the property of uniqueness. Notice that we did not include the Depot ID as part of the primary key. The reason is that it does not add any further to the uniqueness of the key and, therefore, its inclusion would violate the other property of primary keys, that of minimality. The purpose of this treatise is not to try to convince you that all dimensional models are automatically 3NF models; rather, my intention is to show that it is erroneous to say that the choice is between dimensional models and 3NF models. The following sums up the discussion:
1. Star schemes are not usually in 3NF. 2. Snowflake schemes can be in 3NF.
Stars and Snowflakes The second religious war is being fought inside the dimensional model camp. This is the argument about star schema versus snowflake schema. Kimball (1996) proscribes the use of snowflake schemas for two reasons. The first is the effect on performance that has already been described. The second reason is that users might be intimidated by complex hierarchies. My experience has shown his assertion, that users and business people are uncomfortable with hierarchies, to be quite untrue. My experience is, in fact, the opposite. Most business people are very aware of hierarchies and are confused when you leave them out or try to flatten them into a single level. Kimball (1996) uses the hierarchy in Figure 3.3 as his example.
Figure 3.3. Confusing and intimidating hierarchy.
113
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
It is possible that it is this six-layer example in Figure 3.3 that is confusing and intimidating, rather than the principle. In practice, hierarchies involving so many layers are almost unheard of. Far more common are hierarchies like the one reproduced from the Wine Club case study, shown in Figure 3.4.
Figure 3.4. Common organizational hierarchy.
These hierarchies are very natural to managers because, in real-life scenarios, customers are organized into geographic areas or market segments and the managers' desire is to be able to ask questions about the business performance of this type of segmentation. Similarly, managers are quite used to comparing the performance of one department against other departments, or one product range against other product ranges. The whole of the business world is organized in a hierarchical fashion. We live in a structured world. It is when we try to remove it, or flatten it, that business people become perplexed. In any case, if we present the users of our data warehouse with a star schema, all we have done, in most instances, is precisely that: We have flattened the hierarchy in a kind of “denormalization” process. So I would offer a counter principle with respect to snowflake schemas: There is no such thing as a true star schema in the eyes of business people. They expect to see the hierarchies. Where does this get us? Well, in the dimensional versus 3NF debate, I suspect a certain amount of reading-between-the-lines interpretation is necessary and that what the 3NF camp is really shooting for is the retention of the online transaction processing (OLTP) schema in preference to a dimensional schema. The reason for this is that the OLTP model will more accurately reflect the underlying business processes and is, in theory at least, more flexible and adaptable to change. While this sounds like a great idea, the introduction of history usually makes this impossible to do. This is part of a major subject that we'll be covering in detail in Chapter 4. It is worth noting that all online analytical processing (OLAP) products also implement a dimensional data model. Therefore, the terms OLAP and dimensional are synonymous.
114
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
This brings us neatly onto the subject of data marts. The question “When is it a data warehouse and when is it a data mart”? has also been the subject of much controversy. Often, it is the commercial interests of the software vendors that carry the greatest influence. The view is held, by some, that the data warehouse is the big, perhaps enterprise-wide, repository and that data marts are much smaller, maybe departmental, extractions from the warehouse that the users get to analyze. By the way, this is very closely associated with the previous discussion on 3NF versus dimensional models. Even the most enthusiastic supporter of the 3NF/OLTP approach is prepared to recognize the value that dimensional models bring to the party when it comes to OLAP. In a data warehouse that has an OLAP component, it is that OLAP component that the users actually get to use. Sometimes it is the only part of the warehouse that they have direct access to. This means that the bit of the warehouse that the users actually use is dimensional, irrespective of the underlying data model. In a data warehouse that implements a dimensional model, the bit that the users actually use is, obviously, also dimensional. Therefore, everyone appears to agree that the part that the users have access to should be dimensional. So it appears that the only thing that separates the two camps is some part of the data warehouse that the users do not have access to. Returning to the issue about data marts, we decline to offer a definition on the basis that a subset of a data warehouse is still a data warehouse. There is a discussion on OLAP products generally in Chapter 10. only for RuBoard - do not distribute or recompile
115
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
WHAT WORKS FOR CRM Data warehousing is now a mature business solution. However, the evolution of business requires the evolution of data warehouses. That business people have to grasp the CRM nettle is an absolute fact. In order to do this, information is the key. It is fair to say that an organization cannot be successful at CRM without high-quality, timely, and accurate information. You cannot determine the value of a customer without information. You cannot personalize the message to your customer without information, and you cannot assess the risk of losing a customer without information. If you want to obtain such information, then you really do need a data warehouse. In order to adopt a personalized marketing approach, we have to know as much as we can about our customers' circumstances and behavior. We described the difference between circumstances and behavior in the section on market segmentation in Chapter 1. The capability to accurately segment our customers is one the important properties of a data warehouse that is designed to support a CRM strategy. Therefore, the distinction between circumstances and behavior, two very different types of data, is crucial in the design of the data warehouse. Let's look at the components of a “traditional” data warehouse to try and determine how the two different types of data are treated. The diagram in Figure 3.5 is our now familiar Wine Club example.
Figure 3.5. Star schema for the Wine Club.
It remains in its star schema form for the purposes of this examination, but we could just as easily be reviewing a snowflake model.
116
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The first two questions we have to ask are whether or not it contains information about:
1. Customers' behavior 2. Customers' circumstances Clearly it does. The Sales table (the Fact table) contains details of sales made to customers. This is behavioral information, and it is a characteristic of dimensional data warehouses and data marts that the Fact table contains behavioral information. Sales is a good example, probably the most common, but there are plenty more from all industries: Telephone call usage Shipments and deliveries Insurance premiums and claims Hotel stays Aircraft flight bookings These are all examples of the subject of a dimensional model, and they are all behavioral. The customer dimension is the only place where we keep information about customer circumstances. According to Ralph Kimball (1996), the principal purpose of the customer dimension, as with all dimensions in a dimensional model, is to enable constraints to be placed on queries that are run against the fact table. The dimensions merely provide a convenient way of grouping the facts and appear as row headers in the user's result set. We need to be able to slice and dice the Fact table data “any which way.” A solution based on the dimensional model is absolutely ideal for this purpose. It is simply made for slicing and dicing. Returning to the terms behavior and circumstances, a dimensional model can be described as behavior centric. It is behavior centric because its principal purpose is to enable the easy and comprehensive analysis of behavioral data. It is possible to make a physical link between Fact tables by the use of a common dimension tables as the diagram in Figure 3.6 shows.
Figure 3.6. Sharing information.
117
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This “daisy chain” effect enables us to “drill across” from one star schema to another. This common dimension is sometimes referred to as a conformed dimension. We have seen previously how the first-generation data warehouses tended to focus on the analysis of behavioral information. Well, the second generation needs to support big business issues such as CRM and, in order to do this effectively, we have to be able to focus not only on behavior, but circumstances as well.
Customer Behavior and Customer Circumstances–The Cause-and-Effect Principle We have explored the difference between customers' circumstances and their behavior, but why is it important? Most of the time in data warehousing, we have been analyzing behavior. The Fact table in a traditional dimensional schema usually contains information about a customer's interaction with our business. That is: the way they behave toward us. In the Wine Club example we have been using, the Fact table contained information about sales. This, as has been shown, is the normal approach toward the development of data warehouses. Now let us look again at one of the most pressing business problems, that of customer loyalty and its direct consequence, that of customer churn. For the moment let us put ourselves in the place of a customer of a cellular phone company and think of some reasons why we, as a customer, may decide that we no longer wish to remain as a customer of this company: Perhaps we have recently moved to a different area. Maybe the new area has a poor reception for this particular company. We might have moved to a new employer and have been given a mobile phone as part of the deal, making the old one surplus to requirements. We could have a child just starting out at college. The costs involved might require economies to be made elsewhere, and the mobile phone could be the luxury we can do without.
118
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the above situations could be the cause for us as customers to appear in next month's churn statistics for this cellular phone company. It would be really neat if the phone company could have predicted that we are a high-risk customer. The only way to do that is to analyze the information that we have gathered and apply some kind of predictive model to the data that yields a score between, say, 1 for a very low risk customer to 10 for a very high risk customer. But what type of information is likely to give us the best indication of a customer's propensity to churn? Remember that, traditionally, data warehouses tend to be organized around behavioral systems. In a mobile telephone company, the most commonly used behavioral information is the call usage. Call usage provides information about: Types of calls made (local, long distance, collect, etc.) Durations of calls Amount charged for the call Time of day Call destinations Call distances If we analyze the behavior of customers in these situations, what do you think we will find? I think we can safely predict that, just before the customer churned, they stopped making telephone calls! The abrupt change in behavior is the effect of the change in circumstances. The cause-and-effect principles can be applied quite elegantly to the serious problem of customer churn and, therefore, customer loyalty. What we are seeing when we analyze behavior is the effect of some change in the customer's circumstances. The change in circumstances, either directly or indirectly, is the cause of their churning. If we analyze their behavior, it is simply going to tell us something that we already know and is blindingly obvious–the customer stopped using the phone. By this time it is usually far too late to do anything about it. In view of the fact that most dimensional data warehouses measure behavior, it seems reasonable to conclude that such models may not be much help in predicting those customers that we are at risk of losing. We need to turn our attention to being very much more rigorous in our approach to tracking changes in circumstances, rather than behavior. Thus, the second-generation data warehouses that are being built as an aid to the development of CRM applications need to be able to model more than just behavior. So instead of being behavior centric, perhaps they should be dimension centric or even circumstances centric. The preferred term is customer centric. Our second-generation data warehouses will be classified as customer centric. Does this mean that we abandon behavioral information? Absolutely not! It's just that we need to
119
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
switch the emphasis so that some types of information that are absolutely critical to a successful CRM strategy are more accessible. So what does this mean for the great star schema debate? Well, all dimensional schemes are, in principle, behavioral in nature. In order to develop a customer-centric model we have to use a different approach. If we are to build a customer-centric model, then it make sense to start with a model of the customer. We know that we have two major information types–behavior and circumstances. For the moment, let's focus in on the circumstances. Some of the kinds of things we might like to record about customers are: Customer Name Address Telephone number Date of birth Sex Marital status Of course, there are many, many more pieces of information that we could hold (check out Appendix D to see quite a comprehensive selection), but this little list is sufficient for the sake of example. At first sight, we might decide that we need a customer dimension as shown in Figure 3.7.
Figure 3.7. General model for customer details.
The customer dimension in Figure 3.7 would have some kind of customer identifier and a set of attributes like those listed in the table above. But that won't give us what we want. In order to enable our users to implement a data warehouse that supports CRM, one of the things they must be able to do is analyze, measure, and classify the effect of changes in a customer's circumstances. As far as we, that means the data architects, are concerned, a change in circumstances simply means a change in the value of some attribute. But, ignoring error corrections, not all attributes are subject to change as part of the ordinary course of business. Some attributes change and some don't. Even if an attribute does change, it does not necessarily mean that the change is of any real interest to our business. There is a business issue to be resolved here.
120
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
We can illustrate these points if we look a little more closely at the simple list of attributes above. Ignoring error corrections, which are the attributes that can change? Well, in theory at least, with the exception of the date of birth, they can all change. Now, there are two types of change that we are interested in:
1. Changes where we need to be able to see the previous values of the attribute, as well as the new value
2. Changes where the previous values of the attribute can be lost What we have to do is group the attributes into these two different types. So we end up with a model with two entities like the one in Figure 3.8.
Figure 3.8. General model for a customer with changing circumstances.
We are starting to build a general conceptual model for customers. For each customer, we have a set of attributes that can change as well as a set of attributes for which either they cannot change or, if they do, we do not need to know the previous values. Notice that the relationship has a cardinality of one to many. Please note this is not meant to show that there are many attributes that can change; it actually means that each attribute can change many times. For instance, a customer's address can change quite frequently over time. In the Wine Club, the name, telephone number, date of birth, and sex are customer attributes where the business feels that either the attributes cannot change or the old values can be lost. This means that the address and marital status are attributes where the previous values should be preserved. So, using the example, the model should look as shown in Figure 3.9.
Figure 3.9. Example model showing customer with changing circumstances.
121
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
So each customer can have many changes of address and marital status, over time. Now, the other main type of data that we need to capture about customers is their behavior. As we have discussed previously, the behavioral information comes from the customers' interaction with our organization. The conceptual general model that we are trying to develop must include behavioral information. It is shown in Figure 3.10.
Figure 3.10. The general model extended to include behavior.
Again the relationship between customers and their behavior is intended to show that there are many behavioral instances over time. The actual model for the Wine Club would look something like the diagram in Figure 3.11.
Figure 3.11. The example model extended to include behavior.
122
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the behavioral entities (wine sales, accessories, and trips) probably would have previously been modeled as part of individual subject areas in separate star schemas or snowflake schemas. In our new model, guess what? They still could be! Nothing we have done so far means that we can't use some dimensional elements if we want to and, more importantly, if we can get the answers we need. Some sharp-eyed readers at this point might be tempted into thinking, “Just hold on a second, what you're proposing for the customer is just some glorified form of common (or conformed) dimension, right?.” Well, no. There is, of course, some resemblance in this model to the common dimension model that was described earlier on. But remember this: The purpose of a dimension, principally, is to constrain queries against the fact table. The main purpose of a common dimension is to provide a drill across facility from one fact table to another. They are still behavior-centric models. It is not the same thing at all as a model that is designed to be inherently customer centric. The emphasis has shifted away from behavior, and more value is attached to the customer's personal circumstances. This enables us to classify our customers into useful and relevant segments. The difference might seem quite subtle, but it is, nevertheless, significant. Our general model for a customer centric-data warehouse looks very simple, just three main entity types. Is it complete? Not quite. Remember that there were three main types of customer segmentation. The first two were based on circumstances and behavior. We have discussed these now at some length. The third type of segment was referred to as a derived segment. Examples of derived segments are
123
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
things like “estimated life time value” and “propensity to churn.” Typically, the inclusion and classification of a customer in these segments is determined by some calculation process such as predictive modeling. We would not normally assign a customer a classification in a derived segment merely by assessing the value of some attribute. It is sensible, therefore, to modify our general model to incorporate derived segments, as shown in Figure 3.12.
Figure 3.12. General conceptual model for a customer-centric data warehouse.
So this is it. The diagram in Figure 3.12 is the boiled-down general model for a customer centric data warehouse. In theory, it should be able to answer almost any question we might care to throw at it. I say “in theory” because in reality the model will be far more complex than this. We will need to be able to cope with customers whose classification changes. For example, we might have a derived segment called “life time value” where every customer is allocated an indicator with a value from, say, 1 to 20. Now, Lucie Jones might have a lifetime value indicator of “9.” But when Lucie's salary increases, she might be allocated a lifetime value indicator of “10.” It might be useful to some companies to invent a new segment called, say, “increasing life time values.” This being the case, we may need to track Lucie's lifetime value indicator over time. When we bring time into our segmentation processes, the possibilities become endless. However, the introduction of time also brings with it some very difficult problems, and these will be discussed in the next chapter. Our model can be described as a general conceptual model (GCM) for a customer-centric data warehouse. The GCM provides us with a template from which all our actual conceptual models in the future can be derived. While we are on the subject of conceptual models, I firmly believe it is high time that we reintroduce the conceptual, logical, and physical data model trilogy into our design process.
Whatever Happened to the Conceptual/Logical/Physical Trilogy? In the old days there was tradition in which we used a three-stage process for designing a database. The first model that was developed was called the conceptual data model and it was usually represented by an entity relationship diagram (ERD). The purpose of the ERD was to provide an
124
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
abstraction that represented the data requirements of the organization. Most people with any experience of database design would be familiar with the ERD approach to designing databases. One major characteristic of the conceptual model is that it ought to be able to be implemented using any type of database. In the 1970s, the relational database was not the most widely used type of DBMS. In those days, the databases tended to be:
1. Hierarchical databases 2. Network databases The idea was that the DBMS to be ultimately deployed should not have any influence over the way in which the requirements were expressed. So the conceptual data model should not imply the technology to be used in implementing the solution. Once the DBMS had been chosen, then a logical model would be produced. The logical model was normally expressed as a schema in textual form. So, for instance, where the solution was to be implemented using a relational database, a relational schema would be produced. This consisted of a set of relations, the relationships expressed as foreign key constraints, and a set of domains from which the attributes of the relations would draw their values. The physical data model consisted of the data definition language (DDL) statements that were needed to actually build, in a relational environment, the tables, indexes, and constraints. This is sometimes referred to as the implementation model. One of the strengths of the trilogy is that decisions relating to the logical and physical design of the database could be taken and implemented without affecting the abstract model that reflected the business requirements. The astonishing dominance of relational databases since the mid-1980s has led, in practice, to a blurring of the boundaries between the three models, and it is not uncommon nowadays for a single model to be built, again in the form of an ERD. This ERD is then converted straight into a set of tables in the database. The conceptual model, logical model, and physical model are treated as the same thing. This means that any changes that are made to the design for, say, performance-enhancing reasons are reflected in the conceptual model as well as the physical model. The inescapable conclusion is that the business requirements are being changed for performance reasons. Under normal circumstances, in OLTP-type databases for instance, we might be able to debate the pros and cons of this approach because the business users don't ever get near the data model, and it is of no interest to them. They can, therefore, be shielded from it. But data warehouses are different. There can be no debate; the users absolutely have to understand the data in the data warehouse. At least they have to understand that part of it that they use. For this reason, the conceptual data model, or something that can replace its intended role, must be reintroduced as a necessary part of the development lifecycle of data warehouses. There is another reason why we need to reinvent the conceptual data model for data warehouse development. As we observed earlier, in the past 15 years, the relational database has emerged as a
125
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
de facto standard in most business applications. However, to use the now well worn phrase, data warehouses are different. Many of the OLAP products are nonrelational, and their logical and physical manifestations are entirely different from the relational model. So the old reasons for having a three-tier approach have returned, and we should respond to this.
The Conceptual Model and the Wine Club Now that we have the GCM, we can apply its principles to our case study, the Wine Club. We start by defining the information about the customer that we want to store. In the Wine Club we have the following customer attributes: Customer Information Title Name Address Telephone number Date of birth Sex Marital status Childrens' details Spouse details Income Hobbies and interests Trade or profession Employers' details The attributes have to be divided into two types:
1. Attributes that are relatively static (or where previous values can be lost) 2. Attributes that are subject to change Customer's Static Information Title Name Telephone number Date of birth Sex Customer's Changing Information Address
126
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Marital status Childrens' details Spouse details Income Hobbies and interests Trade or profession Employers' details We now construct a model like the one in Figure 3.13.
Figure 3.13. Wine Club customer changing circumstances.
This represents the customer static and changing circumstances. The behavior model is shown in Figure 3.14.
Figure 3.14. Wine Club customer behavior.
127
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Now, thinking about derived segments, these are likely to be very dynamic in the sense that some derived segments will change over time, some will remain fairly static over time, and others still will appear for a short while and then disappear. Some examples of these, as they apply to the Wine Club, are: Lifetime value. This is a great way of classifying customers, and every organization should try to do this. It is an example of a fairly static form of segmentation. We would not expect dramatic changes to customers' positions here. It would be good to know which customers are on the “generally increasing” and the “generally decreasing” scale. Recently churned. This is an example of a dynamic classification that will be constantly changing. The ones that we lose that had good lifetime value classifications would appear in our “Win back” derived segment. Special promotions. These can be good examples where a kind of “one-off” segment can be used effectively. In the Wine Club there would be occasions when, for instance, it needs to sell off a particular product quickly. The requirement would be to determine the customers most likely to buy the product. This would involve examination of previous behavior as well as circumstances (e.g., income category in the case of an expensive wine). The point is that this is a “use once” segment. Using the three examples above, our derived segments model looks as shown in Figure 3.15.
Figure 3.15. Derived segment examples for the Wine Club.
128
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
There is a design issue with segments generally, and that is their general dynamic nature. The marketing organization will constantly want to introduce new segments. Many of them will be of the fairly static and dynamic types that will have long lives in the data warehouse. What we don't want is for the data warehouse administrator to have to get involved in the creation of new tables each time a new classification is invented. This, in effect, results in frequent changes to the data warehouse structure and will lead to the marketing people getting involved in complex change control procedures and might ultimately result in a stifling of creativity. So we need a way of allowing the marketing people to add new derived segments without involving the database administrators too much. Sure, they might need help in expressing the selection criteria, but we don't want to put too many obstacles in their path. This issue will be explored in more detail in Chapter 7—the physical model. only for RuBoard - do not distribute or recompile
129
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY I have a theory, which is that data warehousing has always been about customer relationships. It's just that previously we didn't entirely make the connection, because, in the early days, CRM had not really been recognized as any kind of discipline at the systems level. The advent of CRM has put the spotlight firmly back on data warehousing. Data warehouses provide the technology to enable us to perform customer relationship analysis. The management of the relationship is the value that is added by the business. This is the decision-making part. The warehouse is doing what it has always done, providing decision support. That is why this book is about supporting customer relationship management. In this chapter we have been looking at some of the design issues and tried to quantify and rationalize some aspects of data warehouse design that have been controversial in the past. There are good reasons for using dimensional schemas, but there are cases where they can work against us. The best solution for CRM is to use dimensional models where they absolutely do add value, in modeling customers' behavior, but to use more “conventional” approaches when modeling customers' circumstances. Toward the end of the chapter we developed a general conceptual model (Figure 3.12) for a customer-centric data warehouse. We will develop this model later on in this book by applying some of the needs of our case study, the Wine Club. Firstly, however, the awfully thorny subject of time has to be examined. only for RuBoard - do not distribute or recompile
130
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 4. The Implications of Time in Data Warehousing The principal subject of this book is the design of data warehouses. One of the least well understood issues surrounding data warehouse design is the treatment and representation of time. This chapter introduces the characteristics of time and the way that it is used in data warehousing applications. The chapter goes on to describe fully the problems encountered with time. We need to introduce some rigor into the way that time is treated in data warehousing, and this chapter lays out the groundwork to enable that to be achieved. We also examine the more prominent solutions that other major practitioners have proposed in the past and that have been used ubiquitously in first-generation data warehouses. We will see that there are some issues that arise when these methods are adopted. The presence of time and the dependence upon it is one of the things that sets data warehouse applications apart from traditional operational systems. Most business applications are suited to operating in the present environment where time does not require special treatment. In many cases, dates are no more than descriptive attributes. In a data warehouse, time affects the very structure of the system. The temporal requirements of a data warehouse are very different from those of an operational system, yet it is the operational system that feeds information about changed data to the data warehouse. In a temporal database management system, support for time would be implicit within the DBMS and the query language would contain time-specific functions to simplify the manipulation of time. Until such systems are generally available, the data warehouse database has to be designed to take account of time. The support for time has to be explicitly built into the table structures and the queries. only for RuBoard - do not distribute or recompile
131
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE ROLE OF TIME In data warehousing the addition of time enables historical data to be held and queried upon. This means that users of data warehouses can view aspects of their enterprise at any specific point or over any period of time for which the historical data is recorded. This enables the observation of patterns of behavior over time so that we can make comparisons between similar or dissimilar periods, for example, this year versus last year, seasonal trends. Armed with this information, we can extrapolate with the use of predictive models to assist us with planning and forecasting. We are, in effect, using the past to attempt to predict the future: If men could learn from history, what lessons it might teach us! But passion and party blind our eyes, and the light which experience gives is a lantern on the stern, which shines only on the waves behind us! —(Coleridge, 1835) Despite this gloomy warning from the nineteenth century, the use of information from past events and trends is commonplace in economic forecasting, social trend forecasting, and even weather forecasting. The value and importance of historical data are generally recognized. It has been observed that the ability to store historical data is one of the main advantages of data warehousing and that the absence of historical data, in operational systems, is one of the motivating factors in the development of data warehouses. Some people argue that most operational systems do keep a limited amount of history, about 60–90 days. In fact, this is not really the case, because the data held at any one time in, say, an order processing system will be orders whose lifecycle has not been completed to the extent that the order can be removed from the system. This means that it may take, on average, 60–90 days for an order to pass through all its states from inception to deletion. Therefore, at any one time, some of the orders may be up to 90 days old with a status of “invoiced,” while others will be younger, with different status such as “new,” “delivered,” “back ordered,” etc. This is not the same as history in our sense.
Valid Time and Transaction Time Throughout this chapter and the remainder of the book, we will make frequent reference to the valid times (Jensen et al., 1994) and transaction times of data warehouse records. These two times are defined in field of temporal database research and have quite precise meanings that are now explained. The valid time associated with the value of, say, an attribute is the time when the value is true in modeled reality. For instance, the valid time of an order is the time that the order was taken. Such
132
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
values may be associated with:
1. A single instant. Defined to be a time point on an underlying time axis. An event is defined as an instantaneous fact that occurs at an instant.
2. Intervals (periods) of time. Defined to be the time between two instants. The valid time is normally supplied by the user, although in some cases, such as telephone calls, the valid time can be provided by the recording equipment. The transaction time associated with the value of, say, an attribute records the time at which the value was stored in the database and is able to be retrieved. Transaction times are system generated and may be implemented using transaction commit times. Transaction times may also be represented by single instants or time intervals. Clearly, a transaction time can provide only an approximation of the valid time. We can say, generally, that the transaction time that records an event will never be earlier than the true valid time of the same event.
Behavioral Data In a dimensional data warehouse, the source systems from which the behavioral data is derived are the organization's operational systems such as order processing, supply chain, and billing. The source systems are not usually designed to record or report upon historical information. For instance, in an order processing system, once an order has satisfactorily passed through its lifecycle, it tends to be removed from the system by some archival or deletion process. After this, for all practical purposes, the order will not be visible. In any case, it will have passed beyond the state that would make it eligible to be captured for information purposes. The task of the data warehouse manager is to capture the appropriate entities when they achieve the state that renders them eligible to be entered into the data warehouse. That is, when the appropriate event occurs, a snapshot of the entity is recorded. This is likely to be before they reach the end of their lifecycle. For instance, an order is captured into the data warehouse when the order achieves a state of, say, “invoiced.” At this point the invoice becomes a “fact” in the data warehouse. Having been captured from the operational systems, the facts are usually inserted into the fact table using the bulk insertion facility that is available with most database management systems. Once loaded, the fact data is not usually subject to change at all. The recording of behavioral history in a fact table is achieved by the continual insertion of such records over time. Usually each fact is associated with a single time attribute that records the time the event occurred. The time attribute of the event would, ideally, be the “valid time,” that is, when the event occurred in the real world. In practice, valid times are not always available and transaction times (i.e., the time the data was recorded) have to be used. The actual data type used to record the time of the event will vary from one application to another depending on how precise the time has to be (the granularity of time might be day, month, and year when recording sales of wine, but would need to be more precise in the case of telephone calls and
133
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
would probably include hours, minutes, and seconds).
Circumstantial Data Operational data, from which the facts are derived, is accompanied by supporting data, often referred to as reference data. The reference data relates to entities such as customers, products, sales regions, etc. Its primary purpose, within the operation processing systems, is to enable, for instance, the right products and documentation to be sent to the right addresses. It is this data that is used to populate the dimensions and the dimensional hierarchies in the data warehouse. In the same way that business transactions have a lifecycle, these reference entities also have a lifecycle. The lifecycle of reference data entities is somewhat different to transactions. Whereas business transactions, under normal circumstances, have a predefined lifecycle that starts at inception and proceeds through a logical path to deletion, the lifecycle of reference data can be much less clear. The existence of some entities can be discontinuous. This is particularly true of customer entities who may switch from one supplier to another and back again over time. It is also true of some other reference information, such as products (e.g., seasonal products). Also, the attributes are subject to change due to people moving, changing jobs, etc. only for RuBoard - do not distribute or recompile
134
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
PROBLEMS INVOLVING TIME There are several areas in data warehousing where time presents a problem. We'll now explore those areas.
The Effect of Time on the Data Model Organizations wishing to build a data warehouse have often already built a data model describing their operational business data. This model is sometimes referred to as the corporate data model. The database administrator's office wall is sometimes entirely obscured by a chart depicting the corporate data model. When building a data warehouse, practitioners often encounter the requirement to utilize the customer's corporate data model as the foundation of the warehouse model. The organization has invested considerably in the development of the model, and any new application is expected to use it as the basis for development. The original motivation for the database approach was that data should be entered only once and that it should be shared by any users who were authorized to have access to it. Figure 4.1 depicts a simple fragment of a data model for an operational system. Although the Wine Club data model could be used, the model in Figure 4.1 provides a clearer example.
Figure 4.1. Fragment of operational data model.
Figure 4.1 is typical of most operational systems in that it contains very little historical data. If we are to introduce a data warehouse into the existing data model, we might consider doing so by the addition of a time variant table that contained the history that is needed. Taking the above fragment of a corporate data model as a starting point, and assuming that the warehouse subject area is “Sales,” a dimensional warehouse might be created as in Figure 4.2.
135
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Figure 4.2. Operational model with additional sales fact table.
Figure 4.2 shows a dimensional model with the fact table (Sales) at the center and three dimensions of analysis. These are time, customer, and salesperson. The salesperson dimension participates in a dimensional hierarchy in which a department employs salespeople and a site contains many departments. Figure 4.2 further shows that the sales fact table is populated by the data contained in the orders table, as indicated by the dotted arrow (not part of standard notation). That is, all new orders that have achieved the state required to enable them to be classified as sales are inserted into the sales table and are appended to the data already contained in the table. In this way the history of sales can be built. At first sight, this appears to be a satisfactory incorporation of a dimensional data warehouse into an existing data model. Upon closer inspection, however, we find that the introduction of the fact table “Sales” has had interesting effects. To explain the effect, the sales dimensional hierarchy is extracted as an example, shown in Figure 4.3. This hierarchy shows that a site may contain many departments and a department may employ many salespeople. This sort of hierarchy is typical of many such hierarchies that exist in all organizations.
Figure 4.3. Sales hierarchy.
The relationships shown here imply that a salesperson is employed by one department and that a department is contained in one site. These relationships hold at any particular point in time.
136
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The addition of a fact table, which contains history, is attached to the hierarchy as shown in Figure 4.4.
Figure 4.4. Sales hierarchy with sales table attached.
The model now looks like a dimensional model with a fact table (sales) and a single dimension (salesperson). The salesperson dimension participates in a dimensional hierarchy involving departments and sites. Assuming that it is possible, during the course of ordinary business, for a salesperson to move from one department to another, or for a department to move from one site to another, then the cardinality (degree) of the relationships “Contains” and “Employs” no longer holds. The hierarchy, consisting of salespeople, departments, and sites contains only the latest view of the relationships. Because sales are recorded over time, some of the sales made by a particular salesperson may have occurred when the salesperson was employed by a different department. Whereas the model shows that a salesperson may be employed by exactly one department, this is only true where the relationship is viewed as a “snapshot” relationship. A more accurate description is that a salesperson is employed by exactly one department at a time. Over time, a salesperson may be employed by one or more departments. Similarly, a department is contained by exactly one site at a time. If it is possible for departments to move from one site to another then, over time, a department may be contained by one or more sites. The introduction of time variance, which is one of the properties of a data warehouse, has altered the degree of the relationships within the hierarchy, and they should now be depicted as many-to-many relationships as shown in Figure 4.5. This leads to the following observation: The introduction of a time-variant entity into a time-invariant model potentially alters the degree of one or more of the relationships in the model.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
A point worth noting is that it is the rules of the business, not a technical phenomenon, that cause these changes to the model. The degree to which this causes a problem will vary from application to application, but dimensions typically do contain one or more natural hierarchies. It seems reasonable to assume, therefore, that every organization intending to develop a data warehouse will have to deal with the problem of the degree of relationships being altered as a result of the introduction of time. The above example describes the kind of problem that can occur in relationships that are able to change over time. In effect, the cardinality (degree) of the relationship has changed from “one to many” to “many to many” due to the introduction of time variance. In order to capture the altered cardinality of the relationships, intersection entities would normally be introduced as shown in Figure 4.6
Figure 4.6. Sales hierarchy with intersection entities.
This brief introduction to the problem shows that it is not really possible to combine a time-variant data warehouse model with a non-time-variant operational model without some disruption to the original model. If we compare the altered data model to the original model, it is clear that the introduction of the time-variant sales entity has had some repercussions and has forced some changes to be made. This is one of the main reasons that forces data warehouses to be built separately from operational systems. Some practitioners believe that the separation of the two is merely a performance issue in that most database products are not able to be optimized to support the highly disparate nature of operational versus decision support type of queries. This is not the case. The example shows that the structure of the data is actually incompatible. In the future it is likely that operational systems will be
138
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
built with more “decision support awareness,” but any attempt to integrate decision support systems into traditional operational systems will not be successful.
The Effect of Time on Query Results As these entities change over time, in operational processing systems, the new values tend to replace existing values. This gives the impression that the old, now replaced, value never existed. For instance, in the Wine Club example, if a customer moves from one address to another and, at the same time, switches to a new region, there is no reason within the order processing system to record the previous address as, in order to service orders, the new address is all that is required. It could be argued that to keep information about the old address is potentially confusing, with the risk that orders may be inadvertently dispatched to the wrong address. In a temporal system such as a data warehouse, which is required to record and report upon history faithfully, it may be very important to be able to distinguish between the orders placed by the customer while resident at the first address from the orders placed since moving to the new address. An example of where this information would be needed is where regional sales were measured by the organization. In the example described above, the fact that the customer, when moving, switched regions is important. The orders placed by the customer while they were at the previous address need to have that link preserved so that the previous region continues to receive the credit for those orders. Similarly, the new region should receive credit for any subsequent orders placed by the customer during their period of residence at the new address. Clearly, when designing a data warehouse in support of a CRM strategy, such information may be very important. If we recall the cause-and-effect principle and how we applied it to changing customer circumstances, this is a classic example of precisely that. So the warehouse needs to record not only the fact that the data has changed but also when the change occurred. There is a conflict between the system supplying the data, which is not temporal, and the receiving system, which is. The practical problems surrounding this issue are dealt with in detail later on in this chapter. The consequences of the problem can be explored in more detail by the use of some data. Figure 4.7 provides a simple illustration of the problem by building on the example given. We'll start by adding some data to the entities.
Figure 4.7. Sales hierarchy with data.
139
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
140
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The example in Figure 4.7 shows a “Relational” style of implementation where the relationships are implemented using foreign key columns. In the data warehouse, the “Salesperson” dimension would be related directly to the sales fact table. Each sales fact would include a foreign key attribute that would contain the sales identifier of the salesperson who was responsible for the sale. In order to focus on the impact of changes to these relationships, time is omitted from the following set of illustrative queries. In order to determine the value of sales by salesperson, the SQL query shown in Query Listing 4.1 could be written:
Listing 4.1 Total sales by sales-person.
Select name, sum(sales_value) from sales s1, sales_person s2 where s1.sales_id = s2.sales_id group by name In order to determine the value of sales by department, the SQL query shown in Query Listing 4.2 could be written:
Listing 4.2 Total sales by department.
Select department_name, sum(sales_value) from sales s1, sales_person s2, department d where s1.sales_id = s2.sales_id and s2.dept_id = d.dept_code group by department_name If the requirement was to obtain the value of sales attributable to each site, then the query in Query Listing 4.3 could be used:
Listing 4.3 Total sales by site.
Select address, sum(sales_value) from sales s1, sales_person s2, department d, site s3 where s1.sales_id = s2.sales_id and s2.dept_id = d.dept_id and d.site = s3.site_code group by address The result sets from these queries would contain the sum of the sales value grouped by salesperson, department, and site.
141
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The results will always be accurate so long as there are no changes in the relationships between the entities. However, as previously shown, changes in the dimensions are quite common. As an example, if Sneezy were to transfer from department “SW” to department “NW,” his relationship between the salesperson entity and the department entity will have changed. If the same three queries are executed again, the results will be altered. The results of the first query in Query Listing 4.1, which is at salesperson level, will be the same as before because the sales made by Sneezy are still attributed to him. However, in Query Listing 4.2, which is at the department level, all sales that Sneezy was responsible for when he worked in department “SW” will in future be attributed to department “NW.” This is clearly an invalid result. The result from the query in Query Listing 4.3, which groups by site address, will still be valid because, although Sneezy moved from SW department to NW department, both SW and NW reside at the same address, Bristol. If Sneezy had moved from SW to SE or NE, then the Listing 4.3 results would be incorrect as well. The example so far has focused on how time alters the cardinality of relationships. There is, equally, an effect on some attributes. If we look back at the salesperson entity in the example, there is an attribute called “Grade.” This is meant to represent the sales grade of the salesperson. If we want to measure the performance of salespeople by comparing volume of sales against grades, this could be achieved by the following query:
Select grade, sum(sales_value) from sales s1, sales_person s2 where s1.sales_id = s2.sales_id group by grade If any salesperson has changed their grade during the period covered by the query, then the results will be inaccurate because all their sales will be recorded against their current grade. In order to produce an accurate result, the periods of validity of the salesperson's grades must be kept. This might be achieved by the introduction of another intersection entity. If no action is taken, the database will produce inaccurate results. Whether the level of inaccuracy is acceptable is a matter for the directors of the organization to decide. Over time, however, the information would become less and less accurate, and the value of the information is likely to become increasingly questionable. How do the business people know which are the queries that return accurate results and, more importantly, which ones are suspect? Unfortunately for our users, there is no way of knowing.
142
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The Time Dimension The time dimension is a special dimension that contains information about times. For every possible time that may appear in the fact table, an entry must exist in the time dimension table. This time attribute is the primary key to the time dimension. The non–key attributes are application specific and provide a method for grouping the discrete time values. The groupings can be anything that is of interest to the organization. Some examples might be: Day of week Week end Early closing day Public holidays/bank holidays 24-hour opening day Weather conditions Week of year Month name Financial month Financial quarter Financial year Some of the groupings, listed above, could be derived from date manipulation functions supplied by the database management system, whereas others, clearly, cannot.
The Effect of Causal Changes to Data Upon examination, it appears that some changes are causal in nature, in that a change to the value of one attribute implies a change to the value of some other attribute in the schema. The extent of causality will vary from case to case, but the designer must be aware that a change to the value of a particular attribute, whose historical values have low importance to the organization, may cause a change to occur in the value of another attribute that has much greater importance. While this may be true in all systems, it is particularly relevant to data warehousing because of the disparate nature of the source systems that provide the data used to populate the warehouse. It is possible, for instance, that the source system containing customer addresses may not actually hold information about sales areas. The sales area classification may come from, say, a marketing database or some kind of demographic data. Changes to addresses, which are detected in the operational database, must be
143
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
implemented at exactly the same time as the change to the sales area codes. Acknowledgment of the causal relationship between attributes is essential if accuracy and integrity are to be maintained. In the logical model it is necessary to identify the dependencies between attributes so that the appropriate physical links can be implemented. only for RuBoard - do not distribute or recompile
144
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
CAPTURING CHANGES Let's now examine how changes are identified in the source systems and can be subsequently captured into the data warehouse and the problems that can occur. Capturing Behavior As has been previously stated, the behavioral facts relate to the business transactions of the organization. Facts are usually derived from some entity having been “frozen” and captured at a particular status in its lifecycle. The process by which this status is achieved is normally triggered by an event. What do we mean by the term event? There are two ways of considering the definition of an event. If the data warehouse is viewed in isolation so that the facts that it records are not perceived as related to the source systems from which they were derived, then they can be viewed purely as events that occurred at a single point in time. If, however, the data warehouse is perceived as part of the “enterprise” database systems, then the facts should be viewed within the wider context, and they become an entity preserved at a “frozen” state, having been triggered by an event. Either way, the distinguishing feature of facts is that they do not have a lifespan. They are associated with just one time attribute. For the purpose of clarity the following definition of facts will be adopted: A fact is a single state entity that is created by the occurrence of some event. In principle, the processes involved in the capture of behavior are relatively straightforward. The extraction of new behavioral facts for insertion into the data warehouse is performed on a periodic, very often daily, basis. This tends to occur during the time when the operational processing systems are not functioning. Typically this means during the overnight “batch” processing cycle. The main benefit of this approach is that all of the previous day's data can be collected and transferred at one time. The process of identifying the facts varies from one organization to another from between being very easy to almost impossible to accomplish. For instance, the fact data may come from: Telephonic network switches or billing systems, in the case of telecommunications companies Order processing systems as in the case of mail order companies such as the Wine Club Cash receipts in the case of retail outlets Once the facts have been identified, they are usually stored into sequential files or streams that are appended to during the day. As the data warehouse is usually resident on a hardware platform that is separate from the operational system, the files have to be moved before they can be processed further. The next step is to validate and modify each record to ensure that it conforms to the format and semantic integration rules that were described in Chapter 2. The actual loading of the data is usually performed using the “bulk” load utility that most database management systems provide.
145
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Once recorded, the values of fact attributes never change so they should be regarded as single state or stateless. There is a time element that applies to facts, but it is simply the time that the event occurred. It is usually implemented in the form of a single timestamp. The timestamp will vary, in granularity, from one application to another. For instance, in the Wine Club, the timestamp of a sale records the date of the sale. In a telecommunications application, the timestamp would record not only the date but also the hour, minute, and second that the call was placed. Capturing Circumstances The circumstances and dimensions are derived from what has been referred to as the reference entities within the organization. This is information such as customer, product, and market segment. Unlike the facts in a data warehouse, this type of information does have a lifespan. For instance, products may have various states during their lifespan from new to fast moving to slow moving to discontinued to deleted. The identification and capture of new or changed dimensional information are usually quite different to the capture of facts. For instance, it is often the case that customer details are captured in the operational systems some time after the customer starts using the services of the organization. Also, the date at which the customer is enrolled as a customer is often not recorded in the system. Neither is the date when they cease to become a customer. Similarly, when a dimensional attribute changes, such as the address of a customer, the new address is duly recorded in such a way as to replace the existing address. The date of the change of address is often not recorded. The dates of changes to other dimensional attributes are also, usually, not recorded. This is only a problem if the attribute concerned is one for which there is a requirement to record the historic values faithfully. In the Wine Club, for instance, the following example attributes need to have their historic values preserved: Customers' addresses Customers' sales areas Wine sales prices and cost prices Suppliers of wines Managers of sales areas If the time of the change is not recorded in the operational systems, then it is impossible to determine the valid time that the change occurred. Where the valid time of a change is not available, then it may be appropriate to try to ascertain the transaction time of the change event. This would be the time that the change was recorded in the database, as opposed to the time the change actually occurred. However, in the same way that the valid time of changes is not recorded, the transaction time of changes is usually not recorded explicitly as part of the operational application. In order for the data warehouse to capture the time of the changes, there are methods available that can assist us in identifying the transaction times:
146
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Make changes to the operational systems. The degree to which this is possible is dependent on a number of factors. If the system has been developed specifically for the organization, either by an organization's own IT staff or by some third party, as long as the skills are available and the costs and timescales are not prohibitive, then the operational systems can be changed to accommodate the requirements of the data warehouse. Where the application is a standard package product, it becomes very much more difficult to make changes to the system without violating commercial agreements covering such things as upgrades and maintenance. If the underlying database management system supporting the application is relational, then it is possible to capture the changes by the introduction of such things as database triggers. Experience shows that most organizations are reluctant to alter operational applications in order to service informational systems requirements for reasons of cost and complexity. Also, the placing of additional processing inside of operational systems is often seen as a threat to the performance of those systems. Interrogation of audit trail. Some operational applications maintain audit trails to enable changes to be traced. Where these exist, they can be a valuable source of information to enable the capture of transaction time changes. Interrogation of DBMS log files. Most database management systems maintain log files for system recovery purposes. It is possible, if the right skills are available, to interrogate these files to identify changes and their associated transaction times. This practice is discouraged by the DBMS vendors, as log files are intended for internal use by the DBMS. If the files are damaged by unauthorized access, the ability of the DBMS to perform a recovery may be compromised. Also, the DBMS vendors always reserve the right to alter the format of the log files without notice. If this happens, processes that have been developed to capture changes may stop working or may produce incorrect results. Obviously, this approach is not available to non-DBMS applications. File comparison. This involves the capture of an entire file, or table, of dimensional data and the copying of the file so that it can be compared to the data already held in the data warehouse. Any changes that are identified can then be applied to the warehouse. The time of the change is taken to be the system time of the detection of the change, that is, the time the file comparison process was executed. Experience shows that the file comparison technique is the one most frequently adopted when data warehouses are developed. It is the approach that has the least impact on the operational environment, and it is the least costly to implement. It should also be remembered that some dimensions are created by the amalgamation of data from several operational systems and some external systems. This will certainly exacerbate an already complex problem. Where the dimensions in a dimensional model are large (some organizations have several million customers), the capture of the data followed by the transfer to the data warehouse environment and subsequent comparison is a process that can be very time-consuming. Consequently, most organizations place limits on the frequency with which this process can be executed. At best, the frequency is weekly. The processing can then take place over the weekend when the systems are
147
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
relatively quiet and the extra processing required to perform this exercise can be absorbed without too much of an impact on other processing. Many organizations permit only monthly updates to the dimensional data, and some are even less frequent than that. The problem is that the only transaction time available, against which the changes can be recorded, is the date upon which the change was discovered (i.e., the file comparison date). So, for example, let us assume that the frequency of comparison is monthly and the changes are captured at the end of the month. If a customer changes address, and geographic region, at the beginning of the month, then any facts recorded for the customer during the month will be credited permanently to the old, incorrect region. It is possible that, during a single month, more than one change will occur to the same attribute. If the data is collected by the file comparison method, the only values that will be captured are those that are in existence at the time of capture. All intermediate changes will be missed completely. The degree to which this is a problem will vary from application to application. It is accepted that, in general, valid time change capture for dimensions is not, practically speaking, realistic. However, it is important that practitioners recognize the issue and try to keep the difference between transaction time and valid time as small as possible. The fact that some data relating to time as well as other attributes is found to be absent from the source systems can come to dominate data warehouse developments. As a result of these problems, the extraction of data is sometimes the longest and riskiest part of a data warehouse project. Summary of the Problems Involving Time So far in this chapter we have seen that maintaining accuracy in a data warehouse presents a challenging set of problems that are summarized below: 1. Identifying and capturing the temporal requirements. The first problem is to identify the temporal requirements. There is no method to do this currently. The present data warehousing modeling techniques do not provide any real support for this. 2. Capture of dimensional updates. What happens when a relationship changes (e.g., a salesperson moves from one department to another)? What happens when a relationship no longer exists (e.g., a salesperson leaves the company)? How does the warehouse handle changes in attribute values (e.g., a product was blue, now it is red)? Is there a need to be able to report on its sales performance when it was red or blue, as well as for the product throughout the whole of its lifecycle? 3. The timeliness of capture. It now seems clear that absolute accuracy in a data warehouse is not a practical objective. There is a need to be able to assess the level of inaccuracy so that a degree of confidence can be applied to the results obtained from queries. 4. Synchronization of changes. When an attribute changes, a mechanism is required for identifying dependent attributes that might also need to be changed. The absence of synchronization affects the credibility of the results.
148
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
We have seen that obtaining the changed data can involve complex processing and may require sophisticated design to implement in a way that provides for both accuracy of information and reasonable performance. Also in this chapter we have explored the various problems associated with time in data warehousing. Some of these problems are inherent in the standard dimensional model, but it is possible to overcome these problems by making changes to the way dimensional models are designed. Some of the problems relate to the way data warehouses interact with operational systems. These problems are more difficult to solve and, sometimes, impossible to solve. Nevertheless, data warehouse designers need to be fully aware of the extent of the problems and familiar with the various approaches to solving them. These are discussed in the coming chapters. The biggest set of problems lies in the areas of the capture and accurate representation of historical information. The problem is most difficult when changes occur in the lifespan of dimensions and the relationships within dimensional hierarchies, also where attributes change their values and there is a requirement to faithfully reflect those changes through history. Having exposed the issues and established the problems, let's have a look at some of the conventional ways of solving them. only for RuBoard - do not distribute or recompile
149
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
FIRST-GENERATION SOLUTIONS FOR TIME We now go on to describe some solutions to the problems of the representation of time that have been used in first-generation data warehouses. One of the main problems is that the business requirements with respect to time have not been systematically captured at the conceptual level. This is largely because we are unfamiliar with the temporal semantics due to the fact that, so far, we have not encountered temporal applications. Logical models systematically follow conceptual models, and so a natural consequence of the failure to define the requirements in the conceptual model is that the requirements are also absent from the logical and physical implementation. As a result, practitioners have subsequently found themselves faced with problems involving time and some have created solutions. However, the solutions have been developed on a somewhat ad hoc basis and are by no means comprehensive. The problem is sufficiently large that we really do need a rigorous approach to solving it. As an example of the scale of the problem, there is, as previously mentioned, evidence in a government-sponsored housing survey that, in the United Kingdom, people change their addresses, on average, every 10 years. This means that an organization can expect to have to implement address details changes to about 10 percent of their customers each year. Over a 10-year period, if an organization has one million customers, it can expect to have to deal with one million changes of address. Obviously, some people will not move, but others will move more than once in that period. This covers only address changes. There are other attributes relating to customers that will also change, although perhaps not with the same frequency as addresses. One of the major contributors to the development of solutions in this area is Ralph Kimball (1996). His collective term for changes to dimensional attributes is slowly changing dimensions. The term has become well known within the data warehouse industry and has been adopted generally by practitioners. He cites three methods of tracking changes to dimensional attributes with respect to time. These he calls simply types one, two, and three. Within the industry, practitioners are generally aware of the three types and, where any support for time is provided in dimensional models, these are the approaches that are normally used. It is common to refer to products and methods as being consistent with Kimball's type one, two, or three. In a later work Kimball (1998) recognizes a type zero that represents dimensions that are not subject to change.
The Type 1 Approach The first type of change, known as Type 1, is to replace the old data values with the new values. This means that there is no need to preserve the previous value. The advantage of this approach, from a system perspective, is that it is very easy to implement. Obviously there is no temporal support being offered in this solution. However, this method sometimes offers the most appropriate solution. We
150
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
don't need to track the historical values of every single database element and, sometimes, overwriting the old values is the right thing to do. In the Wine Club example, attributes like the customer's name are best treated in this way. This is an attribute for which there is no requirement to retain historical values. Only the latest value is deemed by the organization to be useful. All data warehouse applications will have some attributes where the correct approach is to overwrite the previous values with the new values. It is important that the updating is effected on a per attribute basis rather than a per row basis. Each table will have a mixture of attributes, some of which will require the type one replacement approach, while others will require a more sophisticated approach to the treatment of value changes over time. The worst scenario is a full table replacement approach where the dimension is, periodically, completely overwritten. The danger here is that any rows that have been deleted in the operational system may be deleted in the data warehouse. Any rows in the fact table that refer to rows in the dimension that have been deleted will cause a referential integrity violation and will place the database in an invalid state. Thus, the periodic update of dimensions in the data warehouse must involve only inserts and updates. Any logical deletions (e.g., where a customer ceases to be a customer) must be processed as updates in the data warehouse. It is important to know whether a customer still exists as a customer, but the customer record must remain in the database for the whole lifespan of the data warehouse or, at the very least, as long as there are fact table records that refer to the dimensional record. Due to the fact that Type 1 is the simplest approach, it is often used as the default approach. Practitioners will sometimes adopt a Type 1 solution as a short-term expedient, where the application really requires a more complete solution, with the intention of providing proper support for time at a later stage in the project. Too often, the pressures of project budgets and implementation deadlines force changes to the scope of projects and the enhancements are abandoned. Sometimes, Type 1 is adopted due to inadequate analysis of the requirements with respect to time.
The Type 2 Approach The second solution to slowly changing dimensions is called Type 2. Type 2 is a more complex solution than Type 1 and does attempt to faithfully record historical values of attributes by providing a form of version control. Type 2 changes are best explained with the use of an example. In the case study, the sales area in which a customer lives is subject to change when they move. There is a requirement to faithfully reflect regional sales performance over time. This means that the sales area prevailing at the time of the sale must be used when analyzing sales. If the Type 1 approach was used when recording changes to the sales area, the historical values will appear to have the same sales area as current values. A method is needed, therefore, that enables us to reflect history faithfully. The Type 2 method attempts to solve this problem by the creation of new records. Every time an attribute's value changes, if faithful recording of history is required, an entirely new record is created with all the unaffected attributes unchanged. Only the affected attribute is changed to reflect its new value. The obvious problem with this approach is that it would immediately compromise the
151
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
uniqueness property of the primary key, as the new record would have the same key as the previous record. This can be turned into an advantage by the use of surrogate keys. A surrogate key is a system-generated identifier that introduces a layer of indirection into the model. It is a good practice to use surrogate keys in all the customer and dimensional data. The main reason for this is that the production key is subject to change whenever the company reorganizes its customers or products and that this would cause unacceptable disruption to the data warehouse if the change had to be carried through. It is better to create an arbitrary key to provide the property of uniqueness. So each time a new record is created, following a change to the value of an attribute, a new surrogate key is assigned to the record. Sometimes, a surrogate approach is forced upon us when we are attempting to integrate data from different source systems where the identifiers are not the same. There are two main approaches to assigning the value of the surrogate:
1. The identifier is lengthened by a number of version digits. So a customer having an identifier of “1234” would subsequently have the identifier “1234001.” After the first change, a new row would be created that would have an identifier of “1234002.” The customer would now have two records in the dimension. Most of the attribute values would be the same. Only the attribute, or attributes, that had changed would be different.
2. The identifier could be truly generalized and bear no relation to the previous identifiers for the customer. So each time a new row is added, a completely new identifier is created. In a behavioral model, the generalized key is used in both the dimension table and the fact table. Constraining queries using a descriptive attribute, such as the name of the customer, will result in all records for the customer being retrieved. Constraining or grouping the results by use of the name and, say, the sales area attribute will ensure that history is faithfully reflected in the results of queries, of course assuming that uniqueness in the descriptive attribute can be guaranteed. The Type 2 approach, therefore, will ensure that the fact table will be correctly joined to the dimension and the correct dimensional attributes will be associated with each fact. Ensuring that the facts are matched to the correct dimensional attributes is the main concern. An obvious variation of this approach is to construct a composite identifier by retaining the previous number “1234” and adding a new counter or “version” attribute that, initially, would be “1.” This approach is similar to 1, above. The initial identifier is allocated when the customer is first entered into the database. Subsequent changes require the allocation of new identifiers. It is the responsibility of the data warehouse administrator to control the allocation of identifiers and to maintain the version number in order to know which version number, or generalized key, to allocate next. In reality, the requirement would be to combine Type 1 and Type 2 solutions in the same logical row. This is where we have some attributes that we do want to track and some that we do not. An example of this occurs in the Wine Club where, in the customer's circumstances, we wish to trace the history of attributes like the address and, consequently, the sales area but we are not interested in the history of the customer's name or their hobby code. So in a single logical row, an attribute like address would need to be treated as a Type 2, whereas the name would be treated as a Type 1. Therefore, if the customer's name changes, we would wish to overwrite it. However, there may be
152
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
many records in existence for this customer, due to previous changes to other attributes. Do we have to go back and overwrite the previous records? In practice, it is likely that only the latest record would be updated. This implies that, in dimensions where type two is implemented, attributes for which the type one approach would be preferred will be forced to adopt an approach that is nearly, but not quite, Type 2. In describing the problems surrounding time in data warehousing, we saw how the results of a query could change due to a customer moving. The approach taken was to simply overwrite the old addresses and the sales area codes with the new values. This is equivalent to Kimball's Type 1 approach. If we implement the same changes using the Type 2 method, the results would not be disrupted, as new records would be created with a new surrogate identifier. Future insertions to the sales fact table will be related to the new identifying customer codes, and so the segmentation will remain consistent with respect to time for the purposes of this particular query. One potential issue here is that, by making use of generalized keys, it becomes impossible to recognize individual customers by use of the identifying attribute. As each subsequent change occurs, a new row is inserted and is identified by a key value that is in no way associated with the previous key value. For example Lucie Jones's original value for the customer code attribute might be, say, 1136, whereas the value for the new customer code for the new inserted row could be anything, say, 8876, being the next available key in the domain range. This means that, if it was required to extract information on a per customer basis, the grouping would have to be on a nonidentifying attribute, such as the customer's name, that is:
select customer_name "Name", sum(quantity) "Bottles",sum(value) "Revenue" from sales s, customer c where c. customer _code=s. customer _code group by customer _name order by customer _name Constraining and grouping queries using descriptive attributes like names are clearly risky, since names are apt to be duplicated and erroneous results could be produced. Another potential issue with this approach is that, if the keys are truly generalized, as with key hashing, it may not be possible to identify the latest record by simply selecting the highest key. Also, the use of generalized keys means that obtaining the history of, say, a customer's details may not be as simple as ordering the keys into ascending sequence. One solution to this problem is the addition of a constant descriptive attribute, such as the original production key, that is unique to the logical row. Alternatively, a variation as previously described, in which the original key is retained but is augmented by an additional attribute to define the version,
153
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
would also provide the solution to this. The Type 2 method does not allow the use of date columns to identify when changes actually take place. For instance, this means that it is not possible to establish with any accuracy when a customer actually moved. The only date available to provide any clue to this is the transaction date in the fact table. There are some problems associated with this. A query such as “List the names and addresses of all customers who have purchased more than twelve bottles of wine in the last three months” might be useful for campaign purposes. Such a query will, however, result in incorrect addresses being returned for those customers who have moved but not since placed an order. The query in Query Listing 4.4 shows this.
Listing 4.4 Query to produce a campaign list.
select c.customer_code, customer_name, customer_address,sum(quantity) from customer c, sales s, time t where c.customer_code = s.customer_code and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c.customer_code,customer_name, customer_address having sum(quantity) > 12 The table in Table 4.1 is a subset of the result set for the query in Listing 4.4.
Table 4.1. List of Customers to be Contacted Customer Code Customer Name
Customer Address
Sum (Quantity)
1332
A.J. Gordon
82 Milton Ave, Chester, Cheshire
49
1315
P. Chamberlain
11a Mount Pleasant, Sunderland
34
2131
Q.E. McCallum
32 College Ride, Minehead, Somerset
14
1531
C.D. Jones
71 Queensway, Leeds, Yorks
31
1136
L. Jones
9 Broughton Hall Ave, Woking, Surrey
32
2141
J.K. Noble
79 Priors Croft, Torquay, Devon
58
4153
C. Smallpiece
58 Ballard Road, Bristol
21
1321
D. Hartley
88 Ballantyne Road, Minehead, Somerset
66
The highlighted row is an example of the point. Customer L. Jones has two entries in the database. Because Lucie has not purchased any wine since moving, the incorrect address was returned by the query. The result of a simple search is shown in Table 4.2.
154
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 4.2. Multiple Records for a Single Customer Customer Code
Customer Name
Customer Address
1136
L. Jones
9 Broughton Hall Ave, Woking, Surrey
8876
L. Jones
44 Sea View Terrace, West Bay, Bridport, Dorset
If it can be assumed that if the generalized key is always ascending, then the query could be modified, as in the following query, to select the highest value for the key.
select customer_code, customer_name, customer_address from customer where customer_code = (select max(customer_code) from customer where customer_name = 'L. Jones') This query would return the second of the two rows listed in Table 4.2. Using the other technique to implement the Type 2 method, we could have altered the customer code from “1136” to “113601” for the original row and, subsequently, to “113602” for the new row containing the changed address and sales area. In order to return the correct addresses, the query in Query Listing 4.5 has to be executed.
Listing 4.5 Obtaining the latest customers details using Type 2 with extended identifiers.
select c1.customer_code,customer_name, customer_address,sum(quantity) from customer c1, sales s, time t where c1.customer_code = s.customer_code and c.customer_code = (select max(customer_code) from customer c2 where substr(c1.customer_code,1,4) = substr(c2.customer_code,1,4)) and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c1.customer_code,customer_name,customer_address having sum(quantity) > 12 The query in Listing 4.5 is another correlated subquery and contains the following line:
where substr(c1.customer_code,1,4) = substr(c2.customer_code,1,4) The query matches the generic parts of the customer code by use of a “substring” function, provided
155
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
by the query processor. It is suspected that this type of query may be beyond the capability of most users. This approach is dependent on the fact that all the codes are of the same fundamental format. That is, they have four digits plus a suffix. If the approach was to start from 1 through 9,999, then this technique could not be adopted, because the substring function would not produce the right answer. The obvious variation on the above approach is to add an extra attribute to distinguish versions. The identifier then becomes the composite of two attributes instead of a single attribute. In this case, the original attribute remains unaltered, and the new attribute is incremented, as shown in Table 4.3.
Table 4.3. A Modification to Type 2 Using Composite Identifiers Customer Code
Version Number
1136
01
1136
02
1136
03
Using this technique, the following query is executed:
select c1.customer_code,customer_name, customer_address,sum(quantity) from customer c1, sales s, time t where c1.customer_code = s.customer_code and s.counter = c1.counter and c.counter = (select max(counter) from customer c2 where c1.customer_code = c2.customer_code) and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c1.customer_code,customer_name,customer_address having sum(quantity) > 12 The structure of this query is the same as in Query Listing 4.5. However, this approach does not require the use of substrings to make the comparison. This means that the query will always produce the right answer irrespective of the consistency, or otherwise, of the encoding procedures within the organization. These solutions do not resolve the problem of pinpointing when a change occurs. Due to the absence of dates in the Type 2 method, it is impossible to determine precisely when changes occur. The only way to extract any form of alignment with time is via a join to the fact table. This, at best, will give an approximate time for the change. The degree of accuracy is dependent on the frequency of fact table entries relating to the dimensional entry concerned. The more frequent the entries in the fact table, the more accurate will be the traceability of the history of the dimension, and vice versa. It is also not
156
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
possible to record gaps in the existence of dimensional entries. For instance, in order to track precisely the discontinuous existence of, say, a customer, there must be some kind of temporal reference to record the periods of existence.
Problems With Hierarchies
So far in this chapter, attention has focused on so-called slowly changing dimensions and how these might be supported using the Type 2 solution. Now we will turn our attention to, what we shall call, slowly changing hierarchies. As an example, we will use the dimensional hierarchy illustrated in Figure 3.4. The attributes, using a surrogate key approach, are as follows: Sales_Area(Sales_Area_Key, Sales Area Code, Manager key, Sales Area Name) Manager(Manager_Key, Manager Code, Manager Name) Customer(Customer_Key, Customer code, Customer Name, Customer Address, Sales Area key, Hobby Code, Date Joined) Let's say the number of customers and the spread of sales areas in the case study database is as shown in Table 4.4.
Table 4.4. Customers Grouped by Sales Area Sales Area
Count
North East
18,967
North West
11,498
South East
39,113
South West
28,697
We will assume that we have implemented the type two solution to slowly changing dimensions. If sales area SW was to experience a change of managers from M9 to M12, then a new sales area record would be inserted with an incremented counter, together with the new manager_code. So if the previous record was (1, SW, M9, “South West”), the new record with its new key, assumed to be 5, would contain (5, SW, M12, “South West”). However, that is not the end of the change. Each of the customers, from the “SW” sales area still have their foreign key relationship references pointing to the original sales area record containing the reference to the old manager (surrogate key 1). Therefore, we also have to create an entire set of new records for the customers, with each of their sales area key values set to “5”. In this case, there are 11,498 new records to be created. It is not valid to simply update the foreign keys with the new value, because the old historical link will be lost.
157
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Where there are complex hierarchies involving more levels and more rows, it is not difficult to imagine very large volumes of inserts being generated. For example, in a four-level hierarchy where the relationship is just 1:100 in each level, a single change at the top level will cause over a million new records to be inserted. A relationship of 1:100 is not inordinately high when there are many data warehouses in existence with several million customers in the customer dimension alone. The number of extraneous insertions generated by this approach could cause the dimension tables to grow at a rate that, in time, becomes a threat to performance. For the true Star Schema advocates, we could try flattening the hierarchies into a single dimension (de-normalizing). This converts a snowflake schema into a star schema. If this approach is taken, the effect is that, in the four-level 1:100 example, the number of insertions reduces from 1.01 million insertions to 1 million insertions. So reducing the number of insertions is not a reason for flattening the hierarchy.
Browse Queries
The Type 2 approach does cause some problems when it comes to browsing. It is generally reckoned that some 80 percent of data warehouse queries are dimension-browsing queries. This means that they do not access any fact table. A typical browse query we might wish to perform is to count the number of occurrences. For instance, how many customers do we have? The standard way to do this in SQL is shown in the following query:
Select count(*) from
where <predicate> In a Type 2 scenario, this will produce the wrong result. This is because for each logical record, there are many physical records that result in a number of “duplicated” rows. Take the example of a sales exec entity and a customer entity shown in Figure 4.8.
Figure 4.8. Simple general business hierarchy.
In order to count the number of customers that a sales exec is responsible for, a nonexpert user might express the query as shown in Query Listing 4.6.
158
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Listing 4.6 Nonexpert query to count customers.
Select count(*) from Sales_Exec S,Customer C where S.SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' When using the Type 2 approach to allow for changes to a customer attribute, this will produce the wrong result. This is because for each customer there may be many rows, resulting from the duplicates created when an attribute value changes. With more careful consideration, it would seem that the query should instead be expressed as follows:
Select count(distinct ) from
where <predicate> In our example, it translates to the following query:
Select count(distinct CustNum) from Sales_Exec S,Customer C where S. SalesExecNum =C.SalesExecNum and S.Name = 'Tom Sawyer' Unfortunately, this query does not give the correct result either because the result set contains all the customers that this sales executive has ever been responsible for at any time in the past. The query includes the customers that are no longer the responsibility of Tom Sawyer. When comparing sales executives, this would result in customers being double counted. On further examination, it might appear that this problem can be resolved by selecting the latest entry for each customer to ensure that they are counted as a customer for only their current sales executive. Assuming an incremented general key, this can be expressed by the following query:
Select count(*) from Sales_Exec S,Customer C1 where S. SalesExecNum=C1.SalesExecNum And S.Name = 'Tom Sawyer' and C1.Counter =(Select max(Counter from Customer C2) where C1.CustNum=C2.CustNum) In fact, this query also gives invalid results because it still includes customers that are no longer the responsibility of Tom Sawyer. That is, the count includes customers who are not currently the responsibility of any sales executive, because they are no longer customers, but whose records must remain because they are referenced by rows of the fact table.
159
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This example should be seen as typical of data warehouses, and the problem just described is general in this situation. That is, using the Type 2 approach, there are simple dimensional queries that cannot be answered correctly. It is reasonable to conclude that the type two approach does not fully support time in the dimensions. Its purpose, simply, is to ensure that fact entries are related to the correct dimensional entries such that each fact, when joined to any dimension, displays the correct set of attribute values.
Row Timestamping
In a much later publication, the lifecycle toolkit, Kimball (1998) recognizes the problem of dimensional counts and appears to have changed his mind about the use of dates. His solution is that the type two method is “embellished” by the addition of begin and end timestamps to each dimension record. This approach, in temporal database terms, is known as row timestamping. Using this technique, Kimball says it is possible to determine precisely how many customers existed at any point in time. The query that achieves this is shown here:
Select count(*) from Sales_Exec S,Customer C1 where S. SalesExecNum=C1.SalesExecNum And S.Name = 'Tom Sawyer' and C.EndDate is NULL For the sake of example, the null value in the end date is assumed to represent the latest record, but other values could be used, such as the maximum date that the system will accept , for example, 31 December 9999. In effect, this approach is similar to the previous example because it does not necessarily identify ex-customers. So instead of answering the question “How many customers is Tom Sawyer responsible for?” it may be asking “How many customers has Tom Sawyer ever been responsible for?” This method will produce the correct result only if the end date is updated when the customer becomes inactive. The adoption of a row timestamping approach can provide a solution to queries involving counts at a point in time. However, it is important to recognize that the single pair of timestamps is being used in a multipurpose way to record:
1. Changes in the active state of the dimension 2. Changes to values in the attributes 3. Changes in relationships Therefore, this approach cannot be used where there is a requirement to implement discontinuous existences where, say, a customer can become inactive for a period because it is not possible to determine when they were inactive. The only way to determine inactivity is to try to identify two temporally consecutive rows where there is a time gap between the ending timestamp of the earlier
160
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
row and the starting timestamp of the succeeding row. This is not really practical using standard SQL. Even where discontinuous existences are not present, the use of row timestamping makes it difficult to express queries involving durations because a single duration, such as the period of residence at a particular address or the period that a customer had continuously been active before closing their account, may be spread over many physical records. For example, in order to determine how many of the customers of the Wine Club had been customers for more than a year before leaving during 2001 could be expressed as follows:
Select '2001' as year, count(*) From customer c1, customer c2 Where c1.start_date = (select min(c3.start_date) from customer c3 where c3.customer_code = c1.customer_code) and c2.end_date = (select max(c4.end_date) from customer c4 where c4.customer_code = c2.customer_code) and c2.end_date between '2001/01/01' and '2001/12/31' and c1.customer_code = c2.customer_code and c2.end_date - c1.start_date > 365 group by year This query contains a self join and two self correlated sub queries. So the same table is used four times in the query. It is likely that the customer dimension is the one that would be subjected to this type of query most often and organizations that have several million customers are, therefore, likely to experience poor browsing performance. This problem is exacerbated when used with dimensions that are engaged in hierarchies because, like the Type 2 solution, changes tend to cause the problem of cascaded extraneous rows to be inserted. The second example has a further requirement which is to determine length of time that a customer had resided at the same address. This requires the duration to be limited by the detection of change events on attributes of the dimension. This is not possible to do with absolute certainty because of the fact that circumstances, having changed, might later on revert to the previous state. For instance, students leave home to attend university but might well return to the previous address later. In the Wine Club, a supplier might be reinstated as the supplier of a wine they had previously supplied. This is, in effect, another example of a discontinuous existence and cannot be detected using standard SQL.
The Type 3 Approach The third type of change solution (Type 3) involves recording the current value, as well as the original value, of an attribute. This means that an additional column has to be created to contain the extra
161
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
value. In this case, it makes sense to add an effective date column as well. The current value column is updated each time a change occurs. The original value column does not change. Intermediate values, therefore, are lost. In terms of its support for time, type three is rather quirky and does not add any real value and we will not be considering it further.
TSQL2 TSQL2 has been proposed as a standard for the inclusion of temporal extensions to standard SQL but has not so far been adopted by any of the major RDBMS vendors. It introduces some functions that a temporal query language must have. In this book we are concerned with providing a practical approach that can be implemented using today's technology and so, therefore, we cannot dwell on potential future possibilities. However, some of the functions are useful and can be translated into the versions of SQL that are available right now. These functions enable us to make comparisons between two periods (P1 and P2) of time (a period is any length of time, i.e., it has a start time and an end time). There are four main temporal functions:
1. Precedes. For this function to return “true,” the end time of P1 must be earlier than the start time of P2.
2. Meets. For this to be true, the end time of P1 must be one chronon earlier than the start time of P2, or vice versa. A chronon is the smallest amount of time allowed in the system. How small this is will depend on the granularity of time in use. It might be as small as a microsecond or as large as one day.
3. Overlaps. In order for this function to return “true,” any part of P1 must be the same as any part of P2.
4. Contains. The start of P1 must be earlier than the start of P2 and the end of P1 must be later than the end of P2. The diagram in Figure 4.9 illustrates these points.
Figure 4.9. Graphical representation of temporal functions.
162
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This introduction on how to interpret the TSQL2 functions such as Precedes, Overlaps, etc., in standard (nontemporal) SQL is very useful, as it shows how these functions can be used in standard SQL. Upon closer examination, the requirements in a dimensional data warehouse are quite simple. The “contains” function is useful because it enables queries such as “How many customers did the Wine Club have at the end of 2000?” to be asked. What this is really asking is whether the start and end dates of a customer's period of existence contain the date 31-12-2000. This can be restated to ask whether the date 31-12-2000 is between the start and end dates. So the “contains” from TSQL2 can be written almost as easily using the “between” function. Other functions such as “meets” can be useful when trying to detect dimensional attribute changes in an implementation that uses row timestamps. The technique is to perform a self-join on the dimension where the end date of one row meets the start date of another row and to check whether, say, the addresses or supplier codes are different.
Temporal Queries While we are on the subject of temporal extensions to existing technology, it is worth mentioning, but not dwelling upon, the considerable research that has been undertaken into the field of temporal databases. Although there have been more than 1,000 papers published on the subject, a solution is not close at hand. However, some temporal definitions have already been found to be useful in this book. For instance, the valid time and transaction time illustrate the difference between the time an event occurred in real life and the time that the event becomes known to the database. Another useful result is that it is now possible to define three principal types of temporal query that are highly relevant to data warehousing (see Snodgrass, 1997):
1. State duration queries. In this type of query, the predicate contains a clause that specifies a period. An example of such a query is: “List the customers who lived in the SW sales area for a duration of at least one year.” This type of query selects particular values of the dimension, utilizing a predicate associated with a group definition that mentions the duration of the row's period.
163
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
2. Temporal selection queries. This involves a selection based on a group definition of the time dimension. So the following query would fall into this category: “List the customers who lived in SW region in 1998.” This would involve a join between the customer dimension and the time dimension, which is not encouraged. Kimball's reasoning is that the time constraints for facts and dimensions are different.
3. Transition detection queries. In this class of query, we are aiming to detect a change event such as: “List the customers who moved from one region to another region.” This class of query has to be able to identify consecutive periods for the same dimension. The query is similar to the state duration query in that, in order to write it, it is necessary to compare row values for the same customer. We'll be using these terms quite a bit, and the diagram in Figure 4.10 is designed as an aide memoir for the three types of temporal query.
Figure 4.10. Types of temporal query.
only for RuBoard - do not distribute or recompile
164
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
VARIATIONS ON A THEME One way of ensuring that the data warehouse correctly joins fact data to dimensions and dimensional hierarchies is to detach the superordinate dimensions from the hierarchy and reattach them directly to the fact table. Pursuing the example presented earlier, in which a salesperson is able to change departments and a department is able to move from one site to another, what follows is the traditional approach of resolving many-to-many relationships by the use of intersection entities, as the diagram in Figure 4.11 shows. Figure 4.11. Traditional resolution of m:n relationships.
This approach need not be limited to the resolution of time-related relationships within dimensional hierarchies. It can also be used to resolve the problem of time-varying attributes within a dimension. So if it were required, for instance, to track a salesperson's salary over time, a separate dimension could be created as shown in Figure 4.12. Figure 4.12. Representation of temporal attributes by attaching them to the dimension.
165
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
The identifier for the salary dimension would be the salesperson's identifier concatenated with a date. The salary amount would be a nonidentifying attribute.
Salary (Salesperson_Id, StartDate, EndDate, Salary_Amount) The same approach could be used with all time-varying attributes. Another approach is to disconnect the hierarchies and then attach the dimensions to the fact table directly, as is shown in Figure 4.13. Figure 4.13. Representation of temporal hierarchies by attaching them to the facts.
This means that the date that is attached to each fact will, automatically, apply to all the levels of the dimensional hierarchy. The facts would have a foreign key referring to each of the levels of the hierarchy.
166
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
We are left with a choice as to how we treat the changing attributes. The salary attribute in Figure 4.12 could be treated in the same way as before (i.e., as a separate dimension related to the salesperson dimension). Alternatively, it could be attached directly to the fact table in the same way as the other dimensions as shown in Figure 4.14. Figure 4.14. Representation of temporal attributes by attaching them to the facts.
A further variation on this approach is to include the salary amount as a nonidentifying attribute of the fact table, as follows:
Sales(Site_Id, Dept_Id, Salesman_Id, Time_Id, Salary_Amount, Sales_Quantity, Sales_Value) This approach eliminates dimensional hierarchies and, therefore, removes the problem with type two of extraneous cascaded inserts when changes occur in the hierarchy. However, this is by no means a complete solution as it does nothing to resolve the problems of time within the hierarchical structures. The question “How many customers do we have?” is not addressed by this solution. It is presented as a means of implementing Type 2 without incurring the penalties associated with slowly changing hierarchies. There is a further drawback with this approach. The tables involved in the hierarchy are now related only via the fact table. Therefore, the hierarchy cannot be reconstructed so long as no sales records exist for any relationship between the dimensions. only for RuBoard - do not distribute or recompile
167
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
CONCLUSION TO THE REVIEW OF FIRST-GENERATION METHODS In this chapter we focused on the subject of time. Time has a profound effect on data warehouses because data warehouses are temporal applications. The temporal property of data warehouses was never really acknowledged in first-generation data warehouses and, consequently, the representation of time was, in most cases, not adequate. In dimensional models the representation of time is restricted largely to the provision of a time dimension. This enables the fact tables to be accurately partitioned with respect to time, but it does not provide much assistance with the representation of time in the dimensional hierarchies. The advent of CRM has highlighted the problem and reinforced the need for a systematic approach to dealing with time. In order to design and build a truly customer centric data warehouse that is capable of assisting business people in the management of serious business problems such as customer churn, we absolutely have to make sure that the problems posed by the representation of time are properly considered and appropriately addressed. We saw how the first-generation data warehouse design struggles to answer the most basic of questions, such as “How many customers do we have?” With this kind of limited capability it is impossible to calculate churn metrics or any other measures that are influenced by changes in circumstances. We also examined how the traditional approach to solving the problem of slowly changing dimensions can cause the database to have to generate hundreds or thousands of extraneous new records to be inserted when dealing with simple hierarchies that exist in all organizations. We then went on to explore some of the different types of temporal query that we need to be able to ask of our data. Temporal selection, transition detection, and state duration queries will enable us to analyze changes in the circumstances of our customers so that we might gain some insight into their subsequent behavior. None of these types of query can be accommodated in the traditional first-generation data warehouse. It is reasonable to conclude that, in first-generation data warehouse design, the attributes of the dimensions are really regarded as further attributes of the facts. When we perform a join of the fact table to any dimension we are really just extending the fact attributes with those of the dimension. The main purpose of the type two solution appears to be that we must ensure that each fact table entry joins up with the correct dimension table entry. In this respect it is entirely successful. However, it cannot be regarded as the solution to the problem of the proper representation of time in data warehouses. The next generation, the customer-centric solution that supports CRM, has to be capable of far more. This concludes our exploration of the issues surrounding the representation of time in data warehouses. We now move on to the development of our general conceptual model. In doing so we will reintroduce the traditional conceptual, logical, and physical stages. At the same time, we will attempt to overcome the problems we have so far described. only for RuBoard - do not distribute or recompile
168
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 5. The Conceptual Model This chapter focuses on the conceptual modeling part of our data warehouse solution. We intend to adhere to the definition of the conceptual model as one that specifies the information requirements. For a data warehouse design that uses our general conceptual model (GCM; see Chapter 3), we need to be able to provide the following components: Customer-centric diagram Customer's changing and nonchanging circumstances Customer behavior Derived segments In the previous chapter we explored, in some detail, the issues surrounding the representation of time, and so our conceptual model needs to be able to capture the temporal requirements of each data element. We also looked at the causal changes and the dependencies within the data items that we need to capture. only for RuBoard - do not distribute or recompile
169
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
REQUIREMENTS OF THE CONCEPTUAL MODEL Before proceeding, the general requirements of a conceptual data model for data warehousing are restated succinctly. The model must provide for the following:
1. Be simple to understand and use by nontechnical people 2. Support the GCM 3. Support time The most widely used data models are entity relationship (ER) or sometimes, extended entity relationship (EER) methods. These models are widely understood by IT professionals but are not easy for non-IT people to understand. A properly produced ER model contains quite a lot of syntax when you think about it. We have rules regarding cardinality, participation conditions, inclusive and exclusive relationships, n-ary relationships, and entity supertypes and subtypes. These syntax rules are necessary for models for operational systems but, in many respects, the diagrammatic requirements for dimensional data warehouses are simpler than traditional ER models. Take for the moment the rules for dimensional models: These are what we will use to model customer behavior in the GCM.
1. The structure of a dimensional model is predictable. There is a single fact table at the center of the model. The fact table has one or more dimension tables related to it. Each dimension will have zero, one, or more hierarchical tables related to it.
2. The relationships are not usually complex. Relationships are always “one to many” in a configuration where the dimension is at the “one” end of the relationship and the fact table is the “many” end. Where dimensional hierarchies exist, the outer entity (farthest from the fact table) is the “one” end and the inner entity (nearest to the fact table) is the “many” end.
3. “One-to-one” and “many-to-many” relationships are rare, although this changes when “time” is introduced. There is no real need to model the cardinality (degree) of the relationships.
4. The participation conditions do not need to be specified. The dimension at the “one” end of the relationship always has optional participation. The participation condition for the dimension, or fact, at the “many” end is always mandatory.
5. Entity super/subtypes do not feature in dimensional models. 6. There is no requirement to name or describe the relationships as their meaning is implicit. It is important to show how the dimensional hierarchies are structured, but that is the only information that is needed to describe relationships.
7. There is no requirement for the fact table rows to have a unique identifier.
170
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
8. There is no requirement to model inclusive nor exclusive relationships. The additional rules and notations that are required to support the features in the list above are, therefore, not appropriate for dimensional data warehouses. There is a further consideration. We know that data warehouses are not designed to support operational applications such as order processing or stock control. They are designed to assist business people in decision making. It is important, therefore, that the data warehouse contains the right information. Often, the business people are unable to express clearly their requirements in information terms. It is frequently the case that they feel they have a problem but are unsure where the problem lies. This issue was brought out in the introduction to the Wine Club case study in Chapter 2. Most business managers have a set of business objectives. These can be formally defined key performance indicators (KPIs), against which their performance is measured, or they can be more informal, self-imposed, objectives. A data warehouse can be designed to help them to achieve their business objectives if they are able to express them clearly and to describe the kind of information they need to help make better decisions in pursuit of their objectives. One method that leads business managers through a process of defining objectives and subsequent information requirements is described later on in this chapter. What is needed is an abstraction that allows for the business requirements to be focused upon in a participative fashion. The business people must be able to build, validate, modify, or even replace the model themselves. However, in addition, the model must be powerful enough to enable the technical requirements of each data object to be specified so that the data warehouse designers can go on to develop the logical model. Later on we will introduce the dot modeling notation as a model for capturing information requirements in a way that business people can understand. There exists a fundamental requirement that the people who use the data warehouse must understand how it is structured. Some business applications need to be supported by complex data models. Data modeling specialists are comfortable with this complexity. Their role in the organization, to some extent, depends on their being able to understand and navigate complex models. The systems are usually highly parametric in nature, and users tend to be shielded from the underlying complexity by the human computer interface. The users of data warehouses are business people such as marketing consultants. Their usage of the warehouse is often general and, therefore, unpredictable, and it is neither necessary nor desirable to shield them from the structure of it. The conceptual model should be easy to understand by nontechnical people to the extent that, with very little training, such people could produce their own models. If it is accepted that the dimensional model, due to its simplicity, is an appropriate method to describe the information held in a data warehouse, then it would be sensible to ensure that the simplicity is maintained and that the model does not add complexity. Another requirement of the conceptual model is that it should retain its dimensional shape. Again, having achieved a model that is readily understood, we should try to ensure that the essential radial shape is retained even in relatively complex examples.
171
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Also, there is a need within the model to record business semantic information about each of the attributes. This means that some additional, supporting information about attributes will have to be recorded in any case. For the purpose of keeping the conceptual model simple, it seems sensible to incorporate the temporal aspects into the same supporting documents.
The Treatment of Behavior The temporal requirements of behavioral data are very straightforward. Once recorded, they do not change. They do not, therefore, have any form of lifespan. The behavioral facts are usually derived from an entity that has attained a particular state. The attainment of the state is caused by an event, and the event will occur at a particular point in time. So one of the attributes of the fact will be a time attribute. The time attribute records the time of the occurrence of the event. The time attribute will be used in two ways:
1. As part of the selection constraint in a query 2. As an aid to joining the dimensions to higher-level groupings For each event or change there is, as has been said, an associated time. Different applications have different requirements with respect to the grain of time. Some applications require only a fairly coarse grain such as day. These might include: Companies selling car insurance Banks analyzing balances Supermarkets analyzing product-line sales Some other applications might wish to go to a finer grain, perhaps hours. These might include: Supermarkets analyzing customer behavior Transport organizations monitoring traffic volumes Meteorological centers studying weather patterns Other applications would require a still more fine grain of time to say, seconds. An example of this is the telecommunication industry monitoring telephone calls. Another requirement of the data model, therefore, is that it needs to show clearly the granularity of time pertaining to the model. Some applications require more than a single grain of time. Consider the examples above. The supermarket example shows a different requirement with respect to time for product analysis as distinct from customer behavior analysis. In any case, almost all organizations conducting dimensional analysis will require the ability to summarize information, from whatever the granularity of the base event, to a higher (coarser) level in order to extract trend information.
172
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The Treatment of Circumstances—Retrospection In a first-generation dimensional data warehouse, the way in which an attribute of a dimension is treated, with respect to historical values, depends entirely upon the requirements to partition the facts in historical terms. The key role of dimensional attributes is to serve as the source of constraints in a query. The requirements for historical partitioning of dimensional attributes for, say, dimension browsing have been regarded as secondary considerations. Principally, the dimensional attributes exist to constrain queries about the facts. Even though some 80 percent of queries executed are dimension browsing, the main business purpose for this is as a process of refinement of the fact table constraints. It is only comparatively recently, since the advent of CRM, that the dimensions themselves have been shown to yield valuable information about the business such as the growth in customers, etc. In some industries (especially telecommunications and retail banking), this is seen as the largest business imperative at the present time. Our GCM, being a customer-centric model, enables amazingly rich and complex queries to be asked about customers' circumstances, far beyond a traditional dimensional model. The GCM, therefore, imposes the need for a much greater emphasis on the nondimensional aspects of the customer, that is, the circumstances and derived segments. In recognition of the need to place more emphasis on the treatment of history within the GCM, we have to examine the model in detail in order to assess how each of the various elements should be classified. Each component of the model that is subject to change will be evaluated to assess the way in which past (historical) values should be treated. When we refer to a component we mean: Entity: a set of circumstances or a dimension (e.g., customer details or product dimension) Relationship: for example, a hierarchy Attribute: for example, the customer's address Each component will then be given a classification. This is called the retrospection [1] of the component. Retrospection has three possible values: [1]
Retrospection means literally “looking back into the past.”
1. True 2. False 3. Permanent True retrospection means that the object will reflect the past faithfully. It enables queries to return temporal subsets of the data reflecting the partitioning of historical values. Each dimension, relationship, and attribute value will, in effect, have a
173
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
lifespan that describes the existence of the object. An object may have a discontinuous lifespan, that is, many periods of activity, punctuated by periods of inactivity. True retrospection is the most accurate portrayal of the life of a data warehouse object. False retrospection means that the view of history will be altered when the object's value changes. In simple terms, when changes occur, the old values will be overwritten and are, therefore, lost. It is as though the old values had never existed. Permanent retrospection means that the value of the object will not change over time. Let us now explore how the various values for retrospection apply to dimensions, relationships, and attributes.
Retrospection in Entities
So far as entities are concerned, the value for retrospection relates to the existence of the dimension. For instance, the existence of a customer starts when the customer first orders a product from the Wine Club, and the existence ends when the customer becomes inactive. Retrospection = true for entities means that the lifespan of the entity consists of one or more time intervals. A single customer may not have a single, continuous lifespan. It is also true of other entities such as the wine dimension. A wine may be available for intervals of time spanning many years, or the entire lifespan may be punctuated by periods when the wine is not available. An example of this would be the Beaujolais Nouveau, which, for some reason, is very popular when first available in November each year but must be consumed quickly, as it soon deteriorates. As this wine is not available for 10 months out of each year, it is reasonable to say that the lifespan of this wine is discontinuous. Retrospection = false for entities means that the current state only, of the existence of the entities, is recorded. An example from the Wine Club would be the supplier dimension. There may be a need to be able to distinguish between current suppliers and previous suppliers. There is no requirement to record the intervals of time when a supplier was actually supplying wine to the Wine Club as distinct from the intervals of time when they were not. Retrospection = permanent for entities means that the entity exists forever. The concept of existence does not, therefore, apply. An example from the Wine Club would be the region dimension. Regions, which represent the wine-growing areas of the world, are unlikely to disappear once created.
Retrospection in Relationships
174
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
In a dimensional model, the degree of snapshot relationships is always “one to many.” When the relationship becomes temporal, due to the need for true retrospection on the relationship, the degree of the relationship may change to that of “many to many.” This has been described in detail in Chapter 4. It is important that this information is recorded in the model without introducing significant complexity into the model. The essential simplicity of the model must not be dominated by time, while at the same time it needs to be straightforward for designers to determine, quickly and precisely, the degree to which support for temporal relationships is required. The situation with relationships is similar to that of entities. It is the requirement with respect to the existence, and lifespan, of a relationship that determines the value for retrospection in relationships. Retrospection = true for relationships means that the lifespan of each relationship must be recorded and kept so that the results from queries will faithfully reflect history. An example of this in the Wine Club is the relationship between customer and sales area. If a customer moves from one sales area to another, it is important that the previous relationships of that customer with sales areas are retained. Retrospection = false for relationships means that only the current relationship needs to be recorded. There is no need for the system to record previous relationships. A true view of history is not required. An example of this, within the Wine Club, is the relationship between a hobby and a customer. If Lucie Jones informs the club, through the periodic data update process, that her hobby has changed from, say, horse riding to choral singing, then the new hobby replaces the old hobby and all record of Lucie's old hobby is lost. Retrospection = permanent for relationships means that the relationship is never expected to change. No change procedures have to be considered. An example of this kind of relationship, in the Wine Club, is the relationship between a wine and a growing region. A wine is produced in a region. A particular wine will always be produced in the same region, so it is reasonable to take the view that this relationship will not change.
Retrospection in Attributes
Each attribute in the model must be assessed to establish whether or not it needs temporal support. Therefore, one of the properties of an attribute is its value for retrospection. As with the other data objects, the recording of the temporal requirements for attributes should not introduce significant complexity into the model. The situation with respect to attributes and retrospection appears, at first, to be somewhat different to the requirements for other objects. In reality, the situation is very similar to that of relationships. If we consider an attribute to be engaged in a relationship with a set of values from a domain, it becomes easy to use precisely the same approach with attributes as with relationships.
175
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Retrospection = true for attributes means that we need to record faithfully the values associated with the attribute over time. An example of this is the cost of a bottle of wine. As this cost changes over time, we need to record the new cost price without losing any of the previous cost prices. Retrospection = false for attributes means that only the latest value for the attribute should be kept. When the value of the attribute changes, the new value replaces the old value such that the old value is lost permanently. An example of this in the Wine Club is the alcohol by volume (ABV) value for the wine. If the ABV changes, then the previous ABV is replaced by the new ABV. Retrospection = permanent for attributes means that the value is not expected to change at all. Changes to the values of these type of attributes do not have to be considered. An example of this type of attribute in the Wine Club is the hobby name. Once a hobby has been given a name, it is never expected to change. There is a rule that can be applied to identifying attributes. With great originality it is called the identifying attribute rule and simply states that all identifying attributes have a value for retrospection of permanent. This is because the identifying attributes should not change. There is an implicit inclusion of an existence attribute for entities, relationships, and attributes where the value for retrospection is true. The status of true retrospection will direct the logical database designers to provide some special treatment, with respect to time, to the affected object. The type of treatment will vary on a per-case basis and will, to some extent, depend on the type of database management system that is to be deployed. The inclusion of an existence attribute is also implicit for entities, but not relationships and attributes, where the value for retrospection is false. The provision of time support, where retrospection is false, is usually simpler to implement than where retrospection is true. For relationships and attributes, it is simply a case of replacing the previous value with a new value—in other words, a simple update.
Retrospection and the Case Study
Table 5.1 now lists a handful of the data elements, entities, attributes, and relationships for the Wine Club. For each, the value for retrospection is given that satisfies the requirements of the Wine Club with regard to the representation of time. In accordance with the previous point about their implicit nature, there is no explicit mention of existence attributes in Table 5.1. A complete list can be found in Appendix A. So far, we have analyzed the requirements with respect to time and have identified three cases that can occur. The Wine Club data model has been examined and each object classified accordingly. The requirement that follows on from this is to develop a method that enables the classification to be incorporated into the model so that a solution can be designed. It is important that the requirements, at
176
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
this conceptual level, do not prescribe a solution. The designers will have additional points to consider such as the volumes of data, frequency of access, and overall performance. As always, some compromises are likely to have to be made.
Table 5.1. Wine Club Entities, Relationships, and Attributes Object Name
Type
Retrospection
Reason
Customer
Entity
True
Customers may have more than one interval of activity. It is important to the Wine Club that it monitors the behavior of customers over time. There is a requirement, therefore, to record the full details of the existence of customers.
Sales_Area
Entity
False
Latest existence only, is required. Sales areas may be combined, or split. Only the latest structure is of interest. Note that some organizations might wish to compare old regional structures to the new one.
Hobby
Entity
Permanent
The hobby details, once entered, will exist forever.
Sales Area?Customer
Relationship True
There is a requirement to monitor the performance of sales areas. As customers move from one area to another, therefore, we need to retain the historical record of where they lived previously, so that sales made to those customers can be attributed to the area in which they lived at the time.
Hobby?Customer
Relationship False
A customer's hobby is of interest to the Wine Club. Only the current hobby is required to be kept.
Customer?Sales
Relationship Permanent
The relationship of a particular sale to the customer involved in the sale will never change.
Customer.Customer_Code
Attribute
Permanent
Identifying attribute rule.
Customer.Customer_Name
Attribute
False
The latest value only is sufficient.
Customer.Customer_Address Attribute
True
Requirement to analyze by detailed area down to town/city level.
Customer.Date_Joined
False
The latest value only is sufficient.
Attribute
only for RuBoard - do not distribute or recompile
177
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE IDENTIFICATION OF CHANGES TO DATA Causality It would be very helpful if, during the analysis of the kinds of changes that can occur, it is made clear as to whether the changes are causal in nature. It would be sufficient to identify causal changes only and to assume that all unidentified changes are noncausal. This will provide assistance to the designers. Some changes, as has been previously indicated, can occur to attributes that have the property of false retrospection but that, due to the fact that they are determinants, have a “knock-on” effect on other attributes that might have the property of true retrospection. The capture of changes has to be developed into an automated process. Some mechanism is created that enables changes that have occurred to be identified by examining the organization's operational systems as these are, invariably, the source of data for the data warehouse. The source system will not share the data warehouse data model and will not be aware of the effect of changes. For instance, in the Wine Club, there is a relationship between the address of a customer and the sales area that contains the address. So it could be said that the address determines the sales area. This determinant relationship is identical to that used in the process of normalization. However, the purpose here is quite different and has more to do with the synchronization of the timing of changes to attribute values, to ensure temporal consistency, than the normalization of relations. Thus the term causality has been adopted in order to distinguish this requirement as it is unique to data warehousing. The operational system that records customers' details may not be aware of the sales area hierarchy. When a customer moves, the fact that a change in sales area might have occurred would not normally be apparent. It becomes the responsibility of the data warehouse designer to manage this problem. Generally, there is little recognition of the fact that logically connected data may need to be extracted from different data files which, in turn, might belong to various operational systems. That there may be a need to implement a physical link between data sources due to the causal nature of the relationship is also not recognized. This stitching together of data from various sources is very important in data warehousing. Apart from operational systems, the sources can also be external to the organization. For instance, an external attribute relating to a customer's economic classification might be added to the customer's record. This is a good example of causality. What is the trigger that generates a change in the economic classification when a change in the customer's circumstances is implemented so that temporal consistency is maintained? Without such a facility, the data warehouse may be recording inconsistent information. If a customer's address changes, then the sales area code must be checked and updated, if necessary, at the same time as the address change. Where the data relating to addresses and sales areas is derived from different source systems, the temporal synchronization of these changes may be difficult to implement. If temporal synchronization is not achieved, then any subsequent query involving the history of these attributes may produce inaccurate results. The main point is to recognize the problem and ensure that the causal nature of changes is covered during the requirements analysis.
178
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The Frequency of Capture of Changes Associated with identification of changes is the timing with which changes to data are captured into the data warehouse. In this respect, the behavior of “behavioral data” is different from the behavior of “circumstances.” The frequency of capture for the fact data is usually as close as possible to the granularity of the valid time event. For instance, in the Wine Club example, the granularity of time of a sale is “day” and the sales are captured into the data warehouse on a daily basis. There are exceptions, such as telecommunications where the granularity of time is “seconds” but the frequency of capture is, typically, still daily. Nevertheless, the time assigned to the fact can usually be regarded as the valid time. The granularity of time for recording changes to the dimensions adopts an appearance that is often misleading. The most frequently used method for identifying changes to dimensions is by use of the file comparison approach, as outlined in the previous chapter. The only time that can be used to determine when the change occurred will be the time that the change was detected (i.e., the time the file comparison process was executed). The time recorded on the dimensional record can be at any level of grain, for example, day. In this example, the granularity of time for the changed data capture appears to be daily because the date that the change was captured will be used to record the change. However, this is a record of the transaction time so far as the data warehouse is concerned. It is not actually a record of the transaction time that the change was recorded in the originating source system. The granularity is related to the frequency that the changed data is captured. If the changes are detected and captured into the data warehouse on a monthly basis, then the transaction time frequency should be recorded as monthly. In practical situations, different parts of the model are usually updated at differing frequencies of time. Some changed data is captured daily while others are weekly and, still others, monthly. The frequency of capture is often dependent on the processing cycle of the source systems. As with the previous section on causality, the valid time and transaction time should be the same, if possible. Where such synchronization is not possible, the difference between the two times should be recorded so that the potential error can be estimated. Our modeling method should provide a means of capturing the true granularity of time on a per attribute basis. only for RuBoard - do not distribute or recompile
179
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
DOT MODELING The remainder of this chapter is devoted to the explanation of a methodology for the development of conceptual models for data warehouses. The methodology is called dot modeling. Dot modeling is based on the simplified requirements for dimensional models that were described in the introduction to this chapter. It is a complete methodology that enables nontechnical people to build their own conceptual model that reflects their personal perception of their organization in dimensional terms. It also provides a structured way for constructing a logical (currently relational) model from the conceptual. The method was invented in July 1997 and has been used in real projects since then. It has received positive reviews from nontechnical people in environments where it has been deployed. The name was given by a user. Dot is not an acronym. It comes from the characteristic that the center of the behavioral part of the model, the facts, are represented by a dot. The method was developed as a kind of evolution using dimensional concepts and has been evolved to adapt to the requirements of the customer centric GCM. We start by modeling behavior. Figure 5.1 represents the design of a two dimensional tabular report. This kind of report is familiar to everyone and is a common way of displaying information, for example, as a spreadsheet.
Figure 5.1. Example of a two-dimensional report.
180
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The intersection of the axes in this example, as shown by the dot, would yield some information about the sale of a particular product to a particular customer. The information represented by the dot is usually numeric. It could be an atomic value, such as the monetary value of the sale, or it could be complex and may include other values such as the unit quantity and profit on the sale. Where there is a requirement to include a further dimension such as time into the report, then one might envisage the report being designed as several pages where each page represents a different time period. This could be displayed as shown in Figure 5.2.
Figure 5.2. Example of a three-dimensional cube.
181
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Now the dot represents some information about the sale of a particular product to a particular customer at a particular time. The information contained in the dot is still the same as before. It is either atomic or complex and is usually numeric. All that has changed is that there are more dimensions by which the information represented by the dot may be analyzed. It follows, therefore, that the dot will continue to represent the same information irrespective of how many dimensions are needed to analyze and report upon it. However, it is not possible to represent more than three dimensions diagrammatically using this approach. In effect, the dot is “trapped” inside this three-dimensional diagram. In order to enable further dimensions of analysis to be represented diagrammatically, the dot must be removed to a different kind of structure where such constraints do not apply. This is the rationale behind the development of the dot modeling methodology. In dot modeling the dot is placed in the center of the diagram and the dimensions are arranged around it as shown in Figure 5.3.
Figure 5.3. Simple multidimensional dot model.
182
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The model readily adopts the well-understood radial symmetry of the dimensional star schema.
The Components of a Behavioral Dot Model There are three basic components to a dot model diagram: Dot. The dot represents the facts. The name of the subject area of the dimensional model is applied to the facts. In the Wine Club, the facts are represented by “sales.” Dimension names. Each of the dimensions is shown on the model and is given a name. Connectors. Connectors are placed between the facts and dimensions to show first-level dimensions. Similarly, connectors are placed between dimensions and groupings to show the hierarchical structure Emphasis has been placed on simplicity so there are virtually no notational rules on the main diagram. It is sensible to place the dot near the center of the diagram and for the dimensions to radiate from the dot. This encourages a readable dimensional shape to emerge. The behavioral dot model for the Wine Club is reproduced in Figure 5.4. The attributes for the facts and dimensions are not shown on the diagram. Attributes are described on supporting worksheets. Similarly, the temporal requirements are represented on supporting worksheets rather than on the diagram. The method uses a set of worksheets. The worksheets are included in the appendices. Some of the worksheets are completed during the conceptual design stage of the development and some are completed during the logical design stage. The first worksheet is the data model worksheet itself. It
183
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
contains the following: Name of the application, or model (e.g., The Wine Club-Sales) Diagram, as shown in Figure 5.4
Figure 5.4. Representation of the Wine Club using a dot model.
List of the “fact” attributes (i.e., “quantity” and “value” in the Wine Club) For each fact attribute, some information describing the fact is recorded under, what is commonly known as “metadata.” Its purpose is to document the business definition of the attribute. This is to solve the problem of different people, within an organization, having differing views about the semantics of particular attributes. The descriptions should be phrased in business terms. A second worksheet, the entities worksheet, is used to record the following: Behavioral dimensions Customer circumstances Derived segments This part of the method holds some of the more complex information in the model. The model name is given on each page to ensure that parts of the document set are not mistakenly mixed up with other
184
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
models' documents. The purpose of the entities worksheet is to aid the designers of the system to understand the requirements in order to assist them in the logical design. For each entity the following items of information are recorded: Name of the dimension as it is understood by the business people. For example, “customer.” Retrospection of the entity's existence. Existence attribute for the entity. For entities with permanent retrospection, an example of which might be “region” in the Wine Club, there is no requirement to record the existence of an entity, because, once established, the entity will exist as long as the database exists. With other entities, however, an attribute to represent existence would be needed so that, for instance, the Wine Club would be able to determine which wines were currently stocked. Frequency of the capture of changes to the existence of the dimension. This will help to establish whether the dimension will be subject to errors of temporal synchronization. For each dimension, a set of attributes is also defined on a separate worksheet. The existence attribute has already been described. The following description refers to the properties of other attributes. So for each attribute, the following information is recorded: Name of the dimension that “owns” it. Name of the attribute. This is the name as the business (nontechnical) people would refer to it. Retrospection. Whether or not the historical values of this attribute should be faithfully recorded. Frequency. This is the frequency with which the data is recorded in the data warehouse. This is an important component in the determination of the accuracy of the data warehouse. Dependency. This relates to causality and identifies other attributes that this attribute is dependent upon. Identifying attribute. This indicates whether the attribute is the identifying attribute, or whether it forms part of a composite identifying attribute. Metadata. A business description of the attribute. Source. This is a mapping back to the source system. It describes where the attribute actually comes from. Transformations. Any processing that must be applied to the attribute before it is eligible to be brought into the data warehouse. Examples of transformations are the restructuring of dates to the same format, the substitution of default values in place of nulls or blanks.
185
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Data type. This is the data type, and precision of the attribute. Information about dimensional hierarchies is captured on the hierarchies worksheet. Pictorially, the worksheet shows the names of the higher and lower components of the hierarchy. The following information is also captured: Retrospection of the hierarchy Frequency of capture Metadata describing the nature of the hierarchy
Dot and the GCM An interesting development occurred when working with a major telecommunications company in the United Kingdom. Their business objective is to build a customer-centric information data model that covers their entire enterprise. There were several different behavioral dot models: Call usage. The types of phone calls made, the duration, cost, etc. Payments. Whether the customer paid on time, how often they had to be chased, etc. Recurring revenue. Covered insurance, itemized billing, etc. Nonrecurring revenue. Accessories and other one-off services requested Order fulfillment. How quickly orders placed by customers were satisfied by the company Service events. Customers recording when a fault has occurred in their equipment or service Contacts. Each contact made to a customer through offers, campaigns, etc. In dimensional modeling terms this means several dimensional models, each having a different subject area. During a workshop session with this customer I was able to show how the whole model might look using a single customer-centric diagram, which I now refer to as joining the dots. The diagram is shown in Figure 5.5.
Figure 5.5. Customer-centric dot model.
186
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Figure 5.5 shows seven separate dimensional models that share some dimensions. This illustrates that, even with very complex situations, it is still very easy to determine the individual dimensional models using the dot modeling notation because the radial shape of each individual model is still discernible. The use of the dot model, in conjunction with business-focused workshops, coming up next, enables the “softer” business requirements, in the form of business objectives or key performance indicators, to be expressed in information terms so that the data warehouse can be designed to provide precisely what the business managers need. A further requirement is that the model should enable the business people to build the conceptual abstraction themselves. They should be able to construct the diagrams, debate them, and replace them. My own experiences, and those of other consultants in the field, are that business people have found dot models relatively easy to construct and use. only for RuBoard - do not distribute or recompile
187
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
DOT MODELING WORKSHOPS The purpose of the remainder of this chapter is to provide assistance to data warehouse practitioners in the delivery of data warehouse applications. This practitioner's guide is subdivided into two main sections, conceptual design and logical design. The conceptual design stage consists of the development of a data model using the dot modeling method described in this chapter. The logical model is developed as an extension of the dot modeling method and this will be described in the next chapter.
The Dot Modeling Methodology The methodology, as outlined earlier in this chapter, will now be described in detail. As you might expect, there are a fair number of processes associated with the construction of a complete dot model. Experience shows that these processes can be conducted through the use of two highly structured workshops. The use of joint application development (JAD) workshops is now a well-accepted practice, and holding one or more of these workshop sessions is a good way of gathering user requirements. The purpose of JAD workshops is to enable the development of better systems by bringing together the business people and the system developers. The idea is that the system design is a joint effort. These workshops require skilled facilitation.
Each of the two workshops can be completed in two days
The Information Strategy Workshop The objective of the information strategy workshop is for the business people, within the customer's organization, to develop their own dot model that reflects their own perception of their business. The session should last for approximately two days. The emphasis is on the word workshop. Every participant must expect to work and not to be lectured to.
Participants—Practitioners
There should be, at least, two consultants present at the workshop sessions. The ideal combination is to have one highly organized workshop facilitator and one business consultant who understands the methodology and the customer's business. It is very useful to have a third person who is able to record the proceedings of the sessions. Quite a substantial proportion of the work is actually done in teams. It is quite useful to have an extra
188
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
person who can assist in wandering from team to team checking that everyone understands what is required and that progress is being made. The success or failure of these sessions often has very little to do with the technical content of the workshop but has everything to do with softer issues such as: Ambience—are they enjoying actually being there? Relationships between the participants Friendliness of the presenters Comfort of the participants It is helpful to have consultants who are sensitive to these issues and can detect and respond to body language, suggest breaks at the right time, etc. Also, where the native language of the participants is different from the native language of the consultants, great care must be taken to ensure that the message, in both directions, is clear. Ideally, at least one of the consultants should be fluent in the language of the participants. If that is not possible, then someone from the customer organization must step in to act as the interpreter. If an interpreter is required to translate every statement by the presenters, then the workshop would generally take about 50 percent longer than normal.
Participants—Customer
The information strategy workshop requires a mixed attendance of business people and IT people. The ideal proportions are two-thirds from the business and one-third from IT. It is very important that the session is not dominated by IT staff.
Getting several senior people together for two whole days can be a virtually impossible task. The only way to achieve this is to ensure that: You have senior executive commitment You book their diaries in advance It's no good waiting until the week before; you won't get them to attend. There must be some representation from the senior levels of management. As a test of seniority, we would describe a senior person as one who is measured on the performance of the business. This is someone whose personal income is, to some extent, dependant on the business being successful. In any case, they must be able to influence the business strategy and direction. The key to being successful in data warehousing is to focus on the business, not the technology. There must be a clear
189
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
business imperative for constructing the data warehouse. It is not enough to merely state that information is strategic. This tendency to cite business requirements as fundamental to success has many followers. Any information system should exist solely to support the firm's business objectives, and it is critical that the business goals are clearly defined. One of the major goals, initially, is to build and validate the business case, and we will be covering this in Chapter 8.
The Information Strategy Workshop Process
The steps involved in running the workshop are now described. Some of the subsequent sections should be regarded as being for guidance only. Different consultants have differing approaches to facilitating workshops, and these variations in technique should be maintained. Here is one method that has been found to be successful in the past. First, the room organization:
1. The tables in the room should be organized in a horseshoe shape rather than a classroom style, as this greatly encourages interaction.
2. Plenty of writing space should be available. I personally favor white boards with at least two flip charts to record important points.
3. Try to work without an overhead projector as they tend to introduce a lecturing or presentation tone to the proceedings, whereas the atmosphere we want to encourage is that of a workshop where everyone contributes. We'll now go through the workshop step by step. For each step I'll give an estimate for how long it ought to take. Bear in mind that different people might prefer a faster or slower pace.
Workshop Introduction
Estimated time: 60 minutes Explain why the team is assembled and what the objectives are.
The objective: By the end of the information strategy workshop we will have built a conceptual model of the information requirements needed to support the business direction of the organization. Do not assume that everyone in the room will know who all the others are. I have yet to attend a workshop session where everyone knew one another. Get everybody to introduce themselves. This is, obviously, a good ice-breaking method that has been used successfully for many years. Ask them for:
190
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Name (pronunciation is very important if the language is foreign to you) Position in the organization Brief description of their responsibilities Also, ask them to tell us one thing about themselves that is not work related, maybe a hobby or something. Breaking the ice is very important in workshops and so it should not be rushed. There are a number of different approaches that you may already be familiar with. One way is to get people to spend 10 minutes talking to the person sitting next to them and for them then to introduce each other. A good variation on this theme is to hang a large piece of paper on the wall and get them to draw a picture of the person they have introduced. If this sounds juvenile, well I suppose it is, but it really does get people relaxed.
I also like to make a note about which of them are business people and which are IT people. It is a good idea to introduce yourself first and for your colleagues to follow you. This takes the pressure off the participants. Next come the ground rules. These are important and here are some suggestions. I think they are just common sense, but you never know whom you are dealing with so it is best to be explicit: Silence means assent. If people do not voice objections to decisions, then they are deemed to have accepted them. It is of no use to anyone if, during the wrap-up session on day two, someone says something like, “Well I didn't agree with your basic ideas in the first place!” The sessions start on time. It is incumbent on the facilitator to state clearly and precisely when the next session is due to start. It is the responsibility of all attendees to be there at that time. So it is absolutely vital that the consultants are never late to sessions. Mobile telephones are not allowed. Personal criticisms are not allowed. It's OK for them to criticize the consultants, or the method, but not each other. Only one person is allowed to be speaking at a time. No stupid questions. This does not mean that you are not allowed to ask stupid questions. It actually means that no question should be regarded as stupid.
191
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Always give them the collective authority to change the rules or add others.
Always give them the collective authority to change or add rules. The next thing to do is to outline the agenda for the whole of the workshop. People like to know when their breaks are, what time lunch will be served, and roughly at what time they might expect to be leaving in the afternoon. It is always worth asking whether anyone has any time constraints (flights, etc.) on the last day. Try to accommodate these as far as you are able by manipulating the agenda. Now we move into the workshop proper. The first session is on business goals.
Business Goals
Estimated Time: 30 Minutes This is the first real business session. The following question should be posed: Does your organization have business goals? This is a directed question in the sense that it should be asked of the senior business people present. You need to pick up the body language quickly because some people may feel:
1. Embarrassed that they don't know what their business goals are 2. Uncomfortable about sharing their business goals with outsiders Experience shows that the former is more common than the latter. Most people have a vague idea about business goals and would offer things like: Increase market share Increase customer loyalty Increase profitability What we need to get to is to be able to articulate business goals that have the following properties:
1. Measurable 2. Time bounded 3. Customer focused The third of these is not an absolute requirement, but most business goals do seem to relate to
192
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
customers in some way. This is particularly true if the data warehouse is being built in support of a CRM initiative. If this is the case, try to guide the participants toward customer-related business goals. So a well-articulated business goal might be: To increase customer loyalty by 5 percent per annum for the next three years If you can get one of these types of goals from your business people, that is good. If you can get two or three, that's excellent. The most likely result is that you won't get any. That's OK too. Its just that, if the organization has business goals and is prepared to share them, then it is sensible to use them.
Thinking About Business Strategy
Estimated time: 1 hour This is the first real workshop exercise. We split them into groups with three or four people in a group. In order to get the best out of this exercise, the composition of the groups is important. This is why it is useful to know who the business people are and who the IT people are. We need one senior business person, if possible, in each group. The remaining business people should be spread as evenly as possible throughout the groups and the IT people likewise. Don't allow the IT people to become dominant in any one group. IT people have a tendency to become “joined at the hip” in these sessions if allowed.
One way of achieving an even spread of business and IT people is as follows: During the introduction session earlier, the nonspeaking facilitator should note the names and job titles of each of the participants and can form the groups independently. The attendees can then be simply informed as to which group they have been assigned. The objectives of the exercise at this stage are:
1. Decide upon the business goals (assuming none was forthcoming previously), or use the business goals previously presented. Each team, therefore, is working toward its own set of goals, rather than the entire class having common goals. This is perfectly OK. Each of the business people in the room may have their own goals that will be different from the others' goals. One or two goals are sufficient. Each goal must possess the three properties listed above.
193
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
2. Think about business strategy to help them to achieve the goals. This is a prioritized set of initiatives that the organization might need to implement. It is unlikely that a single item, by itself, would be enough to make the goal achievable.
3. What steps would they need to take in order to put the strategy into operation? The steps will be driven by the prioritization of the strategic components. Be careful to allow enough time for this. This process invariably leads to serious discussion in the groups as almost everyone has a view on the single thing their organization needs to do in order to become more successful. These discussions are, in the main, a good thing and should be allowed to run their course. About one hour should be allowed for this part and a break should be included. It is worth recommending that each person spends the first 15 minutes thinking about the problem, so that each has a contribution to make before the group convenes. Make sure that you have sufficient room for the group discussions to be conducted without affecting each other. The groups must record their decisions as, later on, they will be expected to present them to the other groups.
The Initial Dot Model
Estimated time: 1–2 hours The class should reconvene for this teaching session. The objective is to get them to understand how to use the dot modeling system. This is quite easy, but the IT people will understand it more quickly than the business people. It is sensible to start by focusing on customer behavior, as this usually is the easiest part for people to comprehend. The best way to explain how to make a dot model is as follows: First, on a flipchart or a whiteboard, go through the spreadsheet explanation that we saw earlier in this chapter. Everyone knows what a spreadsheet looks like and can relate to the axes, the cells, etc. Then build a model in front of them, like the one in Figure 5.6. It represents the Wine Club.
Figure 5.6. Initial dot model for the Wine Club.
194
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
It is vital that at least some of the members of the team understand how to do this. It is useful to ask questions, especially of the IT people, to convince yourself that each group is capable of developing a rudimentary model. At this stage, we do not expect perfect models and there is a refinement stage later on. However, it is important that they know the difference between a fact and a dimension. This is obvious to some people, but experience shows it is not obvious to others. It helps to draw on the star schema analogy. Most IT people will be able to relate to it straight away. The business people will pick it up, albeit more slowly.
Behavior
Estimated time: 15 minutes We begin by explaining the meaning of the dot, that is, the behavior. These are the attributes of the dot itself and they represent the focus of the model. Any data warehousing practitioner would be expected to be able to explain this. Typical facts are sales values and quantities. We must be careful to include derived facts such as profit or return on investment (ROI). Each fact attribute must have the properties of being:
1. Numeric 2. Summable, or semi summable Each measurable fact must earn its place in the warehouse. Whoever suggests a fact should demonstrate that it has the capability to contribute toward the business strategy. In addition, questions should be posed, in the form of queries (in natural language, not SQL), to show how the fact will be used.
195
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The measurable fact name, as it will be known in the data warehouse, should be recorded. It is vital that everyone has a clear idea as to the meaning of the facts, and it makes sense to spend a little time focusing on the semantics of the information. In the “sales” scenario, the question “What is a sale?” may result in a discussion about the semantics. For instance, an accountant may say that a sale has occurred when an invoice has been produced. A salesperson might say that a sale has occurred when an order is received. A warehouse manager might say that a sale has occurred when a delivery note is signed by the customer. Everyone has to be clear about the meaning of the data before moving on. We also need to know the time granularity of the measurable fact. This should, ideally, be the lowest level of grain of time that is practical. The best possible level is known as the valid time. Valid time means the time that the event occurred in real life. This usually translates to the transaction level. In a telecommunications application where the data warehouse is to record telephone calls, the valid time relates to the time the call was placed, right down to the second. In other applications, the time granularity might be set to “day.” Usually, all the facts will share the same time granularity. However, it is important to know where this is not the case so we record the time granularity for each measurable fact attribute. For each fact attribute, metadata supporting the attribute should also be recorded. This means a precise description of the attribute in business terms.
The Dimensions
Estimated time: 15 minutes We now describe the meaning of dimensions. Again, it is expected that data warehousing practitioners understand this term, and so the intention here is to give guidelines as to how to explain the term to nontechnical people. One way of achieving this is to draw a two-dimensional spreadsheet with axes of, say, customer and product and explain that this is an example of a two-dimensional view of sales. Each intersection, or cell, in the spreadsheet represents the sale of a particular product to a particular customer. I usually draw an intersection using lines and a dot where they intersect. The dot represents the measurable fact and, by using it, you reinforce the dot notation in the minds of the participants. The two-dimensional view can be extended to a three-dimensional view by transforming the diagram into a cube and labeling the third axis as, say, “time.” Now each intersection of the three dimensions, the dot, represents the sale of a particular product to a particular customer at a particular time. We have now reached the limit of what is possible to draw. So we need a diagrammatic method that releases the dot from its three-dimensional limitation, and this is where the dot model helps. By describing the dimensions as axes of multidimensional spreadsheets, it is usually fairly easy to understand.
Creating the Initial Dot Model
196
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Estimated time: 1 hour Taking the business goals, the strategy, and the steps they decided upon in the earlier group session, the groups now have to do the following:
1. Using their highest-priority strategy, decide what information they need to help them. They will need guidance on this at first. Lead them through an example.
2. Formulate some questions that they would like to be able to ask. 3. Create a dot model that is able to answer those questions. You need to allow at least one hour for this exercise and about half an hour for the teaching session. It is useful to include a major break, such as the lunch break, or overnight. The facilitators must be on hand all the time to assist where required. At this stage you will always get questions like “Is this a fact or a dimension?” Also, the more “switched-on” IT people will already start making assumptions about things like: Availability of data from source systems Target database architectures Data quality issues, etc. They should be discouraged from these lines of thought. It is important that they do not constrain the innovative thinking processes. We are trying to establish the information we need, not what is currently available.
Group Presentations
Estimated time: 30 minutes per group Each group must now present, to the other groups:
1. Business goals 2. Steps in their strategy 3. Information needed to support the steps 4. Dot model to support the information requirements 5. Some example questions that the model supports
197
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
6. How each question is relevant to their strategy The information should be presented on a flip chart or overhead projector. The groups should each elect a spokesperson. Experience shows that some groups prefer to have two people, or even the whole group, participating in the presentation. The other groups should provide feedback. The facilitators must provide feedback. The feedback at this stage must always be positive with some suggestions as to how the model could be enhanced. Do not comment on the accuracy or completeness of the models. The refinement process will resolve these issues.
The Refinement Process
Estimated time: 1 hour We now have a classroom session about refinement of the model. First, we have to decide whether the groups should continue to work on their own models or whether all groups should adopt the same, perhaps a composite, model. There are no real preferences either way at this time. In either case, the work is still undertaken in groups. Then we explain about:
1. Combination of dimensions 2. Dimensional hierarchies 3. Inclusion of others' suggestions Point 1 above concerns dimensions on the original model that are not really dimensions at all. There is an instance of this in the example provided in that the “vintage” dimension is really no more than an attribute of “wine.” After discussion, it is sometimes wise to leave these, apparently extraneous, dimensions alone if no agreement can be reached. The creation of dimensional hierarchies is a fairly straightforward affair. The refined example dot model is shown in Figure 5.7.
Figure 5.7. Refined dot model for the Wine Club.
198
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The groups should now be re-formed to refine the models. The facilitators should wait for about 10 to 15 minutes and then approach the groups to see if they need help. There will usually be some discussion and disagreement about how, and whether, changes should be incorporated.
Presenting the Refined Models
Estimated time: 15 minutes per group The refined models are re-presented to the other groups. The team has to show:
1. How the model has evolved from the original model as a result of the review process 2. How any enhancements, from the previous feedback, have been incorporated Documenting the Models
Estimated time: 1 hour The models are documented, in their refined state, using the data model and the first part of the entities and segmentation worksheets (the remaining parts of the entities and segmentation worksheet
199
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
are completed during the second workshop). The entities include the customers' circumstances and the dimensions from the behavioral dot models. The segmentation relates to the derived segments, which is the final part of our GCM. Sharing a worksheet for these two components of the model is a good idea because it can be difficult for business people to separate the two things in their mind. Encourage each group member to complete their own worksheets. The facilitators must make sure they take a copy of each group's model. These models provide the primary input to the component analysis workshop. Some examples of the data model and entities worksheets for the Wine Club are shown in Figure 5.8 (a fuller set of dot model worksheets is included in Appendix B).
Figure 5.8. Dot modeling worksheet showing Wine Club sales behavior.
The following are examples of the entities and dimensions worksheet. The first, predictably, is the customer entity. At this stage, we are really concerned with describing only the entity itself, not the
200
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
attributes. Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Customer
True
Monthly
Metadata The customer entity contains all customer details. We require to monitor discontinuous existences and to be able to identify active and nonactive customers. An active customer is defined to be one who has placed an order within the past 12 months. Any customer not having done so will be classified as inactive. Subsequent orders enable the customer to be reclassified as active, but the periods of inactivity must be preserved. Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data Type
Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data Type
The second example is the wine dimension. This will be used, principally, as a dimension in one or more behavioral models. Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Wine
True
Monthly
201
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Metadata The wine dimension holds information about the wines sold by the club. Details include prices and costs. Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data Type
Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data type
All that we are recording at this stage is information relating to the entity itself, not to any of the attributes belonging to the entity.
Workshop Wrap-Up
Estimated time: 30 minutes It is important to summarize the process that has been conducted over the previous two days and to congratulate the participants on having taken the first steps toward the creation of a business-led information model for the organization:
1. Explain how the model is just the first step and how it becomes the input to the next stage. Briefly describe the main components of the component analysis workshop.
2. Try to secure commitment from as many people as possible to attend the second workshop.
202
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
3. The arrangements for the second workshop should be confirmed, if that is possible. 4. Hand out the attendee feedback form and ask each person to complete it. We all want to improve on the service we offer, so we need as much feedback as we can get. While they are filling in the feedback forms, explain about dimensional attributes and try to encourage them to start thinking about this for the next workshop.
The Component Analysis Workshop The principal objective of the component analysis workshop is to put some substance into the model that was created in the information strategy workshop.
Participants—Practitioners
As far as possible, the same consultancy people as were present in the first workshop should participate in the second. This exercise is more technical than the first and so some additional skills are required but continuity of personnel should be preserved.
Participants—Customer
Similarly, as far as is feasible, the same people as before should be present. However, as has been stated, this is a more technical workshop, and some of the more senior business people should be permitted to send fully authorized deputies to work in their place. The business goals and the first part of the dot model have been established, so their part of the task is complete. The proportions of business people to IT people can be reversed in this workshop, so maybe two-thirds IT people is OK this time. It is, however, vital that some business people are present. What is also important is continuity, so I would reiterate the point that we need as many people as we can get from the first workshop to attend this second workshop.
The Component Analysis Workshop Process
The organization of this second workshop is very similar to that of the first workshop. The room layout and writing facilities are the same. We do not need extra rooms for group work this time.
Review of Previous Model
Estimated time: 30 minutes The purpose of this first exercise is to refresh the minds of the participants as to the state of the dot
203
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
model as at the end of the first workshop. Since the previous workshop, it is quite likely that the participants will have thought of other refinements to the model. Whether these refinements should be adopted depends on the authority of the persons present. As long as any refinements have the full backing of the business, and can be demonstrated to be genuinely in pursuit of business goals, they should be discussed and adopted where appropriate. The extent of refinement is difficult to assess in general terms. Ideally, the lead consultant on the project will have kept in touch with developments inside the customer organization and will be able to estimate the time needed to conduct this part of the exercise. If there are no changes required, it should take no more than 30 minutes.
Defining Attributes
Estimated time: 3–4 hours This can be quite a time-consuming process. The objective is to build a list of attributes for each dimension in the model.
Make sure lots of breaks are included in this session. Encourage people to walk around. The supporting dot modeling entities worksheet should be used for this exercise. We started completing the entities worksheet in the first workshop. Now we can complete it by adding the details of all the attributes. It is recommended that one of the facilitators maintains these worksheets as the session proceeds. The examples of this worksheet that were previously shown are reproduced with further examples of some attributes. Note that, although the form is shown as completed, some of the information can be added later in the logical modeling stage. Right now you should be aiming to record just the names of the attributes and the business metadata. As previously, we have completed a couple of examples for the customer's details: Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Customer
True
Monthly
Metadata
204
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The customer entity contains all customer details. We require to monitor discontinuous existences and to be able to identify active and nonactive customers. An active customer is defined to be one that has placed an order within the past 12 months. Any customer not having done so will be classified as inactive. Subsequent orders enable the customer to be reclassified as active, but the periods of inactivity must be preserved. Attribute Name
Customer Name
PK? N
Retrospection
Frequency
Dependency
False
Monthly
None
Metadata The name of the customer in the form of surname preceded by initials Source
Transofrmation
Data Type
Customer Admin
None
Char 25
Attribute Name
Customer Address
PK? N
Retrospection
Frequency
Dependency
True
Monthly
Sales area hierarchy
Metadata The customer's address Source
Transformation
Data Type
Customer Admin
None
Char 75
Attribute Name
Lifetime value indicator
PK? N
Retrospection
Frequency
Dependency
True
Monthly
None
Metadata The calculated lifetime value for the customer. Values are from 1 to 20 in ascending value Source
Transformation
Data type
Customer Admin
SQL package crm_life_value.sql
Numeric (2)
The second example again is the wine entity that is used as a dimension in one or more dimensional dot models. Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Wine
True
Monthly
Metadata The wine dimension holds information about the wines sold by the club. Details include prices and costs. Attribute Name
Wine.Bottle_Cost
Retrospection
Frequency
PK? N Dependency
205
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
True
Monthly
None
Metadata The cost of one bottle of wine, net of all discounts received and including all taxes, duties, and transportaion charges. Source
Transformation
Data Type
Stock (Goods Inwards)
None
Numeric (7,3)
Attribute Name
Wine.ABV
Retrospection
Frequency
Dependency
False
Monthly
None
PK? N
Metadata The alcohol content of a wine expressed as a percentage of the total volume to one decimal place. Source
Transformation
Data type
Stock (Goods Inwards)
None
Numeric (3,1)
One worksheet, or set of worksheets, is completed for each entity in the model. This includes the customers' details, products (e.g., wine), and market segments. The identifying attribute should be included where known. The attribute name, as it will be known in the data warehouse, should be recorded. The name must make it clear, to business people, what the attribute represents. For each attribute, metadata supporting the attribute should also be recorded. At this level we are looking for a concise but precise description of the attribute in business terms. The facilitator should call for nominations of attributes from the workshop participants. Each attribute must earn its place in the model. Whoever suggests an attribute should demonstrate that it has the capability to contribute toward the business strategy. In addition, questions should be posed, in the form of queries (in natural language, not SQL) that will show how the attribute will be used. As has been stated, this exercise can be very time-consuming and somewhat tedious. It should be punctuated with frequent breaks.
Dimensional Analysis of the Facts
Estimated time: 30 minutes Next, for each measurable fact attribute, we examine each dimension to determine the extent to which the standard arithmetic functions can be applied. This enables us to distinguish between fully summable facts and semi summable additive facts. For instance: Supermarket quantities can be summed by product but not by customer.
206
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Bank balances can be summed at a particular point in time but not across time. Return on investment, and any other percentages, cannot be added at all, but maximums, minimums, and averages may be OK. So called “factless” facts, such as attendance, can be counted but cannot be summed or averaged, etc. So each of the measurable facts should undergo an examination to determine which of the standard arithmetical functions can be safely applied to which dimensions. The worksheet to be used to perform this exercise is called the fact usage worksheet, and an example is as shown below (again, a fuller model is contained in Appendix B). Dot Modeling Fact Usage Model Name: Wine Club Sales Fact Name: Value
Frequency Daily
Dimensions
Sum
Count
Ave
Min
Max
1. Customer
?
?
?
?
?
2. Hobby
?
?
?
?
?
3. Wine
?
?
?
?
?
4. Supplier
?
?
?
?
?
Hierarchies and Groupings
Estimated time: 1 hour This exercise is simply to provide some metadata to describe the relationships between the different levels in the dimensional hierarchy. Similar to before, the metadata is required to provide a precise description of the relationship in business terms. The worksheet used to capture this information is called the hierarchies and groupings worksheet, and an example is shown in Figure 5.9.
Figure 5.9. Example of a hierarchy.
207
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The hierarchy or grouping is identified by recording the names of the dimensions in the hierarchy diagram at the top of the worksheet. The example above shows Sales Area ? Customer grouping. The higher (senior) level in the hierarchy is always placed at the top of the diagram, and the lower (junior) level is placed at the bottom.
Retrospection
Estimated time: 1–2 hours We covered retrospection earlier in this chapter. However, we need to be able to present the concept to the workshop attendees so that they can assign values for retrospection to each of the data warehouse components. Whereas it is important for us to fully comprehend the design issues surrounding retrospection, usually it is not necessary for our customers to have the same level of understanding. We should present to them something that enables them to grasp the business significance of the subject without becoming embroiled in the technical detail. Anyway, here's a recap: Retrospection is the ability to look backward into the past. Each of the database objects above can take one of three possible values for retrospection:
1. True. True retrospection means that the data warehouse must faithfully record and report upon the changing values of an object over time. This means that true temporal support must be provided for the data warehouse object in question.
2. False. False retrospection means that, while the object can change its value, only the latest value is required to be held in the data warehouse. Any previous values are permanently lost. This means that temporal support is not required.
3. Permanent. Permanent retrospection means that the values will not change during the life
208
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
of the data warehouse. This means that temporal support is not required. The meaning of retrospection varies slightly when applied to different types of warehouse objects. When applied to dimensions, the value for retrospection relates to the existence of the dimension in question. For instance, if we need to know how many customers we have right now, then we must be able to distinguish between current customers and previous customers. If we want to be able to ask: How many customers do we have today compared to exactly one year go? then we need to know exactly when a customer became a customer and when they ceased being a customer. Some customers may have a discontinuous lifespan. For some periods of time they may be active customers, and during the intervening periods they may not be. If it is required to track this information faithfully, then retrospection = true would apply to the customer dimension. Another example might be the supplier dimension. While we might be interested to know whether a supplier is a “current” or “dead” supplier, it is perhaps not so crucial to know when they became a supplier and when they ceased. So, in this case, retrospection = false may well be more appropriate. Still another example of dimensions is region. In some applications, the regions might be expected to always exist. In these cases, there is no point in tracking the existence of the dimensions, so we allow for a status of retrospection = permanent to accommodate this situation. Insofar as hierarchies and attributes are concerned, retrospection relates to the values that are held by these objects. If a customer moves from one region to another, or changes some important attribute such as, say, number of children, then, depending on the application, it may be important to be able to trace the history of these changes to ensure that queries return accurate results. If this is the case, then retrospection = true should be applied to the attribute or hierarchy. In other cases it may be required to record only the latest value, such as the customer's spouse's name. In this case, retrospection = false would apply. In still other cases the values may never change. An example of this is date of birth. In these cases a value of retrospection = permanent would be more applicable. The value given, to each data warehouse object, for retrospection will become very important when the project moves into the design phase. It must be remembered that the requirements relating to retrospection are business, and not technical, requirements. It is all about accuracy of results of queries submitted by the users.
The measurable fact attributes, once entered into the warehouse, never change.
209
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
These attributes have an implied value of: Retrospection = Permanent Therefore, we have to examine each dimension, dimensional attribute, and hierarchy to establish which value for retrospection should be applied. This means that the worksheets, previously completed, must be revisited.
How to Determine the Value for Retrospection
In order to establish the value for retrospection of each data warehouse object, the designer must investigate the use of the object within the organization. One way of doing this is to question the appropriate personnel as follows: If this object were to change, would you want to be able the track the history accurately? Experience has shown that the response to this type of question is, almost invariably, in the affirmative. As a result, every attribute becomes a candidate for special treatment with respect to time, and every dimension, relationship, and attribute will be assigned a value for retrospection of true. In view of the fact that provision of temporal support is expensive in terms of resources and adversely affects performance, it is very important that such support is provided only where it will truly add value to the information provided by the data warehouse. So we need to adopt an approach to ascertaining the real need for temporal support on an object by object basis. The method should, as far as possible, provide an objective means of evaluation. One approach is to ask questions that do not invite a simple yes or no response. Try asking the following questions: For entities and dimensions: How would the answer to the following question help you in making decisions regarding your business objectives: How many customers do we have today compared to the same time last year? Other questions along similar lines such as: How many customers have we had for longer than one year, or two years? should also be phrased and the answer should be expressed in terms that show how the ability to answer such questions would provide a clear advantage. For relationships, the questions might be: How many customers do we have in each sales area today compared to the same time last year?
210
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
For attributes: How many of our customers have moved in the past year? The questions need to focus on the time requirements, preferably with a time series type of answer. The responder should be able to demonstrate how the answers to the questions would be helpful in the pursuit of their goals.
Granularity and the Dot_Time Table
Estimated time: 1–2 hours We have discussed the time granularity of the measurable facts. Now we must discuss the time granularity of the dimensions, hierarchies, and attributes. We also have to decide on the contents of the Dot_Time table. For dimensions, hierarchies, and attributes, the time granularity relates to the frequency with which changes to values, and changes to existence, are notified and applied to the data warehouse. So if, for example, customer updates are to be sent to the warehouse on a monthly basis, then the time granularity is monthly. In order for complete accuracy to be assured, the granularity of time for the capture of changes in the dimensions must be the same as the granularity of time for the capture of the measurable fact events. In practice, this is hardly ever possible. There is an inevitable delay in time between the change event and the capture of that change in the source system. There is, usually, a further delay between the implementation of the change in the source system and the subsequent capture of the change in the data warehouse. The challenge for data warehouse designers is to minimize the delays in the capture of changes. The Dot Time worksheet is simply a list of all the headings, relating to time, by which the users will need to group the results from their queries. The requirements will vary from one application to another. Below is an example of the worksheet. Dot Modeling—Time Model Name: Example Name
Description
Data Type
Day name
Standard day names
String
Day number
Day number 1–7 (Monday = 1)
Numeric
Bank holiday flag
Y = bank holiday
Character (Y/N)
Month number
Standard months 01–12
Numeric
Month name
Standard month names
String
Quarter
Standard quarters Q1, Q2, Q3, Q4
Numeric
211
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Year
Year numbers
Numeric
Weekend day
Is it a Saturday or Sunday?
Character (Y/N)
Fiscal month no.
April = 01, March = 12
Numeric
Fiscal quarter
Q1 = April – June
Numeric
24-hour opening
Y = open for 24 hours
Character (Y/N)
Weather conditions
Indicator for weather
Character
It is also important to establish how much history the application will wish to hold. This has two main benefits:
1. It tells the designers how many records will have to be created in order to populate the Dot Time table.
2. It gives the designers some indication of the ultimate size of the database when fully populated. Also, check whether future data will be needed, such as forecasts and budgets, as these will have to be built into the Dot Time table as well.
Workshop Wrap-Up
Estimated time: 30 minutes It is important to summarize the process that has been conducted over the previous two days. We now have a complete, business-driven, conceptual information model that the designers can use to help to build the data warehouse. We now have all the components of our conceptual model and lots of instructive data to help us to move forward and develop the logical model, which is the next step in the process. only for RuBoard - do not distribute or recompile
212
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY We need a conceptual model for data warehousing in which we can include all the complex aspects of our design, without making the model too complex for users to comprehend. One of the major strengths of the first-generation solutions was its inherent simplicity. However, as we have seen, the industry has moved on and the next-generation data warehouse must evolve to support the kinds of information our users will demand. For a data warehouse to support CRM properly, we need a customer-centric model, which, while using the strengths and benefits of the old behavior-centric solutions, is much more in tune with customers' circumstances, especially their changing circumstances. As we have seen in previous chapters, often it is these changing circumstances that cause a change in customer behavior. We need to be able to analyze these underlying causes if we are to be successful at CRM. In this chapter we introduced the concept of retrospection. Retrospection provides us with a way of classifying each component of the customers' circumstances so that we can build a full picture of each customer over time. We also introduced dot modeling. Dot modeling helps us to build a picture of the data warehouse starting with the behavior, the most familiar part, followed by the incorporation of the circumstances, retrospection, dependencies, and everything else we need to describe the business requirements correctly. The main part of the model can be built by the business people themselves under our guidance. We worked through a workshop scenario to show precisely how this might be achieved. Ultimately we end up with a customer-centric model, our GCM, that can be transformed into a logical model. This transformation is the subject of the next chapter. only for RuBoard - do not distribute or recompile
213
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Chapter 6. The Logical Model In this chapter we explore solutions to the implementation of the general conceptual model (GCM) in data warehouses. One of the key components of our model is the concept of retrospection that was introduced in the previous chapter. Various options as to how it might be implemented are explored in this chapter. In connection with retrospection, the subject of existence, which was discussed in Chapter 3, is developed and used to show how some of the queries that were very difficult, if not impossible, to express using the Type 2 method in Chapter 4, can be written successfully. Also in this chapter, the solution is developed further to show how it can be transformed into a relational logical model. In practice, the logical modeling stage of the conceptual?logical?physical progression is now usually omitted and designers will move from a conceptual model straight to a physical model. This involves writing the data definition language (DDL) statements directly from the conceptual model. This practice has evolved over time because relational databases have become the assumed implementation platform for virtually all database applications. This applies to operational systems as well as informational systems such as data warehouses. However, many data warehouses are implemented not on relational data base systems but on dimensional database systems that are generally known as Online analytical processing (OLAP) systems. There are currently no standards in place to assist designers in the production of logical models for OLAP proprietary database systems as there are for relational systems. Even so, in order to produce a practitioner's guide to developing data warehouses, the logical design process must pay some regard to nonrelational database management systems. Where the target database management system is not relational, in the absence of any agreed-on standards, it is not possible to produce a physical model unless the designer has intimate knowledge of the proprietary DDL or physical implementation processes of the DBMS in question. In this chapter we'll also consider the performance tradeoff. The temporal solutions that are put forward will have an impact on performance. We'll briefly consider the ramifications of this and suggest some solutions. The chapter then provides recommendations as to the circumstances when each of the available temporal solutions should be adopted. Finally the chapter defines some
214
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
constraints that have to be applied to ensure that the implementation of the representation of time does not compromise the integrity of the data warehouse. only for RuBoard - do not distribute or recompile
215
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
LOGICAL MODELING When reviewing these problems with the objective of trying to formulate potential solutions, the following are the main requirements that need to be satisfied: Accurate reporting of the facts. It is very important that, in a dimensional model, whenever a fact entry is joined to a dimension entry, the fact must join to the correct dimension with respect to time. Where a snowflake schema exists, the fact entry must join to the correct dimension entry no matter where that dimensional entry appears in the hierarchy. In reality, it is recognized that this can never be wholly achieved in all cases, as deficiencies in the capture of data from the operational systems of the organization will impinge on our ability to satisfy this requirement. We can hope only to entirely satisfy this requirement insofar as the warehouse model itself is concerned and to minimize, as far as possible, the negative effect caused by the operational systems. Accurate recording of the changes in entities to support queries involving customer circumstances and dimension browsing. Queries such as these represent a very significant component of the data warehouse usage. It is important to ensure that the periods of existence (validity) of dimensions, relationships, and attributes are recorded accurately with respect to time where this has been identified as a business requirement. Again, the ability to do this is constrained by the accuracy and quality of data supplied by the operational systems. It is important that the warehouse does not compound the problem. only for RuBoard - do not distribute or recompile
216
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE IMPLEMENTATION OF RETROSPECTION Introduction We begin this section with a general rule: Every query executed in a data warehouse must have a time constraint. If an executed query does not have an explicit time constraint, then the inferred time period is “for all time.” Queries that embrace all of time, insofar as the data warehouse is concerned, can be generally regarded as nonsensical because “for all time” simply means the arbitrary length of time that the database has been in existence. Whereas it may be sensible to aggregate across all customers or all products in order to ascertain some information about, say, total revenue for a period of time, it does not make sense to apply the same approach to time under normal circumstances. The following query has been quoted extensively: How many customers do we have? Even this query has an implicit time constraint in that it is really asking: How many customers do we have at this precise point in time, or how many customers currently exist? It is likely that readers of this book may be able to think of circumstances where the absence of a time constraint makes perfect sense, which is why the axiom is offered as a general rule rather than a fixed one. The principle still holds.
The Use of Existence Attributes We'll now explore the use of existence attributes as one approach to the implementation of retrospection. The concept of existence has already been raised, but it's general application to all data warehouse components might not be entirely clear. The reasoning is as follows: The temporal requirement for any entity relates to its existence. The existence of an entity may, in some circumstances, be discontinuous. For instance a wine may be sold by the Wine Club for a period of time and then be withdrawn. Thereafter, it may again come into existence for future periods of time. This discontinuity in the lifecycle of the wine may be important when queries such as “How many product lines do we sell currently?” are asked. A further example concerns customers. The regaining of lost customers is increasingly becoming an important business objective especially in the telecommunication and financial services industries. When customers are “won back,” the companies prefer to reinstate them with their previous history intact. It is now becoming more common for customers to have discontinuous existences.
217
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
A relationship that requires temporal support can be modelled as an m:n relationship. Using the entity relationship (ER) standard method for resolving m:n relationships, a further entity, an intersection entity, would be placed between the related entities. So there now exists another entity (a weak entity). As the temporal requirement for an entity relates to the existence of the entity, the treatment of relationships is precisely the same as the treatment of entities. It is a question of existence once again. An attribute should really be regarded as a property of an entity. The entity is engaged in a relationship with a domain of values. Over time, that relationship would necessarily be modeled as an m:n relationship. An attribute that requires temporal support, therefore, would be resolved into another intersection entity between the entity and the domain, although the domain is not actually shown on the diagram. The treatment of attributes is, therefore, the same as entities and relationships and relates to the existence of the value of the attribute for a period of time. It is proposed that temporal support for each element, where required, within customer circumstances, segmentation, or the dimensional structure of a data warehouse should be implemented by existence attributes. Various ways of representing such an existence attribute may be considered. At the simplest level for an entity, an existence attribute could be added to record whether each occurrence is currently active or not. However, if full support for history is needed, then this requires a composite existence attribute consisting of a start time and an end time that records each period of existence. Some special value should be assigned to the end time to denote a currently active period. Each element may have such an existence attribute implemented as follows:
1. For the temporal support of an entity that may have a discontinuous existence, a separate table is required, consisting of the primary key of the entity and an existence period. If discontinuous existence is not possible, the existence period may be added to the entity type.
2. For the temporal support of a relationship between entities, a separate table is required, consisting of the primary keys of both participating entities together with an existence period.
3. For the temporal support of an attribute of an entity, a separate table is required, consisting of the primary key of the entity, an existence period, and the attribute value. It should be noted that the concept of existence attributes is not new. In fact, the approach could be described as a kind of “selective” attribute timestamping. However, use of attribute timestamping has largely been limited to research into temporal database management systems and has not before been associated with data warehousing. It is the selective adoption of attribute timestamps, in the form of existence attributes, for data warehousing purposes that is new. The use of existence attributes solves the problem of cascaded extraneous inserts into the database caused by the use of the Type 2 solution with a slowly changing hierarchy that was described in Chapter 4. The reason is that there is no need to introduce a generalized key for a dimension, because changes to an attribute are kept in a separate table. However, the performance and usability of the data warehouse need to be considered. The use of existence attributes in the form of additional tables does add to the complexity, so it should be allowed only for those elements where there is a
218
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
clearly identified need for temporal support. It is for the decision makers to choose which elements are sufficiently important to justify such treatment. To explore the benefits of existence attributes, let us consider that the user wishes to know the number of customers that a particular sales executive is responsible for. The intuitive query, reproduced from Query Listing 4.6 in Chapter 4, is shown in Query Listing 6.1.
Listing 6.1 Nonexpert query to count customers (reproduced from 4.6).
Select count(*) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' If Type 2 is implemented, the result would definitely be wrong. Each customer would have one or more rows in the table depending on the number of changes that had occurred to the customer's record. The result would substantially overstate the real situation. Type 2 works by creating a new dimension record with a different generalized key. One simple solution is that the dimension is given an additional attribute that signifies whether the row is the latest row for the customer. The value of this attribute declares the existence of the dimension and remains true until the row is superseded. In its simplest form the existence of a dimension can be implemented using a Boolean attribute. This means that when a change is implemented, the previous latest row is updated and the existence attribute is set to false. The new row has its existence attribute set to true. The new query to determine the number of customers, that a particular sales executive is responsible for, is as follows:
Select count(*) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' And C.Existence = TRUE The end-user query software could be configured to add this last line to all dimension tables, so the user need not be aware of it. However, this does not entirely solve the problem, because it is not answering the question: How many customers is Tom Sawyer responsible for now? Rather, it is answering the question: How many customers has Tom Sawyer ever been responsible for? One method toward solving this problem would be to also set the existence attribute to false when the customer ceased to be a customer. Whether or not this is possible depends on the ability of the data warehouse processing to detect that a customer's status had become inactive. For instance, if the customer's record in the source system were to be deleted, the data warehouse processing could infer that the customer's existence attribute in the data warehouse should be updated to false.
219
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Another variation is to make use of null values in the existence attribute, as follows: Null for not existing Not null for existing (i.e., current) If the column was called “existence,” to identify all customers who were still active customers, then the query would be as shown here:
Select count(existence) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' Due to the way nulls are treated (i.e., they are ignored unless explicitly coded for), this expression of the query is almost as simple as the original, intuitive query phrased by the user in Query Listing 6.1. Furthermore, if the true value of existence was 1, then the following query would also return the correct result:
Select sum(existence) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' This appears to effectively solve the problem of determining who the current customers are. Thus, even the simplest existence attribute can improve the original Type 2 method considerably. The existence attribute could be implemented using a single “effective date.” This has the advantage that we can determine when a change occurred. However, such a method does not allow us to answer the previous query (How many customers is Tom Sawyer responsible for?), because, again, there is no means to determine inactive customers. The use of row timestamping, using a pair of dates, does enable the question to be answered so long as the end date is updated when the customer becomes inactive. However, there are many questions, such as state duration and transition detection questions, that are very difficult to express and even some that are impossible to express using this approach. We have already explored the concept of customer churn. The loss of valuable customers is a subject close to the hearts of many, many business people. It is a problem that could equally apply to the Wine Club. It is now becoming a common practice to contact customers that have churned in sales campaigns in an attempt to attract them back. Where organizations are successful in doing this, the customers are reinstated with their previous identifiers so that their previous history is available to the customer service staff. The incidence of discontinuous existences is, therefore, becoming more common.
220
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The need to monitor churn and to establish the reasons for it tends to create a need for queries that return results in the form of a time series. The following exemplifies the type of questions we would like to express:
1. How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? The result of such a query would be a time series containing three periods and a number attached to each period. This is an example of a temporal selection query.
2. Of the customers who were lost, how many had been customers continuously for at least one year? The loss of long-standing customers might be considered to be as a result of worsening service. The result from this question is also a time series, but the query contains an examination of durations of time. So this query is a state duration query.
3. How many of these customers experienced a change of administration because they moved in the year that they left? Perhaps they are unhappy with the level of service provided by the new area. This question is concerned with the existence of the relationship between the customer and the sales area. It is also an example of a transition detection query.
4. How many price changes did they experience in the year that they left? Perhaps they were unhappy with the number of price rises imposed. This is similar to question 3 in that it is a transition detection query, but it is applied to the value of an attribute, in this case the selling price of a bottle of wine, instead of a relationship. These requirements cannot be satisfied using a single Boolean attribute to describe the existence of a dimension, as there is a requirement to make comparisons between dates. Neither can the queries be expressed using a single date attribute for the reason previously explained. It seems clear that the expression of such queries requires a composite existence attribute that is, logically, a period comprising a start time and an end time. It has been shown that row timestamping can provide a solution in many cases, but not all, and the resulting queries are complex to write. A simpler solution is sought. The approach to be adopted will be to use a separate existence attribute, in the form of a composite start time and end time, for each dimension, relationship, and attribute where retrospection has been defined to be true. It is assumed that these existence attributes will be held in separate tables. So the existence attribute for customers' existence will be held in a table called “CustomerExist” and the existence attribute for the period during which a customer lives in a sales area will be called “CustomerSalesAreaExist.” Using the composite existence attribute, the first query can be expressed as in Query Listing 6.2.
Listing 6.2 Count of customers who have left during Q4 year on year.
select 'Q4 2000' as quarter, count(*)
221
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
from CustomerExist ce where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' union select 'Q4 1999' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' The query in Listing 6.2 could have been answered using row timestamping only if the end timestamp was updated to show that the customer was no longer active. The distinction is made between the existence attribute and the row timestamp because the existence attribute is a single-purpose attribute that purely records the existence of the customer. The row timestamp is, as has been stated, a multipurpose attribute that records other types of changes as well. In order to express the query using row timestamps, it would have to be written as a correlated subquery to ensure that only the latest record for the customer was evaluated. This means that discontinuous existences could not be detected. The second query can be expressed as in Query Listing 6.3.
Listing 6.3 Count of long-standing customers lost.
select 'Q4 2000' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1999' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1998' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365
222
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
In order to express the third query, it is assumed that there is a separate existence attribute for the relationship between the customer and the sales area. This is shown in Query Listing 6.4.
Listing 6.4 Lost customers who moved.
select 'Q4 2000' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '2000/01/01' and '2000/12/31' union select 'Q4 1999' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '1999/01/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '1998/01/01' and '1998/12/31' The query in Listing 6.4 is an example of a combined state duration and transition detection query. As with the other queries, in order to express the fourth query, it is assumed that there is a separate existence attribute for the bottle price (Listing 6.5).
Listing 6.5 Lost customers affected by price increases.
select 'Q4 2000' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '2000/10/01' and '2000/12/31'
223
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1998 and spe.ExistenceStart between '2000/01/01' and '2000/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 union select 'Q4 1999' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1997 and spe.ExistenceStart between '1999/01/01' and '1999/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 union select 'Q4 1998' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1996 and spe.ExistenceStart between '1998/01/01' and '1998/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 The query in Query Listing 6.5 shows customers who left the club in the last quarter of the year and who had experienced more than five price changes during the year. This is another example of a combined state duration and transition detection query.
224
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
In dealing with the issue of churn, this approach of trying to detect patterns of behavior of customers is typical. It is accepted that the queries have drawbacks in that they are quite complex and would prove difficult for the average data warehouse user to write. Also, each query actually consists of a set of smaller queries, and each of the smaller queries is responsible for processing a discrete point in time or a discrete duration. Any requirement to increase the overall timespan, or to break the query into smaller discrete timespans, would result in many more queries being added to the set. So the queries cannot be generalized to deal with a wide range of times. In the next section, we explore and propose an approach for making the queries much easier to express and generalize. In this section we have seen that the use of existence periods does provide a practical solution to the implementation of true retrospection. This enables time to be properly represented in the customer circumstances and dimensional structures of the data warehouse and, therefore, satisfies one of the major requirements in the design of data warehouses. only for RuBoard - do not distribute or recompile
225
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE USE OF THE TIME DIMENSION In this section let's explore how the use of the composite existence attribute that was introduced in the previous section, together with the time dimension, may allow the expression of some complex queries to be simplified. The purpose of the time dimension is to provide a mechanism for constraining and grouping the facts, as do the other dimensions in the model. We are examining methods for properly representing customers; this means providing support for time in the customer circumstances, dimensions, and dimensional hierarchies as well as the facts. The four queries listed in Query Listings 6.2–6.5 show that similar time constraints apply to the dimensions as to the facts. Therefore, it seems appropriate to allow the users of the data warehouse to express time constraints on other components using the same approach as they do with the facts. Kimball proscribes the use of the time dimension with other dimensions because he is concerned that the semantics of time in the dimensions is different from that of facts and is potentially misleading. The view that the time dimension should not be used in dimensional browse queries is supported implicitly by the conventional star schema and snowflake schema data models that show the time dimension as being related to the fact table alone. There is no relationship between the time dimension and any other dimension on any dimensional model that I have seen. In considering this matter, two points emerge: Firstly, the time dimension provides a simple interface to users when formulating queries. Preventing them from using the time dimension with other entities means that the users will be able to place time constraints by selecting terms such as 2nd Quarter 2000 in queries involving the fact table but not other entities such as customer circumstances and dimensions. In these types of queries, the explicit time values have to be coded. Second, some dimensional browsing queries are much easier to express if a join is permitted between these other entities and the time dimension. Further, I have discovered that some useful but complex queries, such as those in the previous section, can be generalized if a join to the time dimension is permitted. This is described below. Referring to the queries expressed in Listings 6.2–6.5, the first query (Listing 6.2) is: How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? Using the time dimension, it can be expressed as shown in Listing 6.6.
Listing 6.6 Count of customers lost during Q4, using the time dimension.
226
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
select t.Quarter, count(*) from CustomerExist ce, Time t where ce.ExistenceEnd = t.TimeCode and t.Quarter in ('Q42000', 'Q41999', 'Q41998') group by t.Quarter Changes to the temporal scope of the query can be effected simply by altering one line of the predicate instead of creating additional discrete queries. The query from Listing 6.3 is: Of the customers who were lost, how many had been customers continuously for at least one year? This can be expressed as follows:
Listing 6.7 Count of long-standing customers lost, using the time dimension.
select t.Quarter, count(*) from CustomerExist ce, Time t where ce.ExistenceEnd = t.TimeCode and t.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 group by t.Quarter The third query, from Listing 6.4, is: How many of the customers experienced a change of administration because they moved in the year that they left? Using the same assumptions as before, the query can be expressed as in Query Listing 6.8.
Listing 6.8 Lost customers who moved, using the time dimension.
select t1.Quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa, Time t1, Time t2 where ce.ExistenceEnd = t1.TimeCode and t1.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart = t2.TimeCode and t2.Year = t1.Year group by t1.Quarter Finally, the fourth query, from Query Listing 6.5, is: How many price changes did they experience in the year that they left? This can be expressed as shown in Query Listing 6.9.
Listing 6.9 Lost customers affected by price increases, using the time dimension.
select t1.Quarter, ce.CustomerCode,
227
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t1, Time t2, Time t3 where ce.ExistenceEnd = t1.TimeCode and t1.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and spe.ExistenceStart = t2.TimeCode and s.TimeCode = t3.TimeCode and t2.Year = t1.Year and t3.Year = t2.Year group by t1.Quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 Thus, we can conclude that allowing the time dimension to be joined to other dimensions, when existence attributes are used, enables a simpler expression of some temporal queries In order to adopt a change of approach whereby joins are allowed between the time dimension and other dimensions, we have to alter the data model. There now exists a relationship between the time dimension and some of the other entities. Only those dimensions that need true support for time will be related to the time dimension. Part of the Wine Club model has been reproduced in Figure 6.1.
Figure 6.1. ER diagram showing new relationships to the time dimension.
Figure 6.1 shows time having a relationship with sales, as before, and also with the customer and wine dimensions. In fact, as Figure 6.1 shows, there are two relationships between the time dimension
228
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
and other dimensions, one for the start time of a period and one for the end time. A problem caused by this change in the conventional approach is that dimensional models will immediately lose their simple shape, and the overall model will become horribly complex. Simplicity is one of the requirements placed on the model. In creating this new idea, I have introduced a further problem to be solved. It is important that the dimensional shape is not lost. Clearly, we cannot present the users with such a diagram. The solution could be that the time dimension is removed altogether from the model. It has been said before that the time dimension is always included as part of a dimensional model because time is always a dimension of analysis in a data warehouse that records history. In further recognition of the fact the data warehouses are temporal databases, the explicit inclusion of a time dimension could be regarded as unnecessary. So, on the assumption that the time dimension is a given requirement, I have adopted the view that its inclusion is implicit. This means that the diagram does not need to model the time dimension explicitly. However, this causes a problem with the entity relationship modeling methodology in that it would be misleading to have implicit entities that are deemed to exist but are excluded from the diagram. The dot modeling methodology does not have this limitation and will be adapted to accommodate the new requirement. However, the time dimension does have attributes that are specific to each application, and so it is not something that can be ignored altogether. For instance, data warehouses in some types of organization require specific information about time, such as: Half-day closing Prevailing weather conditions Effect of late opening due to staff training Whether the store was open for 24 hours This type of information cannot be obtained through any type of derivation. So there is a need for some means of specifying the attributes for time on a per application basis. In the dot modeling methodology we can solve this problem by the introduction of a table that will satisfy the requirements previously handled by the explicit time dimension as well as the requirements covered in this section. The table could be given a standard table name for use in all applications. The use of Time as a name for the table is likely to conflict with some RDBMS reserved word list so, for our purposes, the name dot_time will be use to describe the table. Each application will have its own requirements as to the columnar content of the dot_time table, although some columns, such as the following, would almost always be required: Date Day name
229
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Week number Month name Month number Quarter Year As practitioners we could add value to our customers by bringing a “starter” dot_time table that might contain, say, 10 years of history and 10 years of future dates. This seems like a large amount of data, but in reality it is fewer than 8,000 rows where the granularity is daily. For finer levels of granularity, for example, seconds, it is sensible to provide two time tables. The first contains all the days required, as before, and the other would contain an entry for each second of a single day (i.e., from 00:00:00 to 23:59:59). It is then a simple matter to join to one table, in the case of dimensional changes, or both tables in the case of, say, telephone calls. In this way, multiple grains of time can be accommodated. Practitioners could also provide standard programmed procedures, perhaps using the user-defined functions capability that is available as an extension to some RDBMS products, to add optional columns such as weekends and bank holidays, etc., although some customization of the dot_time table is almost inevitable for every application. The removal of the explicit time dimension from the conceptual model to the logical model is a step forward in the approach to the design of data warehouses. It also goes some way to recognizing that data warehouses are true temporal applications and the support for time is implicit in the solution, rather than having to be made explicit on the data model. only for RuBoard - do not distribute or recompile
230
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
LOGICAL SCHEMA The implementation of true retrospection for any entity, attribute, or relationship requires that the existence of the lifespan of the object in question is recorded. This must be in the form of a period marking the starting and ending times. The introduction of such periods changes the structure of the entities. For instance, the customer circumstances from the Wine Club model has a requirement for true retrospection on its own existence as well as the relationship with the sales area and the customers' addresses. Each of these would be given their own existence period attributes as is illustrated in the diagram in Figure 6.2. Figure 6.2 shows how the implementation of true retrospection using existence attributes results in the creation of new relations. The relational logical schema for the diagram is also shown. A more complete logical schema for the Wine Club is shown in Appendix C. Figure 6.2. Logical model of part of the Wine Club.
231
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the relations in the diagram is now described: Relation Customer Customer_Code Customer_Name Hobby_Code Date_Joined Primary Key (Customer_Code)
Relation Sales_Area Sales_Area_Code Sales_Area_Name Primary Key (Sales_Area_Code) The logical schema has been included as an aid to clarity and is not intended to prescribe a physical model. Performance issues will have to be considered when implementing the solution, and some denormalization of the above schema may be appropriate. However, the above schema does present several examples of true retrospection.
233
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
True retrospection of an entity is exemplified in the existence of the customer by providing a separate relation containing just the customer code and the existence period. True retrospection of an attribute is shown by the customer address relation. True retrospection for a relationship is shown by the customer sales area relation that records the relationship between the customer and the sales area. In each case the primary key includes the start time of the period attribute. False retrospection is supported for the relationship between customers and hobbies by the inclusion of the hobby code as a foreign key in the customer relation. The customer name, for example, also has the property of false retrospection. only for RuBoard - do not distribute or recompile
234
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
PERFORMANCE CONSIDERATIONS This book is primarily concerned with accuracy of information and not with performance. However, it is recognized that performance considerations have to be made at some point and that performance may be considered to be more important than accuracy when design decisions are taken. For this reason we will briefly explore the subject of performance. A query involving the fact table such as “The sum of sales for the year 2000 grouped by Sales Area” using a Type 2 approach, with surrogate keys, would be expressed as follows:
Select SalesAreaCode, sum(s.value) From sales s, customer c, time t Where s.CustomerSurrogate = c.CustomerSurrogate And s.timecode = t.timecode And t.year = 2000 Group by SalesAreaCode Using our new schema, the same query requires a more complex join, as the following query shows:
Select SalesAreaCode, sum(s.Value) From sales s, CustomerSalesArea csa, time t Where s.customer code = csa.customer code And s.time code = t.time code And t.time code between csa.start and csa.end And t.year = 2000 Group by SalesAreaCode The join between the time and customer sales area dimensions is not a natural join (it's known as a theta join) and is unlikely to be handled efficiently by most RDBMS query optimizers. A practical solution to solving the performance issue in these cases, while retaining the benefit of existence attributes, is to copy the sales area code to the fact table. As has been stated previously, the attributes of the fact table always have the property of “permanent” retrospection. So the accuracy of the results would not be compromised, and performance would be improved considerably, even better than the original, because a whole table is omitted from the join. This is shown by the next query:
Select SalesAreaCode, sum(s.Value) From sales s, time t where s.time code = t.time code And t.year = 2000 Group by SalesAreaCode The relationship between the customer and sales area dimensions must be left intact in order to
235
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
facilitate dimensional browsing. This is because the decomposition of the hierarchy is not reversible. In other words, the hierarchy cannot be reconstructed from the fact table, since, if there is no sale, there is no relationship between a customer and a sales area. So it would not be a nonloss decomposition. The model enables dimensions to be queried and for time series types of analysis to be performed by using the time dimension to allow the sales area and customer dimensions to be grouped by any time attribute. Queries involving dimensions only, such as counting the number of customers by year, would be expressed as follows:
Select year,count(*) From CustomerExistence ce, time t Where t.timecode between ce.start and ce.end And t.timecode in (2000/12/31, 1999/12/31, 1998/12/31) Group by year This appears to be a reasonably efficient query. At this point it is worth reiterating that browse queries represent about 80 percent of queries executed on the data warehouse. At this level in the model, there is no reason to distinguish between the requirements of false and permanent retrospection. The main reason for these classifications is concerned with issue of population of the data warehouse and the capture of changed data values. In any case, the method handles these requirements quite naturally. only for RuBoard - do not distribute or recompile
236
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
CHOOSING A SOLUTION The choice as to which solution is most appropriate depends upon the requirements and the type of data warehouse object under consideration. Retrospection provides a mechanism for freeing the circumstances and dimensions to be regarded as a source of information in their own right. If there is a requirement for true historical analysis to be performed on any components of the dimensional structure, then those components will have a classification for retrospection of true. If there is a requirement to perform state duration queries or transition detection queries, then the most appropriate choice is to provide an existence attribute, in the form of a start time and an end time, that is applied to the object in question. As far as circumstances and dimensions are concerned, the need to establish discontinuous existence is the main reason for allocating a true retrospection. At any point in time, a customer will or will not be active due to the discontinuous nature of their existence. True retrospection for circumstances and dimensions results from the need for time series analysis relating to their existence. So the existence attribute approach is the only choice in this case. False retrospection for circumstances and dimensions is similar. It enables the currently existing and no longer existing dimensions to be identified. So it is possible to determine which customers currently exist and which do not. It is not possible to carry out any form of time-based analysis. The best approach for implementing false retrospection for circumstances and dimensions is to use an existence attribute, containing true or false values, as described earlier in this chapter. Permanent retrospection requires no support at all as the existence of the circumstances or dimension is considered never to change. Support for relationships that implement dimensional hierarchies is entirely missing from the Type 2 solution, and any attempt to introduce a Type 2 or a row timestamp solution into a relationship may result in very large numbers of extraneous cascaded inserts as has been shown in Chapter 4. It is recommended that use of these techniques is avoided where there are implicit (as implemented by a star) or explicit (as implemented by a snowflake) hierarchies unless all the superordinate objects in the hierarchy have the property of false or permanent retrospection. True retrospection in relationships. When retrospection is true in a relationship, if state duration or transition detection analysis of the relationship is required, the relationship must be implemented using an existence attribute with a start time and an end time. There is a slight difference between the existence of relationships and the existence of dimensions. The subordinate customer will always be engaged in a relationship with a superordinate sales area because the participation condition for the customer is mandatory. It is not, therefore, a question of existence versus
237
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
nonexistence but rather a question of changing the relationship from one sales area to another and the existences pertaining to relationship instances. The discontinuous nature still exists but applies to changing relationships rather than gaps in durations. A row timestamp approach can be used where these requirements do not apply as long as the cascade insertion problem is manageable. False retrospection in relationships. False retrospection in relationships is best implemented by allowing the foreign key attribute to be overwritten in the subordinate dimension's row. No special treatment is required. Permanent retrospection in relationships. With permanent retrospection, the foreign key attribute is not expected to change at all, so no special treatment is required. True retrospection in attributes. The situation with regard to true retrospection in attributes is similar again. If state duration or transition detection analysis is required, then the simplest and most flexible solution is to use an existence attribute with a start time and an end time. Where these types of analysis are not needed, the use of row timestamps can be considered as long the problem relating to cascade insertions is manageable. False retrospection in attributes. False retrospection in attributes requires no special support, as the attribute can be safely overwritten. Permanent retrospection in attributes. Permanent retrospection also requires no support, as the attribute will not change. The purpose of Table 6.1 is to provide a simplified guide to aid practitioners in the selection of the most arppropriate solution for the representation of time in dimensions and circumstances.
238
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Table 6.1. Choosing a Solution Circumstances/Dimension
Relationships
Attributes
True State duration or transition State duration or transition detection analysis detection analysis required? required? Yes
No
Yes
Existence period
Row time stamp
Existence Does the hierarchy Existence Is this attribute at the period change regularly? period lowest level in the hierarchy
False Existence true/false attribute updated using Type 1 method Perm. Will not change
No
State transition duration or detection analysis required? Yes
No
Yes
No
Yes
No
Existence period
Row time stamp
Row time stamp
Existence period
Type 1 Will not change
Type 1 Will not change
only for RuBoard - do not distribute or recompile
239
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
FREQUENCY OF CHANGED DATA CAPTURE In the pursuit of accuracy relating to time, we need to know whether the data we are receiving for placement into the data warehouse accurately reflects the time that the change actually occurred. This was identified as another source of inaccuracy with respect to the representation of time in Chapter 4. This requirement cannot be fully satisfied by the data warehouse in isolation, as is now described. So far as the facts are concerned, the time that is recorded against the event would be regarded, by the organization, as the valid time of the event. That means it truly reflects the time the sale occurred, or the call was made. With the circumstances and dimensions, we are interested in capturing changes. As has been discussed, changes to some attributes are more important than others with respect to time. For the most important changes, we expect to record the times that the changes occurred. Some systems are able to provide valid time changes to attributes, but most are not equipped to do this. So we are faced with the problem of deducing changes by some kind of comparison process that periodically examines current values and compares them to previous values to determine precisely what has changed and how. The only class of time available to us in this scenario is transaction time. Under normal circumstances, the transaction time is the time that the change is recorded in the operational system. Often, however, the transaction time changes are not actually recorded anywhere by the application. Changes to, say, a customer's address simply result in the old address being replaced by the new address, with no record being kept as to when the change was implemented. Other systems attempting to detect the change, by file comparison method, have no real way of knowing when the change occurred in the real world or when it was recorded into the system. So, in a data warehouse environment, there are two time lags to be considered. The first is the lag between the time the change occurred in the real world, the valid time, and the time the change is recorded in an operational system, the transaction time. Usually, the organization is unaware of the valid time of a change event. In any case, the valid time is rarely recorded. The second time lag is the time it takes for a change, once it has been recorded in the operational system, to find its way into the data warehouse. The solution is to try to minimize the time lags inherent in this process. Although that is often easier said than done, the objective of the designers must be to identify and process changes as quickly as possible so that the temporal aspect of the facts and dimensions can be synchronized. only for RuBoard - do not distribute or recompile
240
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
CONSTRAINTS Part of the role of the logical model is to record constraints that need to be imposed on the model. The introduction of full support for time brings with it some additional requirements for the imposition of constraints. Double-Counting Constraints Double-counting occurs when the joining of tables returns more rows than should be returned. This problem is usually avoided if the general rules about the structure of dimensional models are followed. However, the introduction of existence attributes into the model increases the risk of error by changing the nature of the relationships in the warehouse data model from simple (1:n) to complex (m:n). The problem is best described by the use of an example of a sale of wine. Table 6.2 shows the bottle cost existence.
Table 6.2. Example of the Existence of a Wine Cost Entity Wine Code
Start
End
Bottle Cost
4504
1997/02/27
1999/03/31
4.36
4504
1999/03/31
Now
4.79
The bottle cost has an existence attribute. The existence is continuous, but a change in the cost price of a bottle of the wine has occurred. This has resulted in the generation of a new row. Table 6.3 shows a fragment of the sales fact table detailing a sale of the wine above.
Table 6.3. A Single Sale of Wine Wine Code 4504
Day 1999/03/31
Quantity 5
Value 24.95
Next, the following query is executed and is intended to show the sales value and costs:
Select w.wine_name, s.value "Revenue", sum(s.quantity * w.bottle_cost) "Cost" from Sales s, Wine w, BottleCostExistence bce where w.wine_code = s.wine_code and w.wine_code = bce.wine_code and s.day between bce.start_date and bce.end_date
241
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
group by w.wine_name The result set in Table 6.4 is returned.
Table 6.4. Example of Double-Counting Wine Name
Revenue
Cost
Chianti Classico
24.95
21.80
Chianti Classico
24.95
23.95
The result is that the sale has been double-counted. The problem has occurred because of an overlap of dates that caused the sale, which was made on March 31, 1999, to successfully join to two rows. The date of the change may well be right. The old cost price ceased to be effective on March 31, 1999, and the new price took effect immediately, on the same day. As far as the query processing is concerned, the multiple join is also correct. The join criteria have been met. What is wrong is the granularity of time. There is an implicit constraint that states that: Time overlaps in existence are not permitted in a dimensional model. Therefore, if both dates are correct, then the granularity is incorrect and a finer grain, such as time of day, must be used instead. The alternative is to ensure that the end of the old period and the start of the new period actually “meet.”. This means that there is no overlap, as shown in Table 6.5.
Table 6.5. Wine Cost Entity Showing No Overlaps in Existence Wine Code
Start
End
Bottle Cost
4504
1997/02/27
1999/03/30
4.36
4504
1999/03/31
Now
4.79
It is equally important to ensure that no gaps are inadvertently introduced, as in the Table 6.6.
Table 6.6. Wine Cost Entity Showing Gaps in Existence Wine Code
Start
End
Bottle Cost
4504
1997/02/27
1999/03/30
4.36
4504
1999/04/01
Now
4.79
If the data in Table 6.6 were used, then the result set would be empty.
242
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The query used in the example was phrased to aid clarity. It is worth remembering that in data warehousing most queries, involving joins to the fact table, use arithmetical functions to aggregate the results. The chances of users identifying errors from the result sets of such queries are seriously reduced when large numbers of rows are aggregated. Therefore, in order for the concept of existence to work, temporal constraints must prevent any form of overlapping to occur on periods; otherwise, there is a risk that double-counting might occur. The following query, using the temporal construct “overlaps,” would detect such an occurrence.
Select R1.PK From R1, R2 where R1.PK = R2.PK and R1.Period R2.Period where R1.Period overlaps R2.Period R1 and R2, in the previous query, are synonyms for the same relation, and PK is the primary key. The relation is subjected to a self-join in order to identify temporal overlaps. This query can be rewritten, without using the temporal constructs, as in the following query:
Select R1.PK from R1, R2 where R1.PK = R2.PK and (R1.Start R2.Start or R1.End R2.End) and R1.Start