DATA QUALITY
The Kluwer International Series on ADVANCES IN DATABASE SYSTEMS Series Editor
Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907
Other books in the Series: THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4 SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING, Shu-Ching Chen, R.L. Kashyap, and Arif Ghafoor; ISBN: 0-79237888-1 INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA: A Metadata-based Approach, Vipul Kashyap, Amit Sheth; ISBN: 0-7923-7883-0 DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS, KianLee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic; ISBN: 0-7923-7840-7 ADVANCED DATABASE INDEXING, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras; ISBN: 0-7923-7716-8 MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushil Jajodia, Binto George ISBN: 0-7923-7702-8 FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6 INTERCONNECTING HETEROGENEOUS INFORMATION SYSTEMS, Athman Bouguettaya, Boualem Benatallah, Ahmed Elmagarmid ISBN: 0-7923-8216-1 FOUNDATIONS OF KNOWLEDGE SYSTEMS: With Applications to Databases and Agents, Gerd Wagner ISBN: 0-7923-8212-9 DATABASE RECOVERY, Vijay Kumar, Sang H. Son ISBN: 0-7923-8192-0 PARALLEL, OBJECT-ORIENTED, AND ACTIVE KNOWLEDGE BASE SYSTEMS, Ioannis Vlahavas, Nick Bassiliades ISBN: 0-7923-8117-3 DATA MANAGEMENT FOR MOBILE COMPUTING, Evaggelia Pitoura, George Samaras ISBN: 0-7923-8053-3 MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING, Alex A. Freitas, Simon H. Lavington ISBN: 0-7923-8048-7 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS, Elisa Bertino, Beng Chin Ooi, Ron Sacks-Davis, Kian-Lee Tan, Justin Zobel, Boris Shidlovsky, Barbara Catania ISBN: 0-7923-9985-4 INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES, Thomas A. Mueck, Martin L. Polaschek ISBN: 0-7923-9971-4
DATA QUALITY
by
Richard Y. Wang Mostapha Ziad Yang W. Lee
KLUWER ACADEMIC PUBLISHERS New York / Boston / Dordrecht / London / Moscow
eBook ISBN: Print ISBN:
0-306-46987-1 0-792-37215-8
2002 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Kluwer Online at: and Kluwer's eBookstore at:
http: www.kluweronline.com http: www.ebooks.kluweronline.com
To our families ...
This page intentionally left blank.
Table of Contents
Preface
xiii
Chapter 1
1
Introduction
1
Fundamental Concepts Data vs. Information Product vs. Infomation manufacturing The Information Manufacturing System Information Quality is a Multi-Dimensional Concept The TDQM Cycle A Frameworkfor TDQM Define IP Measure IP Analyze IP Improve IP Summary Book Organization
2 2 3 4 4 5 6 7 12 13 14 14 14
Chapter 2
19
Extending the Relational Model to Capture Data Quality Attributes
19
The Polygen Model Architecture Polygen Data Structure Polygen Algebra The Attribute-based Model
20 21 22 23 25
viii Data structure Data manipulation QI-Compatibility and QIV-Equal Quality Indicator Algebra Data Integrity Conclusion
26 27 28 29 31 32
Chapter 3
37
Extending the ER Model to Represent Data Quality Requirements
37
Motivating Example Quality Requirements Identification Requirements Modeling Modeling Data Quality Requirements Data Quality Dimension Entity Data Quality Measure Entity Attribute Gerund Representation Conceptual Design Example Concluding Remarks
38 39 40 41 42 42 43 44 47
Chapter 4
49
Automating Data Quality Judgment
49
Introduction Quality Indicators and Quality Parameters Overview Related Work Data Quality Reasoner Representation of Local Dominance Relationships Reasoning Component of DQR First-Order Data Quality Reasoner The Q-Reduction Algorithm Q-Merge Algorithm Conclusion
50 51 51 52 52 53 54 57 58 60 61
ix
Chapter 5
63
Developing a Data Quality Algebra
63
Assumptions Notation Definitions A Data Quality Algebra Accuracy Estimation of Data Derived by Selection Worst case when error distribution is non-uniform Best case when error distribution is non-uniform Accuracy Estimation of Data derived by Projection Worst case when error distribution is non-uniform Best case when error distribution is non-uniform Illustrative Example Conclusion
64 65 65 69 69 70 71 71 72 73 73 76
Chapter 6
79
The MIT Context Interchange Project
79
Integrating Heterogeneous Sources and Uses The Information Integration Challenge Information Extraction and Dissemination Challenges Information Interpretation Challenges Overview of The Context Interchange Approach Context Interchange Architecture Wrapping Mediation Conclusion
80 81 82 82 84 84 85 86 91
Chapter 7
93
The European Union Data Warehouse Quality Project
93
Introduction The D WQ Project DWQ Project Structure
93 95 95
x DWQ Objectives Data Warehouse Components The Linkage to Data Quality D WQ Architectural Framework The Conceptual Perspective The Logical Perspective The Physical Perspective Relationships between the Perspectives Research Issues Rich Data Warehouse Architecture Modeling Languages Data Extraction and Reconciliation Data Aggregation and Customization Query Optimization Update Propagation Schema and Instance Evolution Quantitative Design Optimization Overview of the DWQ demonstration Source Integration (Steps 1,2) Aggregation and OLAP Query Generation (Steps 3, 4) Design Optimization and Data Reconciliation (Steps 5,6) Summary and Conclusions
96 97 98 100 101 101 102 103 103 104 105 105 106 107 108 108 108 109 113 113 114
Chapter 8
119
The Purdue University Data Quality Project
119
Introduction Background Related Work Searching Process Matching Process Methodology Data Pre-processing Sorting, Sampling & the Sorted Neighborhood Approach Clustering and Classification Decision Tree Transformation and Simplification Experiments and Observations Conclusions
120 121 122 122 123 124 125 126 127 128 130 135
xi
Chapter 9
139
Conclusion
139
Follow-up Research Other Research Future Directions
140 143 143
Bibliography
149
Index
163
This page intentionally left blank.
P R E F A C E
If you would not be forgotten, As soon as you are dead and rotten, Either write things worth reading, Or do things worth the writing. Benjamin Franklin
This book provides an exposé of research and practice in data quality for technically oriented readers. It is based on the research conducted at the MIT Total Data Quality Management (TDQM) program and the work of other leading research institutions. It is intended for researchers, practitioners, educators and graduate students in the fields of Computer Science, Information Technology, and other inter-disciplinary fields. This book describes some of the pioneering research results that form a theoretical foundation for dealing with advanced issues related to data quality. In writing this book, our goal was to provide an overview of the cumulated research results from the MIT TDQM research perspective as it relates to database research. As such, it can be used by Ph.D. candidates who wish to further pursue their research in the data quality area and by IT professionals who wish to gain an insight into theoretical results and apply them in practice. This book also complements the authors’ other well-received book Quality Information and Knowledge that deals with more managerially- and practice-oriented topics (Prentice Hall, 1999) and the book Journey to Data Quality: A Roadmapfor Higher Productivity (in preparation). The book is organized as follows. We first introduce fundamental concepts and a framework for total data quality management (TDQM) that encompasses data quality definition, measurement, analysis and improvement. We then present the database-oriented work from the TDQM program, work that is focused on the data quality area in the database field. Next, we profile research projects from leading research institutions. These projects broaden the reader’s perspective and offer the reader a snapshot of some of the cutting-edge research projects. Finally, concluding remarks and future directions are presented. We thank Cambridge Research Group for sponsoring this project and providing a productive work environment. We are indebted to Professor Stuart Madnick at MIT for his continuous, unreserved support and encouragement. The support ofDean Louis Lataif, Senior Associate Dean Michael Lawson and Professor John Henderson at Boston University’s School of Management is appreciated. Also appreciated is the help and support from Dean John Brennan, Professors Warren
xiv
Preface
Briggs, Jonathan Frank, and Mohamed Zatet at Suffolk University. In addition, we appreciate the support from Dean Ira Weiss and Senior Associate Dean Jim Molloy at Northeastern University’s College of Business Administration. We thank John F. Maglio and Tom J. Maglio for their assistance in fulfilling many of book requirements such as the index and format. Thanks are also due to our colleagues and researchers whose work is included in this book, including Don Ballou, Ahmed Elmagarmid, Chris Firth, Jesse Jacobson, Yeona Jang, Henry Kon, Matthias Jarke, Harry Pazer, Stuart Madnick, Leo Pipino, M. P. Reddy, Giri Kumar Tayi, Diane Strong, Veda Storey, Yannis Vassiliou, Vassilios Verykios, and Yair Wand. We must also thank the many practitioners who have made significant contributions to our intellectual life. Although it is not possible to list all of them, we would like to express our gratitude to Jim Funk at S.C. Johnson, Kuan-Tsae Huang at TASKCo.com, Ward Page at DARPA’s Command Post of the Future program, Sang Park at Hyundai Electronics, Brent Jansen at IBM, Jim McGill at Morgan Stanley Dean Witter, Mel Bergstein, Alan Matsumura, and Jim McGee at Diamond Technology Partners, Jim Reardon at the Under-Secretary’s Office of the Defense Department for Human Affairs, Gary Ropp and David Dickason at U.S. Navy Personnel Research, Studies, and Technology, Terry Lofton and Frank Millard at Georgia State’s Public Health Division, Bruce Davidson and Alein Chun at Cedars Sinai Medical Center, Frank Davis and Joe Zurawski at Firstlogic, Madhavan Nayar at Unitech Systems Inc., and Les Sowa and Beam Suffolk at Dow Chemical. In addition, we thank D.J. Chang at Dong-Nam Corporation, Kyung-Sup Lee at Korean Airlines, Dwight Scott and Karen O’Lessker at Blue Cross Blue Shield, Kary Burin at Hewlett Packard, Don Hinman and John Talburt at Axiom Corporation, Peter Kaomea at Citicorp, Lu-Yue Huang at Edgewave, Col. Lane Ongstad of the U.S. Air Force Surgeon General’s Office, Lt. Col. Dave Corey of the U.S. Army Medical Command, Steve Hemsley, Dane Iverson and Raissa Katz-Haas at United Health Group, Sue Bies at the First Tennessee National Corporation, and Rita Kovac at Concept Shopping. We are grateful for the contributing authors and publishers for granting us permission to reprint their materials in this book. Specifically, Chapter 1 is based, in part, on the paper, A Product Perspective on Total Data Quality Management, authored by Richard Wang and published the Communications of ACM in 1998. Chapter 2 is based, in part, on the paper, A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective, co-authored by Richard Wang and Stuart Madnick and published in the Proceedings of the International Conference on Very Large Database Conference in 1990. It is also based on the paper, reprinted from Decision Support Systems, 13, Richard Wang, M. P. Reddy, and Henry Kon, Toward quality data: An attribute-based approach, 349-372, 1995, with permission from Elsevier Science. Chapter 3 has been contributed by Veda Storey and Richard Wang based on the paper, An Analysis of Quality Requirements in Database Design, that they coauthored and published in the Proceedings of the 1998 Conference on Information
xv
Quality. Chapter 4 is based, in part, on the M.I.T. CISL working paper, A Knowledge-Based Approach to Assisting In Data Quality Judgment, co-authored by Yeona Jang, Henry Kon, and Richard Wang. Chapter 5 is based, in part, on the M.I.T. TDQM working paper, A Data Quality Algebra for Estimating Query Result Quality. The MIT Context Interchange Project in Chapter 6 has been contributed by Stuart Madnick based on the paper, Metadata Jones and the Tower of Babel: The Challenge of Large-Scale Semantic Heterogeneity, published in the Proceedings of 1999 IEEE Meta-Data Conference. The European Union Data Warehouse Quality Project in Chapter 7 has been contributed by Matthias Jarke and Yannis Vassiliou based on their paper “Data Warehouse Quality: A Review of the DWQ Project”, published in the Proceedings of the 1997 Conference on Information Quality. The Purdue University Data Quality Project in Chapter 8 has been written by Vassilios S. Verykios under the direction of Ahmed Elmagarmid. It profiles one aspect ofthe ongoing Purdue Univesity’s data quality initiative. Originally, additional chapters based on the work conducted at Telcordia (formerly Bellcore), Georgia Mason University, The University of Queensland, Australia, the University of St. Gallen in Switzerland, Universidad de Buenos Aires in Argentina, and other institutions were planned. Unfortunately, we were unable to incorporate them into the current edition, but we certainly hope to do so in a future publication. We appreciate Elizabeth Ziad for her love, support, and understanding, our children, Fori Wang, and Leyla, Ahmed, and Nasim Ziad who bring so much joy and happiness to our lives, as well as Abdallah, Abdelkrim, Ali and the Ziad family. In addition, we thank A. Russell and Arlene Lucid, Dris Djermoun, Kamel YoucefToumi, Hassan Raffa, Said Naili, Youcef Bennour, Boualem Kezim, Mohamed Gouali, Ahmed Sidi Yekhlef, YoucefBoudeffa, and Nacim Zeghlache for their help, support, and generosity. We also thank Jack and Evelyn Putman, and John, Mary Lou, Patrick, and Helen McCarthy for their love and support. Last, but not least, we would like to thank our parents who instilled in us the love of learning. Richard Y. Wang Boston University, Boston, Massachusetts & MIT TDQM program (http://web.mit.edu/tdqm/)
[email protected] Mostapha Ziad Suffolk University, Boston, Massachusetts
[email protected] Yang W. Lee Northeastern University, Boston, Massachusetts
[email protected] This page intentionally left blank.
C H A P T E R
1
Introduction
Many important corporate initiatives, such as Business-to-Business commerce, integrated Supply Chain Management, and Enterprise Resources Planning are at a risk of failure unless data quality is seriously considered and improved. In one study of a major manufacturer, it was found that 70% of all orders had errors. Aspects of this problem have even gained the attention of the Wall Street Journal, which reported that “A major financial institution was embarrassed because of a wrong data entry of an execution order of $500 million...” “Some Northeast states are filing multi-million dollar suits against TRW for bad credit data...” “Mortgage companies miscalculated home owners’ adjustable rate monthly mortgage payments, totaling multi-billion dollars...” The major news networks have also reported that many patients died or became seriously ill because of prescription data errors, prompting the Clinton Administration to issue an executive order to address the problem. Anecdotes of high-stake data quality problems abound in the academic literature and news articles.
2
Introduction
Chapter 1
The field of data quality has witnessed significant advances over the last decade. Today, researchers and practitioners have moved beyond establishing data quality as a field to resolving data quality problems, which range from data quality definition, measurement, analysis, and improvement to tools, methods, and processes. As a result, numerous data quality resources are now available for the reader to utilize [ 10, 29]. Many professional books [11, 12, 19, 24, 26], journal articles, and conference proceedings have also been produced. Journal articles have been published in fields ranging from Accounting [6, 17, 21, 36] to Management Science [1, 5] to Management Information Systems [2, 25, 35] to Computer Science [3, 4, 13, 16, 18, 20, 23, 27, 31, 34], and Statistics [7, 14, 22]. For example, Communications of the ACM, a premier publication of the Association of Computing Machinery, featured a special section entitled “Examining Data Quality” [28]. Not-for-profit academic conferences and commercial conferences on data quality are part of today’s data quality landscape. Back in 1996, the Total Data Quality Management (TDQM) program at the Massachusetts Institute of Technology pioneered the first academic conference for exchanging research ideas and results between researchers and practitioners. The overwhelmingly positive feedback from the conference participants led the organizers to hold the conference every year since. Today, it is a well-established tradition to hold the MIT Conference on Information Quality annually. Additionally, many information systems and computer science conferences now feature tracks or tutorials on data quality. Commercial conferences include those organized by Technology Transfer Institute [30] and others. These commercial conferences typically feature invited speakers as well as vendor and consultant workshops. In short, the field of data quality has evolved to such a degree that an exposé of research and practice of the field is both timely and valuable.
FUNDAMENTAL CONCEPTS Before we present the theme of this book, we first introduce some fundamental concepts that we have found to be useful to researchers and practitioners alike.
Data vs. Information Data and information are often used synonymously in the literature. In practice, managers intuitively differentiate between information and data, and describe information as data that have been processed. Unless specified otherwise, this book will use the term data interchangeably with the term information.
Fundamental Concepts
3
Product vs. Information manufacturing An analogy exists between quality issues in product manufacturing and those in information manufacturing, as shown in Table 1.1. Product manufacturing can be viewed as a processing system that acts on raw materials to produce physical products. Analogously, information manufacturing can be viewed as a system acting on raw data to produce information products. The field ofproduct manufacturing has an extensive body of literature on Total Quality Management (TQM) with principles, guidelines, and techniques for product quality. Based on TQM, knowledge has been created for data quality practice. Table 1.1 : Product vs. Information manufacturing
(Source: IEEE Transactions on Knowledge and Data Engineering [34])
An organization would follow certain guidelines to scope a data quality project, identify critical issues, and develop procedures and metrics for continuous analysis and improvement. Although pragmatic, these approaches have limitations that arise from the nature of raw materials, namely data, used in information manufacturing. Whereas data can be utilized by multiple consumers and not depleted, a physical raw material can only be used for a single physical product. Another dissimilarity arises with regard to the intrinsic property of timeliness of data. One could say that a raw material (e.g., copper) arrived just in time (in a timely fashion); however, one would not ascribe an intrinsic property of timeliness to the raw material. Other dimensions such as the believability of data simply do not have a counterpart in product manufacturing [5].
4
Introduction
Chapter 1
The Information Manufacturing System We refer to an information manufacturing system as a system that produces information products. The concept of an information product emphasizes the fact that the information output from an information manufacturing system has value transferable to the consumer, either internal or external. We define three roles (called the three C’s) in an information manufacturing system: 1. Information collectors are those who create, collect, or supply data for the information product. 2. Information custodians are those who design, develop, or maintain the data and systems infrastructure for the information product. 3. Information consumers are those who use the information product in their work. In addition, we define information product managers as those who are responsible for managing the entire information product production process and the information product life cycle. Each of the three C’s is associated with a process or task • Collectors are associated with data-production processes • Custodians with data storage, maintenance, and security • Consumers with data-utilization processes, which may involve additional data aggregation and integration. We illustrate these roles with a financial company’s client account database. A broker who creates accounts and executes transactions has to collect, from his clients, the information necessary for opening accounts and executing these transactions, and thus is a collector. An information systems professional who designs, develops, produces, or maintains the system is a custodian. A financial controller or a client representative who uses the information system is a consumer. Finally, a manager who is responsible for the collection, manufacturing, and delivery of customer account data is an information product manager.
Information Quality is a Multi-Dimensional Concept Just as a physical product (e.g., a car) has quality dimensions associated with it, an information product also has information quality dimensions. Information quality has been viewed as fitness for use by information consumers, with four information quality categories and fifteen dimensions identified [35]. As shown in Table 1.3, intrinsic information quality captures the fact that information has quality in its own right. Accuracy is merely one of the four dimensions underlying this category. Contextual information quality highlights the requirement that information quality must be con-
Fundamental Concepts
5
sidered within the context of the task at hand. Representational and accessibility information quality emphasize the importance ofthe role ofinformation systems. Table 1.3: Information Quality Categories and Dimensions
Information Quality Category
Information Quality Dimensions
Intrinsic Information Quality
Accuracy, Objectivity, Believability, Reputation Accessibility Information Qulity Accessibility, Access eecurity Relevancy, Value-Added, Timeliness, Contextual Information Quality Completeness, Amount of information Representational Information Quality Interpretability, Ease of understanding, Concise representation, Consistent representation, Ease of manipulation (Source: Journal of Management Information Systems [35])
The TDQM Cycle Defining, measuring, analyzing, and continuously improving information quality is essential to ensuring high-quality information products. In the TQM literature, the widely practiced Deming cycle for quality enhancement consists of: Plan, Do, Check, and Act. By adapting the Deming cycle to information manufacturing, we have developed the TDQM cycle [32], which is illustrated in Figure 1.1. The definition component of the TDQM cycle identifies important data quality dimensions [35] and the corresponding data quality requirements. The measurement component produces data quality metrics. The analysis component identifies root causes for data quality problems and calculates the impacts of poor quality information. Finally, the improvement component provides techniques for improving data quality. These components are applied along data quality dimensions according to requirements specified bythe consumer.
Figure 1.1 : The TDQM Cycle (Source: Communications of the ACM [32])
6
Introduction Chapter 1
A FRAMEWORK FOR TDQM In applying the TDQM framework, an organization must: (1) clearly articulate the information product (IP) in business terms; (2) establish an IP team consisting of a senior manager as a champion, an IP engineer who is familiar with the TDQM framework, and members from information collectors, custodians, consumers, and IP managers; (3) teach data quality assessment and data quality management skills to all the IP constituents; and (4) institutionalize continuous IP improvement. A schematic of the TDQM framework is shown in Figure 1.2. The tasks embedded in this framework are performed in an iterative manner. For example, an IP developed in the past may not fit today’s business needs for private client services. This should have been identified in the continuous TDQM cycle. If not, it would be the IP team’s responsibility to ensure that this need is met at a later phase; otherwise this IP will not be fit for use by the private client representatives,
Figure 1.2: A Schematic of the TDQM Framework
(Source: Communications of the ACM [32])
In applying the TDQM framework, one must first define the characteristics of the IP, assess the IP’s data quality requirements, and identify the information manufacturing system for the IP [5]. This can be challenging for organizations that are not familiar with the TDQM framework. Our experience has shown, however, that after the previously mentioned tasks have been performed a number of times and the un-
A Framework for TDQM
7
derlying concepts and mechanisms have been understood; it becomes relatively easy to repeat the work. Once these tasks are accomplished, those for measurement, analysis, and improvement can ensue.
Define IP The characteristics of an IP are defined at two levels. At a higher level, the IP is conceptualized in terms of its functionalities for information consumers just like when we define what constitutes a car, it is useful to first focus on the basic functionalities and leave out advanced capabilities (e.g., optional features for a car such as A/C, stereo, or cruise control). Continuing with the client account database example, the functionalities are the customer information needed by information consumers to perform the tasks at hand. The characteristics for the client account database include items such as account number and stock transactions. In an iterative manner, the functionalities and consumers of the system are identified. The consumers include brokers, client representatives, financial controllers, accountants, and corporate lawyers (for regulatory compliance). Their perceptions ofwhat constitutes important data quality dimensions need to be captured and reconciled. At a lower level, one would identify the IP’s basic units and components, along with their relationships. Defining what constitutes a basic unit for an IP is critical as it dictates the way the IP is produced, utilized, and managed. In the client account database, a basic unit would be a un-grouped client account. In practice, it is often necessary to group basic units together (e.g., eggs are packaged and sold by the dozen). A mutual-fund manager would trade stocks on behalf ofmany clients, necessitating group accounts; top management would want to know how much business the firm has with a client that has subsidiaries in Europe, the Far East, and Australia. Thus, a careful management of the relationship between basic accounts and aggregated accounts, and the corresponding processes that perform the mappings are critical because of their business impacts. Components of the database and their relationships can be represented as an entity-relationship (ER) model. In the client account database, a client is identified by an account number. Company-stock is identified by the company’s ticker symbol. When a client makes a trade (buy/sell), the date, quantity of shares and trade price are stored as a record of the transaction. An ER diagram of the client account database is shown in Figure l.3.
8
Introduction Chapter 1
Figure 1.3: A Client Account Schema
With the characteristics of the IP specified, the next step is to identify data quality requirements from the perspectives of IP suppliers, manufacturers, consumers, and managers. We have developed an instrument for data quality assessment and corresponding software to facilitate the data quality assessment task. After data are collected from information collectors, custodians, consumers, and IP managers and entered into the survey database, the data quality assessment software tool performs the query necessary for mapping the item values in the surveys to the underlying data quality dimensions [9]. Figure 1.4 illustrates the capabilities ofthe software tool.
Figure 1.4: Dimensional Assessment of Data Quality Importance Across Roles
A Framework for TDQM
9
As can be seen from Figure 1.4, the result from the first dimension indicates that the custodian believes the IP to be largely free of error (score 7 on a scale from 0 to 10 with 10 being completely free of error), whereas the consumer does not think so (with a score of 4). Both the custodian and the consumer indicate that the IP contains data that are objective and relatively important (score 7). The biggest contrast shows up for Dimension 8, completeness. Although the custodian assesses the IP to have complete data (score 7.6), the consumer thinks otherwise (score 1)! From the IP characteristics and the data quality assessment results, the corresponding logical and physical design of the IP can be developed with the necessary quality attributes incorporated. Timeliness and credibility are two important data quality dimensions for an IP supporting trading operations. In Figure 1.5, timeliness on share price indicates that the trader is concerned with how old the data is. A special symbol, “√ inspection” is used to signify inspection requirements such as data verification.
Figure 1.5: Information Quality added to the ER Diagram
(Source: Proceedings - 9th International Conference on Data Engineering [33])
The data quality requirements are further refined into more objective and measurable characteristics. As shown in Figure 1.6, these characteristics are depicted as a dotted-rectangle. For example, timeliness is redefined by age (of the data), and credibility of the research report is redefined by analyst name. The quality indicator collection method, associated with the telephone attribute, is included to illustrate that multiple data collection mechanisms can be used for a given type of data. Values for the collection method may include “over the phone” or “from an existing account”.
10
Introduction Chapter 1
The quality indicator media for research report is used to indicate the multiple formats of database-stored documents such as bit-mapped, ASCII or Postscript. The quality indicators derived from “√ inspection” indicate the inspection mechanism desired to maintain data reliability. The specific inspection or control procedures may be identified as part of the application documentation. These procedures might include independent, double entry of important data, front-end rules to enforce domain or update constraints, or manual processes for performing certification of the data.
Figure 1.6: A Quality Entity-Relationship Diagram
(Source: Proceedings - 9th International Conference on Data Engineering [33])
Equally important to the task of identifying data quality dimensions is the identification of the information manufacturing system for the IP. Figure 1.7 illustrates an information manufacturing system which has five data units (DU1 - DU5) supplied by three vendors (VB1 - VB3). Three data units (DU6, DU8, DU10) are formed by having passed through one of the three data quality blocks (QB1 - QB3). For example, DU6 represents the impact of QB1 on DU2 [5].
A Framework for TDQM
11
Figure 1.7: An Illustrative Information Manufacturing System
(Source: Management Science [5])
There are six processing blocks (PB1 - PB6) and, accordingly, six data units (DU7, DU9, DU11, DU12, DU13, DU14) that are the output of these processing blocks. One storage block (SB1) is used both as a pass through block (DU6 enters SB1 and is passed on to PB3) and as the source for data base processing (DU1 and DU8 are jointly processed by PB4). The system has three consumers (CB1 - CB3), each receives some subset of the IP. The placement of a quality block following a vendor block (similar to acceptance sampling) indicates that the data supplied by vendors is deficient with regards to data quality. In the client account database example, identifying such an information manufacturing system would provide the IP team with the basis for assessing the values of data quality dimensions for the IP through the Information Manufacturing Analysis Matrix and for studying options in analyzing and improving the information manufacturing system [5]. The IP definition phase produces two key results: (1) a quality entityrelationship model that defines the IP and its data quality requirements, and (2) an information manufacturing system that describes how the IP is produced, along with the interactions among information suppliers (vendors), manufacturers, consumers, and IP managers. With these results from the IP definition phase, an organization has two alternatives. First, the organization can develop a new information manufacturing system
12
Introduction Chapter 1
for the IP based on these results. The advantage of this approach is that many data quality requirements can be designed into the new information manufacturing system, resulting in quality-information-by-design analogous to that of quality-bydesign in product manufacturing. Many of the data quality problems associated with a legacy system can also be corrected with the new system. The disadvantage is that it would require more initial investment and significant organizational change. Second, the organization can use these results as guidelines for developing mechanisms to remedy the deficiencies of the existing system. However, as the business environment changes, and with it the information needs of the consumers, a new information manufacturing system will ultimately need to be developed.
Measure IP The key to measurement resides in the development of data quality metics. These data quality metrics can be the basic data quality measures such as data accuracy, timeliness, completeness, and consistency [1]. In our client account database example, data quality metrics may be designed to track, for instance: • The percentage of incorrect client address zip code found in a randomly selected client account (inaccuracy) • An indicator of when client account data were last updated (timeliness or currency for database marketing and regulatory purposes) • The percentage of non-existent accounts or the number of accounts with missing value in the industry-code field (incompleteness) • The number of records that violate referential integrity (consistency) At a more complex level, there are business rules that need to be observed. For example, the total risk exposure of a client should not exceed a certain limit. This exposure needs to be monitored for clients who have many accounts. Conversely, a client who has a very conservative position in one account should be allowed to execute riskier transactions in another account. For these business rules to work, however, the IP team needs to develop a proper account linking method and the associated ontology to make the linkage, There are also information-manufacturing-oriented data quality metrics. For example, the IP team may want to track • Which department made most of the updates in the system last week • How many un-authorized accesses have been attempted (security) • Who collected the raw data for a client account (credibility)
A Framework for TDQM
13
Other data quality metrics may measure the distribution of the data qualityrelated collective knowledge across IP roles. Whatever the nature of the data quality metrics, they are implemented as part of a new information manufacturing system or as add-on utility routines in an existing system. With the data quality metrics, data quality measures can be obtained along various data quality dimensions for analysis.
Analyze IP From the measurement results, the IP team investigates the root cause for potential data quality problems. The methods and tools for performing this task can be simple or complex. In the client account database example, one can introduce dummy accounts into the information manufacturing system to identify sources that cause poor data quality. Other methods include statistical process control (SPC), pattern recognition, and Pareto chart analysis for poor data quality dimensions over time. We illustrate other types of analysis through the case of the Medical Command of the Department of Defense, which developed data quality metrics for the information in their Military Treatment Facilities (MTF). In this particular case [8], an analysis of the assumptions and rationale underlying the data quality metrics was conducted. Some ofthe issues raised were: • What are the targeted payoffs? • How do the data quality metrics link to the factors that are critical to the target payoffs • How representative or comprehensive these data quality metrics are • Whether these data quality metrics are the right set of metrics The target payoffs could be two-fold: (1) the delivery of ever-improving value to patients and other stakeholders, contributing to improved health care quality; and (2) improvement of overall organizational effectiveness, use of resources, and capabilities. It would be important to explicitly articulate the scope of these metrics in terms of the categories of payoffs and their linkages to the critical factors. To provide the best health care for the lowest cost, different types of data are needed, MTF commanders need cost and performance data, Managed Care Support contractors need to measure the quality and cost of their services, and patients need data they can use to know what kind of services they get from different health plans. The types of data needed can fall into several categories: patient, provider, type of care, use rate, outcome, and financial. Based on the targeted payoffs, the critical factors, and the corresponding types of data needed, one can evaluate how representative or comprehensive these data quality metrics are and whether these metrics are the right set of metrics.
14
Introduction
Chapter 1
Improve IP Once the analysis phase is complete, the IP improvement phase can start. The IP team needs to identify key areas for improvement such as: (1) aligning information flow and workflow with infrastructure, and (2) realigning the key characteristics of the IP with business needs. As mentioned earlier, the Information Manufacturing Analysis Matrix [5] is designed for the above purposes. Also, Ballou and Tayi [3] have developed a framework for allocating resources for data quality improvement. Specifically, an integer programming model is used to determine which databases should be chosen to maximize data quality improvement given a constrained budget.
SUMMARY In this chapter, we presented concepts and principles for defining, measuring, analyzing, and improving information products (IP), and briefly described a data quality survey software instrument for data quality assessment. We also introduced the Total Data Quality Management (TDQM) framework and illustrated how it can be applied in practice. The power of the TDQM framework stems from a cumulative multidisciplinary research effort and practice in a wide range of organizations. Fundamental to this framework is the premise that organizations must treat information as a product that moves through an information manufacturing system, much like a physical product moves through a manufacturing system, yet realize the distinctive nature of information products. The TDQM framework has been shown to be particularly effective in improving IP in organizations where top management has a strong commitment to a data quality policy. Organizations of the 21st century must harness the full potential of their data in order to gain a competitive advantage and attain their strategic goals. The TDQM framework has been developed as a step towards meeting this challenge.
BOOK ORGANIZATION The remainder of the book is organized as follows. Chapter 2 presents two seminal research efforts, the Polygen Model and the Attribute-based Model, which capture data quality characteristics by extending the Relational Model. Chapter 3 exhibits a research effort that models data quality requirements by extending the Entity Relationship Model. Chapter 4 introduces a knowledge-based model that provides an overall measure that can be useful in automating the judgment of the quality of data in a complex system that uses data from various sources with varying degrees of quality. Chapter 5 describes an attempt to develop a data quality algebra that can be
Book Organization
15
embedded in an extended Relational Model. These chapters summarize our technically oriented work in the database field within the context of data quality. Other leading research institutions have also been working on issues related to data quality within the framework of database systems. Chapter 6 profiles the MIT Context Interchange project. The European Union’s project on data warehouse quality is presented in Chapter 7. In Chapter 8, we profile the data quality project at Purdue University’s database group. Finally, concluding remarks and future directions are presented in Chapter 9.
References [1] Ballou, D.P. and H.L. Pazer, Modeling Data and Process Quality in Multi-input, Multioutput Information Systems. Management Science, 31(2), 1985, 150-162. [2] Ballou, D.P. and H.L. Pazer, Designing Information Systems to Optimize the AccuracyTimeliness Tradeoff. Information Systems Research, 6( 1), 1995, 51-72. [3] Ballou, D.P. and G.K. Tayi, Methodology for Allocating Resources for Data Quality Enhancement. Communications of the ACM, 32(3), 1989, 320-329. [4] Ballou, D.P. and G.K. Tayi, Enhancing Data Quality in Data Warehouse Environment. Communications of the ACM, 42( 1), 1999, 73-78. [5] Ballou, D.P., R.Y. Wang, H. Pazer and G.K. Tayi, Modeling Information Manufacturing Systems to Determine Information Product Quality. Management Science, 44(4), 1998, 462-484. [6] Bowen, P. (1993). Managing Data Quality in Accounting Information Systems: A Stochastic Clearing System Approach. Unpublished Ph.D. dissertation, University of Tennessee. [7] CEIS. Commander‘s Data Quality Assessment Guide for CEIS Beta Test - Region 3.Proceedings of Corporate Executive Information System. 1216 Stanley Road, Fort Sam Houston, Texas 78234-6127, 1996. [8] Corey, D., L. Cobler, K. Haynes and R. Walker. Data Quality Assurance Activities in the Military Health Services System. Proceedings of the 1996 Conference on Information Quality. Cambridge, MA, pp. 127-153, 1996. [9] CRG, Information Quality Assessment (IQA) Software Tool. Cambridge Research Group, Cambridge, MA, 1997. [10] DQJ (1999). Data Quality Journal. [11] English, L.P., Improving Data Warehouse and Business Information Quality. John Wiley & Sons, New York, NY, 1999.
16
Introduction Chapter 1
[12] Huang, K., Y. Lee and R. Wang, Quality Information and Knowledge. Prentice Hall, Upper Saddle River: N.J., 1999. [13] Huh, Y.U., F.R. Keller, T.C. Redman and A.R. Watkins, Data Quality. Information and Software Technology, 32(8), 1990, 559-565. [14] Huh, Y.U., R.W. Pautke and T.C. Redman. Data Quality Control. Proceedings of ISQE, Juran Institute 7A1-27, 1992. [15] ICM (1999). ICM Group, Inc. [16] Kahn, B.K., D.M. Strong and R.Y. Wang (1999). Information Quality Benchmarks: Product and Service Performance. Communications of the ACM, Forthcoming. [17] Khang, S. W. (1990). An Accounting Model-Based Approach Toward Integration of Multiple Financial Databases (No.WP # CIS-90-15). Sloan School of Management, MIT. [18] Laudon, K.C., Data Quality and Due Process in Large Interorganizational Record Systems. Communications of the ACM, 29(1), 1986, 4-11. [19] Liepins, G.E. and V.R.R. Uppuluri, ed. Data Quality Control: Theory and Pragmatics. D. B. Owen. Vol. 112. 1990, Marcel Dekker, Inc., New York, 360 pages, [20] Morey, R.C., Estimating and Improving the Quality of Information in the MIS. Communications of the ACM, 25(5), 1982, 337-342. [21] Nichols, D.R., A Model of auditor's preliminary evaluations of internal control from audit data. 62(1), 1987, 183-190. [22] Pautke, R.W. and T.C. Redman. Techniques to control and improve quality of data in large databases. Proceedings of Statistics Canada Symposium 90, Canada, pp. 319-333, 1990. [23] Reddy, M.P. and R.Y. Wang. Estimating Data Accuracy in a Federated Database Environment. Proceedings of 6th International Conference, CISMOD (Also in Lecture Notes in Computer Science). Bombay, India, pp. 115-134, 1995, [24] Redman, T.C., ed. Data Quality: Management and Technology.1992, Bantam Books: New York, NY. 308 pages. [25] Redman, T.C., Improve Data Quality for Competitive Advantage. Sloan Management Review, 36(2), 1995, 99-109. [26] Redman, T.C., ed. Data Quality for the Information Age. 1996, Artech House: Boston, MA. 303 pages. [27] Strong, D.M., Y.W. Lee and R.Y. Wang, Data Quality in Context. Communications of the ACM, 40(5), 1997, 103-110. [28] Tayi, G.K.and D. Ballou, Examining Data Quality. Communications of the ACM, 41(2), 1999, 54-57.
Book Organization
17
[29] TDQM (1999). Total Data Quality Management Research Program. [30] TTI (1999). Technology Transfer Institute, Inc. Website. [31] Wand, Y. and R.Y. Wang, Anchoring Data Quality Dimensions in Ontological Foundations. Communications of the ACM, 39(11), 1996, 86-95. [32] Wang, R.Y., A Product Perspective on Total Data Quality Management. Communications of the ACM, 41(2), 1998, 58-65. [33] Wang, R.Y., H.B. Kon and S.E. Madnick. Data Quality Requirements Analysis and Modeling. Proceedings of the 9th International Conference on Data Engineering. Vienna: pp. 670-677, 1993. [34] Wang, R.Y., V.C. Storey and C.P. Firth, A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering, 7(4), 1995, 623-640. [35] Wang, R.Y. and D.M. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12(4), 1996, 5-34. [36] Yu, S. and J. Neter, A Stochastic Model of the Internal Control System. Journal of Accounting Research, 1(3), 1973, 273-295.
This page intentionally left blank.
C H A P T E R
2
Extending the Relational Model to Capture Data Quality Attributes
Current commercial relational database management systems (e.g., IBM’s DB2, Oracle’s Oracle DBMS, and Microsoft’s SQL Server) and their underlying relational model are based on the assumption that data stored in the databases are correct. This assumption, however, has some nontrivial ramifications. Consider a join operation in a SQL query. Suppose that the data used in a join operation are incorrect, it would follow that the query results would most likely be incorrect. To what extent the query results would be incorrect and what their impact would be remains an open question, although researchers have started to investigate such issues [1, 22]. The fundamental assumption of the relational model, that “data stored in the underlying databases are correct,” is not without merit. To ensure data integrity, the relational model has facilities such as data dictionaries, integrity rules, and edits checks. In practice, however, dirty data pervade databases for various reasons [14,
20
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
26]. Furthermore, the scope of data quality goes beyond accuracy and integrity as conceived by many in the database community. It is well established that other aspects of data quality such as believability and timeliness are equally, if not more, important from the end-user’s perspective [2, 3, 15, 24, 27]. In this chapter, we present two early attempts at extending the relational model to capture data quality attributes: The Polygen Model [31]1 and the Attribute-based Model [25]2. We analyze these models from three aspects following the relational model [8, 11, p. 181]: (1) data structure (object types), (2) data integrity (integrity rules), and (3) data manipulation (operators). The polygen model is a seminal research that precipitated the downstream work at the MIT TDQM program and Context Interchange research, such as [18-21]. The idea came from the realization that end-users needed to know where the data came from (i.e., data sources) in order to assess the quality of the data (credibility of the data source, for example) for the tasks at hand. The attribute-based model is a follow-up research that defines a formal data structure and the corresponding algebra and integrity rules to include the quality aspects of data within the framework of the relational model.
THE POLYGEN MODEL Heterogeneous (distributed) database systems strive to encapsulate the heterogeneity of the underlying databases, for example, producing an illusion that all data resides in a single location, which is referred to as location transparency. Indeed, a set of transparency rules for distributed database systems has been proposed [23]. In contrast, the polygen model studies heterogeneous database systems from the multiple (poly) source (gen) perspective. Most end-users wish to know the source of their data (e.g., “Source: Reuters, Thursday, August 17, 2000”). This source knowledge may be important to them for many reasons. For example, it enables them to apply their own judgment to the credibility of the data. We call this the Data Source Tagging problem. A decision maker may need to know not only the sources of data but also the intermediate sources that helped in composing the data. We call this the Intermediate Source Tagging problem. By resolving the Data Source Tagging and Intermediate Source Tagging problems, the polygen model addresses critical issues such as "where did the data come from" and "which intermediate data sources were used to derive the data." In the context of heterogeneous database systems, a data source means the underlying database in which the data 1 Adapted from the Proceedings of the 16th International Conference on Very Large Data Bases, Richard Wang and Stuart Madnick, A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective, pp.5 19-538, 1990, Brisbane, Australia. 2 Reprinted from Decision Support Systems, 13, Richard Wang, M. P. Reddy, and Henry Kon, Toward quality data: An attribute-based approach, 349-372, 1995, with permission from Elsevier Science.
The Polygen Model
21
reside. By the same token, an intermediate data source means an intermediate database that contributes to a final query result.
Architecture A query processor architecture for the polygen model is depicted in Figure 2.1. Briefly, the Application Query Processor translates an end-user query into a polygen query for the Polygen Query Processor (PQP) based on the user's application schema. The PQP in turn translates the polygen query into a set of local queries based on the corresponding polygen schema, and routes them to the Local Query Processors (LQP). The details of the mapping and communication mechanisms between an LQP and its local databases are encapsulated in the LQP. To the PQP, each LQP behaves as a local relational system. Upon return from the LQPs, the retrieved data are further processed by the PQP in order to produce the desired composite data.
Figure 2.1: The Query Processor Architecture
22
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
Many critical problems need to be resolved in order to provide a seamless solution to the end-user. These problems include source tagging, query translation, schema integration [4, 13], inter-database instance matching [30], domain mapping [12, 29], and semantic reconciliation [28]. We focus on the first problem - source tagging - and make the following assumptions: • The local schemata and the polygen schema are all based on the relational model. • Sources are tagged after data has been retrieved from each database. • Schema integration has been performed, and the attribute mapping information is stored in the polygen schema. • The inter-database instance identifier mismatching problem (e.g., IBM vs. 1.B.M or social security identification number vs. employee identification number) has been resolved and the data is available for the PQP to use. • The domain mismatch problem such as unit ($ vs. ¥), scale (in billions vs. in millions), and description interpretation (“expensive” vs. “$$$”,“Chinese Cuisine” vs. “Hunan or Cantonese”) has been resolved in the schema integration phase and the domain mapping information is also available to the PQP.
Polygen Data Structure Let PA be a polygen attribute in a polygen scheme P, LS a local scheme in a local database LD, and LA a local attribute in LS. Let MA be the set of local attributes corresponding to a PA, i.e., MA = {(LD, LS,LA)|(LD,LS, LA) denotes a local attribute of the corresponding PA}. A polygen scheme P is defined as P = ((PA1, MA1), , , . , (PAn, Man)> where n is the number ofattributes in P. A polygen schema is defined as a set {PI, . . . , PN) of N polygen schemes. A polygen domain is defined as a set of ordered triplets. Each triplet consists of three elements: the first is a datum drawn from a simple domain in an LQP. The second is a set of LDs denoting the local databases from which the datum originates. The third is a set of LDs denoting the intermediate local databases whose data led to the selection of the datum. A polygen relation p of degree n is a finite set of time-varying n-tuples, each n-tuple having the same set of attributes drawing values from the corresponding polygen domains. A cell in a polygen relation is an ordered triplet c=(c〈d〉, c〈o〉, c〈i〉) where c〈d〉 denotes the datum portion, c〈o〉 the originating portion, and c〈i〉 the intermediate source portion. Two polygen relations are union-compatible if their corresponding attributes are defined on the same polygen domain.
The Polygen Model
23
Note that P contains the mapping information between a polygen scheme and the corresponding local relational schemes. In contrast, p contains the actual timevarying data and their originating sources. Occasionally, a polygen scheme and a polygen relation may be used synonymously without confusion. The data and intermediate source tags for p are updated along the way as polygen algebraic operations are performed.
Polygen Algebra Let attrs(p) denote the set of attributes of p. For each tuple t in a polygen relation p, let t〈d〉 denote the data portion, t〈o〉 the originating source portion, and t〈i〉 the intermediate source portion. If x E attrs(p) and X = {x 1..., xj, ...,xJ} is a sublist of attrs(p), then let p[x] be the column in p corresponding to attribute x, let p[x] be the columns in p corresponding to the sublist of attributes X, let t[x] be the cell in t corresponding to attribute x, and let t[X] be the cells in t corresponding to the sublist of attributes X. As such, p[x] 〈o〉 denotes the originating source portion of the column corresponding to attribute x in polygen relation p while t[X] 〈i〉 denotes the intermediate source portion of the cells corresponding to the sublist of attributes X in tuple t. On the other hand, p[x] denotes the column corresponding to attribute x in polygen relation p inclusive of the data, originating source, and intermediate source portions; while t[X] denotes the cells corresponding to the sublist of attributes X in tuple t inclusive ofthe data, originating source, and intermediate source portions. The five orthogonal algebraic primitive operators [5-9, 17] in the polygen model are defined as follows:
Projection If p is a polygen relation, and X = {x1. . .,xj,. . .,xJ} is a sublist ofattrs(p), then p[X] = {t'|t' = t[X] if t ∈p t[X]〈d〉 is unique; t'〈d〉 = ti [X]〈d〉, t'[xj]〈o〉= ti[xj]〈o〉 ∪. . . ∪ tk[xj]〈o〉 xj ∈ X, t'[xj]〈i〉= ti[xj]〈i〉 . . . ∪ tk[xj]〈i〉 ∀ xj ∈ X, IF ti , ... , tk ∈ p ti[X]〈d〉=. . .= tk[X]〈d〉}.
Cartesian product If p1 and p2 are two polygen relations, then (p1 x p2 ) = {t1 o t2 | t1 ∈ p1 and t2 ∈ p2, where o denotes concatenation}.
Restriction If p is a polygen relation, x ∈ attrs(p), y ∈ attrs(p), and θ is a binary relation, then p[x θ y]={t'|t'〈d〉=t〈d〉, t'〈o〉=t〈o〉, t'[w] 〈i〉 ∪ t[x] 〈o〉 ∪ t[y] 〈o〉 ∀ w ∈ attrs(p),
24
Extending the Relational Model to Capture Data Quality Attributes
IF t ∈ p
Chapter 2
t[x] 〈d〉 θ t[y]〈d〉}.
Union If p1 and p2 are two polygen relations of degree n, t1 ∈p1, and t2 ∈p2, then (p1 ∪ p2) = {t'|t'=t1 IF t1〈d〉 ∈ p1 t1〈d〉 ∉ p2; t'=t2 IF t2〈d〉 ∉ p1 t2〈d〉 ∈ p2; t'〈d〉 = t1〈d〉, t'〈o〉 = t1〈o〉 t2〈o〉, t'〈i〉 ∪ t2〈i〉 if t1〈d〉 = t2〈d〉}
Difference Let p〈o〉 denote the union of all the t〈o〉 sets in p, and p〈i〉 denote the union of all the t〈i〉 sets in p. If p1 and p2 are two polygen relations of degree n, then (p1-p2)={t'| t'〈d〉=t〈d〉, t'〈o〉=t〈o〉, t'[w] 〈i〉 ∪ p2〈o〉 ∪ p2〈i〉 ∀ w ∈ attrs(p), IF t ∈p1 t〈d〉∉p2}. The intermediate source portion t〈i〉 is updated by the Restriction and Difference operations. The Restriction operation selects the tuples in a polygen relation which satisfies the [x θ y] condition. As such, the originating local databases of the x and y attribute values are added to the t〈i〉 set in order to signify their mediating role. Since Select and Join are defined through Restriction, they also update t〈i〉. The Difference operation selects a tuple in p1 to be a tuple in (p1 – p2) if the data portion of the tuple in p1 is not identical to those of the tuples in p2. Since each tuple in p1 needs to be compared with all the tuples in p2, it follows that all the originating sources of the data in p2 should be included in the intermediate source set of (p1–p2) as is denoted by t'〈i〉 = t〈i〉 ∪ p2〈o〉 ∪ p2〈i〉. In contrast, the Projection, Cartesian Product, and Union operations do not involve intermediate local databases as the mediating sources. Other traditional operators can be defined in terms of the above five operators. The most common are Join, Select, and Intersection. Join and Select are defined as the restriction of a Cartesian product. Intersection is defined as the Projection of a join over all the attributes in each of the relations involved in the Intersection. In order to process a polygen query, we also need to introduce the following new operators to the polygen model: Retrieve, Coalesce, Outer Natural Primary Join, Outer Natural Total Join, andMerge. A local database relation needs to be retrieved from a local database to the PQP first before it is considered as a PQP base relation. This is required in the polygen model because a polygen operation may require data from multiple local databases. Although a PQP base relation can be materialized dynamically like a view in the conventional database system, for conceptual purposes, we define it to reside physically in the PQP. The Retrieve operation can be defined as an LQP Restriction operation without any restriction condition. The Coalesce and Outer Natural Join operations have been informally introduced by Date to handle a surprising number of practical applications. Coalesce
The Attribute-based Model
25
takes two columns as input, and coalesces them into one column. An Outer Natural Join is an outer join with the join attributes coalesced [ 10]. An Outer Natural Primary Join is defined as an Outer Natural Join on the primary key of a polygen relation. An Outer Natural Total Join is an Outer Natural Primary Join with all the other polygen attributes in the polygen relation coalesced as well, Merge extends Outer Natural Total Join to include more than two polygen relations. It can be shown that the order in which Outer Natural Total Join operations are performed over a set of polygen relations in a Merge is immaterial. Since Coalesce can be used in conjunction with the other polygen algebraic operators to define the Outer Natural Primary Join, Outer Natural Total Join, and Merge operations, it is defined as the sixth orthogonal primitive of the polygen model.
Coalesce Let © denote the coalesce operator. If p is a polygen relation, x ∈ attrs(p), y ∈ attrs(p), z = attrs(p) - {x,y}, and w is the coalesced attribute of x and y, then p[x y:w] = {t'| t'[z]=t[z], t'[w] 〈d〉, t'[w] 〈o〉=t[x] 〈o〉 ∪ t[y] 〈o〉, t'[w] 〈i〉 = t[x] 〈i〉 ∪ t[y] 〈i〉, IF t[x] 〈d〉=t[y] 〈d〉; t'[z]=t[z], t'[w] 〈d〉=t[x] 〈d〉, t'[w] 〈o〉=t[x] 〈o〉, t'[w] 〈i〉=t[x] 〈i〉, IF t[y] 〈d〉=nil; t'[zl=t[z], t'[w] 〈d〉=t[y]〈d〉,t’[w]〈o〉=t[y]〈o〉, t'[w] 〈i〉=t[y]〈i〉, IF t[x] 〈d〉=nil}. Note that in a heterogeneous distributed environment, the values to be coalesced may be inconsistent. It is assumed that inter-database instance mismatching problems will be resolved before the Coalesce operation is performed [30, 33]. We have presented the polygen model and the polygen algebra. The reader is referred to Wang and Madnick [31, 32] for a scenario that exemplifies the polygen model, a detailed presentation of a data-driven query translation mechanism for dynamically mapping a polygen query into a set of local queries, and an example that illustrates polygen query processing.
THE ATTRIBUTE-BASED MODEL The attribute-based data model facilitates cell-level tagging of data. It includes a definition that extends the relational model, a set of quality integrity rules, a quality indicator algebra that can be used to process SQL queries that are augmented with quality indicator requirements. From these quality indicators, the user can make a better interpretation of the data and determine the believability of the data. Given that zero defect data may not be necessary or attainable, it would be useful to be able to judge the quality of data. This suggests that we tag data with quality
26
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
indicators, which are characteristics of the data. From these quality indicators, the user can make a judgment of the quality of the data for the specific application at hand. For example, in making a financial decision to purchase stocks, it would be useful to know the quality of data through quality indicators such as who originated the data, when and how the data was collected.
Data structure An attribute may have an arbitrary number ofunderlying levels of quality indicators. In order to associate an attribute with its immediate quality indicators, a mechanism must be developed to facilitate the linkage between the two, as well as between a quality indicator and the set of quality indicators associated with it. This mechanism is developed through the quality key concept. In extending the relational model, Codd made clear the need for uniquely identifying tuples through a system-wide unique identifier, called the tuple ID [7, 16]. This concept is applied in the attributebased model to enable such linkage. Specifically, an attribute in a relation scheme is expanded into an ordered pair, called a quality attribute, consisting of the attribute and a quality key. This expanded scheme is referred to as a quality scheme. Correspondingly, each cell in a relational tuple is expanded into an ordered pair, called a quality cell, consisting of an attribute value and a quality key value. This expanded tuple is referred to as a quality tuple and the resulting relation is referred to as a quality relation. Each quality key value in a quality cell refers to the set of quality indicator values immediately associated with the attribute value. The quality indicator values are grouped together to form a kind of quality tuple called a quality indicator tuple. A quality relation composed of a set of these time-varying quality indicator tuples is called a quality indicator relation. The quality scheme that defines the quality indicator relation is referred to as the quality indicator scheme. The quality key thus serves as a foreign key that relates an attribute (or quality indicator) value to its associated quality indicator tuple. A quality scheme set is defined as the collection of a quality scheme and all the quality indicator schemes that are associated with it. We define a quality database as a database that stores not only data but also quality indicators. A quality schema is defined as a set of quality scheme sets that describes the structure of a quality database. Figure 2.2 illustrates the relationship among quality schemes, quality indicator schemes, quality scheme sets, and the quality schema.
The Attribute-based Model
Figure 2.2:
27
Quality schemes, quality indicator schemes, quality scheme sets, and the quality schema
Ifqr1 is a quality relation and a an attribute in qr1 that has quality indicators associated with it, then the quality key for a must be non-null. Ifqr2 is a quality indicator relation containing a quality indicator tuple for a, then all the attributes ofqr2 are called level 1 quality indicators for a. Each attribute in qr2 can have a quality indicator relation associated with it. An attribute can have n levels of quality indicator relations associated with it (n ≥ 0). Following the constructs developed in the relational model, a domain is defined as a set of values of similar type. Let ID be the domain for a system-wide unique identifier. Let D be a domain for an attribute. Let DID be defined on the Cartesian product D X ID. Let id be a quality key value associated with an attribute value d where d ∈D and id ∈ID. A quality relation (qr) of degree m is defined on the m+1 domains if it is a subset ofthe Cartesian product: ID X DID, X DID2 X ... X DIDm. Let qt be a quality tuple, which is an element in a quality relation. Then a quality relation qr is designated as: qr = {qt| qt = where id ∈ID, didj ∈DIDj, j = 1, ... ,m}
Data manipulation In order to present the attribute-based algebra formally, we first define two key concepts that are fundamental to the quality indicator algebra: QI-compatibility and QIVEqual.
28
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
QI-Compatibility and QIV-Equal Let a1 and a2 be two application attributes. Let QI(ai) denote the set ofquality indicators associated with ai. Let S be a set of quality indicators. If S C QI(a1) and S C QI(a2), then a1 and a2 are defined to be QI-Compatible with respect to S.3 For example, if S = {qi1, qi2, qi21}, then the attributes a1 and a2 shown in Figure 2.3 are QICompatible with respect to S. Whereas if S = {qi1, qi22}, then the attributes a1 and a2 shown in Figure 2.3 are not QI-Compatible with respect to S.
Figure 2.3: QI-Compatibility Example
Let a1 and a2 be QI-Compatible with respect to S. Let w1 and w2 be values of a1 and a2 respectively. Let qi(w1) be the value of quality indicator qi for the attribute value w1 where qi e S (qi2(w1) = v2 in Figure 2.4). Define w1 and w2 to be QIVEqual with respect to S provided that qi(w1) = qi(w2) ∀ qi ∈ S, denoted as w1 =sw2. In Figure 2.4, for example, w1 and w2 are QIV-Equal with respect to S = (qi1, qi21}, but not QIV-Equal with respect to S = {qi1, qi31} because qi 31(w1) = v31 whereas qi31(w2) = x31.
Figure 2.4: QIV-Equal Example
3It is assumed that the numeric subscripts (e.g., qi ) map the quality indicators to unique positions in the 11 quality indicator tree.
The Attribute-based Model
29
In practice, it is tedious to explicitly state all the quality indicators to be compared (i.e., to specify all the elements of s). To alleviate this situation, we introduce i-level QI-compatibility (i-level QIV-Equal) as a special case for QI-compatibility (QIV-equal) in which all the quality indicators up to a certain level of depth in a quality indicator tree are considered. Let a1 and a2 be two application attributes. Let a1 and a2 be QI-Compatible with respect to S. Let w1 and w2 be values of a1 and a2 respectively, then w1 and w2 are defined to be i-level QI-Compatible if the following two conditions are satisfied: (1) a1 and a 2 are QI-Compatible with respect to S, and (2) S consists of all quality indicators present within i levels ofthe quality indicator tree of a1 (thus ofa2). By the same token, i-level QIV-Equal between w1 and w2, denoted by w1 =i w2, can be defined. If ‘i’ is the maximum level of depth in the quality indicator tree, then a1 and a2 are defined to be maximum-level QI-Compatible. Similarly, maximumlevel QIV-Equal between w1 and w2, denoted by w1 = mw2, can also be defined. An example using the algebraic operations in the quality indicator algebra is given in Wang, Reddy, and Kon [25]. In order to illustrate the relationship between the quality indicator algebraic operations and the high-level user query, the SELECT, FROM, WHERE structure of SQL is extended with an extra clause ‘’with QUALITY.” This extra clause enables a user to specify the quality requirements regarding an attribute referred to in a query. If the clause “with QUALITY” is absent in a user query, then it means that the user has no explicit constraints on the quality of data that is being retrieved. In that case, quality indicator values would not be compared in the retrieval process; however, the quality indicator values associated with the applications data would be retrieved.
Quality Indicator Algebra Following the relational algebra, we define the five orthogonal quality relational algebraic operations, namely Selection, Projection, Union, Difference, and Cartesian product [17]. In the following operations, let QR and QS be two quality schemes and qr and qs be two quality relations associated with QR and QS respectively. Let a and b be two attributes in both QR and QS. Let t1 and t 2 be two quality tuples and Sa be a set of quality indicators specified by the user for the attribute a; in other words, Sa is constructed from the specifications given by the user in the “with QUALITY” clause. Let the term t1.a = t2.a denote that the values ofthe attribute a in the tuples t1 S and t2 are identical. Let t1.a = a t2.a denote that the values ofattribute a in the tuples t1 and t2 are QIV-equal with respect to Sa. In a similar fashion, let t1.a =it2.a and m t1.a = t2.a denote i-level QIV-equal and maximum-level QIV-equal between the values oft1.a and t2.a respectively.
30
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
Selection Selection is a unary operation that selects only a horizontal subset of a quality relation (and its corresponding quality indicator relations) based on the conditions specified in the Selection operation. There are two types of conditions in the Selection operation: regular conditions for an application attribute and quality conditions for the quality indicator relations corresponding to the application attribute. The selection, σqC is defined as follows:
σqC (qr) = {t| ∀ t1 ∈qr, ∀ a ∈QR, ((t.a = t1.a) (t.a = m t1.a)) C(t1)} where C(t1) = e1 Φ e2 Φ ... Φ en Φ e1q Φ ... Φ epq; ei is in one of the forms: (t1.a θ constant) or (t1.a θ t1.b); eiq is in one of the forms (qik = constant) or (t1.a =Sa,b t1.b) or (t1.a =i t1.b) or (t1.a =m t1.b); qik ∈ QI(a); Φ∈{ ,∨, }; θ = {= ≤, ≥, ≤, ,=}; and Sa,b is the set of quality indicators used duing the compari son of t1.a and t1.b.
Projection Projection is a unary operation that selects a vertical subset of a quality relation based on the set of attributes specified in the Projection operation. The result includes the projected quality relation and the corresponding quality indicator relations that are associated with the set of attributes specified in the Projection operation. Let PJ be the attribute set specified, then the Projection ΠqPJ (qr) is defined as follows:
ΠqPJ (qr) = {t | ∀ t1 ∈ qr, ∀a PJ, ((t.a = t1•a) (t.a =m t1•a))}
Union In this operation, the two operand quality relations must be QI-Compatible. The result includes (1) tuples from both qr and qs after elimination of duplicates, and (2) the corresponding quality indicator relations that are associated with the resulting tuples. qr ∪q qs = qr { t | ∀ t2 ∈ qs, ∃t1 ∈ qr, ∀ a ∈ QR, ((t.a = t2.a) (t.a =m t2.a) ((t1.a=t2.a) (t1.a =Sa t2.a)))} In the above expression, " ((t1.a = t2.a) (t1.a=Sat2.a))" is meant to eliminate duplicates. Tuples t1 and t 2 are considered duplicates provided that (1) there is a match between their corresponding attribute values (i.e., t1.a = t 2.a ) and (2) these values are QIV-equal with respect to the set of quality indicators (Sa) specified by the user(i.e., t1.a =Sat2.a).
The Attribute-based Model
31
Difference In this operation, the two operand quality relations must be QI-Compatible. The result of the Difference operation consists of all tuples from qr which are not equal to tuples in qs. During this equality test, the quality of attributes specified by the user for each attribute value in the tuples t1 and t2 will also be taken into consideration. qr–q qs = { t |∀ t1 ∈qr, ∃t2 ∈ qs, ((t1.a = t2.a) (t1.a =Sa t2.a)) )} ∀ a ∈ QR, ((t.a = t1.a) (t.a =m t1.a) Just as in relational algebra, other algebraic operators such as Intersection and Join can be derived from these five orthogonal operators. Below is a list of the many different ways the attribute-based model can be applied: • The ability of the model to support quality indicators at multiple levels makes it possible to retain the origin and intermediate data sources. • A user can filter the data retrieved from a database according to quality requirements. • Data authenticity and believability can be improved by data inspection and certification. A quality indicator value could indicate who inspected or certified the data and when it was inspected. The reputation of the inspector will enhance the believability of the data. • The quality indicators associated with data can help clarify data semantics, which can be used to resolve semantic incompatibility among data items received from different sources. This capability is very useful in an interoperable environment where data in different databases have differentsemantics. • Quality indicators associated with an attribute may facilitate a better interpretation ofnull values. • In a data quality control process, when errors are detected, the data administrator can identify the source of error by examining quality indicators such as data source or collection method.
Data Integrity A fundamental property of the attribute-based model is that an attribute value and its corresponding quality indicator values (including all descendants) are treated as an atomic unit. By atomic unit we mean that whenever an attribute value is created, deleted, retrieved, or modified, its corresponding quality indicators also need to be created, deleted, retrieved, or modified respectively. In other words, an attribute value and its corresponding quality indicator values behave atomically. We refer to
32
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
this property as the atomicity property hereafter. This property is enforced by a set of quality referential integrity rules as defined below. • Insertion: Insertion of a tuple in a quality relation must ensure that for each non-null quality key present in the tuple (as specified in the quality schema definition), the corresponding quality indicator tuple must be inserted into the child quality indicator relation. For each non-null quality key in the inserted quality indicator tuple, a corresponding quality indicator tuple must be inserted at the next level. This process must be continued recursively until no more insertions are required. • Deletion: Deletion of a tuple in a quality relation must ensure that for each non-null quality key present in the tuple, corresponding quality data must be deleted from the table corresponding to the quality key. This process must be continued recursively until a tuple is encountered with all null quality keys. • Modification: If an attribute value is modified in a quality relation, then the descendant quality indicator values of that attribute must be modified.
CONCLUSION Both of the models presented in this chapter define data quality by extending the relational model. The Polygen Model resolves the data source tagging and intermediate source tagging problems. It addresses issues in heterogeneous distributed database systems from the “where” perspective and thus enables us to interpret data from different sources more accurately. Furthermore, it follows the relational model by specifying the data structure and data manipulation components of the data model. However, it does not include the data integrity component. The Attribute-based Model, on the other hand, allows for the structure, storage, and processing of quality relations and quality indicator relations through a quality indicator algebra. In addition, it includes a description of its data structure, a set of data integrity constraints for the model, and a quality indicator algebra.
Conclusion
33
References [1] Ballou, D., I. Chengalur-Smith and R. Y. Wang , A Sampling Procedure for Data Quality Auditing in the Relational Environment (No. TDQM-00-02). Massachusetts Institute of Technology, 2000. [2] Ballou, D. P. and H. L. Pazer, "Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff," Information Systems Research, 6(1), 1995, pp. 51-72. [3] Ballou, D. P., R. Y. Wang, H. Pazer and G. K. Tayi, "Modeling Information Manufacturing Systems to Determine Information Product Quality," Management Science, 44(4), 1998, pp. 462-484. [4] Batini, C., M. Lenzirini and S. Navathe, "A comparative analysis of methodologies for database schema integration,'' ACM Computing Survey, 18(4), 1986, pp. 323-364. [5] Codd, E. F., "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM, 13(6), 1970, pp. 377-387. [6] Codd, E. F., "Relational completeness of data base sublanguages," in Data Base Systems, R. Rustin, Editor 1972, Prentice Hall, 1972. [7] Codd, E. F., "Extending the relational database model to capture more meaning," ACM Transactions on Database Systems, 4(4), 1979, pp. 397-434. [8] Codd, E. F., "Relational database: A Practical Foundation for Productivity, the 1981 ACM Turing Award Lecture," Communications of the ACM, 25(2), 1982, pp. 109-117. [9] Codd, E. F. "An evaluation scheme for database management systems that are claimed to be relational," in Proceedings of the Second International Conference on Data Engineering. Los Angeles, CA: pp. 720-729, 1986. [10] Date, C. J. "The outer join," in Proceedings of The 2cd International Conference on Databases. Cambridge, England: pp. 76-106, 1983. [11] Date, C. J., An Introduction to Database Systems. 5th ed. Addison-Wesley Systems Programming Series, Addison-Wesley, Reading, 1990. [12] DeMichiel, L. G. Performing operations over mismatched domains. In pp. 36-45. Los Angeles, CA: 1989. [13] Elmasri, R., J. Larson and S. Navathe, Schema integration algorithms for federated databases and logical database design (No. 1987 [14] Huang, K., Y. Lee and R. Wang, Quality Information and Knowledge. Prentice Hall, Upper Saddle River: N.J., 1999. [15] Kahn, B. K., D. M. Strong and R. Y. Wang. "Information Quality Benchmarks: Product and Service Performance," Communications of the ACM, 1999.
34
Extending the Relational Model to Capture Data Quality Attributes
Chapter 2
[16] Khoshafian, S. N. and G. P. Copeland, "Object Identity," in The Morgan Kaufmann Series in Data Management Systems, S. B. Zdonik and D. Maier, Ed. 1990, Morgan Kaufmann, San Mateo, CA, 1990. [17] Klug, A., "Equivalence of relational algebra and relational calculus query languages having aggregate functions," The Journal of ACM, 29, 1982, pp. 699-717. [18] Lee, T. and S. Bressen. "Multimodal Integration of Disparate Information Sources with Attribution," in Proceedings of Entity Relationship Workshop on Information Retrieval and Conceptual Modeling, 1997. [19] Lee, T., S. Bressen and S. Madnick. "Source Attribution for Querying Against Semistructured Documents,". in Proceedings of Workshop on Web Information and Data Management, ACM Conference on Information and Knowledge Management, 1998. [20] Lee, T. and L. McKnight. "Internet Data Management: Policy Barriers to an Intermediated Electronic Market in Data," in Proceedings of 27th Annual Telecommunications Policy Research Conference 1999. [21] Madnick, S. E., "Database in the Internet Age," Database Programming and Design, 1997, pp. 28-33. [22] Reddy, M. P. and R. Y. Wang. "Estimating Data Accuracy in a Federated Database Environment," in Proceedings of 6th International Conference, CISMOD (Also in Lecture Notes in Computer Science). Bombay, India: pp. 115-134, 1995. [23] Rob, P. and C. Coronel, Database Systems: Design, Implementation, and Management. 3rd ed. Course Technology, Boston, 1997. [24] Strong, D. M., Y. W. Lee and R. Y. Wang, "Data Quality in Context," Communications of the ACM, 40(5), 1997, pp. 103-110. [25] Wang, R. Y., M. P. Reddy and H. B. Kon, "Toward quality data: An attribute-based approach," Decision Support Systems (DSS), 13, 1995, pp. 349-372. [26] Wang, R. Y., V. C. Storey and C. P. Firth, "A Framework for Analysis of Data Quality Research," IEEE Transactions on Knowledge and Data Engineering, 7(4), 1995, pp. 623640. [27] Wang, R. Y. and D. M. Strong, "Beyond Accuracy: What Data Quality Means to Data Consumers," Journal of Management Information Systems (MIS), 12(4), 1996, pp. 5-34. [28] Wang, Y. R. and S. E. Madnick, Connectivity among information systems. Vol. 1. Cornposite Information Systems (CIS) Project, MIT Sloan School of Management, Cambridge, MA, 1988. [29] Wang, Y. R. and S. E. Madnick, "Facilitating connectivity in composite information systems," ACM Data Base, 20(3), 1989, pp. 38-46.
Conclusion
35
[30] Wang, Y. R. and S. E. Madnick. "The inter-database instance identification problem in integrating autonomous systems," in Proceedings of the Fifth International Conference on Data Engineering. Los Angeles, CA: pp. 46-55, 1989. [31] Wang, Y. R. and S. E. Madnick. "A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective,". in Proceedings of the 16th International Conference on Very Large Data bases (VLDB). Brisbane, Australia: pp. 519-538, 1990. [32] Wang, Y. R. and S. E. Madnick. "A Source Tagging Theory for Heterogeneous Database Systems," in Proceedings of International Conference on Information Systems. Copenhagen, Denmark: pp. 243-256, 1990. [33] Yuan, Y. , The design and implementation of system P: A polygen database management system (No. CIS-90-07). MIT Sloan School of Management, Cambridge, MA 02139, 1990
This page intentionally left blank.
C H A P T E R
3
Extending the ER Model to Represent Data Quality Requirements
In the previous chapter, we described how data quality can be incorporated into the relational model. To achieve dataquality-by-design, it would be useful to incorporate data quality attributes at the conceptual design stage of a database application. Conventional conceptual data models and their corresponding design methodologies, however, have been developed to capture entities, relationships, attributes, and other advanced concepts such as is-a and component-of relationships. Data quality is not explicitly recognized. The task of incorporating data quality into the design of a database application is left to the designer. Consequently, some database applications capture data quality attributes whereas others don’t. Indeed, Chen [ 1] stated that: “Mostdata models (including the ER model) assume that all data given is correct. We need a scientific way to assess the quality of the data in (or to be put into) the database. We also need to incorporate such quality
38
Extending the ER Model to Represent Data Quality Requirements
Chapter 3
measures into the Entity-Relationship (ER) model as well as to develop algebra- or calculus-based operations to manipulate such data quality descriptions.” This chapter presents a Quality Entity Relationship (QER) model4 proposed by Storey and Wang [6] that incorporates data quality requirements into the database design process. The procedure for doing so involves differentiating application quality requirements anddata quality requirements from application requirements, and then developing the constructs and procedures necessary to represent the data quality requirements in an ER.
MOTIVATING EXAMPLE Consider a training database used to keep track of employee training for career development purposes. Data is needed on courses and the vendors who supply them, students who are enrolled and those who are wait-listed to take the classes, job skills, as well as course evaluations, proficiency levels, and other information. A course can be offered by different vendors, as shown in the ER model in Figure 3.1. The min/max cardinalities [8] indicate that, for each Course, there can be zero-tomany vendors who offer the course and one-to-many courses offered by each vendor. This conceptual model represents the traditional approach to database design, but does not capture quality requirements such as the reputation of the vendor or the standards for the course.
Figure 3.1 : A training database example
4Adapted from the Proceedings of the 1998 Conference on Information Quality. Storey, V. C. and R. Y. Wang. “An Analysis of Quality Requirements in Database Design,” pp. 64-87, with permission from the MIT Total Data Quality Management (TDQM) Program.
Quality Requirements Identification
39
Traditional user requirements as well as user quality requirements must be represented in the database design. To do so, user requirements are divided into three categories: • Application Requirements: Widely accepted and practiced. Traditional application requirements analysis identifies data items such as employees and their ranks and salaries, or departments and their names and locations, that are fundamental to a database application [7]. These are the requirements that are usually discussed and represented when designing a database. • Application Quality Requirements Application quality requirements may arise at the input, process, and output stages associated with the production ofa product [2-4]. Such requirements include measurements against standards and comparisons of final products to initial design specifications. Traditionally, these have not been recognized or captured during the design process. • Data Quality Requirements: Data quality requirements correspond to the quality of the actual data in the database. They can be associated with either application requirements or application quality requirements. As previously mentioned, these requirements are multidimensional in nature. In the training example, Course and Instructor and their relationships are application requirements. One may want to know what standards a course has. This is an application quality requirement. An example of a data quality requirement is the accuracy of the class attendance data. There are various reasons why the three types of requirements are separated. First, quality requirements (both application quality and data quality requirements) differ from application requirements. Since they are often overlooked, this forces the user to identify them specifically during requirements analysis. Second, most existing systems were not built with quality requirements in mind. Therefore, the requirements can be elicited separately and added without requiring a completely new design. This is important for legacy systems. Finally, there is subjectivity in deciding what quality items to include and what measurement scales to use.
QUALITY REQUIREMENT IDENTIFICATION The identification of quality requirements must begin at the requirements analysis phase because quality requirements are both application and user-dependent. During the identification ofapplication requirements, the designer collects requirements and expresses them in English sentences [1]. Application quality requirements and data quality requirements can be identified in a similar manner; that is, by the de-
40
Extending the ER Model to Represent Data Quality Requirements
Chapter 3
signer working with the user to elicit the requirements and possibly suggesting some quality requirements based upon the designer's experience with similar or relatedapplicationdomains. Wang and Strong [12] 0provide a framework that categorizes data quality into four main categories: 1) intrinsic data quality (believability, accuracy, objectivity, reputation), 2) contextual data quality (value-added, relevancy, timeliness, completeness, appropriate amount ofdata), 3), representation data quality (interpretability, ease of understanding, representational consistency, concise representation), and 4) accessibility data quality (accessibility, access security). A designer, possibly with the assistance of an automated design tool, could use this framework during requirements elicitation as a checklist of data quality items, drilling down as appropriate to the underlying elements. Alternatively, the system could use them to try to verify that the user had chosen appropriate data quality measures.
REQUIREMENTS MODELING The quality design process is summarized in Figure 3.2. Since the identification of traditional application requirements is well understood, it is the logical foundation upon which to analyze application quality requirements. The acquisition of quality requirements is shown as Steps 1 and 2. Since both application requirements and application quality requirements will ultimately be captured and stored as data in a database management system, a mechanism must be available to measure the quality of the data that are stored to represent these two types of requirements. This is the task of Step 3: identifying data quality requirements.
Figure 3.2: Quality Requirements Modeling Process
Modeling Data Quality Requirements
41
The arrow from Step 1 to Step 2 signifies that application requirements identification precedes application quality requirements identification. The arrows from Steps 1-2 to Step 3 indicate that application and application quality requirements are modeled before data quality requirements. As in conventional database design and software engineering practices, these steps may be iterated. The distinctions among application requirements, application quality requirements, and data quality requirements are important at both the requirements analysis and conceptual design stages. Both application requirements and application quality requirements are identified to represent a real-world system [5,9]. They are modeled as entities, relationships, and attributes. Data quality requirements are identified to measure the quality of the attribute values (data) that are stored in a database that represents a real-world system. Therefore, new constructs are needed to model data quality requirements in a conceptual design. It might be argued that the distinctions made among application requirements, application quality requirements, and data quality requirements are not necessary because a view mechanism can be applied to provide separate views to the user for application-oriented data, application quality-oriented data, and data qualityoriented data. We reject this argument outright. In order to make use of the view mechanism to provide separate views, an underlying schema containing all the data items that represent application requirements, application quality requirements, and data quality requirements is needed. However, developing such a schema is exactly why the distinctions are made in the first place. On the other hand, once a conceptual schema that incorporates application, application quality, and data quality requirements is obtained and the corresponding database system developed, then view mechanisms can be applied to restrict views that correspond to each of these types ofrequirements. This leads to the following principle: The Data Quality (DQ) Separation Principle: Data quality requirements are modeled separately from application requirements and application quality requirements.
MODELING DATA QUALITY REQUIREMENTS Underlying the Data Quality Separation Principle is the need to measure the quality of attribute values. One might try to model the quality of an attribute value as another attribute for the same entity. This would have undesirable results such as violation of normalization principles in the relational model (e.g., “addressquality” is dependent upon “address”which is dependent upon the “person-id”).Another approach is to model the quality of an attribute as a meta-attribute [10, 11]. However, no constructs in the conceptual design level are explicitly defined for capturing meta-attributes. We, therefore, introduce and define two types of data quality entities: 1) a Data Quality Dimension entity and 2) a Data Quality Measure entity.
42
Extending the ER Model to Represent Data Quality Requirements
Chapter 3
To model the quality of attribute values, an attribute gerund representation is needed.
Data Quality Dimension Entity A user needs to be able to capture different data quality dimensions for some (or all) attributes of an entity; for example, the accuracy and timeliness of the attribute “attendance” or the timeliness of the “re-certification period” (is it up-to-date?). Only one representation is desired, but it needs to be flexible enough to represent different data quality dimensions for different attribute values. A Data Quality Dimension entity is introduced for this purpose. Its primary key is the combination of Dimension Name (D-Name), which can take on values such as ‘timeliness’ and ’completeness’, and Rating, which stores the corresponding value that is obtained on that dimension: DataQualityDimension:[Dimension-Name, Rating] For example: [Accuracy, 1], [Accuracy, 2], [Timeliness, yes] The values that appear in the populated entity are the set of values that exist in the database and will usually be a subset of all possible combinations. They represent the actual dimensions, upon which an attribute’s data quality is assessed, and the values obtained on those dimensions. Because both attributes are part of the key we are able, for example, to capture the fact that the attribute “actual cost” has an accuracy of ‘1’ whereas the accuracy of attribute “attendance” is ‘2’, using only one entity. This representation assumes that the accuracy for all attributes is rated on the same scale. If, for example, the accuracy scale can differ depending on the attribute, then this can be accommodated by adding the attribute name to the key as follows: Data QualityDimension: [Dimension-Name,Attribute, Rating] For example: [Accuracy, class attendance, 1], [Accuracy, actual cost, good]
Data Quality Measure Entity To completely capture the quality aspects, it is necessary to store information on the values that are assigned to a data quality dimension, e.g., the interpretation of “rating”. A Data Quality Measure entity is introduced to capture the interpretation of the quality attribute values. It consists, for example, of a primary key, “Rating”, that could have values ‘1’, ‘2’, and ‘3’, and a textual non-key attribute, “description” which stores a description of what the rating values mean (e.g., 1 is excellent, 2 is acceptable, 3 is poor). The Data Quality Measure entity enables different data qual-
Modeling Data Quality Requirements
43
ity dimensions with heterogeneous measurement scales to be modeled (e.g., accuracy is measured between ‘ 1’ and ‘3’, and timeliness is measured as ‘yes,’ or ‘no’): Data Quality Measure: [Rating, description] For example:
[ 1, excellent] [yes, up-to-date]
(for accuracy) (for timeliness)
There is another reason for introducing this separate Data Quality Measure entity: we want to be able to capture all of the descriptions of “rating” whether or not they currently appear in the database, If “description” were a non-key attribute of the Data Quality Dimension entity, then only those that are instantiated would appear. The purpose of the Data Quality Measure entity is, in essence, for its corresponding relation to serve as a complete “look-up” table. Furthermore, including “description”in the Data Quality Dimension entity would lead to a second normal form violation in the relational model because “description” would depend on only part of the key (“rating”). Analogous to the above, the Data Quality Measure entity could be expanded to model different scales for different attributes for the same data quality measure. For example, suppose the accuracy rating for “attendance” has a different interpretation than the accuracy for “basecost”, then the key would be expanded to include both the dimension and attribute names: Data Quality Measure: [Dimension-Name, Attribute, Rating, description] For example: [Accuracy, attendance, 1, excellent], [Accuracy, base cost, 90%, 90% correct] [Timeliness, attendance, yes, up-to-date]
Attribute Gerund Representation As shown in Figure 3.3, the application entity Class has “attendance” as an attribute for which the user might be interested in its data quality ratings. In the entityrelationship model there is no direct mechanism to associate an attribute (“attendance”) of one entity (Class) with another entity (e.g., a Data Quality Dimension entity or even a Data Quality Measure entity). Furthermore, “attendance” is an application attribute and by the DQ Separation Principle, an application requirement must remain part of the application specification, and not become an entity or attribute below it. This problem is resolved using the Attribute Gerund Representation, First the relationship, Class attendance-data-has Data Quality Dimension, is created that includes “attendance” as part of its name. This data quality relationship can be represented as a gerund entity [1]. Then, a relationship between the gerund and Data Quality Measure entity provides a mechanism for retrieving the descriptions ofthe data quality dimension values.
44
Extending the ER Model to Represent Data Quality Requirements
Chapter 3
An alternative representation would be to make “rating” an attribute of the relationship Class attendance-data-has Data Quality Dimensions. This representation, however, does not capture additional data on rating such as descriptive data on its value. It would only be appropriate if the interpretation of rating is obvious, but this is not usually the case. Further analysis of this and alternative representations is provided in Storey and Wang [6]. In general, the Attribute Gerund Representation can be applied when an application entity attribute needs to be associated directly with a data quality entity that has various quality categories, each of which has a corresponding Rating and “description” (or other data).
Figure 3.3: Attribute Gerund Representation
Conceptual Design Example This section extends the training database example to incorporate data on course standards along with other quality requirements. A user might want to know what standards (an application quality entity) a course (an application entity) might have. This can be modeled by the relationship, Course has Standard, as shown in Figure 3.4. The application quality entity Standard Rating is created with primary key [Std-Name, Value] and a textual non-key attribute ‘‘interpretation.’’
Modeling Data Quality Requirements
45
Figure 3.4: Course Standards
Each occurrence of the relationship Course has Standard has a standard rating value. However, in the entity-relationship model, an entity can only be associated with another entity. Therefore, this relationship must first be converted to an entity gerund [ 1], called Course-Standard. SinceCourse has Standard is a many-tomany relationship, the primary key of the gerund is the concatenation of the keys of Course and Standard. Then, a new relationship is created between this gerund and Standard-Rating; that is, Course-Standard has Standard-Rating. The final entityrelationship model is shown in Figure 3.5. The steps involved in the development ofthe final entity-relationship are outlined in Figure 3.6 and are as follows. In the first step, identify user requirements. The user, with assistance from the designer, would be asked to identify the entities, attributes, and relationships of interest as well as any data quality aspects that need to be captured. Note, for example, that, analogous to the “attendance”attribute of Class, the “value” attribute of Standard Rating could have a data quality dimension. The attribute gerund representation would again be used. First the relationship, Standard Rating value-data-has DQ Dimension would be created. This would be represented by the entity gerund Standard-Rating-Value-DQ-Dimension which, in turn, would be associated with the DQ Measure quality entity.
46
Extending the ER Model to Represent Data Quality Requirements
Figure 3.5: An extended conceptual design of the course database
Chapter 3
Concluding Remarks
47
Step 1: Identify user requirements What courses are available, their certification data and cost. What classes are offered for a course? Accuracy and timeliness measures of class attendance, along with a description of What the accuracy and timeliness values mean. (Data quality requirements) The instructor of a class. What standards are required for a course and a course’s rating? (Application quality requirments) Step 2: Identify application entities: Course: [CourseId, base cost, re-certification period, description] Class: [ClassId, attendance, actual cost] Instructor: [InstructorId, phone] Step 3: Identify corresponding application quality entitites: Standard: [Std-Name, description] Standard rating: [Std-Name, Value, interpretation] Step 4: Identify data quality entities: DQ Dimension: [D-Name, Rating] DQ Measure: [Rating, description] Step 5: Associate application entities with application quality entities: Course-Standard: [CourseID, Std-Name] Step 6: Associate application attributes with data quality entities: Class-Attendance-DQ-Dimension: [ClassID, D-Name] Step 7: List the application relationships: Course has Classes [0,N] [1,1] Instructor teaches Class [1,N] [1,N] Step 8: List the application quality relationships: Course has Standards [1,N] [0,N] Course-Standard has Standard Rating [1,1] [0,N] Step 9: List the data quality relationships: Class attendance-data-has DQ Dimension [1,N] [1,N] Class-Attendance-DQ-Dimension has DQ Measure [1,1] [0,N]
Figure 3.6: Conceptual design for the course database
CONCLUDING REMARKS We have presented how a DQ model can be developed through an extension of the entity-relationship model at the conceptual level. The reader is referred to Storey and Wang [6] for a detailed quality database design example adapted from a human
48
Extending the ER Model to Represent Data Quality Requirements
Chapter 3
resources system. This example shows the logical design and the corresponding relational database developmentgiven a quality-entity-relationship model.
References [1] Chen, P. S., "The Entity-Relationship Approach," in Information Technology in Action: Trends and Perspectives, R. Y. Wang, Editor 1993, Prentice Hall, Englewood Cliffs, 1993. [2] Hauser, J. R. and D. Clausing, "The House of Quality," Harvard Business Review, 66(3), 1988, pp. 63-73. [3] ISO ISO9000 International Standards for Quality Management. In International Standard Organization, Geneva, 1992. [4] Juran, J. M., Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services. Free Press, New York, 1992. [5] Kent, W., Data and Reality. North Holland, New York, 1978. [6] Storey, V. C. and R. Y. Wang. "An Analysis of Quality Requirements in Database Design,". in Proceedings of the 1998 Conference on Information Quality. Massachusetts Institute of Technology: pp. 64-87, 1998. [7] Teorey, T. J., Database Modeling and Design: The Entity-Relationship Approach. Morgan Kaufman Publisher, San Mateo, CA , 1990. [8] Tsichritzis, D. and F. Lochovsky, Data Models. Prentice Hall, Englewood Cliffs, N.J., 1982, [9] Wand, Y. and R. Weber, "On the Ontological Expressiveness of Information Systems Analysis and Design Grammars," 3(3), 1993, pp. 217-237. [ 10] Wang, R. Y., H. B. Kon and S. E. Madnick. "Data Quality Requirements Analysis and Modeling," in Proceedings of the 9th International Conference on Data Engineering. Vienna: pp. 670-677, 1993. [11] Wang, R. Y., M. P. Reddy and H. B. Kon, "Toward quality data: An attribute-based approach," Decision Support Systems (DSS), 13, 1995, pp. 349-372. [12] Wang, R. Y. and D. M. Strong, "Beyond Accuracy: What Data Quality Means to Data Consumers," Journal of Management Information Systems (JMIS), 12(4), 1996, pp. 534.
C H A P T E R
4
Automating Data Quality Judgment
In the previous two chapters, we have presented models to represent data quality requirements. At the conceptual level, Storey and Wang [4] propose a Quality Entity-Relationship (QER) model that incorporates data quality requirements. At the logical level, the Polygen Model [8, 9] and the Attribute-Based Model [6] extend the relational model to capture data quality attributes and data quality indicators. In this chapter, we present a knowledge-based model that derives an overall data quality value from local relationships among quality parameters.5 Because the model provides an overall measure of data quality, information consumers can thus use it for data quality judgment. This model was first proposed in 1991 by Jang and Wang [2]. Although inconclusive, this effort represented a first attempt at automating data quality judgment. It can provide a basis for implementing data quality in Adapted from the MIT CISL working paper (No. CISL-91-08) Jang, Y. and Y. R. Wang . Data Quality Calculus: A data-consumer-based approach to delivering quality data., with permission from the MIT Total Data Quality Management (TDQM) Program. 5
50
Automating Data Quality Judgment
Chapter 4
environments where the quality of data must be judged automatically by the system. The model was later extended in 1995 by Wang, Reddy, and Kon [6].
INTRODUCTION A great deal of research in the area of data quality has concentrated on providing the data consumers with quality indicators (see Table 1) they can use in judging data quality. However, the data consumer must still grapple with the task of analyzing the “metadata”provided and making the appropriate conclusions as to the quality of the data. The knowledge-based model, on the other hand, was developed to assist data consumers in judging if the quality of data meets their requirements, by reasoning about information critical to data quality judgment. It does this by using a data quality reasoner to help in assessing, from the user’s perspective, the degree to which datameets certain requirements. Because requirements of data depend to a large extent on the intended usage of the data, the model addresses the issue of how to deal with user- or applicationspecific quality-parameters with a knowledge-based approach. It also addresses the issue of how to represent relationships among quality parameters and how to reason with such relationships to draw insightful knowledge about overall data quality. To do so, it assumes that data quality parameters, such as those shown in Table 2, are available for use. Table 1 : Data Quality Indicators Indicator Source Creation-time Updatefrequency Collectionmethod
data #1 DB#1 6/11/92
data #2 DB#2 6/9/92
Data #3 DB#3 6/2/92
Daily
weekly
Monthly
barcode
entry clerk
radio freq.
Table 2: Data Quality Parameters Parameter Credibility
Data #1 High
Timeliness Accuracy
High High
data #2 Medium Low Medium
Data #3 Medium Low Medium
The data quality reasoner is a simple data quality judgment model based on the notion of a "census of needs." It applies a knowledge-based approach in data quality judgment to provide flexibility advantages in dealing with the subjective,
Overview
51
decision-analytic nature of data quality judgment. The data quality reasoner provides a framework for representing and reasoning with local relationships among quality parameters to produce an overall data quality level to assist data consumers in judging data quality, particularly when data involved in decision-making come from different and unfamiliar sources.
QUALITY INDICATORS
AND
QUALITY PARAMETERS
The essential distinction among quality indicators and quality parameters is that quality indicators are intended primarily to represent objective information about the data manufacturing process [7]. Quality parameters, on the other hand, can be userspecific or application-specific, and are derived from either the underlying quality indicators or other quality parameters. The model presented in this chapter uses a “quality hierarchy” whereas a single quality parameter can be derived from n underlying quality parameters. Each underlying quality parameter, in turn, could be derived from either its underlying quality parameters or quality indicators. For example, a user may conceptualize quality parameter Credibility as one depending on underlying quality parameters such as Source-reputation and Timeliness. The quality parameter Source-reputation, in turn, can be derived from quality indicators such as the number of times that a source supplies obsolete data. The model assumes that such derivations are complete, and that relevant quality parameter values are available.
OVERVIEW In general, several quality parameters may be involved in determining overall data quality. This raises the issue of how to specify the degree to which each quality parameter contributes to overall data quality. One approach is to specify the degree, in some absolute terms, for each quality parameter. It may not, however, be practical to completely specify such values. Rather, people often conceptualize local relationships, such as "Timeliness is more important than the credibility of a source for this data, except when timeless is low." In such a case, if timeliness is high and sourcecredibility is medium, the data may be of high quality. The model presented in this chapter provides a formal specification of such local “dominance relationships”between quality parameters. The issue is, then, how to use these local dominance relationships between quality parameters, and what can be known about data quality from them. Observe that each local relationship between quality parameters specifies the local relative significance of quality parameters. One way to use local dominance relationships would be to rank and enumerate quality parameters in the order of significance im-
52
Automating Data Quality Judgment
Chapter 4
plied by local dominance relationships. Finding a total ordering of quality parameters consistent with local relative significance, however, can be computationally intensive. In addition, a complete enumeration of quality parameters may contain too much information to convey to data consumers any insights about overall data quality. This chapter provides a model to help data consumers raise their levels of knowledge about the data they use, and thus make informed decisions. Such a process represents data quality filtering.
RELATED WORK The decision-analytic approach (as summarized in [3]) and the utility analysis under multiple objectives approach (as summarized in [1]) describe solution approaches for specifying preferences and resolving multiple objectives. The preference structure of a decision-maker or evaluator is specified as a hierarchy of objectives. Through a decomposition of objectives using either subjectively defined mappings or formal utility analyses, the hierarchy can be reduced to an overall value. The decision-analytic approach is generally built around the presupposition of the existence of continuous utility functions. The approach presented in this chapter, on the other hand, does not require that dominance relations between quality parameters be continuous functions, or that their interactions be completely specified. It only presupposes that some local dominance relationships between quality parameters exist. The representational scheme presented in this chapter is similar to those used to represent preferences in sub disciplines ofArtificial Intelligence, such as Planning [10, 11]. That research effort, however, has focused primarily on issues involved in representing preferences and much less so on computational mechanisms for reasoning with such knowledge.
DATA QUALITY REASONER We now discuss the data quality reasoner (DQR). DQR is a data quality judgment model that derives an overall data quality value for a particular data element, based on the following information: 1, A set QP of underlying quality parameters qi that affect data quality: QP = {q1, q2, ...,qn}. 2. A set DR of local dominance relationships between quality parameters in QP.
Data Quality Reasoner
53
In particular, we address the following fundamental issues that arise when considering the use of local relationships between quality parameters in data quality judgment: • How to represent local dominance relationships between quality parameters. • What to do with such local dominance relationships.
Representation of Local Dominance Relationships This section discusses a representation of local dominance relationships between quality parameters. To facilitate further discussion, additional notations are introduced below. For any quality parameter qi, let Vi denote the set of values that qi can take. In addition, the following notation is used to describe value assignment; for quality parameters. For any quality parameter qi, the value assignment qi := v (for example, Timeliness := High) represents the instantiation of the value of qi as v, for some v in Vi. Value assignments for quality parameters, such as qi := v, are called “quality-parameter value assignments”. A quality parameter that has a particular value assigned to it is referred to as an instantiated quality parameter. For some quality parameters q1, q2, ..., qn, and for some integer n ≤ 1, q1 q2 ... qn represents a conjunction of quality parameters. Similarly, q1:=v1 q2:=v2 ... qn:=vn, for some vi in Vi, for all i = 1,2, ... , n, represents a conjunction of quality-parameter value assignments. Note that the symbol used in the above statement denotes the logical conjunction, not the set intersection, of events asserted by instantiating quality parameters. Finally, ' ' notation is used to state that data quality is affected by quality parameters. It is represented as (q1 q2 ... qn) to mean that data quality is affected by qality parameters q1, q2, ..., and qn. The statement (q1 q2 ... qn) is called a quality-merge statement, and is read as “the quality merge of q1, q2, ..., and qn.“The simpler notion, (q1, q2, ..., qn) can also be used. A quality-merge statement is said to be instantiated, if all quality parameters in a quality-merge statement are instantiated to certain values. For example, statement (q1:=v1 q2:=v2 ... qn:=vn) is an instantiated quality-merge statement of (q1, q2, ..., qn), for some vi in Vi and for all i = 1,2 ,..., n. The following defines a local dominance relationship among quality parameters. Definition 1 (Dominance relation): Let E1 and E2 be two conjunctions of qualityparameter value assignments. E1 is said to dominate E2, denoted by E1 >dE2, if and
54
Automating Data Quality Judgment
Chapter 4
only if (E1, E2, +) is reducible to (E1, +), where "+" stands for the conjunction of value assignments for the rest of the quality parameters, in QP, which are shown neither in E1 nor in E2. Note that, as implied by "+," this definition assumes the contextinsensitivity ofreduction (E1,E2,+) can be reduced to (E1, +), regardless of the values of the quality parameters in QP that are not involved in the reduction. Moreover, "+"implies that these uninvolved quality parameters in QP remain unaffected by the application of the reduction. For example, consider a quality-merge statement that consists of quality parameters Source-credibility, Interpretability, Timeliness, and others. Suppose that when Source-credibility and Timeliness are High, and Interpretability is Medium, Interpretability dominates the other two. This dominance relationship can be represented as follows: "Interpretability := Medium >d Source-Credibility := High Timeliness := High" Then: (Source-Credibility := High, Interpretability := Medium, Timeliness := High, +) isreducibleto quality-merge statement (Interpretability :=Medium, +). The evaluation of the overall data quality for a particular data element requires information about a set of quality parameters that play a role in determining the overall quality, QP = {q, qi ..., qn},and a set DR of local dominance relationships between quality parameters in QP. The information provided in QP is interpreted by DQR as follows: "The overall quality is the result of quality merge of quality parameters q1, q2, ..., and qn, i.e., (q1,q2,...,qn).” Local dominance relationships in DR are used to derive an overall data quality value. However, it may be unnecessary or even impossible to explicitly state each and every plausible relationship between quality parameters in DR. Assuming incompleteness of preferences in quality parameter relationships, the model approaches the incompleteness issue with the following default assumption: For any two conjunctions of quality parameters, if no information on dominance relationships between them is available, then they are assumed to be in the indominance relation. The indominance relation is defined as follows: f
Definition 2 (Indominance relation): Let E1 and E2 be two conjunctions of qualityparameter value assignments. E1 and E2 are said to be in the indominance relation, if neither E1 >d E2 nor E2 >dE1. When two conjunctions of quality parameters are indominant, a data consumer may specify the result of their quality merge according to his or her needs.
Reasoning Component of DQR The previous section discussed how to represent local relationships between quality parameters. The next question that arises is then how to derive overall data quality from such local dominance relationships, i.e., how to evaluate a quality-merge state-
Data Quality Reasoner
55
ment based on such relationships. This task, simply referred to as the "data-qualityestimating problem," is summarized as follows:
Data-Quality-Estimating Problem: Let DR be a set of local dominance relationships between quality parame-
ters q1,q2, ..., and qn. Compute (q1,q2,...,qn), subject to local dominance relationships in DR. An instance of the data-quality-estimating problem is represented as a list of a quality-merge statement and a corresponding set of local dominance relationships, i.r., ( (q1,q2,...,qn), DR). The rest of this section presents a framework for solving the data-qualityestimating problem, based on the notion of "reduction". The following axiom defines the data quality value when only one quality parameter is involved in quality merge. Axiom 1 (Quality Merge): For any quality-merge statement (q1,q2,...,qn), if n = 1, then the value of (q1,q2,..., qn) is equal to that of q1. Quality-merge statements with more than one quality parameter are reduced to statements with a smaller number of quality parameters. As implied by Definition 1 and the default assumption, any two conjunctions of quality-parameter valueassignments can be in either the dominance relation or in the indominance relation. Below, we define axioms which provide a basis for the reduction. Axiom2 specifies that any two conjunctions cannot be both in the dominance relation and in the indominancerelation. Axiom 2 (Mutual Exclusivity): For any two conjunctions E1 and E2 of qualityparameter value assignments, E1 and E2 are related to each other in exactly one ofthe followingways: 1. E1 >d E2 2. E2 >d E1 3. E1 and E2 are in the indominance relation. Axiom3 defines the precedence of the dominance relation over the indominance relation. This implies that while evaluating a quality-merge statement, quality parameters in the dominance relation are considered before those not in the dominancerelation.
56
Automating Data Quality Judgment
Chapter 4
Axiom 3 (Precedence of >d): The dominance relation takes precedence over the indominancerelation. Reduction-Based Evaluation: A reduction-based evaluation scheme is any evaluation process where the reduction operations take precedence over all other evaluation operations. Definition 1 and axiom 3 allow the reduction-based evaluation strategy to be used to solve the data-quality-estimating problem for quality-merge statements with more than one quality parameter. The use of dominance relationships to reduce a quality-merge statement raises the issue ofwhich local dominance relationships should be applied first, i.e., regarding the order in which local dominance relationships are applied. Unfortunately, the reduction of a quality-merge statement is not always well defined. In particular, a quality-merge statement can be reduced in more than one way, depending on the order in which the reduction is performed. For example, consider an instance of the data quality-estimating problem, (_(q1,q2,q3,q4,q5,q6), DR), where DR consists of the followinglocaldominancerelationships: q1 > d q2, q2 > d q3, q4 > d q5, q5 > d q6, (q1 q4) > d (q q6), (q2 q5) > d (q1 q4). Then, the quality-merge statement (q1,q2,q3,q4,q5,q6) can be reduced to more than one irreducible quality-merge statement as shown below: • In case q2>d q3, q5>d q6, and (q2 q5) >d (q1 q4) are applied in that order, (q1,q2,q3,q4,q5,q6) is reducible to (q2,q5), as follows: (q1,q2,q3,q4,q5,q6) = (q1,q3,q4,q5,q6), by applying q1>d q2, = (q1,q3,q4,q6), by applying q4 >dq5, = (q1 q4), by applyng (q1 q4) >d (q3 q6). • In case q2>d q3, q5>d q6, and (q2 q5) >d (q1 q4) are applied in that order, (q1,q2,q3,q4,q5,q6) is reducible to (q2,q5), as follows: = = =
(q1,q2,q3,q4,q5,q6) (q1,q2,q4,q5,q6), by applying q2 >d q3, (q1,q2,q4,q5), by applying q5 >d q6, (q2,q5), by applying (q2 q5) >d (q1 q4).
As illustrated in this simple example, the reduction of a quality-merge statement is not always well defined.
First-Order Data Quality Reasoner
57
FIRST-ORDER DATA QUALITY REASONER This section investigates a simpler data quality reasoner that guarantees the welldefined reduction of quality-merge statements, by making certain simplifying assumptions. To facilitate the next step of derivation, an additional definition is introduced.
Definition 3 (First-Order Dominance Relation): For any two conjunctions E1 and E2 of quality parameters such that E1 and E2 are in the dominance relation, E1 and E2 are said to be in the first-order dominance relation, if each of E1 and E2 consists of one and only one quality parameter. The first-order data quality reasoner, in short called DQR1, is a data quality judgment model that satisfies the following: First-order Data Quality Reasoner (DQR1) 1. Axioms 1, 2, and 3 hold. 2. Only indominance and first-order dominance relationships are allowed. 3. d q4 or (q4 q5) >d (q1 q2), are not allowed. In addition, the firstorder data quality reasoner requires that the dominance relation be transitive. This implies that for any conjunctions of quality-parameter value assignments, E1, E2, and E3, if E1 >dE2 and E2>dE3, then E1 >dE3. transitivity of the dominance relation impIies the need for an algorithm to verify that, when presented with an instance of the quality-estimating proble (q1,q2,...,qn), DR), dominance relationships in DR do not conflict with each other. Well-known graph algorithms can be used for performing this check [5]. Quality-merge statements can be classified into groups, with respect to levels ofthe reducibility, as defined below.
Definition 4 (Irreducible Quality-Merge Statement): For any instantiated qualitymerge statement e = (q1:=v1,q2:=v2,...,qn:=vn) such that n ≤ 2, for some vi in Vi and ∀i = 1,2, ..., n; e is said to be irreducible if for any pair of quality-parameter value assignments in e (say qi:=vi and qj:=vj) qi:=vi and qj:=vj are in the indominance relation. Similarly, any quality-merge statement that consists of one and only one quality parameter is said to be irreducible. Definition 5 (Completely-Reducible Quality-Merge Statement): For any instantiated quality-merge statement e = (q1:=v1, q2:=v2,...,qn:=vn) such that n ≤ 2, for
58
Automating Data Quality Judgment
Chapter 4
some vi in Vi, ∀i = 1,2,..., and n, e is said to be completely reducible, if for any pair of quality-parameter value assignments in e, say qi:=vi and qj:=vj, qi:=vi and qj:=vj are in the dominance relation. The next two sections discuss algorithms for evaluating a quality merge statement in DQR1. This process is diagrammed in Figure 4.1. Algorithm Q-Merge is the top-level algorithm that receives as input a quality-merge statement and the corresponding quality parameter relationship set. Within Algorithm Q-Merge, there is a two-stage process. First, the given quality-merge statement is instantiated accordingly. It then calls Algorithm Q-Reduction to reduce the Quality Merge Statement into its corresponding irreducible form.
Figure 4.1. The Quality-Merge Statement (QMS) Evaluation Process
The Q-Reduction Algorithm This section describes an algorithm, called Q-Reduction, for reducing a qualitymerge statement into an irreducible quality-merge statement, according to local dominance relationships between quality parameters. We continue to assume that all dominancerelationshipsare first-order.
First-Order Data Quality Reasoner
59
Figure 4.2: Q-Reduction Algorithm
Algorithm Q-Reduction in Figure 4.2 takes as input an instantiated qualitymerge statement e and DR, and returns as output an irreducible quality-merge statement of e. The instantiated quality-merge statement e is reduced as follows: For expository purposes, suppose that e = (q1:=v1,q2:=v2,...,qn:=vn), for some vi in Vi, for all i = 1,2, ... , and n, and let Ω be a dynamic set of qualityparameter value assignments, which is initialized to {q1:=v1,q2:=v2,...,qn:=vn}. For any pair of quality-parameter value assignments qi:=vi and qj:=vj in e, if qi:=vi >d qj:=vj is a member of DR, then e is reducible to a quality-merge statement with the quality parameters in Ω, less qj:=vj (by Definition 1.) This allows us to remove qj:=vj from Ω, if both qi:=vi and qj:=vj are elements in Ω. the process of removing dominated quality parameters continues until no pair of the quality parameters in Ω is related in the dominance relation. Let Ω denote the modified Ω produced at the end of this removal process. The quality merge of the quality parameters in Ω' is the corresponding irreducible quality-merge statement of e. The algorithm returns (Ω '). It is proven in [2] that the Q-Reduction algorithm shown in Figure 4.2 always results in a unique output in the first-order data-quality reasoner provided that all dominancerelations are first-order.
60
Automating Data Quality Judgment
Chapter 4
Q-Merge Algorithm When presented with an instance of the quality-estimating problem ( (q1,q2,...,qn), DR) for some integer n, the Q-Merge algorithm first instantiates the given qualitymerge statement. Then, the instantiated quality-merge statement is reduced until the reduction process results in another instantiated quality-merge statement that cannot be reduced any further (using the Q-Reduction algorithm.) This raises the issue of how to evaluate an irreducible quality-merge statement. Unfortunately, the evaluation of an irreducible quality-merge statement is not always well defined. When evaluating an irreducible quality-merge statement, the number of orders in which the quality merge operation can be applied grows exponentially with the number of quality parameters in the statement. In particular, certain quality-merge statements may be merged in more than one way, depending on the order in which the merge is performed. It is possible that this set might include every element of Vi’s This model evades the problem by presenting quality-parameter value assignments in the irreducible quality-merge statement returned by the Q-Reduction algorithm so that a user may use the information presented according to his or her needs. Figure 4.3 summarizes the Q-Merge algorithm. Q-Merge Algorithm
Input: (e, DR), where e = (q1,q2,...,qn), for some integer n and DR is a set of local dominance relationships between quality parameters q1, q2, ..., and qn. Output: Overall data quality value produced by evaluating e. 1.
Instantiate e. ;; Suppose that q1,q2 .. ., and qn are instantiated as v1, v2,... ., and vn, respectively, for some v1in Vt ;; for all i = 1,2, ..., and n.
2. 3.
e' ← IF (n = 1) THEN e ELSE Q-Reduction ( (q1:=v1, q2:=v2,...,qn:=vn) , DR) Present quality-value assignments in e’. Figure 4.3: Q-Merge Algorithm
Conclusion
61
CONCLUSION In this chapter, we have presented a knowledge-based model that tailors data quality judgment to the data consumers’ needs. The model provides flexibility in representing the specific requirements of data consumers by allowing them to specify local dominance relationships between quality parameters. It will then perform the reduction of quality-merge statements and derive a value for overall data quality, i.e., a measurement ofdata quality. The model presented has limitations that have to be overcome in order for it to be of practical use. The following is a list of some of the limitations of the model and suggestions for its enhancement. Higher-order data quality reasoner: The first-order data quality reasoner evades the problem of ill-defined reduction by prohibiting higher order relationships. Realworld problems, however, often involve more complex relationships than first-order relationships between quality parameters. In order to deal with higher-order relationships, both the representational and algorithmic components of the first-order data quality reasoner would need to be extended. Data Acquisition/A hierarchy of quality indicators and parameters: This model assumed that values of quality parameters are available so that quality-merge statements can be instantiated properly. Issues of how to represent and how to streamline such values to the data quality reasoner, however, must be addressed. One approach to these issues would be to organize quality parameters and quality indicators in a hierarchy. Then, to each data element or type can be attached information about how to compute a value of a quality parameter. Such a hierarchy would allow the derivation of a quality parameter value from its underlying quality parameters and quality indicators. A tool for automatically constructing such a hierarchy and computing quality-parameter values would enhance the utility of the data quality reasoner. Knowledge Acquisition: The capability of using local dominance relationships, which are typically user-specific or application-specific, allows us to build systems more adaptable to customers’ needs. As application domains are complex, however, it becomes increasingly difficult to state all the relationships that must be known. Such knowledge acquisition bottlenecks could be alleviated through development of a computer program for guiding the process of acquiring relationships between quality parameters. User interface for cooperative problem solving: The Different orders in which quality parameters in an irreducible quality-merge statement are evaluated may result in different values. This model dealt with the need to evaluate an irreducible qualitymerge statement by simply presenting information on quality parameters in an irreducible quality-merge statement. Development of a user interface, which allows
62
Automating Data Quality Judgment
Chapter 4
evaluating irreducible quality-merge cooperatively with a data consumer, could lessen the problem. The model presented in this chapter provides a first step toward the development of a system that can assist data consumers in judging whether data meets their requirements. This is particularly important when decision-making involves data from different, foreign sources.
References [1] Chankong, V. and Y. Y. Haimes, Multiobjective Decision Making: Theory and Methodology. North-Holland Series in System Science and Engineering, ed. A. P. Sage. Elsevier Science, New York, N.Y., 1983. [2] Jang, Y. and Y. R. Wang , Data Quality Calculus: A data-consumer-based approach to delivering quality data (No. CISL-9 1-08). Composite Information Systems Laboratory, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, 02139, 1991 [3] Keeney, R. L. and H. Raiffa, Decisions with Multiple Objectives: Preferences and Value Tradeoffs. John Wiley & Son, New York, 1976. [4] Storey, V. C. and R. Y. Wang. "An Analysis of Quality Requirements in Database Design," in Proceedings of the 1998 Conference on Information Quality. Massachusetts Institute of Technology: pp. 64-87, 1998. [5] T H Cormen, T. H., C. E. Leiserson and R. L. Rivest, Introduction to Algorithms. MIT Press, Cambridge, MA, USA, 1990. [6] Wang, R. Y., M. P. Reddy and H. B. Kon, "Toward quality data: An attribute-based approach," Decision Support Systems (DSS), 13, 1995, pp. 349-372. [7] Wang, R. Y., P. Reddy and H. B. Kon, "An Attribute-based Model of Data for Data Quality Management," 1992. [8] Wang, Y. R. and S. E. Madnick. "A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective," in Proceedings of the 16th International Conference on Very Large Data bases (VLDB). Brisbane, Australia: pp. 519-538, 1990. [9] Wang, Y. R. and S. E. Madnick, 'Where Does the Data Come From: Managing Data Integration with Source Tagging Capabilities," 1990. [10] Wellman, M. P., Formulation of Tradeoffs in Planning Under Uncertainty. Pitman and Morgan Kaufmann, 1990. [ 1 1] Wellman, M. P. and J. Doyle. "Preferential Semantics for Goals," in Proceedings of the 9th National Conference on Artificial Intelligence, 1991.
C H A P T E R
5
Developing a Data Quality Algebra
W hen accessing various data sources, it is important to know the meaning of the data [14, 15] and the quality of the data retrieved [11]. Despite the elegance of the relational theory [4, 5] and mechanisms such as integrity constraints [6, 7] to ensure that the database state reflects the real worldstate[9],manydatabases contain deficient data [ 16]. If the data in the underlyingbaserelationsaredeficient, then the query results obtained from these base relationsmayalsocontaindeficient data even if the query processing mechanism is flawless. These deficient data, in turn, may lead to erroneous decisions that can result in significant social and economic impacts. In this chapter, we present a mechanism bsed on the relational algebra to estimate the accuracy of data derived from base relations. We willintroduce a data
64
Developing a Data Quality Algebra
Chapter 5
quality algebra6 that estimates the quality of query results given the quality characteristics of the underlying base relations, in the context of a single relational database environment. We will focus on the accuracy dimension. Thus, the accuracy dimension will be used in this chapter to refer to data quality although, as we saw in chapter 1 , various dimensions such as interpretability, completeness, consistency, and timeliness have been identified as typical data quality dimensions [1, 2].
Assumptions We make the following assumptions: Assumption A1: Query processingisflawless. Assumption A2: Query processing operations are based on the relational algebra. Assumption A3: Accuracy estimates of relevant base relations are given at the relational level. Assumption A4: The accuracy of tuples in base relations can be validated. The first two assumptions define the scope of the problem to be one that focuses on developing a mechanism based on the relational algebra for estimating the accuracy of data derived from base relations. The last two assumptions are elaborated on below. In principle, accuracy information for the relevant base relations should be provided as the input, at a certain level of granularity, for the estimation of the accuracy of data derived from these base relations. In the relational model, a base relation must be at least in the first normal form, which means every attribute value is atomic. Each atomic attribute value, however, can be either accurate or inaccurate and can be tagged accordingly. There are also emerging commercial tools for estimating the quality characteristics of a base relation; see, for example, QDB Solutions' QD B/Analyze. The attribute-based model presented in chapter 2 can be used to store data accuracy value [17-19]; however, measurement and storage ofthe accuracy of each cell value is an expensive proposition. Similarly, it is also relatively expensive to measure and store accuracy information at the tuple level. Therefore, Assumption A3 states that accuracy estimates are available at the relational level, and in addition, these estimates are based on some tuple level accuracy measures. To measure the Adapted from the M.I.T. TDQM working paper, A Data Quality Algebra for Estimating Query Result Quality, co-authored by M.P. Reddy and Richard Wang, with permission from the MIT Total Data Quality Management (TDQM) Program.
6
65
accuracy of a tuple, some validation procedure should be available such as those described by Janson [8] and Paradice & Fuerst [12].
Notation The notation used is summarized below. At N M P AR IMR IAR Ska Ski Skim Skia
The deterministic accuracy of a tuple ‘t’ The number of accurate tuples The number of mismember tuples The number oftuples with at least one inaccurate attribute value The ratio accurate tuples in relation R to the total number of tuples in R The ratio of mismember tuples in relation R to the total number of tuples in R The ratio of tuples in R with at least one inaccurate attribute value to the total number of tuples in R The accurate portion ofrelation Sk The inaccurate portion ofrelation Sk The inaccurate portion ofrelation Sk due to inaccurate membership The inaccurate portion ofrelation Sk due to inaccurate attribute value
(The following notation is adopted from the relational model) |R| Cardinality of relation R σc Selection operation with selection condition c Πp Projection operation with projection list given by the set P ∞ Cartesian Product operation ∪ Union operation Difference operation
Definitions For exposition purposes, let DEPT1_EMP (Table 1) and DEPT2_EMP (Table 2) be two employee relations for two departments, DEPT1 and DEPT2. Each employee relation consists of three attributes: EMP_ID, EMP_NAME, and EMP_SAL.
66
Developing a Data Quality Algebra
Table 1: DEPT1_EMP
EMP ID EMP NAME Henry 1 Jacob 2 3 Marshall Alina 4 5 Roberts 6 Ramesh 7 Patel 8 Joseph 9 John Arun 10
EMP SAL 30,000 32,000 34,000 33,000 50,000 45,000 46,000 55,000 60,000 50,000
Chapter 5
Table 2: DEPT2_EMP
EMP ID EMP NAME Henry 1 2 Jacob 5 Roberts 6 Ramesh 9 John Nancy 11 12 James 13 Peter Ravi 14 Anil 15
EMP SAL 30,000 32,000 55,000 45,000 60,000 39,000 37,000 46,000 55,000 45,000
In the relational model, a relation represents a time-varying subset of a class of instances sharing the same domain values that correspond to its relational scheme. For example, the relation DEPT1_EMP is a subset or a complete set of the employees in DEPT1. Every tuple is implicitly assumed to correspond to a member of the class of instances of the same entity type as defined by a relational scheme. In reality, from a data quality perspective, this implicit assumption may not hold. For example, some tuples in DEPT1_EMP may not belong to DEPT1 (For example, a database administrator may, by mistake, insert into DEPT1 an employee tuple that belongs to another department, or may not have had time to delete a tuple that corresponded to a former employee who has left DEPT1). A relation containing one or more tuples that do not meet the above implicit assumption is said to have tuple mismembership. A tuple that does not satisfy the implicit assumption is referred to as a mismember tuple; otherwise, it is referred to as a member tuple. In order to determine the accuracy of an attribute value in a tuple, one must know if the attribute value reflects the real-world state. We refer to an attribute value as being accurate if it reflects the real-world state; otherwise, it is inaccurate. To validate that the gender, age and income data for a particular person are accurately stored in a tuple, it is necessary to know that the tuple indeed belongs to this person, and that the attribute values in the tuple reflect the real-world state. The concepts of member tuple and accurate attribute values lead to Definition D1 below. Definition D1 - Deterministic Tuple Accuracy: A tuple t in a relation R is accurate if and only if it is a member tuple and every attribute value in t is accurate; otherwise, it is inaccurate. Let At denote the accuracy oft. If t is accurate, then the value of At is 1 ; if t is inaccurate, then the value ofAt is 0. By Definition D1, a tuple can be inaccurate either because it is a mismember tuple or because it has at least one inaccurate attribute value. For example, if an em-
67
ployee belongs to the department, and the employee’s ID number, name, and salary are accurately stored in the tuple denoted by emp_tuple, then emp_tuple is said to be accurate, or Aemp_tuple = 1. On the other hand, Aemp_tuple = 0 if the employee does not belong to the department or if any of the ID number, name, or salary is inaccurate. Suppose that in the relation DEPT1_EMP, the tuple for Employee 2 contains inaccurate salary, the tuple for Employee 8 contains inaccurate name, and Employee 6 no longer works for DEPT1; in DEPT2_EMP, the tuples for Employees 2 and 5 contain inaccurate salary, while the tuple for Employee 15 contains inaccurate name. These tuples are shaded in Tables 1 and 2. By Definition D1, the tuple for Employee 6 is a mismember tuple, and other tuples contain inaccurate attribute values. Definition D2 - Relation Accuracy: Let AR denote the accuracy of relation R, Nthe number of accurate tuples in R, and |R| the cardinality of relation R. Then N AR = |R| Let IMR denote the inaccuracy due to mismember tuples, and IAR denote the inaccuracy due to attribute value inaccuracy for the relation R. Let M denote the number of mismember tuples in which every attribute value is accurate. (If a mismember tuple in the relation R has an inaccurate attribute value, then it will be counted under IAR but not under IMR.) If P denotes the number of tuples with at least one inaccurate attribute value, then it follows that M IMR = |R| P IAR = |R| AR + IMR + IAR = 1 As mentioned earlier, an accuracy tag could be associated with each tuple in a base relation. For a large database, however, measurement and storage ofthese accuracy tags is expensive. As such, in this paper we adopt a statistical approach to provide accuracy estimates for base relations. In this approach, we measure four parameters for each base relation: (i) the number of tuples in the relation, (ii) the accuracy of the relation, (iii) the inaccuracy of the relation due to mismembership, and (iv) the inaccuracy of the relation due to inaccurate attribute value. The collection of these four parameters for each relation in a component database constitutes the quality profile ofthe database. Let T denote the relation that contains a random sample of tuples of a relation R. In T, let NT denote the number of accurate tuples, MT the number of mismember tuples, and PT the number of tuples with at least one inaccurate attribute value. Using Definition D2, we get
68
Developing a Data Quality Algebra
AR =
Chapter 5
NT |T| |R|
PT |R| InTable 1, AR=0.7, IMR=0.1 and IAR=0.2. In Table 2, AR=0.7, IMR= 0, and IAR = 0.3. Suppose that DEPT1_EMP and DEPT2_EMP (Tables 1-2) are two relations in a database. The quality profile of the database would appear as follows: IAR =
Relation Name DEPT1_EMP DEPT2_EMP
Size 10 10
AR 0.7 0.7
IMR 0.1 0.0
IAR 0.2 0.3
Definition D3 - Null Relation Accuracy: In D2, if the cardinality |R| is zero, then we define AR = 1. In order to proceed with the analysis, we make the following additional assumption and use the definitions listed below: Assumption A5: At the relational level, the error is uniformly distributed across tuples andattributes. Definition D4 - Probabilistic Tuple Accuracy: At the relational level, all the tuples in a relation have the same probability of being accurate regardless of their deterministic tuple accuracy. Let AR denote the accuracy of relation R, then the probabilistic accuracy of each tuple in R is AR. In Table 1, the deterministic tuple accuracy of Tuple 2 is 1 and Tuple 8 is 0. Further, AR = 0.7, and the probabilistic tuple accuracy for both Tuples 4 and 8 (in fact, for every tuple in DEPT1_EMP) is 0.7. Definition D5 - Probabilistic Attribute Accuracy: All attributes in a base relation have the same probability of being accurate regardless of their respective deterministicaccuracies. Based on the above definition, one can compute the probabilistic attribute accuracy as follows. Let AR denote the accuracy of relation R, D the degree of the relation R, and X an attribute in relation R. Then, the probability that attribute X is accurate is given by
A Data Quality Algebra
69
In Table 1, the accuracy of relation DEPT1_EMP is 0.7 and the probabilistic attribute accuracy ofthe attribute EMP_NAME is Assumption A5 is relatively strong and will not hold in general. From the theoretical viewpoint, however, it provides an analytic approach for estimating the accuracy of data derived from base relations, especially because the detailed accuracy profile of the derived data is rarely available on an a-priori basis.
A DATA QUALITY ALGEBRA In this section, we compute accuracy measures for reports based on the accuracy estimates of input relations for the five orthogonal algebraic operations: Selection, Projection, Cartesian product, Union, and Difference. Other traditional operators can be defined in terms ofthese operators [10]. We make assumption A6 below: Assumption A6 The cardinality of the resulting relation |R| is available. In practice, |R| can be either counted after an operation is performed or estimated before an operation is performed. Algorithms for estimating |R| are given in [3]. Let AS, AS1, and AS2 denote the accuracy of Relations S, S1, and S2 respectively. Let R denote the resulting relation of an algebraic operation on S (if unary operation) on S1 and S2 (if binary operation). Let AR denote the accuracy of R.
Accuracy Estimation of Data Derived by Selection Selection is a unary operation denoted as R = σcS where ‘ σc ’ represents the selection condition. Furthermore, conditions: (i) (ii)
As=1 As=0
S and AR must satisfy the following boundary
AR=1 AR=0
(S1) (S2)
Using Definition D4, the probabilistic accuracy of each tuple in the relation S is given by AS. Since the selection operation selects a subset oftuples from S, it follows that the estimated number of accurate tuples in R is given by |R| * AS By Definition D2,
70
Developing a Data Quality Algebra
AR =
|R|*AS = AS |R|
Chapter 5
(1)
Equation (1) satisfies both boundary conditions (S1) and (S2). By the same token, IMR, the percentage of mismember tuples in R, and IAR, the percentage of tuples in R with at least one inaccurate attribute value can be estimated as follows: IMR = IMS IAR = IAS where IMs denote the percentage of mismember tuples in S, and IAS the percentage of tuples in S with at least one inaccurate attribute value. Equation (1) is derived using Assumption A6, which assumes errors to be uniformly distributed. When the error distribution is not uniform, the worst and best cases for the selection operation can be analyzed as shown in the following subsections.
Worst case when error distribution is non-uniform From Definition D2, the worst case occurs when the selection operation selects the maximum number ofinaccurate tuples into the resulting relation R. Let Si denote the number of inaccurate tuples in S. Two cases need to be considered. In the first case, we have |Si| ≥ |R |. this scenario, the worst case occurs when all the tuples selected into R are inaccurate and thus, AR = 0. In the second case, we have |Si| < |R |. Under this scenario, the worst case occurs when all the inaccurate tuples in S are included in R. As such, the number of accurate tuples in R is given by |R|- (|S| * (1-AS)). Using Definition D2, we get: (2) Equation (2) satisfies both boundary conditions (S1) and (S2), as shown below. In boundary condition (S1), AS = 1. By substituting AS with 1 in Equation (2), we get AR = 1. Therefore, boundary condition (S1) holds for Equation (2). In boundary condition (S2), AS = 0. It follows from Equation (2) that |S| (2a) AR = 1 |R| Since |S| cannot be less than |R| by the definition of the selection operation, it follows from Equation (2a) that AR ≥ 0. Since by definition, AR ≤ 0, therefore AR = 0. Thus, Equation (2) satisfies boundary condition (S2).
A Data Quality Algebra
71
Best case when error distribution is non-uniform From Definition D2, the best case occurs when the maximum number of accurate tuples is selected into the resulting relation R. Let Sa denote the number of accurate tuples in S. Two cases need to be considered. In the first case, we have |Sa| ≥ |R|. Under this scenario, the best case occurs when all the tuples selected into R are accurate. Therefore, AR = 1. In the second case, we have Sa < R .Under this scenario, the best case occurs when all the accurate tuples are included in R. The number of accurate tuples in relation R is given by AS * |S|. Therefore, by Definition D2, we have |S| (3) AR = AS * |R| Equation (3) satisfies both boundary conditions (S1) and (S2).
Accuracy Estimation of Data derived by Projection Projection selects a subset of attributes from a relation and is denoted as R(A) = ΠA S(B), where A and B are the set ofattributes in R and S respectively, and A B. By definition, we have |R| ≤ |S|. In addition, the accuracy of R must satisfy the following two boundary conditions: ⇒ AR = 0 (P1) AS = 0 AS = 1 ⇒ AR = 1 (P2) We consider projection as a two-step operation. In Step 1, the set of attributes that is not present in R will be eliminated from S to generate a relation Q, in which and Q may have duplicate tuples. In Step 2, all duplicate tuples in Q will be eliminated, resulting in R. Let 〈v1, v2,..., vm〉 be a tuple in the relation S. Let p be the probabilistic at1
tribute accuracy for attribute vi. Using D5, one can write p = (AS)m , where AS is the accuracy of relation S. If n (n ≤ m) attributes are selected from S into Q, then the probability of a tuple in Q being accurate before elimination of duplicates is n
given by(AS)m. As such, the estimated total number of accurate tuples in Q before the elimination of duplicates is given by (AS) mn * |S| . Before elimination of duplicates from Q, we assume that Q and S represent the same class of objects as discussed earlier, except that Q provides less number of attributes than S. Therefore, the number of mismember tuples in Q is the same as that
72
Developing a Data Quality Algebra
Chapter 5
of S. As such, the inaccuracy of Q due to mismember tuples is the same as the inaccuracy of S due to mismember tuples. That is, IMQ = IMS. By definition, the inaccuracy of Q due to inaccuracy in the attribute value is given by (4) Let |R| be the size of the resulting relation after elimination of dupllicates from Q, which implies that, |Q| - |R| tuples are eliminated from Q. The set of tuples eliminated may consist of both accurate and inaccurate tuples. By Definition D4, the number of accurate tuples, mismember tuples, and tuples having at least one inaccurate attribute value can be estimated as (|Q| - |R|)* AQ , (|Q| - |R|)* IMQ , and (|Q| - |R|)* IAQ respectively. by
The total number of accurate tuples retained in the resulting relation R is given
(5) Similarly, we can derive IMR = IMS By definition, IAR can be determined as IAR = 1 - (AR + IMR)
(6) (7)
Worst case when error distribution is non-uniform In the worst case, we consider two sub-cases. In the first sub-case, all tuples in Q are inaccurate, which implies that, the accuracy of R is zero. In the second sub-case, some tuples in Q are accurate. In this sub-case, the worst situation occurs when all accurate tuples in Q collapse to a single tuple in R. In this situation, the accuracy is given by
The above accuracy estimate satisfies boundary conditions both P1 and P2 with an assumption that 1