TEXTS AND MONOGRAPHS IN COMPUTER SCIENCE
RELATIONAL DATABASE TECHNOLOGY Suad Alagic
*
Springer-Verlag
Texts and Mon...
284 downloads
1600 Views
10MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
TEXTS AND MONOGRAPHS IN COMPUTER SCIENCE
RELATIONAL DATABASE TECHNOLOGY Suad Alagic
*
Springer-Verlag
Texts and Monographs in Computer Science
Editor
David Gries Advisory Board F. L. Bauer J. J. Horning R. Reddy D. C. Tsichritzis W. M. Waite
Texts and Monographs in Computer Science
Suad Alagid Relational Database Technology Suad Alagic and Michael A. Arbib The Design of Well-Structured and Correct Programs S. Thomas Alexander Adaptive Signal Processing: Theory and Applications Michael A. Arbib, A. J. Kfoury, and Robert N. Moll A Basis for Theoretical Computer Science Michael A. Arbib and Ernest G. Manes Algebraic Approaches to Program Semantics F. L. Bauer and H. W6ssner Algorithmic Language and Program Development Kaare Christian The Guide to Modula-2 Edsger W. Dijkstra Selected Writings on Computing: A Personal Perspective Nissim Francez Fairness Peter W. Frey, Ed. Chess Skill in Man and Machine, 2nd Edition R. T, Gregory and E. V. Krishnamurthy Methods and Applications of Error-Free Computation David Gries. Ed. Programming Methodology: A'Collection of Articles by Members of IFIP WG2.3 David Gries The Science of Programming A. J. Kfoury, Robert N. Moll, and Michael A. Arbib A Programming Approach to Computability E. V. Krishnamurthy Error-Free Polynomial Matrix Computations Franco P. Preparata and Michael Ian Shamos Computational Geometry: An Introduction Brian Randell, Ed. The Origins of Digital Computers: Selected Papers Arto Salomaa and Matti Soittola Automata-Theoretic Aspects of Formal Power Series Jeffrey R. Sampson Adaptive Information Processing: An Introductory Survey William M. Waite and Gerhard Goos Compiler Construction Niklaus Wirth Programming in Modula-2
Relational Database Technology Suad Alagic
With 114 Illustrations
Springer-Verlag New York Berlin Heidelberg Tokyo
Suad Alagi& Department of Informatics Faculty of Electrical Engineering University of Sarajevo 71000 Sarajevo, Lukavica Yugoslavia
Series Editor David Gries Department of Computer Science Cornell University Upson Hall Ithaca, NY 14853 U.S.A.
Library of Congress Cataloging in Publication Data Alagi&, Suad, 1946Relational database technology. (Texts and monographs in computer science) Bibliography: p. Includes index. 1. Data base management. I. Title. II. Series. QA76.9.D3A48 1986 005.75'6 85-32108
© 1986 by Springer-Verlag New York Inc. All rights reserved. No part of this book may be translated or reproduced in any form without written permission from Springer-Verlag, 175 Fifth Avenue, New York, New York 10010, U.S.A. Typeset by Asco Trade Typesetting Ltd., Hong Kong. Printed and bound by R.R. Donnelley & Sons, Harrisonburg, Virginia. Printed in the United States of America. 987654321 ISBN 0-387-96276-X Springer-Verlag New York Berlin Heidelberg Tokyo ISBN 3-540-96276-X Springer-Verlag Berlin Heidelberg New York Tokyo
To Mara, Irena, and Gorjan
Preface
This book presents a unified collection of concepts, tools, and techniques that constitute the most important technology available today for the design and implementation of information systems. The framework adopted for this integration goal is the one offered by the relational model of data, its applications, and implementations in multiuser and distributed environments. The topics presented in the book include conceptual modeling of application environments using the relational model, formal properties of that model, and tools such as relational languages which go with it, techniques for the logical and physical design of relational database systems and their implementations. The book attempts to develop an integrated methodology for addressing all these issues on the basis of the relational approach and various research and practical developments related to that approach. This book is the only one available today that presents such an integration. The diversity of approaches to data models, to logical and physical database design, to database application programming, and to use and implementation of database systems calls for a common framework for all of them. It has become difficult to study modern database technology without such a unified approach to a diversity of results developed during the vigorous growth of the database area in recent years, let alone to teach a course on the subject. It is quite clear that applications of the relational database technology are much more important to the majority of readers interested in that technology. However, knowledge about the underlying structures and algorithms, concurrency control protocols, and distributed processing techniques are becoming a common professional knowledge and are of interest not only to the implementors of such systems but also to all those readers-database
viii
Preface
application programmers in particular-who would like to know how relational database systems actually work. On the other hand, implementors of such systems will benefit from the exposition of conceptual and relational modeling and end-user requirements and tools. Indeed, the most important implementation techniques for relational languages, for multiuser and distributed systems are presented within a framework that is an extension of the formal approach offered by various results developed around the relational model of data. Although the book is based mainly on a collection of research papers, it is written as a textbook. This refers to the style of exposition, numerous examples which are used in the book, and an extensive set of exercises. Very little is assumed in terms of the required background. Some familiarity with the very basic concepts of modern mathematics, computer systems, and their common applications is, of course, necessary. The presentation of the basic theory of the relational model of data, its underlying relational structures and algorithms, transaction processing models, and distributed systems is sometimes formal and very technical, but it is always illustrated by very pragmatic examples. In addition to the described integration goal the book is different from existing ones in that it applies recent object-action oriented techniques for conceptual modeling within the relational framework, and it provides a thorough exposition of the most important issues-particularly those related to data integrity-involved in the design and implementation of multiuser and distributed database systems. The author acknowledges research support from SIZ nauke SR Bosne i Hercegovine, and valuable observations made by Mario Gaon. Tsukuba Science City, Japan March 1985
SUAD ALAGIC
Contents
Introduction 1 2 3 4
Modeling Application Environments Relational Model Relational Database System Relational Technology
CHAPTER 1 Data Model 1 Relational Model of Data 1.1 Relational Representation of Entities 1.2 Relational Algebra 1.3 Relational Query Languages 2 Logical Dependencies 2.1 Functional Dependencies 2.2 Multivalued Dependencies 2.3 Join Dependencies 3 Hierarchical and Network Models 3.1 Unnormalized Relational Schemas 3.2 Network Model Exercises: GROUP BY Clause; SET Function; Entity Relationship Diagrams; Generalized Joins; M: N Relationships; Lossless Joins Bibliographical Notes
1 5 11 17
20 20 20 24 29 36 36 41 47 48 48 51
56 63
CHAPTER 2 Logical Design
64
1 1.1
64 65
Normal Forms Second Normal Form (2NF)
Contents
X
1.2 Third Normal Form (3NF) 1.3 Boyce-Codd Normal Form (BCNF) 1.4 Fourth Normal Form (4NF) 1.5 Projection/Join Normal Form (PJNF) 2 Abstractions 2.1 Unnormalized Relational Model 2.2 Aggregation 2.3 Generalization 3 Design Methodology 3.1 Extended Relational Model 3.2 Relational Database Programming Environment 3.3 Conceptual Modeling Exercises: Views; Types of Database Entities; Generalization; Associations; Aggregation; Characterization; Cover Aggregation; Taxonomic Design Methodology; Exception Modeling; Exception Handling Bibliographical Notes
118 130
CHAPTER 3 Structural Design
131
1 Relational Images 2 Decomposition of Unary Queries 3 Decomposition of Binary Queries 4 Optimization of Binary Queries 5 Decomposition of Queries with Set Operators 6 Relational Representation of Relations and Their Images 7 Decomposition of Data Manipulation Statements 8 Structure of Images Exercises: Links; Network Structures; Decomposition of n-ary Queries; Optimization of Query Expressions; Properties of the Relational Operators; B*-Trees Bibliographical Notes CHAPTER 4 Data Integrity
68 72 75 79 82 82 84 88 92 92 95 107
131 135 141 146 149 152 153 156
162 172
173
1 Transactions and Integrity of Data 2 Concurrent Executions of Transactions 3 Locking Protocols 4 Logical Locks 5 Restoring a Consistent Database State Exercises: Assertions; Transactions; Triggers; Tree Protocol; Hierarchical Locking Protocol Bibliographical Notes
205 210
CHAPTER 5 Distributed Technology
212
1 2
212 216
Architecture of Database Systems Distributed Executions and Integrity
173 175 179 188 197
Contents
xi
3 Distributed Query Processing 4 Distributed Updating Exercises: Fragmentation; Transaction Structure; Integrity Constraints and Data Distribution; Generalization and Fragmentation; Multidatabase Systems; Catalog Management; Object Naming Bibliographical Notes
222 231
References
245
Index
249
237 244
Introduction
1 Modeling Application Environments Concepts, techniques, and tools of the technology that is the topic of this book are meant to be applied in a diversity of environments. In such an application environment, a diversity of users and their information needs are the reason for introducing the relational technology in the environment to satisfy those needs in the best possible way. In order to satisfy users' needs the relational technology requires the design of a representation of an environment of its application that captures those static and dynamic properties of that environment which are relevant for meeting the information requirements of its users. In addition to capturing the requirements specified at the design time, the representation should be adaptable to future requirements. An application environment is viewed as a collection of interrelated objects and activities (or events) that involve those objects. It is thus represented as follows: (i) Relevant objects and their properties are specified as well as the relevant relationships among objects. (ii) Actions on objects and their relationships are specified in such a way that they correspond to the activities or events which exist in the application environment. The actions capture (reflect) the changes which occur in that environment and affect the properties of the represented objects and their relationships. (iii) Static and dynamic properties of the application environment which are not conveniently expressed in the representation of objects and actions
2
Introduction
upon them are specified as constraints or integrity rules over objects. A constraint often relates static and dynamic properties of an application environment, so it is a representation of a property of that environment. The information needs of the users in an application environment are satisfied by stating (triggering) queries which produce the required outputs. Specification of the set of queries which are required to satisfy the information needs of the users is thus one way of stating precisely what those needs are. Such a specification within the approach offered by the relational technology requires specification of objects and their attributes quoted in those queries. To illustrate the concepts, tools, and techniques presented in the book, we have chosen a typical university application environment which will be used throughout this introduction. We will assume that an analysis of the activities and the information needs in the chosen application environment reveals the following relevant objects, their attributes, and the activities in which they are involved: Objects. STUDENT, COURSE, LECTURER Properties.
STUDENT
I
NAME
I
I
MAJOR
GRADES
It
ADDRESS
YEAR
COURSE
t
I
I
TITLE
CREDIT
II LEVEL
DEPARTMENT
LECTURER
NAME
RANK
DEPARTMENT
Activities Involving ParticularObjects. Enroll student. Drop student. Change year. Change major. Hire lecturer. Fire lecturer. Change rank. Change department.
1 Modeling Application Environments
3
Offer course. Drop course. Change credit. Change level. Change department. Examples of queries which refer to particular objects are: Print Print Print Print
all the grades of a given student. the titles of all graduate courses in computer science. the names of all sophomore students in computer science. the names of all full professors.
Integrity constraints associated with the above objects are: Two different courses cannot have the same title. It is impossible to fire a full professor. When the rank of a lecturer is changed, the new rank must be higher than the previous one. When the year of study of a student is changed the new year must be higher than the previous one. When studying the information needs of the users of our sample application environment we discover easily that many queries refer to more than one object. Examples of such queries are: Print Print Print Print
all all all all
the courses in which a given student is enrolled. the students taking a given course. the courses taught by a given lecturer. the lecturers teaching a given course.
The same observation holds for the activities in the chosen application environment. Some of them involve more than one object. In other words, they involve the relationships among the objects in the application environment. Examples of such activities are: Enroll a student in a course. Drop a student from a course. Assign a lecturer to a course. Change a lecturer of a course. Relationships among objects which are referred to by users' queries or their activities in an application environment may themselves be viewed as abstract objects. The above queries and activities in fact refer to the following objects which represent relationships among the already defined objects:
4
Introduction
Relationships.
LECTURE
I COURSE
I ROOM
II DAY
TIME
LECTURER
ENROLLMENT
I I
COURSE
STATUS
t I GRADE
STUDENT
Integrity constraints related to the above relationships are: A student below the senior level cannot take a graduate course. A student cannot take a course which he or she has already taken. If a lecture is scheduled, then its course, room, day, and time must be specified, whereas the lecturer may be determined later. A lecturer below the assistant professor level cannot teach a graduate course.
Two lectures cannot be given in the same room and at the same time. An application environment of any complexity will have to satisfy a diversity of information needs which are conveniently partitioned into various views which particular groups of users have of that application environment. A view is specified as an abstraction of the application environment suitable to the class of users who share that same view. Generally speaking, the views are characterized in terms of abstractions of objects, actions, and constraints specified for the overall application environment. Data model is a collection of concepts that allow representation of an application environment and various views of that environment according to the above conventions. Specification of objects in the application environment and their relationships is called a schema. Specification of objects and their relationships as they are seen within a particular view is called a subschema. Static constraints are also included in a schema and that completes the static part of the data model specification. The active component of the representation of an application environment consists of a collection of simple and composite actions which are performed on the represented objects under the specified constraints to reflect the changes (events) in the application environment and/or produce the effects (outputs, reports) that satisfy the specified users' information needs. Although a set of predefined actions on objects represented in the schema should be defined when the schema is designed, the data model must allow
5
2 Relational Model
specification of unanticipated actions and queries that would meet the users' requirements unanticipated at design time.
2 Relational Model Abstraction is used out of necessity to develop a representation of an application environment which would correspond to a common view of all users of the application. The basic form of abstraction required by the relational model of data is classification of objects in an application environment into various object types. A collection of objects are considered the same type if they have the same set of properties which are relevant for the users' information needs while they may differ in those properties which are not relevant for those needs. An attribute specifies a single, static property of an object and does not have an existence independent of an object. An object type is characterized by a set of attributes which are common to all objects of that type. The relational model of data represents relationships among objects in the same way it represents objects themselves. The grounds for this semantic relativism is that relationships have their properties, the classification abstraction may be applied to the relationships as well, and so indeed relationships are in fact abstract objects. This semantic relativism is expressed in this book by using the same term for objects and their relationships, both of which are called entities. In the relational model of data an entity is represented by a tuple of values of its attributes. For example, is a tuple which represents a particular lecturer, an instance of the entity type (object class) LECTURER. A set of entities of the same type is represented by a set of tuples, where different entities are represented by different tuples. Such a set of tuples is in the formal sense a relation. For example, LECTURER denotes a set of objects all of which have the attributes NAME, RANK, and DEPARTMENT. The entity type LECTURER is represented as a set of tuples (triples in this particular case) of values of the above three attributes. Informally, a relation may be represented as a table whose columns correspond to the attributes and rows to tuples, elements of the relation. For example, the relation LECTURER may at some point in time correspond to the following table: LECTURER
NAME
RANK
DEPARTMENT
Smith Brown Jones Taylor Sellers
Assistant Associate Associate Full Lecturer
Physics Mathematics Physics Statistics English
6
Introduction
The notation LECTURER(NAME, RANK, DEPARTMENT) is used in the relational schema of our application to describe the structure of the relation (table) LECTURER. Actions and queries that the relational model of data provides for its representation of entities may be interpreted as operations on tables. More importantly, queries and actions in the relational model refer to sets of tuples which satisfy a specified logical condition expressed in terms of values of attributes of those tuples. Apart from other advantages this permits specification of the third part of representation-the integrity constraints-in a way very similar to how queries and update actions are specified within the relational framework. An example of a set-oriented (relational) query on the entity type LECTURER is: Print the names of all Associate Professors in the Physics department in alphabetical order. The result of the above complex operation is a sorted table (relation) of names. In other words, the only attribute of that table is NAME. The query refers to the lecturers with the property which is expressed by the following logical condition: DEPARTMENT = Physics AND RANK = Associate So the set of values of the attribute NAME of the set of tuples of the relation LECTURER which satisfy the above condition is the result of the above query. An example of a set-oriented action is: Promote all Associate Professors. The above action is performed on all the tuples of the relation LECTURER which satisfy the condition: RANK = Associate The value of the attribute RANK of all those tuples is changed to Full which means that we obtained a different set of tuples which corresponds to the entity type LECTURER. An example of an integrity constraint expressed in the same style is: The rank of all lecturers in the Mathematics department must be higher than Assistant Professor. A relational schema together with the associated subschemas, constraints, and actions and queries constructed according to the information requirements of the users in an application environment will be called the relational model of that environment. A relational database is a set of relations (tables of considerable size) whose structure is specified in the schema. The actual sets of tuples change over time due to various actions which are activated by
7
2 Relational Model
the users in the application environment to reflect the changes that happen over time in that environment. However, only those changes (updates) of the relational database are accepted which are performed in accordance with the integrity constraints. Users satisfy their information needs either by triggering prespecified transactions which represent composite queries and produce appropriate reports, or by stating ad hoc queries using user-friendly relational query languages. The global relational model of data has the fundamental advantage of conceptual simplicity (a collection of tables) which is suitable for a diversity of users in the application environment (casual users, programmers, etc.) who can communicate among themselves on the basis of such a unified framework which is well understood by all of them. The relational model of data is also equipped with the relational data definition, query and data manipulation (update) languages which are also suitable for a diversity of users whose technical knowledge varies from the understanding of just a few basic concepts (tables and operations on tables) to thorough knowledge of programming in modern programming languages. This conceptual simplicity of the relational model of data is sometimes not sufficient by itself to develop a rich, highly application-environmentoriented data model that captures a lot of semantics of that environment. Because of that a number of abstraction techniques are used both in designing a relational schema and in specifying various views on top of it. The form of abstraction called aggregation was already used in our example in two important cases. First of all, each of the objects STUDENT, COURSE, and LECTURER-may be viewed as an aggregation of its attributes. Later on, abstract objects LECTURE and ENROLLMENT-were defined as aggregates of other objects that were either simple (attributes) or composite. Indeed, LECTURE was defined as an aggregation of the composite objects COURSE and LECTURER and simple objects (attributes) ROOM, DAY, and TIME. This means that using aggregation we define a relationship among (simple or composite) objects as a higher-level abstract object. The following notation has been proposed to express that situation: LECTURE
I
I
COURSE LECTURER
tI DAY
ROOM
TIME
If aggregation includes a composite object then in the relational framework the identifier of that object is included as an attribute of the aggregate object. In the above example this means that in addition to ROOM, DAY, and TIME the attributes of the entity type LECTURE are also
8
Introduction
course identifier COURSEID and lecturer identifier LECTURERID. So we obtain the following relational representation of the abstract object LECTURE: LECTURE(LECTURE ID, COURSEID, LECTURERID, DAY, ROOM, TIME) An identifier is an attribute (or a set of attributes) of an entity type whose value determines a unique object in the application environment which is classified within that type. Such an attribute is called a (primary) key. In a table (relation) which represents the set of objects of the same type, there is at most one tuple with a given key value. The requirement that each database relation should have at least one key is needed for obvious reasons. Among the integrity constraints which may be specified for the aggregate object LECTURE, one is particularly interesting. It requires that a lecture always refers to an existing course and an existing lecturer. This is called referential integrity. Observe that in order to maintain this integrity constraint actions on the entities which participate in the association must be defined appropriately. If a course is dropped or a lecturer is fired the relevant lecture's attributes must be updated appropriately. Also, when a lecture is scheduled (inserted in the relation LECTURE) its attributes COURSE ID and LECTURERID must be chosen appropriately. In any real situation the above integrity constraint may be too strong. For example, it may often be acceptable to schedule a lecture for which the course, day, room, and time are known, but the lecturer is unknown and will be determined later on. Using a somewhat different form of aggregation we can introduce the entity type DEPARTMENT. In addition to its simple components (attributes such as NAME and LOCATION), an entity of the type DEPARTMENT is an aggregation of a set of courses offered by that department and a set of lecturers belonging to that department. The following graphical notation has been proposed to represent this particular abstraction.
DEPARTMENT
COURSE
LECTURER
We will assume that the chosen application environment has the following properties:
2 Relational Model
9
Each course belongs to one and only one department. Each lecturer belongs to one and only one department. If the above assumptions are satisfied then in the relational representation of the above relationships we choose the department identifier DEPARTMENTID as the attribute of both COURSE and LECTURE entity types. In other words, the department is in fact an attribute of each course and each lecturer. So we obtain the following relational schema: DEPARTMENT(DEPARTMENTID, NAME, LOCATION) COURSE(COURSEID, TITLE, CREDIT, LEVEL, DEPARTMENTID) LECTURER(LECTURERID, NAME, RANK, DEPARTMENT ID) The referential integrity constraints which may be appropriate are: A course must refer to an existing department. A lecturer must belong to an existing department. If the above integrity constraints are adopted then the actions specified on the entity types DEPARTMENT, COURSE, and LECTURER must be defined in such a way that they maintain these integrity constraints. When a new course is established or a new lecturer hired the value of their DEPARTMENT ID attribute must be chosen appropriately (they must be assigned to an existing department). Likewise, when a department is dissolved all of its courses and lecturers must be either deleted or assigned to another existing department. It is important to observe that these actions reflect the assumed properties of the application environment. Of course, different assumptions may be appropriate which would, for instance, permit each course and each lecturer to exist independently of a department to which they belong. Views or data submodels are on the higher level of abstraction from the global data model. They are often obtained using some standard abstraction techniques such as aggregation, generalization, or cover aggregation. In the relational approach views are virtual relations derived from the database relations and as such do not exist in the physical sense. However, they behave like base relations in almost all other ways, which in particular means that features of relational languages are applicable to views as well. Introducing views as higher-level abstract objects is accompanied as a rule with specification of appropriate integrity constraints. Also, actions on views are permitted as long as they do not violate integrity constraints specified for the relations in the database. We present an example of a relational view obtained using the form of data abstraction called generalization. Generalization is a form of abstraction which has been used in structuring knowledge long before the modern database technology was developed. When this particular form of abstraction is used more general concepts are defined extracting common properties from a number of similar concepts. In
10
Introduction
our particular application example this would mean that by extracting common properties of FRESHMAN, JUNIOR, SOPHOMORE, and SENIOR students we obtain their generic entity type UNDERGRADUATE student. Similarly, by extracting common properties of M.Sc. and Ph.D. students we obtain their generic type GRADUATE student. Furthermore, all UNDERGRADUATE and GRADUATE students have some common properties which characterize their generic entity type STUDENT. The following graphical notation has been proposed for the representation of the generalization hierarchy described above: STUDENT
UNDERGRADUATE
FRESHMAN
JUNIOR
SOPHOMORE
GRADUATE
SENIOR
M.SC.
PH.D.
Quite often (but not necessarily) the full property inheritance is assumed to hold. When that rule is applied to the two upper levels in the above hierarchy it means that undergraduate and graduate students have attributes that characterize these two types of student but they inherit all the properties common to all students. One possible way of approaching the above situation using the relational framework is to define two base relations which correspond to the above two subtypes of students: UNDERGRADUATE(STUDENTID, NAME, ADDRESS, YEAR, MAJOR, FACULTY) GRADUATE(STUDENTID, NAME, ADDRESS, DEPARTMENT, ADVISOR)
Having done that, we define a generic view of the above two student subtypes as a virtual relation whose attributes are: STUDENT(STUDENTID, NAME, ADDRESS) The above relation is virtual in the sense that it does not exist in the database. However, at any moment of time the set of tuples of the STUDENT relation may be derived from the base relations UNDERGRADUATE and GRADUATE by extracting the values of the attributes STUDENT ID, NAME, and ADDRESS from the tuples in these two relations and constructing a set out of them.
3 Relational Database System
11
This rule guarantees that all the changes performed in the base relations UNDERGRADUATE and GRADUATE will be reflected (i.e., will be visible) in their view STUDENT. In particular, for every UNDERGRADUATE and GRADUATE tuple there will be a tuple in the relation STUDENT with the same values of the STUDENTID, NAME, and ADDRESS attributes. Views behave like database relations in many ways. In particular, queries may be stated on views. Examples of such queries are: Print the name and the address of a student with a given student id. Print the alphabetical listing of names of all students and their addresses. Print the names of all students who have the same address.
3 Relational Database System A relational model of an application environment is implemented using a general purpose tool available on modern computers which is called a relational database management system. It supports a number of models, each of which has a collection of conceptual objects, actions (operations) on those objects, and constraints under which those actions are performed. Those models coexist in a coherent relational database system which will be viewed as a hierarchy of abstractions. The role of a relational database management system is to implement all those models in a coherent manner, transforming models on higher levels of abstraction into models on lower levels of abstraction all the way down to the machine level. In this book a relational database system is viewed through the following four models: (i) (ii) (iii) (iv) (v)
Global data model of all users. Data submodels or views of particular classes of users. Structural model of data. Multiuser transaction processing model. Data distribution model.
In the relational model of data, relations are simply unordered sets of tuples. In the structural model, which is one level below the data model in the hierarchy of abstractions, relations have much more complex structure. Within this model relations are equipped with logical orderings of tuples called images. Actions on images are such that they permit decomposition of high-level queries and actions of relational languages into appropriate relational algorithms. While queries and other actions on the level of the data model are nonprocedural and refer to sets of tuples, their representation within the structural model is procedural and contains operations that refer to particular tuples. Although the level of detail and procedurality is much lower from the level of the relational model, the relational framework is still
12
Introduction
maintained in many ways. Images of a relation are relations themselves and in many ways behave like particular types of views of a relation. To explain the need for the existence of images over database relations we consider the entity type LECTURE: LECTURE(COURSE, LECTURER, DAY, ROOM, TIME) Assume that the analysis of the activities and the information needs of the users in the chosen application environment show that the following two types of query are frequent and that the response to these two types of query should be fast: Find all the courses taught by a given lecturer. Find all the lecturers of a given course. It should be obvious that the need for the above two types of query occurs in the activity in which the lecturers are assigned to courses while the schedule of lectures is being constructed. When the assignment is performed each lecturer is informed about the courses assigned to him or her and the final schedule is announced. So the following two queries are also required in the considered activity: Print the lecturers in the alphabetical order and for each one of them the list of his lectures. Print the alphabetical listing of courses and for each one of them the list of its lectures. The above queries suggest the following two conceptual views of the relation LECTURE:
COURSE
LECTURER
LECTURE
In other words, all lectures given by one lecturer are assigned to that lecturer and all lectures of one course are assigned to that course. Observe that these are two different structurings of the same relation LECTURE. They may be described using the following notation: LECTURE(LECTURER, LECTURE(COURSE, DAY, ROOM, TIME)) LECTURE(COURSE, LECTURE(LECTURER, DAY, ROOM, TIME))
13
3 Relational Database System
The meaning of the above notation is illustrated by the following diagram: Lecturer A
Lecturer B
Lectures given by A
Course X
Lectures given by B
Course Y
Lectures 0'
X Lectures of y
The first two of the above queries make it obvious that the set of lectures should be the property of the lecturer giving those lectures and also that the set of lectures of a course should be an attribute of that course. The next two of the above queries suggest that in addition to the above requirement the structured tuples of the relation LECTURE should be ordered in two ways: according to the lecturers and according to the courses. Of course, one relation cannot be physically structured according to the above two requirements and the purpose of the relational images is to offer such logical orderings of relations together with the associated primitive operations on them. One way of looking at the relational images within the relational framework is illustrated by the following schema: LECTURE(LECTUREID, COURSEID, LECTURERID, DAY, ROOM, TIME) LECTURERS (LECTURER ID, LECTUREID) COURSES(COURSEID, LECTURE ID) The above representation of two images LECTURERS and COURSES of the relation LECTURE is proposed on the assumption that all three relations are structured in such a way that selection of the tuple with a given tuple identifier is an efficient action (operation). The above representation makes it obvious that changes in the base relations LECTURE must be appropriately reflected in its relational views. That is the price which must be paid to speed-up search (selection) operations on the relation LECTURE. Observe that the views LECTURERS and COURSES are much smaller relations than the base relation LECTURE since their tuples are smaller. Whenever a tuple of the LECTURE relation is inserted, appropriate tuples which refer to it must be inserted in the relational images LECTURERS and COURSES. Whenever a tuple of the LECTURE relation is deleted, the appropriate tuples of the relational images LECTURERS and COURSES which refer to it must be deleted from these images.
14
Introduction
Whenever the values of the LECTUREID or COURSEID attributes of a LECTURE tuple are changed, the relational images LECTURERS and COURSES must be updated appropriately. The purpose of the transaction processing model is to support the multiuser requirements of a relational database system under a given set of integrity constraints. Objects in this model are sequences of actions on base relations which are called transactions. The transaction processing model implements transactions as units of data integrity which means that they are required to comply with the specified integrity constraints associated with a particular relational schema. However, even if all the transactions which are performed concurrently on the same database have that property, it is still possible that their interactions cause violation of the specified integrity constraints. So the problem of maintaining database integrity is a crucial problem for all multiuser environments and the transaction processing model provides concepts, tools, and techniques for solving that problem. We could say that this model offers the users of a relational database system an abstraction in which, at least as far as the integrity constraints are concerned, each one of them has the impression of working in a single user environment. We will assume that our application environment is such that each department is responsible for scheduling the lectures of those courses offered by that department (and taught by lecturers from that department). However, we will further assume that lecture rooms are shared by all departments. Under these assumptions of particular interest becomes the maintenance of the integrity constraint Two lectures cannot be given in the same room and at the same time. associated with the relation LECTURE(LECTUREID, COURSEID, LECTURERID, DAY, ROOM, TIME) As we pointed out before an action associated with an entity type must be defined in such a way that it preserves all the integrity constraints associated with that type. That is why the action Shedule a lecture is defined as follows: Schedule a lecture: 1. Request lecture attributes from the user. 2. Check the integrity constraints. 3. If satisfied, insert lecture tuple and update the images on the LECTURE relation or else issue error message. Checking the integrity constraint in the above specification of the action Schedule a lecture involves the following (and often much more): LECTUREID is valid and nonexistent in the relation LECTURE. COURSEID is valid and exists in the relation COURSE. LECTURERID is valid and exists in the relation LECTURER.
3 Relational Database System
15
The values of the attributes DAY, ROOM, and TIME are valid (i.e., within the ranges specified in the schema). No other lecture is scheduled at the same time and in the same room. Although the action Schedule a lecture is atomic from the viewpoint of the user, it is in fact a composition of simpler actions. Such a composition of actions is called a transaction. Although a transaction as a whole maintains the integrity constraints, they may in fact be violated during its execution. For instance, between the points in time at which a tuple is inserted into the LECTURE relation and all of the images of this relation are updated the referential integrity constraints are not maintained properly. If another transaction attempts to perform a query or an action on the same relation and exploits the images which are not updated yet, some obvious problems will arise. A much simpler example which illustrates the sort of problems that the transaction processing model has to solve is the situation in which two secretaries from different departments perform the same action Schedule a lecture with the following lecture attributes: LECTURE ID COURSE ID LECTURER ID DAY ROOM TIME Assume the following sequence of execution of the primitive actions out of which the action Schedule a lecture is performed: (i) Both transactions request the lecture attributes from the users and conclude that no integrity constraints are violated. These checks (and conclusions) are made before any insertion is performed. (ii) Both transactions perform their insertions in an arbitrary order producing a database state which is not consistent with the specified integrity constraints. Of course, if the two transactions are executed one after the other (in serial order) this sort of problem would not occur. However, that would mean allocating the whole relation LECTURE to one user at a time. In some other cases this strategy would require allocation of all the relations interrelated by (e.g., referential) integrity constraints to one user at a time. An example is deletion of a department which requires appropriate actions on all of its courses and on all of its lecturers. This shows that although serial executions of transactions are attractive from the viewpoint of integrity constraints, they may be quite unacceptable from the viewpoint of availability of the data and the response time of users' requests for queries and actions on that data. The task of the transaction processing model is to specify the protocols which are to be used in multiuser environments to satisfy the requirements of the users in such environments both in terms of the quality (i.e., integrity) of the shared data and in terms of the availability of that data. These protocols
16
Introduction
support executions of transactions which are not serial, but are equivalent to the serial executions in terms of the integrity constraints. The task of the data distribution model is to support another abstraction of the relational database system which hides from its users the fact that the database is in fact a collection of databases distributed over a network of computers. This fact is hidden in the sense that users of a distributed system are presented a unified, relational model of data with the associated views and integrity constraints irrespective of the actual situation in which the physical distribution exists. To support this abstraction offered to the users of a relational data base system a collection of techniques are developed such as techniques for processing relational queries on a distributed database or maintaining integrity of the distributed database when it is accessed concurrently by a collection of transactions which are activated from various nodes in the network and access portions of the database located at different nodes. To illustrate some of the issues involved in the design and use of distributed database systems we will assume that the departments of our university application example are distributed over a number of locations. As far as the location of the common database is concerned there is one obvious solution: one central site is chosen and the database is located at that site together with the computing facilities which can manage that centralized database. There are many advantages of such an approach, particularly in our chosen example in which all the university departments will usually be located on the same campus. However, if we assume that the university departments are scattered in a larger area and that each one of them has its own (at least small scale) computing facilities, then we have the assumptions which may justify the design of a distributed database system, particularly if we have in mind the usual assumption that the cost of intersite communication is more expensive then the cost of local processing by two orders of magnitude. Of course, the issue of the data distribution is also approached from the viewpoint of the activities and the information requirements of the users in the chosen application environment. The assumption that we will make is that most activities within one department (and thus most queries issued by the users from that department) involve courses and lecturers from that department and lectures offered by that department. We will assume that the responsibility of maintaining students' records and the records of their enrollments in courses is not assigned to departments, but rather to schools (faculties) to which they belong (Arts, Sciences, Engineering, Medicine, Graduate). The schools are also assumed to be distributed over a number of locations. The activities of the users in each school (and thus their actions and queries on the common database) are such that they require access to data on all departments, courses, lecturers, students, and enrollments which belong to that school. However, all departments of a school are not necessarily located at the site where that school is and in most cases they are in fact located elsewhere.
4 Relational Technology
17
Occasionally, a user from one department will require access to data associated with another department. An example would be scheduling of lectures which is allowed to be performed by individual departments on condition that no integrity constraints are violated (no conflicting assignments of rooms to lectures). Also, occasionally there will be requests for data abort students enrolled in a course offered by a department or actions on that data (for instance, upon completion of course work which includes grading of students' performance). The above discussion shows that database relations may be partitioned into subsets and these subsets allocated to different locations. A particular and frequently occurring case is when complete relations are the units of distribution. For example, the whole relation LECTURE may be assigned to each location to make access to its tuples faster. In either case, there is an obvious need for data replication. For instance, the data on courses, lecturers, and lectures are assigned to each department, but all the data from all the departments which belong to one school are assigned to the location of that school. Data replication within a distributed database system requires protocols which guarantee that whenever updating of the replicated data is performed, all the copies of that data are updated in the same way or none of them are. For example, if the schedule of lectures is replicated at the location of each department then that has the advantage of quick local checking of the whole schedule without any need for intersite communication. However, whenever a change in the schedule is performed (a lecture is rescheduled) all the copies of the replicated schedule must be updated consistently.
4 Relational Technology This book presents a unified collection of concepts, tools, and techniques that constitute the most important technology available today for implementing information systems in a diversity of environments. (i) It deals with the concepts, tools, and techniques for developing data models of various application environments as well as particular abstractions (views) built on top of such models. (ii) Using a large number of examples it teaches the reader how to use a powerful tool the most widely known user interface to relational database systems the relational language SQL, which is a de facto standard in the area. (iii) The goal of this book is to cover the mathematical concepts and tools of the relational technology which are particularly relevant for using and implementing relational systems. In doing that all formal concepts and results are always supported by examples which are intended to explain the formal approach and show their pragmatic relevance.
18
Introduction
(iv) It provides a relational framework in which techniques for implementing the relational model of data and relational languages are studied. Within that framework a collection of relational algorithms are presented which are required to decompose high-level nonprocedural relational language constructs into procedural form and choose the most efficient of all possible decompositions. (v) A particularly important collection of concepts and techniques covered in this book relates to the problems of implementing relational database systems in multiuser environments. In the relational framework in which these techniques are studied the most important problem is integrity of the database which is concurrently accessed by a collection of transactions. This book presents a collection of techniques suitable for relational database systems. (vi) Another recently developed collection of concepts, tools, and techniques covered in this book applies to the design and implementation of distributed database systems. Examples are techniques for logical and physical design (data distribution) of distributed databases, query processing in distributed systems, and techniques for maintaining data integrity and reliability in such systems. The orthogonal structure of the chapters and the concepts, tools, and techniques which they cover may be presented in the following way:
CONCEPTS DATA MODEL
LOGICAL DESIGN
STRUCTURAL DESIGN
TOOLS
Entity Relationship Relation Relational operation Logical dependency Normal form Conceptual model Aggregation Generalization Covering
Relational data definition query and manipulation language
Image
Relational algorithms Dynamic data structures
TECHNIQUES Classification
Normalization Abstraction Conceptual design Refinement Incremental design Localization Decomposition Optimization
19
4 Relational Technology
TECHNIQUES
CONCEPTS
TOOLS
DATA INTEGRITY
Integrity constraint Transaction Concurrent
Locking protocols Commit protocol
Scheduling Locking Recovery
DISTRIBUTED SYSTEMS
Distributed database Distributed execution Serializability
Reduction Commit protocol
Distributed processing Distributed updating Recovery
execution
CHAPTER 1
Data Model
1 Relational Model of Data 1.1 Relational Representation of Entities (1) Model A model is a representation of a set of objects and their relationships. Such a representation is always the result of a process of abstraction. During that process relevant objects, relevant relationships among them, and relevant attributes of both objects and relationships are chosen. The relevance of an object, a relationship, or an attribute is determined by the purpose of the model. Models studied in this book are used for stating questions (queries) about the represented objects and their relationships. Given a set of queries, one can determine to what objects, relationships, and attributes queries from that set refer. Only those associated with the given set of queries will be relevant for the representation within the model. A model will serve its purpose if it is a clear, correct, up-to-date representation. Attributes of objects and their relationships have specific values which belong to sets called domains. A value of a given attribute may change over time, but it always belongs to the domain of that attribute. Such changes must be reflected in the sets of representations of objects and their relationships. (2) Entities, Attributes, and Domains Objects and their relationships are called entities. A set E of entities of the same type is characterized by its set of attributes A,, A2 , ... , A. and denoted as E(A1 ,A 2, .... A.) where Ai: E--* Di is a function whose codomain Di is
21
1 Relational Model of Data
called the domain of the attribute AL. Given e in E, Ai(e) in Di is called the value of the attribute Ai of the entity e. (3) Example of an Entity Type According to Section (2) an entity type PERSON whose relevant attributes are social security number, first and last names, and age will be denoted as PERSON(SOC #, FNAME, LNAME, AGE) where the domains of the attributes are: SOC # FNAME LNAME AGE
the the the the
set set set set
S of social security numbers A of finite sequences of letters A of finite sequences of letters N of nonnegative integers which are less than 150
(4) Representing Entities by Tuples Given an entity type E(AI,A ..... A.), the set of functions A: E--*D, A 2 : E --+ D 2 , . , A, : E -* D. (attributes of that entity type) determine a unique function: (A1,3A2.,
A,): E --+ D, x
D2 X
... x OD
(i
where D1 x D2 X "" x D. denotes the Cartesian product of the sets D1, D2 . ., D. O [i.e., the set of all n-tuples (d, d 2 ,.. ., d.) where di is in Di for i = 1, 2 ... , n]. The function (A 1, A 2 .... A.) is defined as (A 1 , A 2. .... A,)(e) = (Al(e), A 2 (e),...
A,(e))
(ii)
where we require that for any two different entities e, and e2 of the type E the following holds: (A 1 , A2 1 _ .
A.)(el)
(A 1, A 2 ...
, A.)(e 2)
(iii)
Condition (iii) states that A,, A 2 .. A. is an injective function E--* D, x D2 x ... x D.. It will be satisfied if there existsj(j = 1,2... , n) such that Aj(el) =AAj(e2) (iv) In words, different entities are represented by different tuples. Tuples will be different if they have different values of at least one attribute. (5) Example of Representation of an Entity Type by a Set of Tuples We now interpret the notation PERSON(SOC #, FNAME, LNAME, AGE) as a function PERSON -+ S x A x A x N, where S is the set of social security numbers, A is the set of finite sequences of letters, and N is the set of nonnegative numbers. This function determines, for each person p, a 4-tuple (social security number, first name, last name, age) which represents that person. Different persons p1 and p2 will determine different tuples. Indeed, although two different persons may have the same first and last names and age, their social security numbers will certainly be different.
22
1 Data Model
Sections (4) and (5) show that entities of the type E(A1 , A 2 ,..., Aj) may be represented by n-tuples of values of their attributes. The set of entities of the type E(A1,A 2 .... A,,) may be represented by a table whose columns correspond to the attributes A 1 , A2 , ... , A,, and whose rows are particular tuples representing specific entities. In view of condition (4) (iii), no two rows may be identical which means that such a table is a set having rows of the table as its elements. This set according to definition (6) is precisely a relation. (i)
Entity
Domains
Attributes
SOC #
(ii) For a given moment of time a set of persons may be represented by the following table: SOC#
FNAME
LNAME
AGE
Peter Harry Joan Marc John Alice
Brown Smith Rose Quincy Foz Hess
31 24 64 45 52 24
941 385 102 243 860 543
A set of entites of the type PERSON represented by a table
(6) Relations D,, if and only if
R is a relation over the sets D1 , D 2 .... R _D
1
x
D
2
X ...
x 1D.
23
1 Relational Model of Data
The structure of the relation R is described by the notation R(A 1 , A 2 ,. . where Ai: R -+ Di is an attribute of R and Di is its domain, for i = 1, 2 ...
A) , n.
(7) Relational Database and Relational Schema A relational database is a collection of time-varying relations. A relational schema is a description of the structure of relations in a relational database. (8) Example of a Relational Schema (i)
DEPARTMENT(DEPT #, NAME, LOC) EMPLOYEE(EMP #, NAME, JOB, DEPT #) PROJECT(PRO #, TITLE, FUNDS) ASSIGNMENT(EM P #, PRO #, ROLE)
The above relational schema describes four types of entity: DEPARTMENT, EMPLOYEE, PROJECT, and ASSIGNMENT of employees to projects. (ii) Attributes of entites of the DEPARTMENT type are: DEPT # NAME LOC
department number department name location of the department
Attributes of entities of the EMPLOYEE type are: EMP# NAME JOB DEPT#
employee social security number name of the employee job of the employee number of the employee's department
Attributes of entities of the PROJECT type are: PRO # TITLE FUNDS
project code project title funds assigned to the project
Attributes of entities of the ASSIGNMENT type are: EMP# PRO # ROLE
employee number project code role which the employee performs on the project
(iii) The above schema is supplemented by the following two logical conditions which we assume hold: An employee belongs to a unique department. An employee may be assigned to several projects and a project has a number of employees assigned to it. (iv) We now give a sample database that corresponds to the relational schema defined in (i), (ii), and (iii) at some specified moment of time.
24
1 Data Model
DEPARTMENT
EMPLOYEE
DEPT#
NAME
LOC
EMP#
NAME
JOB
d,
N,
Cl
el
nI1
Jl
d2
N2 N3
c,
e2
n
2
12
C2
e3
n3
13
e4
n4
Jn
e5
n5 n6 n7
j2
d,
e6 e7
d,
12
Ji
d2 d,
d2 d3 d3 di
ASSIGNMENT
PROJECT PRO #
TITLE
P1 P2
t' t' t3
P3
DEPT#
FUNDS f, f2 f3
EMP #
PRO #
ROLE
el
Pi
r,
e2
P3
r,
e2
P2
r2
e3
P2
r2
e3 e4
P3 Pi
r, r,
e5 e6
P3 Pi
r2 r3
e6
P2
r3
e6
P3
r3
e7
P1
r,
1.2 Relational Algebra A model is useless unless we have a suitable language for stating queries about properties of entities represented by the model. There exists a number of such languages defined for the relational model of data. Some of them have been efficiently implemented and widely accepted. From the conceptual point of view all of them are based on a simple, formal language called the relational algebra. The relational algebra consists of a collection of operations over relations. The operations are chosen in such a way that all well-known types of queries may be expressed by their composition in a rather straightforward way. First, the relational algebra contains the usual set operations: Cartesian product, union, intersection, and difference. Second, this algebra also includes the operations of projection, restriction, join, and division. The latter are in fact characteristic for the relational algebra and essential for its expressive power for stating queries. Assuming that the reader is familiar with the usual set operations, we will define only the remaining operations of the relational algebra.
25
1 Relational Model of Data
(1) Projection Let R(A 1, A 2 ..... An) be a relation, X c {A 1, A 2 . , An} and Y {A, A 2 ... An}/X. Permuting the attributes of R we can represent R(A,, A 2,..., A.) as R(X, Y). The result of the operation of projection of the relation R over the attributes X is denoted as R[X] and defined as: (i)
R[X] = {x: there exists y such that (x,y) is in R(X, Y)}
If we represent the relation R as a table, then the operation of its projection over the set of attributes X is interpreted as the selection of those columns of R which correspond to the attributes X and elimination of duplicate rows in a table obtained by such selection. Duplicate rows will appear whenever the relation R(X, Y) contains tuples (x,y,), (x,y 2), .... (x, ym) differing only in the values of the attributes from Y. Unless we eliminate duplicate rows, we cannot claim that the table obtained by the selection of columns represents a set, let alone a relation. (ii) Example of a Projection. Given a relation EMPLOYEE(EMP #, NAME, JOB, DEPT #) the projection EMPLOYEE[DEPT #,JOB] represents a survey of jobs in particular departments. DEPT #
JOB
d,
i3
dl,
d2 d,
i2
d3
12
(2) Restriction Let R(A,, A 2 ... , AJ)be a relation and P a logical condition defined on the set D1 x X... x D, where Di is the domain of the attribute Ai (for i = 1, 2,...,n). The result of restriction of the relation R with respect to the condition P is denoted as R[P] and defined as follows: (i)
R[P] = {x: x in R and P(x) is true}
If we represent the relation R as a table, then the operation of restriction may be interpreted as elimination of those rows from the table R which do not satisfy the condition P. As opposed to the generality of definition (i), condition P is in practice specified by a rather simple expression.
26
1 Data Model
(ii) Example of a Restriction. Given a relation EMPLOYEE(EMP #, NAME, JOB, DEPT #) its restriction EMPLOYEE[JOB = 'PROGRAMMER']
represents a table of those employees who are programmers:
EMP#
NAME
JOB
DEPT#
e2
n2
i2
d2
e5
n5
12
d3
e6
n6
i2
d3
(3) Join Let A(A 1 ,A 2 ,..,A ) and B(B1,B 2 ,....B,) be relations and X___ {A 1 ,A 2 ,...,An} and Yc_ {B1,B 2 ,...,Bm} sets of attributes. Assume
that X and Y contain the same number of attributes and that the corresponding attributes have the same domain. If we denote Z = {A 1, A 2 ,..., A,}IX and W = {B 1, B 2 ,..., Bm}/Y then by permuting the order of the attributes the relations A and B may be represented as A(Z, X) and B(Y, W). The natural join of the relations A and B over the attributes X and Y, respectively, is denoted as A[X = Y]B and defined as follows: (i)
A[X = Y]B = {(z,x, w): (z,x) in A, (y, w) in B and x = y}
Observe first of all that A[X = Y]B is a relation whose set of attributes is the union of the sets Z, X (or Y), and W. The table A[X = Y]B is obtained by concatenating tuples from A with tuples from B whenever the values of X attributes are equal to the values of Y attributes. In addition, duplicate attributes (X or Y) are eliminated from the concatenated tuples. (ii) Example of a NaturalJoin. Given relations EMPLOYEE(EMP #, NAME, JOB, DEPT #) DEPARTMENT(DEPT #, NAME, LOC) the expression EMPLOYEE[DEPT # = DEPT #] DEPARTMENT [NAME, LOC] denotes the composition of two operations: the natural join of the relations EMPLOYEE and DEPARTMENT over the attributes DEPT#, and the projection of the result of the join operation on the attributes NAME and LOC. The result of the composite operation is a table of employees together
27
1 Relational Model of Data
with the locations of their departments:
NAME
LOC
n,
Cl
n,
C1
n3
Cl
n4
C1
n5
C2
n6
C2
n
C1
7
(4) Division
Let A(X, Y) and B(Z) be relations, where X, Y, Z are sets of attributes. Assume that Y and Z contain an equal number of attributes and that the domains of the corresponding attributes in these sets are equal. The result of division of the relation A(X, Y) with the relation B(Z) is denoted as A[Y: Z]B and defined as the maximal subset of the projection A[X] such that its Cartesian product with B(Z) is contained in A(X, Y). In other words, we have (i)
A(X, Y) = A[Y: Z]B x B(Z) U O(X, Y)
where A[Y: Z]B is the quotient and O(X, Y) is the remainder of the division operation. (ii)
Given C g; A[X] such that C x B(Z) S_A(X, Y) it must be that C s_ A[Y: Z]B.
(iii) Example of Division. Consider the relations PROJECT(PRO #, TITLE, FUNDS)
ASSIGNMENT(EMP #, PRO #, ROLE) The expression ASSIGNMENT[PRO # : PRO # ]PROJECT[PRO #]
denotes the division of the relation ASSIGNMENT with the projection PROJECT[PRO # ]. The quotient is a table of employees who are assigned to all projects and their roles within projects:
EMP#
ROLE
e6
r3
28
1 Data Model
The remainder is:
EMP #
PRO #
ROLE
el
Pi
e2 e2
P3 P2
r2
e3
P2
r2 r1
e3 e4
P3
e. e7
r,
P1 P3
Pi
r2 r,
(5) Ordinary Set Operations The operations of the Cartesian product, union, intersection, and difference are defined in the relational algebra in the usual way. There is, however, one restriction as far as the operands of the operations of union, intersection, and difference are concerned. In the relational algebra they are applicable only to the relations which satisfy the following compatibility condition: The number of attributes and the domains of the corresponding attributes of operands of the set operations must be the same. (6) Graphical Representation of the Operations of the Relational Algebra (i) Projection of the relation R over the attributes A:
(ii) Restriction of the relation R with respect to the condition P:
29
1 Relational Model of Data
R
(iii) Natural join of the relations R 1 and R, over the attributes A and B, respectively:
Ri
2
1.3 Relational Query Languages Although the relational algebra is a simple formal language, it is not suitable for casual users of the database, especially those who are not educated in mathematics and programming. As such, it is not a suitable practical query language. Indeed, expressing queries which are more complex from the examples given in the previous section gets rather complicated if the relational algebra is used. A number of relational query languages have been designed and implemented to serve as practical tools for casual users. Queries in such languages have clear structure and meaning and are expressed in a way which is much closer to the way one would ask such queries in ordinary English. In addition, asking complex queries is much simpler than in the relational algebra. Throughout this book the language SQL will be used to illustrate properties of the relational query languages. SQL (formerly SEQUEL) is an abbreviation for the Structured Query Language.
30
1 Data Model
(1) Query Block The most fundamental concept of SQL is called the query block. Its basic form is SELECT FROM WHERE
<list of attributes> (list of relations>
The result of execution of a query block is a relation whose structure and contents are determined by that block. Attributes of that relation are specified in the list of attributes. The attributes listed are selected from the relations in the list of relations. The first two clauses (SELECT and FROM) in the query block define the operation of projection. The qualification expression in the WHERE clause is a logical expression. It contains attributes of the relations listed in the FROM clause and determines what tuples of those relations qualify for the operation of projection. This means that only the attributes of those tuples for which the qualification expression is true will appear in the result of the query block. The WHERE clause thus contains specification of the restriction and the join operations. To summarize, the query block as a whole represents a composition of the operations of projection, restriction, and join of the relational algebra. In fact, it also permits specification of the division operation, although not in a straightforward manner. But what is important to observe is that the query block does not specify the order in which these operations are performed. Because of that it is a more general, less procedural construct than any particular operation or a composition of operations of the relational algebra. (2) Example of a Relational Schema To illustrate the use of the query block and other constructs of SQL we will give a number of sample queries for the following relational schema: DEPARTMENT(DEPT #, NAME, CITY) EMPLOYEE(EMP #, NAME, JOB, SALARY, DEPT #) PROJECT(PRO #, TITLE, FUNDS) ASSIGNMENT(EMP #, PRO #, ROLE)
(3) Projection The following query block constructs a table of department numbers and jobs which the employees perform in their departments: SELECT FROM
DEPT#, JOB EMPLOYEE
(i)
Since the WHERE clause is missing, it appears that the query block (i) represents the operation of projection. That is not quite true since elimination of equal pairs which appear after selection of the attributes DEPT #
31
1 Relational Model of Data
and JOB will not be performed. The elimination process, which is sometimes very expensive, must be explicitly specified, as in query (ii). SELECT
UNIQUE DEPT#, JOB
FROM
EMPLOYEE
Query block (ii) defines precisely the operation of projection of the relation EMPLOYEE over its attributes DEPT# and JOB. It is possible to order the table which represents the result of a query block. For example, if we want to obtain an alphabetical listing of employees, each one with his department number, we type query (iii): SELECT NAME, DEPT# FROM EMPLOYEE ORDER BY NAME
(iii)
(4) Restriction Selection of those tuples from the table EMPLOYEE which represent programmers is specified by the following query: SELECT
*
FROM WHERE
EMPLOYEE JOB ='PROGRAMMER'
(i)
This is precisely the operation of restriction of the relation EMPLOYEE with respect to the condition JOB = 'PROGRAMMER'. * denotes selection of all attributes (columns) of the table employee. The composition of operations of selection, restriction, and ordering is specified by query block (ii) which constructs the table of names of employees who are programmers. SELECT FROM
NAME EMPLOYEE
WHERE
JOB ='PROGRAMMER'
(ii)
ORDER BY NAME So far, the qualification expression has the simplest form possible: it is a comparison of the value of an attribute with a constant. SQL permits much more complex forms of the qualification expressions. The most frequently used form is probably a conjunction of several simple comparisons, as in query (iii), which constructs a table of programmers having salaries greater than $50,000. SELECT FROM
NAME EMPLOYEE
WHERE
JOB = 'PROGRAMMER'
AND
SALARY > 50000
(iii)
32
1 Data Model
In example (iv), the qualification expression has the form of a disjunction of simple comparisons. It constructs a table of employees who work in department number 7 or department number 9. SELECT FROM
NAME EMPLOYEE
WHERE
DEPT# = 7
OR
DEPT# = 9
(iv)
SQL permits a more elegant way (v) of expressing query (iv). It is based on the relationship of membership (IN) of an element in a set: SELECT FROM WHERE
NAME EMPLOYEE DEPT# IN (7,9)
(v)
All the examples given so far represent queries on single relations. Our final example (vi) of one-relation (unary) query shows how embedding a query on one relation into a query on another permits implicit specification of the natural join operation. Query block (vi) constructs a table of names of those employees whose departments are located in Boston: SELECT FROM WHERE
NAME EMPLOYEE DEPT# IN
SELECT FROM WHERE
DEPT# DEPARTMENT CITY ='BOSTON'
(vi)
(5) Join Query (i) constructs a table of employees' names and locations (cities) of their departments. It is our first example of a query on two relations: EMPLOYEE and DEPARTMENT. Query block (i) is a composition of two operations of the relational algebra: projection and join. The attributes over which the join of the relations EMPLOYEE and DEPARTMENT is performed are DEPT# in the first relation and DEPT# in the second. SELECT FROM WHERE
EMPLOYEE.NAME, DEPARTMENT.CITY EMPLOYEE, DEPARTMENT EMPLOYEE.DEPT# = DEPARTMENT.DEPT#
(i)
A more complex query block (ii) constructs a table of names of programmers and their departments if they are located in Boston. In addition to the operations of projection and join, query block (ii) also contains the
33
1 Relational Model of Data
operation of restriction of the relations EMPLOYEE and DEPARTMENT. As such, it is not only the most general type of query we have had so far, but it is very typical for queries on two relations. SELECT FROM WHERE AND AND
EMPLOYEE.NAME, DEPARTMENT.NAME EMPLOYEE, DEPARTMENT EMPLOYEE.JOB = 'PROGRAMMER' EMPLOYEE.DEPT# = DEPARTMENT.DEPT# DEPARTMENT.CITY = 'BOSTON'
(ii)
(6) Union, Intersection, and Difference The result of execution of a query block is a set of tuples (a relation). Because of that, query blocks may appear as operands of the set operations: union, intersection, and difference. That way, we obtain SQL query expressions, query blocks being their particular and most frequently occurring form. As stated before, operands of the operations of union, intersection, and difference must be compatible, which means that they must have the same number of attributes and that the domains of the corresponding attributes must be the same. An example of a query expression is (i). Its result is the table of numbers of those departments which have no employees: SELECT FROM MINUS SELECT FROM
DEPT # DEPARTMENT (i) DEPT# EMPLOYEE
The results of query expression (ii) are employee numbers of those programmers who work on a project: SELECT FROM
EMP # EMPLOYEE
WHERE
JOB
-
'PROGRAMMER'
(ii)
INTERSECT SELECT EMP# FROM ASSIGNMENT (7) Division SQL does not permit the user to express the operation of division in a particularly attractive form. Query block (i) is chosen to demonstrate this property. It constructs a table of employee numbers of those employees who are assigned to all projects with funds greater than $1 million. The table also contains the roles of the selected employees within their projects.
34
1 Data Model
SELECT FROM WHERE
EMP#, ROLE ASSIGNMENT X (SELECT PRO# FROM ASSIGNMENT WHERE EMP#
CONTAINS
= X.EMP#)
(i)
(SELECT PRO# FROM PROJECT WHERE FUNDS > 1000000)
Query block (i) contains a bound variable X whose range is the relation ASSIGNMENT. CONTAINS denotes the set relation which is true if the right-hand operand is a subset of the left-land operand. Observe that query (i) defines the division of the relation. ASSIGNMENT(EMP #,PRO#, ROLE) with the result of query block (ii): SELECT PRO # FROM PROJECT WHERE FUNDS > 1000000
(ii)
(8) Standard Functions SQL contains the following standard functions: AVG (the average value), SUM, COUNT (the number of elements in a set), MAX (the maximum of a set of values), and MIN (the minimum of a set of values). These functions make SQL a much more complete query language. The result of query (i) is the average salary of programmers: SELECT FROM WHERE
AVG(SALARY) EMPLOYEE JOB ='PROGRAMMER'
(i)
The result of query (ii) is the number of different roles which a particular employee performs on projects: SELECT FROM WHERE
COUNT(UNIQUE ROLE) ASSIGNMENT EMP# = 128
(ii)
(9) Insertion SQL is not just a query language. It permits specification of other actions on the database. The most important types of action are insertion, deletion, and modification (update) of tuples. These actions may refer to a set tuples or to a particular tuple. An example that illustrates the action of insertion of a particular tuple is (i): INSERT INTO EMPLOYEE(EMP #, NAME, JOB): K843, 'JONES', 'PROGRAMMER'>)
(i)
35
1 Relational Model of Data
In the inserted tuple the attributes not specified in (i) obtain the undefined (unknown) value. If the values of all attributes are specified, then it is not necessary to specify their names. Suppose now that, in addition to the relations specified in (2), the database also contains (at least temporarily) the relation CANDIDATES(EMP #, NAME, JOB, SALARY, DEPT #)
(ii)
Then statement (iii) inserts into the relation EMPLOYEE selected tuples from the relation CANDIDATES: INSERT SELECT FROM WHERE
INTO EMPLOYEE EMP#, NAME, JOB, SALARY*1.1, DEPT# CANDIDATES JOB IN ('PROGRAMMER', 'ANALYST')
(iii)
(10) Deletion Deletion of a particular tuple from a relation is illustrated by example (i). DELETE WHERE
EMPLOYEE EMP# = 814
(i)
Statement (ii) deletes all employees whose departments are located in Boston: DELETE WHERE
EMPLOYEE DEPT# IN SELECT DEPT# FROM DEPARTMENT WHERE CITY = 'BOSTON'
(ii)
(11) Update Statement (i) increases the salary of a particular employee: UPDATE EMPLOYEE SET SALARY = SALARY* 1.1 WHERE EMP# = 341
(i)
Statement (ii) increases the salary of all employees assigned to a particular project. UPDATE SET WHERE
EMPLOYEE SALARY = SALARY * 1.1 EMP# IN SELECT EMP# FROM ASSIGNMENT WHERE PRO# = 36
(ii)
36
1 Data Model
(12) Creation of a Relation Specification (2) of our sample relational schema was not given in SQL. We now present such a specification for the relation DEPARTMENT. Statement (i) creates this table and specifies its attributes and their domains. It also states which attributes may never assume the undefined value. CREATE TABLE DEPARTMENT (DEPT # (INTEGER, NONULL), NAME(CHAR(10)VAR), CITY(CHAR(20) VAR)) The domain of the attribute DEPT # is the set of integers (INTEGER). Undefined values of this attribute are not permitted (NONULL). The domain of the attribute NAME is the set of sequences of characters (CHAR). The length of those sequences is variable (VAR) but it does not exceed 10. The domain of the attribute CITY is likewise specified. Relations which the user creates in this way may be permanent or temporary. The existence of temporary relations in the database is limited to the period of the interaction of the user of the database who created them.
2 Logical Dependencies 2.1 Functional Dependencies Consider the relation DIRECTORY(NAME, ADDRESS, CITY, ZIP, PHONE) and assume that all the names listed in the directory are different. Furthermore, suppose that each person in the directory may have only one phone. We may then describe dependencies among the attributes of the relation DIRECTORY in the following way: (i) Given a specific name from the directory, it determines a unique tuple of the relation DIRECTORY, which means that the values of the attributes ADDRESS, CITY, ZIP, and PHONE are determined as well. (ii) Given values of the attributes ADDRESS and CITY, all tuples of the relation DIRECTORY with the given values of the attributes ADDRESS and CITY (if they exist) have the same value of the attribute ZIP. (iii) The values of the attributes CITY of all tuples of the relation DIRECTORY with a given zip code are equal. The above logical relationships among the attributes of the relation DIRECTORY are called functional dependencies. The notion of a functional dependency may be formally defined as follows:
37
2 Logical Dependencies
(1) Functional Dependency Given a relation R(A 1 , A 2..... A) and two sets X and Y of its attributes, that is, X ; {A 1 , A 2 ,..., A.} and Y c {A 1 , A 2 ,..., An}, denote with R[XY] the projection of the relation R over the attributes XU Y. We say that a functional dependency X--+ Yexists in the relation R(A 1 ,A 2 , ... ,A,) if and only if R[XY] is in fact a function R[X] -+ R[Y] for every moment of time. The above definition means that whenever xy and Ny' are tuples of the relation R[XY], then in fact y = y' holds. This is precisely the condition which says that R[XY] is a function R[X] -- R[Y]. If the functional dependence X -* Y does not exist, we write X+-+ Y. In that case, R[XY] contains at least one pair xy and xy' of tuples where y 0 y'. The property X -* Y of the relation R is time independent. Although the set of tuples R changes over time, and so does R[XY], R[XY] remains a function for any specific moment of time. The fact that R[XY] viewed as a set of pairs (x,y) changes over time means that R[X] --+ R[Y] is in general a different function for two different moments of time. (2) Example of Functional Dependencies The assumptions which we introduced for the relation DIRECTORY in fact mean that the following functional dependencies hold: NAME -* ADDRESS NAME -- CITY NAME-- ZIP NAME -- PHONE ADDRESS, CITY --* ZIP ZIP --* CITY Definition (1) of the functional dependency allows us to define precisely the notion of a key of a relation. (3) Key If R(A 1,,A,...,A,) is a relation and X a set of iti attributes (X9 {A 1 ,A 2 ..... An}), then Xis a key if and only if: (i) X functionally determines all attributes of the relation R, that is, X --* A1 for i = 1,2 ... , n. (ii) No proper subset of X has that property, that is, if X' c-X then X' +-+Ai for somej(j = 1,2 .. , n). It follows from the above definition that NAME is the only key of the relation DIRECTORY. If a set X of attributes has property (i), but not property (ii), it is called a superkey. For example, any set of attributes which contains the attribute NAME is a superkey. The definition of a key implies immediately the following property:
38
1 Data Model
(4) Every Relation Has a Key A,-* ... ,An) we obviously have A 1, A2 .... Aifori= 1, 2,..., n. If thereis no X, X(c {A 1 ,A 2, .. ,A,} such thatX--*Ai for all i = 1, 2, ... , n, then A 1 , A2 . A,, is a key. Otherwise, X contains a key. So property (4) is established. Given a relation R(A 1 ,A 2,
(5) Primary Key Among all keys of the relation R one is usually chosen to serve as an identifier of entities represented by the tuples of R. This key is called the primary key. Its purpose requires that none of its attributes may ever obtain an undefined (unknown) value, since otherwise we would not know what entity a tuple with an undefined value of the primary key represents. Functional dependencies are assumed to have the following basic properties: (6) Basic Properties of Functional Dependencies (i) Uniqueness. Iff: X--* Y and g: X--* Y are functional dependencies, then f = g. This means that there exists at most one functional dependency with a given domain and a given codomain. (ii) Projectivity. If X is a subset of Y, where X and Y are sets of attributes, then there exists a functional dependency Y -* X. This means that a set of attributes functionally determines all its subsets. Functional dependencies of this type are called trivial since they hold in every relation. (iii) Transitivity. Iff: X -+ Y and g: Y -- Z are functional dependencies of a
relation R, then there exists a functional dependency X -- Z. This functional dependency corresponds to the composition of functions R[X] -+ R[Y] and R[Y] --+R[Z]. (iv) Additivity. If X -+ Y and X
-
Z are functional dependencies, then there
exists a functional dependency X -* YZ. This functional dependency corresponds to the function R[X] -* R[YZ] which sends a tuple x to the pair (yz), where y and z are determined by the given functional dependencies X - Yand X-* Z. To illustrate the basic properties of functional dependencies consider again the relation DIRECTORY(NAME, ADDRESS, CITY, ZIP, PHONE). We have ADDRESS, CITY--*ADDRESS and ADDRESS, CITY -) CITY, which illustrates projectivity. From the given functional dependencies ADDRESS, CITY
--
ZIP and ZIP
-+
CITY we derive the
existence of the functional dependency ADDRESS, CITY -+ CITY using transitivity. This derived functional dependency is precisely the trivial functional dependency ADDRESS, CITY -+ CITY by uniqueness. From NAME
-+
ADDRESS and NAME
-+
PHONE the existence of the func-
2
39
Logical Dependencies
tional dependency NAME--* ADDRESS, PHONE follows by additivity. Using the same property we derive NAME -* CITY, ZIP from NAME
-*
CITY and NAME -* ZIP. Applying additivity once again, this time to the functional dependencies NAME
-+
ADDRESS, PHONE and NAME-*
CITY, ZIP, we obtain the functional dependency NAME -* ADDRESS, CITY, PHONE, ZIP, which says that NAME is a key of the relation DIRECTORY. Consider now more carefully what the uniqueness property means in the given example. Suppose now that the composite functional dependency ADDRESS, CITY
-+
ZIP
-*
CITY is not equal to the trivial functional
dependency ADDRESS, CITY --+ CITY. In that case, we would have two cities determined by a single pair (address, city) and two different functional dependencies. Semantic ambiguities of this type are explicitly forbidden by the uniqueness property. Basic properties of functional dependencies may be regarded as axioms characterizing functional dependencies. On the basis of these axioms one can derive formally other important properties of functional dependencies. The collection of derived properties which follows contains those derived functional dependencies which are particularly important in the logical design of relational databases. (7) Derived Properties of Functional Dependencies (i) Distributivity. If the functional dependency X -- YZ holds, then the functional dependencies X -+ Y and X -* Z hold as well. (ii) Pseudotransitivity. If the functional dependencies X
-*
Y and YW -+ Z
hold, then the functional dependency XW -* Z holds as well. (iii) Augmentation. If the functional dependency X-* Y and W-* Z hold, then so does the functional dependency XW -* YZ. We now prove that properties (7) indeed follow from properties (6). (8) Proof of the Derived Properties of Functional Dependencies (i) Distributivity. Functional dependencies X -* Y and X -* Z are con-
structed as the composites from the given dependency and trivial functional dependencies YZ -* Y and YZ -* Z. So we used the basic properties of pro-
jectivity and transitivity. (ii) Augmentation. From the trivial dependency XW-* W and the given one W-- Z we obtain XW-* Z by transitivity. Likewise, XW-* X and X-* Y produce XW
-*
Y. Applying now additivity to XW -* Z and XW
-*
Y we
obtain XW-- YZ. (iii) Pseudotransitivity. From the given functional dependency X - Y and the trivial functional dependency W
-*
W we derive XW
-+
YW using aug-
40
1 Data Model
mentation. Applying now transitivity to XW XW--+ Z as required.
-*
YW and YW -- Z we obtain
For example, applying pseudotransitivity to the functional dependencies NAME -+ ADDRESS and ADDRESS, CITY -- ZIP of the relation DIRECTORY we obtain NAME, CITY -+ ZIP. The existence of some functional dependencies determines the behavior of the operations of the relational algebra in a very important way. Suppose that we substitute a relation with a pair of its projections. The question is whether we lost anything in the process, that is, whether it is possible to restore the initial relation applying the natural join operation to its projections. It turns out that the answer is affirmative only if some functional dependencies hold. (9) Example of a Loss of Information Consider a relation R(NAME, PHONE, CITY) and suppose that the attributes PHONE and CITY are not functionally dependent on the attribute NAME (NAME +-*PHONE and NAME +-*CITY). This assumption means that a person may have more than one phone in one or more cities. This situation is illustrated by the following table which represents the relation R(NAME, PHONE, CITY) for some moment of time: (i)
NAME
PHONE
CITY
a
t1
C1
a
t2 t'
C2
b
R(NAME, PHONE, CITY)
c,
The projections R[NAME, PHONE] and R[NAME, CITY] of the above relation would, for the same moment of time, be represented by the following tables: (ii)
NAME
PHONE
a
NAME
CITY
a
Cl
a
t2
a
C2
b
t3
b
R[NAME, PHONE]
R[NAME, CITY]
The result of the natural join of the projections R[NAME, PHONE] and R[NAME, CITY] is a relation which for the same moment of time is represented by the following table:
41
2 Logical Dependencies
(iii) NAME
PHONE
CITY
a
t1
c,
a
cC 2
a
t,
c2
a
t2
b
t3
C2
RENAME, PHONE]
*NAME
R[NAME, CITY]
C1
The problem is now clear. From table (iii) we derive the conclusion that there exists a person a who has phone number t 2 in city c, and phone number t, in city c 2 . Both conclusions are wrong. It is easy to see that if either NAME --* PHONE or NAME --+ CITY held, the problem would not have occurred. The relation R[NAME, PHONE] *NAME R[NAME, CITY] would have been equal to the relation R(NAME, PHONE, CITY) which is precisely what one would expect. This observation may be generalized as follows: (10) Restoring a Relation by a Natural Join of Its Projections If X-* Y or X--* Z holds for a relation R(XYZ), where X, Y, Z are sets of attributes, then R(XYZ) = R[XY] *x R[XZ]. To prove property (10) we first observe that R(XYZ) is always a subset of R[XY] *x R[XZ]. Indeed, if xyz is in R(XYZ) then xy is in R[XY] and xz is in R[XZ]. Joining the tuples xy and xz we obtain the tuple xyz, which means that xyz is in R[XY] *x R[XZ]. So every element of the relation R(XYZ) is also an element of the relation R[XY] *x R[XZ]. We still have to prove that given X--* Yor X-- Z the converse is also true; that is, R[XY] *xR[XZ] is a subset of R(XYZ). Suppose that X--* Y holds. Then xyz in R(XYZ) and xy'z' in R(XYZ) implies y = y'. An element of R[XY] *x R[XZ] must have the form xyz. If xyz is in R[XY] *x R[XZ] then there exist y' and z' such that xyz' is in R(XYZ) and xy'z is in R(XYZ). But then X--* Y implies y = y', that is, xyz is in R(XYZ). So we proved that every element of the relation R[XY] *x R[XZ] iz at the same time an element of the relation R(XYZ). R(XYZ) _ R[XY],x R[XZ] and R[XY]*x R[XZ] • R(XYZ) imply R(XYZ) = R[XY] *x R[XZ].
2.2 Multivalued Dependencies Functional dependencies are a particular case of a more general type of logical dependency among the attributes of a relation. These generalized dependencies are called multivalued dependencies. We illustrate them first by an example.
42
1 Data Model
(1) Example of a Multivalued Dependency (i) Consider a relation INVENTORY(ITEM, DEPARTMENT, COLOR) where a triple (a, b, c) belongs to it if and only if the item a has c as one of its colors and is used in department d but possibly in some others as well. So an item may have more than one color and may be used in more than one department. On the other hand, a department may use more than one item in various colors. A specific color determines neither the item nor the department. All this means that no functional dependencies, other than the trivial ones, hold in the relation INVENTORY. (ii) A table which represents the relation INVENTORY for a specific moment of time is given below.
ITEM
DEPARTMENT
COLOR
2 2 2 2 2 3 3 3
brown black brown green brown red yellow blue red yellow blue
C
t
d S
b b b b b b
In table (ii) a nonfunctional dependency exists. It may be described by the following statement: If an item has more than one color and is used in some department, then that department uses all colors of that item. In other words, the set of colors of an item is the same for every department which uses that item. So the set of colors of an item depends only on that item and not on the department which uses it. Likewise, table (ii) shows that the set of departments which use an item depends only on that item and not on its color. We say that there exists a multivalued dependency of the attribute COLOR on the attribute ITEM and a multivalued dependency of the attribute DEPARTMENT on the attribute ITEM. The notion of multivalued dependency is formally defined below.
43
2 Logical Dependencies
(2) Introducing Multivalued Dependencies Let R(XYZ)
be a relation with m + n + r attributes, where X=
{Xi,X 2 ,... , X.}, Y= {Y 1 , Y2 .... IY.}, and Z ={Z 1 ,Z 2. ... Zr} are pairwise disjoint sets of attributes. An m-tuple (Xl, x 2 ,... X.) X-, of values of attri-
butes X 1, X2 ... (Y1, Y2 .... (z 1, z2 .....
X,Xwill be denoted as x. Likewise, y will denote an n-tuple
Y. and z an r-tuple y.) of values of the attributes Y1 , Y2 .... Z,) of values of the attributes Z,, Z 2 , ... , Z,. Yxý will denote the
set of tuples defined as: (i)
Yx= {y: (x,y,z) in R}
The set Yxz is constructed by the operations of restriction and projection of the relation R. Only those m + n + r tuples of R are taken into account whose values of X attributes are equal to x and whose values of Z attributes are equal to z. These tuples are then projected over the attributes Y. For example, given table (1) (ii) we compute the following sets: (ii)
COLORC, = {brown}
COLOR,,
{black}
COLORs1 = {brown} COLORa2 = {green} COLORS2 = {brown}
COLOR, 2 = {red, blue, yellow} COLOR,,3 = {red, blue, yellow} We are now ready for the formal definition of a multivalued dependency. (3) Multivalued Dependency A multivalued dependency X --* Y holds for the relation R(XYZ) if and only
if Yxý = Y.,. for every x, z, and z' for which Yxz and Yxz' are nonempty sets of attributes. For example, we showed that the following holds for the relation INVENTORY: COLOR,, = COLOR, 2
=
COLORb2 = COLORb3
=
{brown} {red, blue, yellow}
According to definition (3) the above equalities mean that the multivalued dependency ITEM --> COLOR holds for the relation INVENTORY.
It is clear that the condition Yxz = Yxz, in definition (3) holds if and only if the following is true: whenever (x, y, z) and (x, y', z') are tuples of the relation R(XYZ), then so are (x, y', z) and (x, y, z'). This is in fact a necessary and sufficient condition for the existence of the multivalued dependency X --* Y, equivalent to condition (3).
44
1 Data Model
(4) Multivalued Dependency A multivalued dependency X--* Y holds for the relation R(XYZ) if and only if the following holds: whenever (x, y, z) and (x, y', z') are tuples of the relation R then so are (x, y', z) and (x, y, z').
For example, (b, 2, red) and (b, 3, blue) are tuples of relation (2) but so are (b, 2, blue) and (b, 3, red). Definitions (3) and (4) of a multivalued dependency may be modified in a straightforward way to obtain the definitions of a functional dependency. If, in definition (3), we require that Yxz, apart from being dependent only on x. contains at most one element, we obtain the condition for the existence of the functional dependency X-. Y. Similarly, in definition (4) we require that whenever (x, y, z) and (x, y', z') are tuples of the relation R, then y = y'. We conclude that functional dependencies are indeed a particular case of multivalued dependencies. (5) Functional and Multivalued Dependencies IfXand Yare disjoint sets of attributes of the relation R(X, Y, Z) such that a functional dependency X--+ Y holds, then so does the multivalued dependency X--o Y. In definition (4) sets of attributes Y and Z have the same relationship with the set X, and because of that we conclude the following: (6) Complementary Multivalued Dependencies If a multivalued dependency X--* Y holds for the relation R(XYZ) then so does the multivalued dependency X--o Z. The above fact is sometimes expressed by the statement which says that the sets of attributes Y and Z are mutually orthogonal. In the relation INVENTORY given in (2) mutually orthogonal attributes are COLOR and DEPARTMENT. This means that ITEM -- COLOR ITEM --* DEPARTMENT and vice versa. Indeed we have
implies
DEPARTMENTbred = DEPARTMENTb, yelow = DEPARTMENTb,blu, = {2,3}
(7) Separating Multivalued Dependencies If we compute the projections INVENTORY [DEPARTMENT, ITEM] and INVENTORY[ITEM, COLOR] of the relation INVENTORY (ITEM, DEPARTMENT, COLOR)
45
2 Logical Dependencies
represented by table (1) (ii) we obtain the following tables: DEPARTMENT
ITEM
ITEM
COLOR
1 I
c
c
brown
t
t
black
1
s
s
brown
2 2 2 3
d s b b
d b b b
green red blue yellow
INVENTORY[DEPARTM ENT, ITEM]
INVENTORY [ITEM, COLOR]
It is easy to verify that we lost nothing in the above decomposition. Indeed, if we perform the operation of natural join over the attribute ITEM of the above two projections we obtain the original relation INVENTORY. The example was chosen to satisfy this property. But it is sufficient to delete the triple (b, 2, red) from the relation INVENTORY(ITEM, DEPARTMENT, COLOR) and the property is lost. After deletion of the triple (b, 2, red) the dependencies ITEM --* COLOR and ITEM --- DEPARTMENT do not hold
anymore. Projections (7) do not change, but their join contains the triple (b, 2, red). This means that it is possible to derive an erroneous conclusion from the projections of the relation INVENTORY (which is that department 2 uses a red item b). It turns out that a necessary and sufficient condition for a relation to be equal to the join of its projections (see 2.1.10) is stated in terms of multivalued dependencies. Remember that we showed that R(XYZ) is always a subset of R[XY] *x R[XZ]. On the contrary, we just showed using the relation INVENTORY as an example that the converse is not always true. R[XY] *x R[XZ] = R(XYZ) holds if and only if the multivalued dependency X -- Y holds.
(8) Restoring a Relation by a Natural Join of Its Projections The relation R(XYZ), where X, Y, and Z are sets of attributes, is equal to the natural join R[XY] *, R[XZ] of its projections R[XY] and R[XZ] if and only if there exists a multivalued dependency X --- Y.
Observe that if a functional dependency exists, then so does the multivalued dependency X --* Y. The existence of the functional dependency X
-*
Y implies by 2.1(10) that R[XY] *,R[XZ] = R(XYZ). However, if X- Y holds, then we cannot apply 2.1(10) since the existence of a multivalued dependency does not imply the existence of a functional dependency. To prove (8) we proceed as follows. Suppose that xyz is in R[XY]*x R[XZ]. Then there exist y' and z' such that xy'z and xyz' are tuples of the relation R(XYZ). But then on the basis of X - Y [see (4)] we conclude that
46
1 Data Model
xyz and xy'z' are also tuples of the relation R(XYZ). Therefore, xyz in R(XYZ), i.e., R[XY] *x R[XZ] si R(XYZ). Since the converse R(XYZ) s_ R[XY]*xR[XZ]
is always true, we in fact proved that X ---* Y implies
R(XYZ) = R[XY] *x R[XZ]. Suppose now that R(XYZ) = R[XY] *x R[XZ] and let xyz and xy'z' be
elements of the relation R(XYZ). If so, then xy and xy' must belong to the relation R[XY] and xz and xz' must belong to the relation R[XZ]. But then by joining xy' and xz and xy and xz' we obtain xyz' and xy'z as elements of the relation R[XY]*,R[XZ]. Because of the assumed equality R[XY]*x
R[XZ] = R(XYZ), xyz' and xy'z must also belong to the relation R(XYZ) which according to (4) means that the multivalued dependency X -o Y holds for R(XYZ). (9) Trivial Multivalued Dependencies Multivalued dependencies X--* 0 and X -- Y hold for any relation R(XY), where X and Y are disjoint sets of attributes. The above assertion is easy to prove. If we take Z = 0 in definition (3) of a multivalued dependency, then the sets Y1z and Y.,z are empty and thus trivially equal. (10) Properties of Multivalued Dependencies Multivalued dependencies behave in a much more complex way then functional dependencies. For example, from X--• Y and Y--+ Z we cannot conclude that X--Z holds. Indeed, XN Y-= 0 and YNfZ = 0 do not imply XN Z = 0. But if YN Z = 0 is not guaranteed, according to definitions (3) and (4) we cannot even talk about the existence of a multivalued dependency X-oZ. On the other hand, if X, Y, and Z are pairwise disjoint sets of attributes, then X--" Y and Y--o Z imply X--" Z. The transitivity of functional dependencies is replaced by a more restrictive property which holds for multivalued dependencies. It is called restrictive transitivity and we present it without proof. (i) Restrictive Transitivity. If X -- Yand Y-
Z hold, then so does X - Z -
Y.
The properties of additivity and distributivity of functional dependencies taken together mean that X -*
Y and X -* Z hold if and only if X -* YZ
holds. For multivalued dependencies only the additivity holds. In fact, we have the following properties of multivalued dependencies which we present without proof. (ii) Union, Intersection, and Difference. If X -+> Y and Xthe following multivalued dependencies:
Z hold, then so do
47
2 Logical Dependencies
X-
YUZ
x-**ynz X-,Y-Z
X-+Z-
Y
2.3 Join Dependencies (1) Join Dependency Suppose that R(A 1 , A 2 . A.) is a relation and {I 1, X2 .. , Xm} is a collection of subsets of {A 1 ,A 2 ... . An} such that X, U X 2 U*. UXm = {A 1,A 2 , .... A,}. We say that R obeys the join dependency *{X 1 ,X 2 ,....Xm} if and only if R = R[X] • R[X 2] ,... Observe that the expression R[X]
*
R[X 2 ]
*
R[X.]
*...*
R[X,,] makes sense since the
natural join operation is commutative and associative. Join dependency is trivial if and only if it always holds. It is easy to see that the only trivial join dependencies
Xi
{A 1 ,A 2 ...
X2 ...
*•{1,
A,} for some i = 1, 2 ...
, X.} are those for which
, m.
(2) Join Dependencies and Multivalued Dependencies The relation R(XYZ), where X, Y, and Z are nonempty disjoint sets of attributes obeys the join dependency *{XY, XZ} if and only if the multivalued dependency X -+ Y holds in R.
Proof of the above statement is immediate since we already proved that R(XYZ) = R[XY] *, R[XZ] if and only if X
--
Y holds in R.
(3) Join Dependencies and Key Dependencies We say that the join dependency * {X1, X 2 ,. ... , Xm} is the result of key dependencies of R if and only if every join in the expression R[X 1] * R[X 2] * .". * R[Xm] is performed on a superkey irrespective of the order in which the joins are performed. (4) Example of a Join Dependency Suppose that we are given a relation R(ABCD) together with the key dependencies A -- BCD B- ACD
The join dependency * {AB, AD, BC} is the result of key dependencies since
48
1 Data Model
the join R[AB] * R[AD] * R[BC] may be computed in one of the following ways: (i)
(R[AB]
(ii)
(R[AB] *B R[BC]) *AR[AD]
*A
R[AD])
*,
R[BC]
In view of the fact that A and B are the keys of the relation R we have the following equalities: (iii)
(R[AB] *A R[AD]) *B R[BC] = R[ABD] *, R[BC] = R(ABCD)
(iv)
(R[AB] *B R[BC]) *A R[AD] = R[ABC] *A R[AD] = R(ABCD)
So the join dependency *{AB, AD, BC} is indeed the result of the key dependencies A -- BCD and B
--
ACD.
3 Hierarchical and Network Models 3.1 Unnormalized Relational Schemas (1) Example of a Normalized Schema Consider a relational schema in the first normal form: (i)
DEPARTMENT(DEPT #, NAME, CITY) EMPLOYEE(EMP #, NAME, JOB, DEPT #) JOB(EMP #, PERIOD, SALARY)
Primary keys are distinguished with the symbol #. In addition to that, we assumed that the following functional dependencies hold: (ii)
EMPLOYEE -. DEPARTMENT JOB -+ EMPLOYEE
This is the first time that we stated logical relationships of the functional type among sets of attributes which represent different entities. The logical dependencies are represented by values of suitable attributes in the schema. More specifically, the functional dependency EMPLOYEE -- DEPARTMENT is represented by the attributes DEPT# in the relations EMPLOYEE and DEPARTMENT. An employee tuple contains as its component the number of the department to which that employee belongs. Since DEPT# is the primary key of the relation DEPARTMENT, a unique department is determined. Likewise, a triple of the relation JOB contains the number of the employee to whose job history it belongs. But since EMP# is the primary key of the relation EMPLOYEE, a unique employee tuple is determined. In opposition to schema (i) we will now consider a relational schema that is not in the first normal form.
49
3 Hierarchical and Network Models
(2) Example of an Unnormalized Relational Schema (i)
DEPARTMENT(DEPT #, NAME, CITY, EMPLOYEE(EMP #, NAME, JOB(PERIOD, SALARY)))
Observe that the domain of the attribute EMPLOYEE of the relation DEPARTMENT is not simple; it is itself a relation. This means that each tuple of the relation DEPARTMENT will contain a set of employees' tuples working in that department. Likewise, the domain of the attribute JOB of the relation EMPLOYEE is not simple; it is itself a relation. The set of pairs which represents the value of that attribute for a particular employee tuple represents the job history of that employee. The hierarchical model permits representation of entities and functional dependencies among them. It may be characterized in the following way: (3) Hierarchical Schema A hierarchical schema consists of two sets: (i) A set of entity types E. Particular entity types will be denoted as E 1 , E2, . ...
E..
(ii) A set of logical relationships L among entity types. The relationship between entity types Ei and Ej will be denoted as Lij. The logical relationships requirements:
among entity types satisfy the following
(iii) There exists at most one relationship Lii between different entity types Ei and Ej. As a consequence, there is no need to name the relationships. (iv) There exists no relationship between an entity type E and itself, that is, Li is not a member of the set L. Entities and their relationships may be represented by a graph in which entities are represented by the nodes and the relationships by the edges of the graph: (v) The existence of the relationships Lij, Lj+ .... sented by:
iEm
Li, is graphically repre-
50
1 Data Model
(4) Example of a Hierarchical Structure (i) DEPARTMENT(DEPT#, NAME, EMPLOYEE(EMP #, NAME, SALARY), PART(PART #, NAME, WEIGHT, SUPPLIER (SUPP #, NAME, QUANTITY, CITY))) Using conventions (3)(v), the above hierarchical relation is represented by the graph (ii).
(ii)
(iii) We thus assumed the existence of the following functional dependencies: EMPLOYEE -+ DEPARTMENT PART -ý DEPARTMENT SUPPLIER - PART
The meaning of these assumptions is that an employee belongs to a unique department, that a part is used in only one department, and that a supplier supplies only one part. The restrictiveness of these assumptions is obvious. (5) Anomalies of the Hierarchical Model (i) Insertion Anomaly. This anomaly is illustrated by the fact that an employee cannot be inserted into the database if he or she does not belong to a particular department. Likewise, a part cannot be inserted into the database if there is no department which uses it. In the same manner a supplier cannot be inserted into the database if it does not supply at least one part. These conditions reflect precisely the assumption about the existence of functional
relationships EMPLOYEE
--
DEPARTMENT,
PART
-*
DEPARTMENT, and SUPPLIER -* PART and as such are not necessarily anomalies. But the problem is that these assumptions are often too restrictive and so do not reflect the actual semantics of data. Under such circumstances the above described behavior will in fact be undesirable.
51
3 Hierarchical and Network Models
(ii) Deletion Anomaly. This anomaly is illustrated by the fact that deletion of a department causes deletion of its employees and all parts which it uses. Likewise, deletion of a part causes deletion of all suppliers who supply that part. Although all these effects reflect precisely the assumption about the existence of the functional dependencies EMPLOYEE -- DEPARTMENT, PART -+ DEPARTMENT, and SUPPLIER -* PART, in many situations they may be quite undesirable and are thus called anomalies. (iii) Update Anomaly. Suppose that an employee may work in more than one department. This means that EMPLOYEE +-+DEPARTMENT contrary to the introduced assumption. Furthermore, it is much more realistic to assume that a part may be used in more than one department, which means PART +-*DEPARTMENT. Similarly, a supplier will in general supply more than one part and so SUPPLIER +-+PART. Under these new assumptions, representations (5)(i) and (5)(ii) are not adequate anymore. If we are forced to keep them, then it is necessary to duplicate an employee tuple for each department it belongs to, a part tuple for each department that uses that part, and a supplier tuple for each part it supplies. Under such circumstances we cannot update a single tuple of an employee, a part, or a supplier, since we may cause inconsistencies in the database. Whenever we make such an update we must guarantee that all duplicate tuples are updated. This situation is called the update anomaly.
3.2 Network Model The most important network model (CODASYL) is essentially based on functional relationships among entities. (1) Sets If E1 and E2 are entity types such that there exists a functional relationship E2 E, then the ordered named pair E(E 1, E2 ) is called a set. E 1 is called Fthe owner of the set E and E 2 the member of the set E. (2) Example of a Set If we assume that EMPLOYEE -- DEPARTMENT, then the pair ASSIGNMENT(DEPARTMENT, EMPLOYEE) is a set, which is graphically represented as follows:
(i)
DEPARTMENT
ASSIGNMENT EMPLOYEE
52
1 Data Model
(ii) The functional interpretation of the above set is illustrated by the following graph: EMPLOYEE
ASSIGNMENT
DEPARTMENT
r2 r3 r4
r5 r6 r7r8 r9
rio
-
(3) Set Occurrences An occurrence of a set consists of an occurrence of the owner entity type E and zero or more occurrences of the member entity type E. In terms of functional relationships this means that a set occurrence consists of a set of member entities and the owner entity to which they are mapped according to the functional relationships. So, for example, set occurrences of the set ASSIGNMENT defined above are {ad,
{r
{d 2,{
1
,r
3
,r
7
,rj 1
0
}}
jd3, {r2, r,, r5,r6, r,,} },
Among the above set occurrences there is one that contains the empty set of members. A set occurrence that contains only the owner entity is called the empty set occurrence. (4) Uniqueness of Owner The fact that a set is in fact a representation of a functional relationship may be expressed by the condition that an entity may participate as a member in only one occurrence of a given set. Because of that, a member entity has a unique owner entity in every set in which it participates as a member. This condition is illustrated in (2)(ii). An employee who participates in two occurrences of the set ASSIGNMENT means that EMPLOYEE +-* DEPARTMENT, contrary to the introduced assumption. The difference between hierarchical and network structures may now be phrased in terms of the owner-member relationships as follows:
53
3 Hierarchical and Network Models
(5) Hierarchical Structure (i) An entity type may be the owner in one or more sets, but as a member it can participate in only one set. In terms of functional relationships this condition has the following form: (ii) An entity type may be the domain of only one functional dependency among entity types. It cannot be the domain and the codomain of a simple or composite functional dependency among entity types, other than the identity one. (6) Network Structure (i) An entity type may be the owner of one or more sets and the member in one or more other sets, possibly at the same time. An entity type cannot be the owner and the member of the same set, directly or indirectly. In terms of the functional relationships the above condition means the following: (ii) An entity type may be the domain of one or more functional dependencies among entities as well as the codomain of some other functional dependencies. It cannot be both the domain and the codomain of the same functional dependency (other than identity) even if it is a composite one. The need for network structures shows up when we want to represent relationships which are not of the functional type. (7) M: N Relationships Among Entities An example of an M: N relationship is the relationship between the entities SUPPLIER and PART. A supplier will in general supply a number of parts, and a given part will in general be supplied by a number of suppliers. This means that PART +-+SUPPLIER and SUPPLIER +-+PART. Because of that, this relationship cannot be represented by two sets as in (i) since sets are functional relationships
(i) SUPPLIER
SUPPLIES PART SUPPLIED
Representation (i) does not obey the condition of the uniqueness of the owner. Indeed, a part which is supplied by more than one supplier will participate as a member in more than one occurrence of the set SUPPLIES. Symmetrically, a supplier who supplies more than one part will participate as
54
1 Data Model
a member in more than one occurrence of the set SUPPLIED. This situation is illustrated in (ii).
(ii)
Two occurrences of the set SUPPLIES with the common member
Two occurrences of the set SUPPLIED with a common member
One further observation which shows that representation (i) is not adequate is that an M: N relationship between entities may have its attributes. An example of such an attribute for the relationship between suppliers and parts is the quantity supplied. It is neither an attribute of the supplier nor is it an attribute of the part. All this shows that an M: N relationship is frequently an (abstract) entity by itself, which has its attributes as any other entity does. That is precisely how M: N relationships are represented with the network (CODASYL) model. For the given example such an associative entity may be the order, whose attributes are quantity, total value, and so on. This associative entity becomes the member entity in two sets: SUPPLIES and SUPPLIED, according to figure (iii).
(iii)
55
3 Hierarchical and Network Models
The above relationships would be represented in the relational model by three relations, corresponding to the above three entities. However, while in the network model two sets are required as well, in the relational model nothing else is necessary. The functional relationships for which sets are used are in the relational model represented by values of suitable attributes. (iv) The relational model which corresponds to (iii) would have the following form: SUPPLIER(SUPPLIER #, NAME, CITY) PART(PART #, NAME, WEIGHT) ORDER(SUPPLIER #, PART #, QUANTITY, VALUE)
(8) Assembly of Parts There exist situations in which an entity type is both the owner and the member of the same set. A typical example is an assembly of parts. A part is assembled using a number of subparts; each one of them consists of its own subparts and so on. The network model (CODASYL) forbids direct representation of this situation as:
(i) I
CONSISTS -OF
PART a
A permitted representation is:
(ii)
PART
SUBSTRUCTURE
SUPERSTRUCTURE
ASSEMBLY
Separating the upper diagram in two we see that this is in fact a particular case of an M: N relationship:
56
1 Data Model
(iii) SUBST
The relational model would consist of two relations, one of them repre-
senting parts and the other their assembly. PART(PART #, NAME, SUBPARTS) ASSEMBLY(SUPERPART #, SUBPART #, NUMBER) The value of the attribute SUBPARTS represents the number of subparts which a given part has. For a given tuple of the relation ASSEMBLY, SUPERPART contains the key of a part, SUBPART the key of one of its subparts, and NUMBER contains the number of occurrences of that subpart required in the assembly of the part.
Exercises (1) Bound Variables (i) SQL permits variables bound to particular relations to appear in query blocks. The need for such variables and the advantages of their use are demonstrated by the following query: SELECT FROM WHERE AND AND
X.NAME, X.SALARY, Y.NAME EMPLOYEE X, DEPARTMENT Y X.JOB ='PROGRAMMER' DEPARTMENT # = Y.DEPARTMENT # Y.LOCATION = 'SAN JOSE'
where the relations EMPLOYEE and DEPARTMENT are specified as follows: EMPLOYEE(EMPLOYEE #, NAME, JOB, SALARY, MANAGER #, DEPARTMENT#) DEPARTMENT(DEPARTMENT #, NAME, LOCATION) (ii) Using two bound variables for the same relation specify the following query: For each employee whose salary exceeds the manager's salary, list the employee's salary and the manager's name.
57
3 Hierarchical and Network Models
Observe that this query contains two joins, one natural and the other one which is based on the relation < (2) Standard Function COUNT For the relational schema EMPLOYEE(EMPLOYEE #, NAME, JOB, DEPARTMENT #) DEPARTMENT(DEPARTMENT #, NAME, LOCATION) specify the following queries: (i) How may different jobs are held by employees in the department 'iro-8'? (ii) Delete all departments having no employees. Note that COUNT(*) denotes the number of tuples which satisfy the WHERE clause of the query block. (3) Nested Queries For a given relational schema with entities specified as follows: PROJECT(PROJECT #, NAME, BUDGET, LOCATION) PART(PART #, NAME, PRICE, COLOR) SUPPLIER(SUPPLIER#, NAME, LOCATION) and the three-way association SHIPMENT(PROJECT #, PART #, SUPPLIER #, QUANTITY) specify the following queries using nesting of query blocks: (i) Find the project numbers of projects which are located in New York. (ii) Find the part numbers of parts whose unit price is greater than 100. (iii) Describe in English the following SQL query: SELECT FROM WHERE
SHIPMENT.PART# SHIPMENT, SUPPLIER SHIPMENT.SUPPLIER# SUPPLIER.SUPPLIER #
-
(iv) Find the supplier numbers of suppliers which supply parts whose unit price is greater than 100. (v) Find the names of suppliers which supply all red parts (division). (vi) Find the supplier numbers of suppliers which supply projects located in New York. (vii) Consider the following two nested blocks intended to express correctly the following query:
58
1 Data Model
Find the names of suppliers that do not supply parts whose part number is P#. SELECT FROM WHERE
NAME SUPPLIER SUPPLIER#
SELECT FROM WHERE
NAME SUPPLIER SUPPLIER# IS IN SELECT SUPPLIER # FROM SHIPMENT WHERE PART# O'P #'
IS NOT SELECT FROM WHERE
IN SUPPLIER # SHIPMENT PART# = 'P #'
Are the above two SQL queries equivalent in the sense that they always produce the same relation as their result? (4) Set Operators (i) It is important to distinguish between the effects of AND and OR logical operators and the effects of UNION and INTERSECT set operators. In the first case logical expressions connected with logical operators are applied to the same tuple whereas in the second case sets of tuples (relations) are computed and the set operator is applied to those sets as operands. (ii) Suppose that the EMPLOYEE table contains the following tuples:
NAME
DEPARTMENT
MANAGER
John Fred Fred Bill
Shoe Shoe Toy Toy
Bob Frank Bob Frank
Check whether the results of the following two SQL queries are the same: SELECT FROM WHERE AND
NAME EMPLOYEE DEPARTMENT = 'SHOE' MANAGER = 'BOB'
59
3 Hierarchical and Network Models
SELECT NAME FROM EMPLOYEE WHERE DEPARTMENT = 'SHOE' INTERSECT SELECT NAME FROM EMPLOYEE WHERE MANAGER = 'BOB' (5) GROUP BY Clause (i) A relation may be partitioned into groups according to the values of some attributes and then a standard function applied to each group. That is the purpose of the GROUP BY clause which is always used together with a standard function. When a GROUP BY clause is used each item in the SELECT clause must be a unique property of a group rather than of a particular tuple. Describe the result of the following query: SELECT FROM GROUP BY
DEPARTMENT #, AVG(SALARY) EMPLOYEE DEPARTMENT#
where the EMPLOYEE relation is specified as EMPLOYEE(EMPLOYEE #, NAME, JOB, SALARY, DEPARTMENT #) (ii) When a relation is partitioned into groups then a predicate may be applied to choose only those groups for which that predicate is true. These group-qualifying predicates are always based on a standard function and placed in a special HAVING clause. Describe the result of the following SQL query: SELECT FROM GROUP BY HAVING
DEPARTMENT # EMPLOYEE DEPARTMENT# AVG(SALARY) < 20000
(iii) Describe the result of the following SQL query: SELECT DEPARTMENT # FROM EMPLOYEE WHERE JOB = 'CLECK' GROUP BY DEPARTMENT# HAVING COUNT(*) > 10
60
1 Data Model
(6) SET Function (i) SQL has a special function SET which may be used in a logical (qualification) expression. The result of this function is a set of values of particular attributes which are present in a given group. Describe the meaning of the following query: SELECT FROM GROUP BY HAVING
DEPARTMENT # EMPLOYEE DEPARTMENT# SET(JOB) = SELECT JOB FROM EMPLOYEE
where the EMPLOYEE table is specified as EMPLOYEE(EMPLOYEE #, NAME, JOB, SALARY, DEPARTMENT #) (ii) Suppose that we are given the following two tables: SUPPLY(SUPPLIER, PART) USAGE(DEPARTMENT, PART) and the following query containing the division operation: List the suppliers that supply all the parts used by the toy department. SELECT FROM WHERE
SUPPLIER SUPPLY X (SELECT PART FROM SUPPLY WHERE SUPPLIER = X.SUPPLIER) CONTAINS (SELECT PART FROM USAGE WHERE DEPARTMENT = 'TOY')
The above query searches for those tuples X (bound variable) of the SUPPLY table such that the set of parts supplied by the supplier in tuple X (computed by the first nested query) contains the set of parts used by the TOY department (computed by the second nested query). Specify an equivalent restatement of the above SQL query using a GROUP BY clause and the special function SET eliminating the bound variable X altogether. (7) Entity Relationship Diagrams (i) Entity relationship diagrams are used to represent entities and their relationships according to the following simple conventions:
3 Hierarchical and Network Models
61
a. Rectangles represent entity types. b. Diamonds represent relationships. They are linked to the entity types participating in the relationship by undirected edges. These edges are labeled in such a way that the type of relationship (M: N or I : N) is indicated. (ii) Consider the following entity relationship diagram:
Select the appropriate attributes of the above entities and their relationships and specify a relational schema that corresponds to the above diagram. (8) Generalized Joins (i) Let 0 denote a relation from the set { }. The join of a relation R on the set of attributes A with a relation S on the set of attributes B is defined as R[AOB]S = {(rs) : r in R and s in S and (r[A]Os[B])]} In the above definition we assume that the domains of A and B are the same and that the relation 0 is defined on that common domain. (ii) An example of a generalized join in SQL is given in the query: SELECT FROM WHERE AND
X.NAME, Y.NAME EMPLOYEE X, EMPLOYEE Y X.MGR# = Y.EMP# X.SALARY > Y.SALARY
where the relation EMPLOYEE is specified as follows: EMPLOYEE(EMP #, NAME, SALARY, MGR #) Describe in English the result of the above query.
62
1 Data Model
(iii) Consider the following relational schema which contains representation of two types of object (SHIP and OFFICER) and two types of event (ASSIGNMENT of an officer to a ship and a ship INCIDENT). SHIP(SHIP #, NAME, TYPE, COUNTRY, PORT, CAPTAIN) OFFICER(OFFICER #, NAME, SENIORITY, COMMANDER #) ASSIGNMENT(OFFICER #, SHIP #, START, END, ROLE) INCIDENT(SHIP #, DATE, DESCRIPTION) Specify the following SQL query: Find the names of those captains which have been involved in a ship accident. (9) M: N Relationships (i) Consider the entity type OFFICER and the relationships SUBORDINATE and SUPERIOR of that entity type with itself as represented by the following diagram and the associated relational schema:
OFFICER (OFFICER #, NAME, RANK) HIERARCHY(SUBORDINATE #, SUPERIOR #) (ii) Using SQL find the names and the ranks of all immediate subordinate officers of a particular officer. (iii) Symmetrically, find the names and the ranks of all immediate superior officers of a particular officer. (10) Lossless Joins (i) Suppose that we are given a relation R(EMP #, DEPT #, MGR #, PROJECT #) where the above specified attributes have the following meaning: EMP # DEPT # MGR # PROJECT #
employee identifier department identifier employee identifier of department manager project identifier
We assume the following logical dependencies:
Bibliographical Notes
63
An employee is assigned to only one department but a department has many employees. A department has only one manager and a manager manages only one department. A department performs work on a single project but several departments are involved in work on a particular project. (ii) Specify explicitly the functional dependencies implied by the above assumptions. (iii) Specify all possible decompositions of the relation R into its projections whose join restores the relation R.
Bibliographical Notes The relational model of data was introduced by Codd (1970). The first two formal relational languages, relational algebra and relational calculus, are defined by Codd (1971b). Some of his more recent views are expressed in his ACM Turing award lecture (Codd, 1982) and another paper (Codd, 1979). A unified data definition, manipulation, and query language SQL (formerly SEQUEL) is thoroughly explained in the paper by Chamberlin et al. (1974). Functional dependencies were first considered by Codd (1971a). Among many other papers on that topic we mention Beeri and Bernstein (1979). Multivalued dependencies were introduced by Fagin (1972) and Zaniolo and Melkanoff (1981). Lossless joins are discussed in detail in the paper by Aho et al. (1979) and Fagin (1979). A good general reference on relational database theory is the book by Maier (1983). The literature on the hierarchical and network models is very extensive. We mention just two general references, the books by Tsichritzis and Lochovsky (1982) and Date (1983). A nice overview of various approaches to date models and data modeling is given in Brodie (1984).
CHAPTER 2
Logical Design
1 Normal Forms The first design principle which we will study states: Every entity should be represented by a separate relation. This principle applies equally to all entities including those which are in fact relationships. It requires little justification. In fact, we already showed that the hierarchical (unnormalized) model exhibits the insertion, deletion, and update anomalies. The reason is precisely the fact that more than one entity was represented by a single relation. To apply the above design principle we have to have criteria which determine what a single entity is. Informal, semantic rules are very important but often imprecise. Normal forms are formal rules for determining whether or not a relation represents a single entity. In addition to the first normal form which we already defined, we will study the second, third, fourth, and fifth normal forms. The second and third normal forms require knowledge about functional dependencies, whereas the fourth and fifth normal forms are based on multivalued and join dependencies, respectively. Higher normal forms represent stronger rules from the first normal form which proves to be much too weak. In fact, even if a relation is in the first normal form it still exhibits the insertion, deletion, and update anomalies. The hierarchy of normal forms is such that if a relation satisfies the requirements of some normal form, then it also satisfies the requirements of all lower-order normal forms. The converse is not true, which means that the hierarchy of normal forms is proper, that is, every normal form is a stronger form than the preceding one in the hierarchy.
65
1 Normal Forms
1.1 Second Normal Form (2NF) Consider the relation R(SUPPLIER, ITEM, CITY, QUANTITY). Assume that a tuple (x, y, z, q) belongs to the relation R if and only if a supplier x supplies quantity q of item y from the city z. Furthermore, suppose that a supplier may supply more than one item and that an item is supplied by more than one supplier. On the other hand, we will assume that a supplier is located in only one city. These assumptions are expressed formally in (1)(i). (1) Example with Anomalies (i) The model is defined as follows: R(SUPPLIER, ITEM, CITY, QUANTITY) SUPPLIER +-*ITEM ITEM +-*SUPPLIER SUPPLIER -* CITY (ii) To study the undesirable properties of the above model suppose that the relation R is for some specific moment of time represented by the following table: SUPPLIER
ITEM
CITY
QUANTITY
a
1
p
a a b b
2 3
p p q q
q, q2
I 3
q3 q4 q5
Although the relation R is in the first normal form, it exhibits all three anomalies: insertion, deletion, and update. The relation R has the insertion anomaly since it is impossible to insert into it the fact that supplier x is located in city z unless x supplies at least one part. Even if we allow undefined (unknown) values in relations that does not help us in this case. Indeed, ITEM is part of the only, and thus necessarily primary, key SUPPLIER, ITEM. Because of that, the attribute ITEM may not assume the undefined value. The situation which is symmetric to the one described shows that the relation R has the deletion anomaly. If we suppose that at some later moment of time supplier b stops supplying items 1 and 3, we encounter a problem. Deletion of the tuple (b, 1, q, q4 ) does not cause any undesirable effects, but when we delete the tuple (b, 3, q, q,), we lose the information that supplier b is located in city q. Leaving the tuple (b, -, q, q,) in the relation R, where - denotes an unknown value, is not permitted, since SUPPLIER, ITEM is the primary key of the relation R. The relation R has a further undesirable property which is a consequence
66
2 Logical Design
of the fact that the same value of the attribute CITY is repeated in every tuple of a particular supplier. Suppose that a supplier moves to another city. To update the relation R we have to update all tuples of that supplier in R. In fact, we cannot permit independent updates of the attribute CITY of tuples belonging to a particular supplier, since that may contradict the condition SUPPLIER -- CITY. What we just described is in fact the update anomaly. All the anomalies of the relation R are a consequence of the functional dependency SUPPLIER -+ CITY which is embedded in R. This functional dependency in fact shows that the relation R represents two entities. One of them is the supplier entity, whose attribute is CITY, and the other is the relationship between suppliers and parts, whose attribute is the quantity supplied. If we substitute the relation R with two projections R[SUPPLIER, CITY] and R[SUPPLIER, ITEM, QUANTITY], we obtain separate representations of the two entities and all the anomalies disappear. This can be easily verified in (iii): (iii)
SUPPLIER
CITY
SUPPLIER
ITEM
QUANTITY
a
p
a
q
a
1 2
q,
b
a
3
b
1
q3 q4
b
3
q5
q2
The existence of the functional dependency SUPPLIER -* CITY guarantees that the natural join of the projections R[SUPPLIER, CITY] and R[SUPPLIER, ITEM, QUANTITY] produces the original relation R(SUPPLIER, ITEM, CITY, QUANTITY). This follows from (1.2.1 (10)) and can be easily verified in (ii) and (iii). The formal treatment of the above informal presentation is based on the observation that the attribute CITY functionally depends on a subset of the set of attributes which constitute the key of the relation R. This type of functional dependency is called partial. (2) Partial Functional Dependency (i) Let X and Y be sets of attributes of a relation R. We say that a functional dependency X -+ Y is full if and only if for every proper subset X' of X we have X' +-*Y. If for some proper subset X' of X we have A' --* Y, then the functional dependency X -- Y is called partial. (ii) An example of a partial functional dependency is SUPPLIER, ITEM -+ CITY. It is partial since SUPPLIER -* CITY. Partial functional dependency may be defined in the following alternative way. (iii) A functional dependency X -* Y is partial if and only if it can be repre-
67
1 Normal Forms
sented as a composite X -+*A' -+ Y where X --+ X' is trivial (but not identical) and X' -+ Y is a nontrivial functional dependency. A partial functional dependency indicates that the relation represents two entities. The key of one of them is X and the key of the other is its proper subset A' such that X' -+ Y. Because of that, the second normal form forbids partial functional dependencies within relations. (3) Second Normal Form (i) Let X be the set of all attributes of the relation R(A 1 , A 2 , .... A,) which do not participate in any key of R. We say that the relation R is in the second normal form if and only if every attribute from X is fully functionally dependent on every key of R. If we call the attributes that participate in at least one key of the relation prime attributes, and the remaining ones nonprime, we obtain the following version of definition (i). (ii) A relation R is in the second normal form if and only if every nonprime attribute of R fully functionally depends on every key of R. In (i) and (ii) we assumed that the relation R is already in the first normal form. The relation R(SUPPLIER, ITEM, CITY, QUANTITY) is not in 2NF because of the existence of the partial functional dependency SUPPLIER, ITEM -+ CITY. The projections R[SUPPLIER, ITEM, QUANTITY] and R[SUPPLIER, CITY] are both in 2NF. The only key of the relation R[SUPPLIER, CITY] is SUPPLIER. Since it consists of a single attribute the existence of partial functional dependencies is excluded. In the relation R[SUPPLIER, ITEM, QUANTITY] the only key is SUPPLIER, ITEM and the only nonprime attribute QUANTITY is fully functionally dependent on it. (4) Special Cases of Relations in 2NF If all attributes of a relation R are prime or all keys of R consist of only one attribute, then R is in second normal form. Observe that definition (3) of second normal form does not forbid the existence of partial functional dependencies of prime attributes on prime attributes. For example, suppose that we are given a relation R(A, B, C, D, E, F) with the following set of functional dependencies. (5) Example of Partial Dependencies Among Prime Attributes R(A, B, C, D, E, F)
ABC -* DE DE - ABC DE -F AB--D E-+ C
68 The keys of R are ABC and DE. C is an example of a prime partially dependent on the key DE. Indeed, DE -+ ABC and we also have E -* C. Likewise, D is a key attribute, but it is dent on the key ABC since ABC -+ DE, and we also have AB not forbid these partial functional dependencies.
2 Logical Design
attribute that is so DE -* C, but partially depen--* D. 2 NF does
(6) 2NF Normalization If a relation R is not in 2 NF then there exists a decomposition of R into a set of its projections which are all in 2NF. Furthermore, this decomposition is such that the original relation R may be restored applying the natural join operation to the projections of R. We already showed that this holds for the relation R(SUPPLIER, PART, CITY, QUANTITY) We now prove that it holds in general. Suppose that R(A 1 , A 2 ..... A,) is not in 2NF. This assumption implies that there exist disjoint subsets X and Y of {A 1, A 2 ,..., A,} such that Y are not key attributes, X is a key, and there exists a partial functional dependency X -+ Y. Denote with Z the set of all attributes of R which are neither in X nor in Y. Then the relation R may be represented as R(X, Y, Z). Since X -+ Y is a partial functional dependency, X may be represented as X'X", where X' - Y is a full functional dependency. In the first step of the decomposition process we substitute R(XYZ) with two projections R[XZ] and R[X'Y]. The natural join of these projections over the attributes X' equals R(XYZ) since we have R[XZ] *x"R[X' Y] = R[X'X"Z] *x" R[X'Y]. But now according to 1.2.1(10) X'-+ Y implies R[X'X"Z] *X. R[X' Y] = R(X'X"ZY) = R(XYZ). We say that the decomposition R[XZ], R[X' Y] of R(XYZ) is lossless. R[X' Y] is in 2NF since X' - Y is full and Y contains all non-prime attributes of R[X'Y]. If R[XZ] is not in 2NF we apply the same procedure to this projection and so on. This decomposition procedure is finite since every step produces projections with a strictly smaller set of attributes. In the worst case we end up with binary relations which are in 2 NF according to property (4).
1.2 Third Normal Form (3NF) It turns out that the discipline imposed by 2 NF is not sufficiently rigorous so that there exist relations which are in 2 NF and still exhibit the insertion, deletion, and update anomalies. We will illustrate such a situation by an example. Suppose that we are given a relation E(EMPLOYEE #, DEPARTMENT #, CITY) and that a triple (x, y, z) belongs to R if and only if employee x works in department y which is located in city z. Furthermore, we will assume that an employee may work in only one department and that a department is
69
1 Normal Forms
located in only one city. On the other hand, each department has a finite number of employees and a number of departments may be located in one city. These assumptions are summarized in (1). (1) Example with Anomalies E(EMPLOYEE #, DEPARTMENT #, CITY) EMPLOYEE# -* DEPARTMENT# DEPARTMENT # +-*EMPLOYEE # DEPARTMENT# -- CITY CITY -+-* DEPA RTM ENT # (i) The above functional dependencies in fact show that E contains attributes which belong to two different entities. One of these entities is an employee, whose attributes are EMPLOYEE# and DEPARTMENT#. The other entity is a department with attributes DEPARTMENT# and CITY. The fact that two different entities are represented by a single relation will cause the insertion, deletion, and update anomalies of this relation. To illustrate these anomalies assume that the relation E is for some specific moment of time represented by table (ii).
(ii)
EMPLOYEE#
DEPARTMENT#
CITY
a b c d e
A A A B B
X X X Y Y
f
C
X
9
C
X
The relation E(EMPLOYEE#,DEPARTMENT#,CITY) has the insertion anomaly since it is impossible to insert into it the information about the location of a department if it still has no employees. We cannot insert a triple (-,y,z) in E, where - denotes an unknown value, since EMPLOYEE # is the primary key of E and as such it may never assume an undefined value. The relation E also has the deletion anomaly. Indeed, deleting the last employee of a specific department, we lose the information about the location of that department. This may be quite an undesirable side effect. A triple with an undefined value of the attribute EMPLOYEE # and specific values of the attributes DEPARTMENT # and CITY cannot be kept in E since EMPLOYEE # is the primary key of E. The relation E has the update anomaly as well. It is caused by the fact that the same value of the attribute CITY is repeated for every employee tuple of
70
2 Logical Design
a specific department. If the location of a department changes, we have to update all employee tuples of that department. Independent updates of the attribute CITY of the relation E are not permitted since they may lead to DEPARTMENT # +-+CITY contrary to assumptions (i).
We thus showed that E has the insertion, deletion, and update anomalies. It is easy to see that E is in fact in 2NF. The only key of E is EMPLOYEE # and thus according to 1.1 (4) the existence of partial functional dependencies is excluded. If we replace the relation E(EMPLOYEE #, DEPARTMENT #, CITY) with its projections E[EMPLOYEE #, DEPARTMENT #] and E[DEPARTMENT#, CITY] we obtain separate representations of two different entities-an employee and a department-and all the anomalies disappear. The existence of functional dependencies DEPARTMENT # -_+ CITY guarantees that application of the natural join operation to E[EMPLOYEE #, DEPARTMENT #] and E[DEPARTMENT #, CITY] produces the original relation E which can be verified for specific representations (ii) and (iii).
(iii)
EMPLOYEE#
DEPARTMENT#
a b c d e
A A A B B
f
C
9
C
E[EMPLOYEE #, DEPARTMENT #]
DEPARTMENT #
CITY
A B C
X Y X
E[DEPARTMENT #, CITY]
Formally, the fact that the relation E represents two different entities is characterized by the statement that the attribute CITY is transitively dependent on the key EMPLOYEE#. The department is an attribute of an employee and thus EMPLOYEE# -+ DEPARTMENT#. The city is an attribute of a department and thus DEPARTMENT # -+ CITY. Therefore, CITY is not directly, but transitively, dependent on EMPLOYEE #. The notion of transitive dependency may be formally characterized as follows:
71
1 Normal Forms
(2) Transitive Dependency (i) Let X, Y, and Z be distinct sets of attributes of the relation R. We say that Z is transitively dependent on X if and only if the following conditions hold: X -y
Y+-*X
Y-.+Z
where the last functional dependency is not trivial. Observe that from (i) we conclude that X --+ Z and Z-- X, while Z -+ Y is
neither required nor forbidden. The graphical representation of the functional dependencies implied by assumptions (i) is given in (ii).
(ii)
/ - -
-Y_
It now becomes obvious that a discipline stronger than 2NF is required which forbids transitive dependencies of nonprime attributes on keys of a relation. (3) Third Normal Form A relation R is in third normal form if and only if none of its non-prime attributes is transitively dependent on any key of R. (4) The Relationship Between 2NF and 3NF (i) If a relation is in third normal form then it is necessarily in second normal form. (ii) There exist relations which are in second normal form but not in third. The above assertions state that 3NF is strictly stronger than 2NF. Assertion (4)(ii) was already proved since the relation E(EMPLOYEE#, DEPARTMENT#, CITY) was shown to be in second normal form but not in third. To prove (4)(i) suppose that a relation R is in 3NF, but there exists its attribute Ai which is partially dependent on the set of attributes X, where X is a key of R. If so, then the partial dependency X--* Ai may be represented as a composite of a trivial and a nontrivial functional depen-
72
2 Logical Design
dency X--* X' --* Ai for some X' c Xsuch that X' --* Ai is not partial. But this means that Ai is transitively dependent on the key X, which contradicts the assumption that R is in 3NF. This proves that a partial dependency X -* Ai cannot exist in R. We showed in a particular example-the relation E(EMPLOYEE#, DEPARTMENT#, CITY)-that a relation which is not in 3NF may be substituted by its projections-E[EMPLOYEE#, DEPARTMENT#] and E[DEPARTMENT #, CITY]-which are in 3NF. Such a procedure is possible in general. (5) 3NF Normalization If a relation R is not in 3NF then there exists a decomposition of R into a set of its projections which are all in 3NF. Furthermore, this decomposition is such that the original relation may be restored applying the natural join operation to the projections of R. The proof of this property is similar to the proof 1.1(6). If X-- Y, Y+-* X, and Y -4 Z hold for R, then R is first decomposed into its projections R[XYW] and R[YZ], where W contains all the attributes which are neither in X nor in Y nor in Z. If the above projections are not both in 3NF we continue the procedure which ends in a finite number of steps since we obtain binary relations in the worst case, and those are certainly in 3NF. 1.3 Boyce-Codd Normal Form (BCNF) The third normal form forbids transitive dependencies of nonprime attributes on keys of a relation. Because of that, the insertion, deletion, and update anomalies exist in those relations for which such dependencies hold for key attributes. (1) Example with Anomalies Suppose that we are given a relation R(CITY, ADDRESS, ZIP) with the functional dependencies as specified below: (i)
R(CITY, ADDRESS, ZIP) CITY, ADDRESS -* ZIP ZIP -* CITY
The only nonprime attribute is ZIP and it certainly does not transitively depend on the only key CITY, ADDRESS. This means that the relation R is in 3NF. In spite of that, it has all the anomalies discussed so far. (ii) Suppose that the set R is, for some moment of time, represented by the following table:
73
1 Normal Forms
CITY
ADDRESS
ZIP
Ca
a,
z1
c'
a2
z2
Cl
a.
Z2
C2
a
Z3
4
R(CITY, ADDRESS, ZIP)
The fact that a zip code belongs to a particular city cannot be entered into R unless we specify some address in that city. The attribute ADDRESS may not assume an undefined value since it belongs to the only key CITY, ADDRESS of R. This shows that the relation R(CITY, ADDRESS, ZIP) has the insertion anomaly. Deleting the last address with a given zip code we lose the information which says to what city that zip code belongs. This is the deletion aomaly of R. (iii) The relation R(CITY, ADDRESS, ZIP) has the update anomaly as well. Suppose that a given zip code (e.g., z 2 ) should be assigned to a different city. Then it is necessary to update all tuples of R in which that zip code appears. Independent updates of the attribute ZIP of particular tuples of R are not permitted, since they may violate the assumption ZIP -+ CITY. If we substitute the relation R(CITY, ADDRESS, ZIP) with its projections R[CITY, ZIP] and R[ADDRESS, ZIP], all the anomalies disappear. The existence of the functional dependency ZIP -+ CITY guarantees that the natural join of R[CITY, ZIP] and R[ADDRESS, ZIP] over the attribute ZIP produces the original relation R(CITY, ADDRESS, ZIP). This can be easily verified for (ii) and (iii). CITY
ZIP
ADDRESS
ZIP
el
Z1
a,
Zl
C1
Z2
a
2
Z2
C2
Z3
a3
Z2
R[CITY, ZIP]
a4
z
R[ADDRESS, ZIP]
The relation R(CITY, ADDRESS, ZIP) satisfies the conditions required by the third normal form, but it does not satisfy the requirements of a stronger normal form which is called the Boyce-Codd normal form. (2) Boyce-Codd Normal Form (BCNF) The relation R(A1 , A 2 .. ,A,,) is in BCNF if and only if the existence of a nontrivial functional dependency X - Y, where X and Y are sets of attributes of R, implies the existence of a functional dependency X -+ A, for all i = 1,2, ... , n.
74
2 Logical Design
Boyce-Codd normal form states that whenever a nontrivial functional dependency X -+ Y exists among sets of attributes of a relation R, the domain of that dependency contains a key of R. The existence of an embedded, nontrivial functional dependency X -* Y indicates that the relation R contains a representation of an embedded entity. This is precisely the reason why the anomalies appear. As a result of the normalization process we obtain a separate representation of this embedded entity. (3) The Relationship Between 3NF and BCNF (i) If a relation is in BCNF then it is necessarily in 3NF. (ii) There exist relations which are in 3NF but not in BCNF. To prove (i) suppose that a relation R(A 1, A 2 ..... A,) is in BCNF but it still has a transitive dependency X -- Y, Y+-). X, Y -+A, for some i (i = 1, 2,..., n), where Xis a key and Ai is not in X since Ai is not a key attribute. Ai is not in Y since otherwise the dependency Y -- Ai would have been trivial, contrary to the definition of transitive dependency. The existence of a nontrivial functional dependency Y -- Ai and the assumption that R is in BCNF imply that Y - Ai for all j = 1, 2, .... , n. But then Y - X contrary to the assumption about the existence of the transitive dependency X -- Y, Y+-+ X, and Y -- Ai. This proves that a transitive dependency cannot exist in a relation which is in BCNF and thus such a relation is also in 3NF. To prove (ii) we simply take as an example the relation R(ABC) for which the functional dependencies AB -- C and C -* B hold. R is in 3NF but not in BCNF. We showed that the relation R(CITY, ADDRESS, ZIP) which is not in 3NF may be decomposed into its projections R[CITY, ZIP] and R[ADDRESS, ZIP] such that their natural join restores R(CITY, ADDRESS, ZIP). This holds in general. (4) BCNF Normalization If a relation R is not in BCNF then there exists a decomposition of R into a set of its projections which are all in BCNF. Furthermore, this decomposition is such that the original relation R may be restored from these projections using the natural join operation. To prove this general property of relations, assume that we have a relation R which is not in BCNF. Then represent R as R(XYZ), where X, Y, Z are sets of attributes, X -* Y, and X+-* Z, and where X -* Y is a nontrivial functional dependency. We first substitute the relation R(XYZ) with its projections R[XY] and R[XZ] and if necessary continue the process, which is finite since we end up with binary relations in the worst case and those are certainly in BCNF.
75
1 Normal Forms
1.4 Fourth Normal Form (4NF) (1) Example with Anomalies Consider again the relation INVENTORY(ITEM, DEPT, COLOR) with the previously adopted assumptions about the existence of multivalued dependencies: (i)
INVENTORY(ITEM, DEPT, COLOR) ITEM -- DEPT ITEM- COLOR
and its representation for a specific moment of time given in (ii): (ii)
ITEM
DEPT
c
1
brown
t s d s b b b b b b
I 1 2 2 2 2 2 3 3 3
black brown green brown red yellow blue red yellow blue
COLOR
Apart from the trivial functional dependencies no other functional dependencies hold for the relation INVENTORY. As a consequence, this relation is in Boyce-Codd normal form. In spite of that, it exhibits the insertion, deletion, and update anomalies. The fact that an item a has a particular color c may be recorded in the relation INVENTORY if and only if there exists a department which uses color c of the item a. Indeed, a triple (a, -, c) may not be inserted in the relation INVENTORY since ITEM, DEPT, COLOR is the only key of this relation and thus necessarily the primary key. Even if there exists a department dwhich uses item a, we have to insert a new triple (a, d, c) for every such department d in the relation INVENTORY. Otherwise, we violate the assumptions ITEM - DEPT and ITEM -*) COLOR. For example, adding a new color green to item b requires insertion of two triples in the relation INVENTORY, (b, 2, green) and (b, 3, green). All this shows that the relation INVENTORY has the insertion anomaly. Deleting a department d from the relation INVENTORY requires deletion of all triples of this relation whose value of the attribute DEPT equals d. Otherwise, the assumptions ITEM --- DEPT and ITEM -- COLOR will be *
76
2 Logical Design
violated. But deletion of this tuple causes further undesirable effects. For example, if we delete department 2 from the relation INVENTORY we have to delete the tuple (d, 2, brown) among others. That way, we lose the information that item d comes in brown color. We cannot leave the triple (d, -, brown) in the relation INVENTORY since ITEM, DEPT, COLOR is the primary key of this relation. So the relation INVENTORY indeed has the deletion anomaly. Independent updates of triples of the relation INVENTORY are not permitted because of the constraints ITEM--- DEPT and ITEM--- COLOR and thus this relation has the update anomaly as well. Indeed, If we want to change one of the colors of an item, we have to do that for all departments in which that item is used. For example, if the color blue of item b is changed to green, we have to update two triples: (b, 2, blue) and (b, 3, blue). The reason for the existence of these anomalies in the relation INVENTORY is that it represents two entities, both of which happen to be relationships among other entites. One entity represented by the relation INVENTORY is the M: N relationship between entities ITEM and DEPARTMENT and the other is the M: N relationship between the entities ITEM and COLOR. These two relationships will in general have their attributes which are omitted from the relation INVENTORY to simplify the presentation. For example, a typical attribute of the first relationship is the quantity used by a department and of the second the quantity of a particular color of an item in stock. If we represent these two relationships by separate relations all the anomalies disappear. This may be easily verified from the projections INVENTORY[ITEM, DEPT] and INVENTORY[ITEM, COLOR].
ITEM
COLOR
ITEM
DEPT
c
brown
c
I
t
black
t
1
s d b b b
brown green red blue yellow
s d s b b
I 2 2 2 3
INVENTORY [ITEM, COLOR]
INVENTORY [ITEM, DEPT]
The conclusion that the relation INVENTORY(ITEM, DEPT, COLOR) should be substituted by its projections INVENTORY[ITEM, COLOR] and INVENTORY[ITEM, DEPT] to obtain a model without anomalies could not be reached considering the functional dependencies among the attributes of this relation. We thus have to consider the multivalued dependencies as
77
1 Normal Forms
well. All the normal forms introduced so far are based on the functional dependencies. The fourth normal form which we are about to introduce is based on multivalued dependencies. (2) Fourth Normal Form (4NF) A relation R(X, Y, Z), where X, Y, Z are pairwise disjoint sets of attributes, is in the fourth normal form (4NF) if and only if the existence of a nontrivial multivalued dependency X -+> Y implies the existence of a functional dependency X -- A for every attribute A of the relation R. In other words, whenever a set X of attributes of a relation R is the domain of a nontrivial multivalued dependency, it contains a key of the relation R. The only difference between this definition and the definition of BCNF is that in the latter sets of attributes, which are domains of nontrivial functional dependencies, were considered. Since functional dependencies are a particular case of multivalued dependencies it is intuitively clear why 4NF is stronger than BCNF. This may be easily established formally. (3) The Relationship Between BCNF and 4NF (i) If a relation is in 4NF then it is necessarily in BCNF. (ii) There exist relations which are in BCNF but not in 4NF. To prove (i) suppose that a relation R is in 4NF but not in BCNF. If R is not in BCNF then there exists a nontrivial functional dependency X+ Y, where X and Y are sets of attributes of the relation R, and an attribute A of R such that X+-> A. Denote with Y1 the set Y/X of the attributes of R which are in Y but not in X. X - Y implies X -* Y,. X - Y, and X n Y1 = 0 implies that we have a multivalued dependency X --* Y1. This multivalued dependency is nontrivial since Y1 # 0 and XU Y1 is not the set of all attributes of the relation R. Indeed, Y, = 0 would mean that X - Yis a trivial functional dependency, contrary to the assumption. XU Y1 is not the set of all attributes of the relation R since A is neither in X nor in Y1. Indeed, A in X implies X-* A contrary to the assumption that X+-* A. A in Y11 together with X-+ Y and Y, -_ Y imply again X -+A. From the assumption that R is in 4NF but not in BCNF we proved that R contains a nontrivial multivalued dependency X -- Y1 and an attribute A such that X+-+A. This is a contradiction which proves statement (i). To prove (ii) it is sufficient to give an example of a relation which is in BCNF but not in 4NF. But the relation INVENTORY(ITEM, DEPT, COLOR) serves that purpose and thus (ii) is proved as well. The relation INVENTORY(ITEM, DEPT, COLOR) is not in 4NF since it contains a nontrivial multivalued dependency ITEM--* COLOR and ITEM is not a key of this relation. If we replace the relation INVENTORY with its projections INVENTORY [ITEM, DEPT] and INVENTORY[ITEM, COLOR] we obtain two relations which are both in =
78
2 Logical Design
4NF. Indeed, the only multivalued dependencies which can exist in these relations are trivial. In addition, according to 1.2.2(8) we can restore the original relation INVENTORY as the natural join INVENTORY[ITEM, COLOR] *ITEM INVENTORY[ITEM, DEPT]. The question is whether this holds in general. As it turns out, it does. (4) 4NF Normalization If a relation is not in 4NF then there exists a decomposition of R into a family of its projections all of which are in 4NF. Furthermore, the relation R may be restored as the natural join of these projections. Proof of the above statement proceeds as follows. If R is not in 4NF then there exists a nontrivial multivalued dependency X -- Y, where X and Y are disjoint, nonempty sets of attributes of R, such that X does not contain a key of R. Denote with Z the set of all attributes of R which are neither in X nor in Y. Y : 0 and Z :A 0 since the multivalued dependency X -* Y is nontrivial. We decompose the relation R(X, Y, Z) into its projections R[XY] and R[XZ]. The join R[XY] *x R[XZ] equals R(XYZ) because of X--* Y (see 1.2.2(8)). If both R[XY] and R[XZ] are in 4NF the normalization process is over. If not, we apply the same procedure to the projection which is not in 4NF. This process is finite since it properly reduces the number of attributes in relations in every step. Because of that we end up with binary relations in the worst case in which the only multivalued dependencies which can exist are trivial. It is important to observe that it is usually neither necessary nor desirable to decompose a relation as far as possible to obtain a 4NF family of relations. Suppose that the only key of the relation R(ABCD) is A and that all other functional dependencies are the consequence of the key dependency A -* BCD. Although it is possible to decompose R(ABCD) into projections R[AB], R[AC], and R[AD] whose natural join equals R(ABCD), this decomposition is not necessary since R(ABCD) is already in 4NF. (5) Example of 4NF Normalization Suppose that we are given a relation SCHEDULE(COURSE, SECTION, LECTURER, STUDENT, ROOM, DAY) which specifies the assignment of students, lecturers, and rooms to sections of courses of an academic department. Suppose that each section of a particular course has a unique lecturer, meets on the same day of the week, and always in the same room and that a student may be assigned to a single section of a particular course. These assumptions introduce the following nontrivial functional dependencies within the relation SCHEDULE: COURSE, SECTION -+ LECTURER STUDENT, COURSE - SECTION COURSE, SECTION -- DAY, ROOM
79
1 Normal Forms
In addition to these functional dependencies our assumptions imply the existence of the following nontrivial multivalued dependencies: COURSE, SECTION -- LECTURER COURSE, SECTION -- DAY, ROOM COURSE, SECTION -- STUDENT The relation SCHEDULE is not in 4NF since COURSE, SECTION does not contain a key of this relation. In fact, the only key is COURSE, SECTION, STUDENT. The semantic interpretation of the fact that SCHEDULE is not in 4NF is that it represents three relationships which correspond to the assignments of lecturers to sections, rooms to sections, and students to sections. If we apply the normalization procedure described in (4) we obtain separate representation of these assignments as follows: SCHEDULEOFSTUDENTS(COURSE, SECTION, STUDENT) SCHEDULEOFLECTURERS (COURSE, SECTION, LECTURER) SCHEDULEOFROOMS(COURSE, SECTION, DAY, ROOM) The above relations are the projections of the relation SCHEDULE. They are all in 4NF and their natural join produces the relation schedule according to 1.2.2(8).
1.5 Projection/Join Normal Form (PJNF) The projection/join normal form, or the fifth normal form as it is sometimes called, is even stronger than the fourth normal form. While the fourth normal form is based on multivalued dependencies, the projection/join normal form is based on join dependencies. Its definition follows. (1) Projection/Join Normal Form (PJNF) A relation R is in the projection/join normal form if and only if every nontrivial join dependency which holds in R is the result of keys. (2) Example The relation R(AGENT, COMPANY, PRODUCT) given below obeys the join dependency *I{(AGENT, COMPANY), (AGENT, PRODUCT), (COMPANY, PRODUCT) }.
AGENT
COMPANY
PRODUCT
Smith Smith Smith Smith Jones
Ford Ford GM GM Ford
car truck car truck car
80
2 Logical Design
Indeed, the relation R may be restored from its projections R[AGENT, COMPANY], R[AGENT, PRODUCT], and R[COMPANY, PRODUCT] using the natural join operation.
AGENT
COMPANY
AGENT
PRODUCT
COMPANY
PRODUCT
Smith Smith Jones
Ford GM Ford
Smith Smith Jones
car truck car
Ford Ford GM GM
car truck car truck
However, this join dependency is not the result of key dependencies since the only key of R is AGENT, COMPANY, PRODUCT. So the relation R(AGENT, PRODUCT, COMPANY) is not in PJNF, whereas its projections, being binary relations, are. Observe that the relation R(AGENT, COMPANY, PRODUCT) obeys the join dependency *{(AGENT, COMPANY), (AGENT, PRODUCT)}. But this join dependency is the consequence of the multivalued dependencies AGENT --* COMPANY and AGENT -- PRODUCT which exist in R. Decomposition of R into its projections, R[AGENT, COMPANY] and R[AGENT, PRODUCT], produces two relations which are in PJNF and which, when joined, produce again R. So the third relation seems superfluous. However, in the next example we enlarge the relation R in such a way that no join of two out of three of its projections produces the original relation.
(3) Example
AGENT
COMPANY
PRODUCT
Smith Smith Smith Smith Jones Jones Brown Brown Brown Brown
Ford Ford GM GM Ford Ford Ford GM Toyota Toyota
car truck car truck car truck car car car bus
81
1 Normal Forms
AGENT
COMPANY
COMPANY
PRODUCT
AGENT
PRODUCT
Smith Smith Jones Brown Brown Brown
Ford GM Ford Ford GM Toyota
Ford Ford GM GM Toyota Toyota
car truck car truck car bus
Smith Smith Jones Jones Brown Brown
car truck car truck car bus
Observe that the initial relation is not in the PJNF since it may be restored by the natural join of the above three projections, but that property is not the consequence of the only key AGENT, COMPANY, PRODUCT. Observe that the join R[AGENT, COMPANY] AGENTT R[AGENT, PRODUCT] contains the triple (Brown, Ford, bus) which is not in the original relation. Similarly, the join R[AGENT, PRODUCT] *PRODUCT R[COMPANY, PRODUCT] contains the triple (Jones, GM, car) which is not in the original relation. Finally, the join R[AGENT, COMPANY] *COMPANY R[COMPANY, PRODUCT] contains the triple (Brown, GM, truck) which is not in the original relation R. (4) The Relationship Between PJNF and 4NF (i) A relation which is in the PJNF is also in the 4NF. (ii) There exist relations which are in the 4NF but not in the PJNF. The relation R(AGENT, COMPANY, PRODUCT) is in the 4NF since it contains only trivial multivalued dependencies. However, we showed that it is not in the PJNF since the join dependency which it obeys is not the result of the only key of R. So property (ii) is established. To prove property (i) suppose that there exists a relation R which is in the PJNF but which is not in the 4NF. If a relation is not in the 4NF then it contains a multivalued dependency whose domain is not a superkey. This multivalued dependency implies that R contains a join dependency which is not the result of keys of R, contrary to the assumption that R is in the PJNF. (5) PJNF Normalization If a relation R is not in the PJNF then there exists a decomposition of R into a set of its projections which are all in the PJNF and their natural join restores the original relation R. To prove the above statement suppose that R obeys a join dependency *X, X2,. ... Xm} which is not the result of keys of R. Then in the first step of the normalization procedure we decompose R into the set R[X1 ], R[X 2 ] .. R[Xm] of its projections. The natural join of these projections equals R because of the assumed existence of the join dependency *{X,X2....Xm}. If the projection R[Xi] for some i = 1, 2, ... , mis not in
82
2 Logical Design
the PJNF we apply the same procedure to R[X]. This process is finite since we end up with binary projections in the worst case in a finite number of steps. Binary relations are obviously in the PJNF.
2 Abstractions 2.1 Unnormalized Relational Model (1) Normalized Versus Unnormalized Model Suppose that we want to find suitable relational representation for three types of entity-DEPARTMENT, EMPLOYEE, and CHILD-on condition that the relationships between the entities DEPARTMENT and EMPLOYEE and between the entities EMPLOYEE and CHILD are 1: N. Our knowledge of normal forms would most likely lead to the following normalized model. (i)
DEPARTMENT(DEPT #, NAME, CITY) EMPLOYEE(EMP #, NAME, SALARY, DEPT #) CHILD(NAME, AGE, PARENT #)
We will also consider a model which corresponds to (i) and which is not even in the first normal form: (ii)
DEPARTMENT(DEPT #, NAME, CITY, EMPLOYEE(EMP #, NAME, SALARY, CHILD (NAME, AGE)))
Model (ii), as one would expect, exhibits the insertion, deletion, and update anomalies. An important observation is that the meaning of these anomalies is quite different for the entities EMPLOYEE and CHILD. The insertion anomaly for the entity EMPLOYEE is exhibited by the fact that a tuple of a particular employee cannot be entered into the database unless that employee belongs to some department. Deleting a department from the database causes deletion of all its employees, which is the deletion anomaly. Both effects may be quite undesirable. Consider now the insertion and deletion anomalies for the lower level of abstraction of the hierarchy in model (ii). A child tuple cannot be entered into the database unless the tuple of its parent is already there. Deleting an employee tuple causes deletion of tuples of all the employee's children. Although these two effects would in general be characterized as anomalies, the semantics of the actual data may easily be such that they would not be anomalies at all. Data about a particular child would usually be of interest (if at all) to a company if and only if its parent is an employee of that company. If this assumption is valid, then the described behavior of the subrelation EMPLOYEE reflects precisely that assumption.
2 Abstractions
83
The above analysis shows that model (ii) contains two types of entity characterized as follows: (2) Kernel and Characteristic Entities (i) An entity type is called kernel (or independent) if its existence in the database is independent of the existence of any other entity in that database. (ii) An entity type is called characteristic if it describes some other entity and corresponds to a multivalued property of that entity. A characteristic entity exists in the database only if the entity which it describes exists in the database. In example (1) the independent entity types are DEPARTMENT and EMPLOYEE, whereas CHILD is the characteristic entity type of the entity type EMPLOYEE. If we now represent independent entity types by separate relations, we obtain the following model: (iii)
DEPARTMENT(DEPT #, NAME, CITY) EMPLOYEE(EMP #, NAME, SALARY, DEPT #, CHILD(NAME, AGE))
The hierarchichal structure remained in the relation EMPLOYEE since the semantics of the actual data justifies that. Thus the entity CHILD is just an attribute of the entity EMPLOYEE and since the domain of that attribute is not simple, the relation EMPLOYEE is not in the first normal form. In the section on normal forms we discussed the disadvantages of the unnormalized models in detail. The above example shows that unnormalized models may have some advantages as well. (3) The Relationship Between Normalized and Unnormalized Models (i) The basic advantage of a model which consists of a collection of hierarchically structured relations is that it often reflects the semantics of the actual relationships among the entities represented by the model. When those relationships are hierarchical by their very nature, then it is also natural to represent them by unnormalized relations. An example is the relationship between an employee and its child viewed in the context of a company in which the employee works. (ii) Performing normalization without regard to the semantics of the actual relationship among entities may lead to a large number of relations from which those relationships may not be obvious for a less sophisticated user. (iii) If hierarchical relations are representations of true hierarchical relationships which exist among kernel entities and their subordinate characteristic entities, then it is simpler for the users to specify queries on such relations than it would have been if those relations were normalized. For example, asking queries which refer to employees and their children requires the join
84
2 Logical Design
operation if model (1) (i) is used. The join operation is not necessary in model (2) (iii) since all relevant data for employees are contained in one relation of that model. (4) Representing M: N Relationships Among Entities What has been said above for the hierarchical relationships among entities does not carry over to the M: N relationships. If we assume that an employee may belong to more than one department, then the association of employees and their departments should be represented by a separate relation whose attributes are primary keys of the entities EMPLOYEE and DEPARTMENT. This approach, which avoids the update anomalies, leads to the following model. DEPARTMENT(DEPT#, NAME, CITY) EMPLOYEE(EMP #, NAME, SALARY, CHILD (NAME, AGE)) ASSIGNMENT(DEPT #, EMP #) (5) Associative Entities The association of employees and their departments may itself be an entity. Such entities interrelate entities of other types and are called associative. Associative entities as a rule have attributes which do not belong to the entities which participate in the association. For example, the percentage of working hours which an employee spends in a department and the job held in that department are examples of such attributes. The above observations produce our final model which contains representations of two kernel entities, one associative and one characteristic entity. DEPARTMENT(DEPT #, NAME, CITY) EMPLOYEE(EMP #, NAME, SALARY, CHILD(NAME, AGE)) ASSIGNMENT(DEPT #, EMP #, PERCENTAGE, JOB) 2.2 Aggregation Hierarchical relations discussed in the previous section show how larger meaningful units may be formed from the smaller (simpler) ones. Associative entities are also examples of larger meaningful units obtained from the simpler ones by an abstraction which is called aggregation. (1) Definition of Aggregation Aggregation is an abstraction in which a relationship between entities is regarded as a higher level, associative entity.
85
2 Abstractions
(2) Example of Aggregation Suppose that the entities LECTURER and COURSE are specified, where the attributes of the former are NAME and RANK and of the latter TITLE and CREDIT. The relationship between these two entities is the entity LECTURE of the associative type. Its attributes are LECTURER, COURSE, DAY, TIME, and ROOM. The two levels of abstraction are represented graphically below:
LECTURE
LECTURER
COURSE
It is obvious that all three entities are in fact the result of aggregation.. However, entities LECTURER and COURSE are obtained by aggregation of elementary entities (attributes) whereas the entity LECTURE is obtained by aggregation of nonelementary entities. Because of that, only the entity LECTURE is regarded as an associative entity in the proper sense. (3) Hierarchical Representation of Aggregation If we attempt to map levels of abstraction obtained by aggregation into the hierarchical model we obtain a model with some undesirable properties. Example (2) of aggregation shows that clearly. Its unnormalized or hierarchical representation is (ii). LECTURE(LECTURER(NAME, RANK), COURSE(TITLE, CREDIT), DAY, TIME, ROOM)
There are two basic disadvantages of the above representation. The first
86
2 Logical Design
one is the insertion, deletion, and update anomalies. Indeed, data about a course can not be inserted into the database if there are no lectures from that course. Likewise, data about a lecturer can not be inserted into the database unless that lecturer is giving a lecture. Deleting the last lecture of a particular lecturer causes deletion of that lecturer as well. Deleting the last lecture of a particular course causes deletion of that course. If we assume that a lecturer teaches more than one lecture, or that there are many lectures of a given course, we obtain the update anomaly, caused by the fact that the data about a particular lecturer will have to be repeated for each lecture which that lecturer teaches and the data about a particular course will have to be repeated for each lecture from that course. The second disadvantage of model (ii) is that it does not permit uniform access to the three entities which it represents. Indeed, to access data of a lecturer, one has to know at least one lecture which that lecturer teaches. Similarly, to access data about a course, one has to specify a lecture of that course. (4) Network Representation of Associative Entities Within the network model based on the CODASYL sets the above situation is represented by the two sets ASSIGNMENT(LECTURER, LECTURE) and SCHEDULE(COURSE, LECTURE) with a common member LECTURE
(5) Relational Representation of the Associative Entity If we assume that the primary key of the entity LECTURER is NAME and of the entity COURSE is TITLE, then the normalized relational model which corresponds to the network model (4) would have the following form: LECTURER(NAME, RANK) COURSE(TITLE, CREDIT) LECTURE(LECTURER, COURSE, DAY, ROOM) In the above model the attributes NAME and LECTURER have the same domain (a set of names) and so do the attributes TITLE and COURSE (a set of titles). The primary key of the entity LECTURE is LECTURER, COURSE.
2 Abstractions
87
(6) Foreign Key The attributes LECTURER and COURSE of the relation LECTURE are said to be foreign keys, which means that their values are in fact primary keys of tuples in other relations. In general, if an entity of type E is the result of aggregation of entity types E1 , E2 ... , E. whose respective (primary) keys are K 1 , K 2 , .. ., K,, then K 1 , K 2 , ... , K, will in fact be attributes of the entity E, possibly with their names changed. Each of Ki (i = 1, 2 ... , n) is then a foreign key in E and together they represent a key of the associative entity E. The case in which that key is in fact the primary key of E, as in example (5), requires careful consideration. In fact, in that case the following general integrity constraint related to aggregation has to be obeyed. (7) Integrity Constraint for Aggregation An occurrence of an associative entity may exist in the database only if the occurrences of the entities participating in the association also exist in the database. Formally, the above condition follows from the definition of primary key whose attributes are not allowed to have undefined values. (8) Relational Scheme for Aggregation with Integrity Constraints Example (5) together with the integrity constraints defined according to (7) has the following form: (i) Relations LECTURER(NAME, RANK) COURSE(TITLE, CREDIT) LECTURE(LECTURER, COURSE, DAY, ROOM) (ii) Integrity constraints LECTURE[LECTURER] s LECTURER[NAME] LECTURE[COURSE] __COURSE[TITLE] Integrity constraint (7) is often too strict and it does not permit adequate representation of some rather natural situations. For example, there may be scheduled lectures for some courses for which the lecturers are not yet determined. It would not be possible to represent such a situation within model (5). A lecture tuple with the undefined value of the attribute LECTURER may not exist in the relation LECTURE since that attribute is a part of the primary key. It is interesting to observe that this problem has been given due care within the CODASYL network model and in the Extended Relational Model. Assume that a lecture cannot exist in the database unless the course is specified, but that it can exist in the database without its lecturer. The
88
2 Logical Design
situation is modeled by choosing mandatory-type membership for the entity LECTURE in the set SCHEDULE and optional-type membership in the set ASSIGNMENT. This means that an occurrence of the entity LECTURE may exist in the database without its owner in the set ASSIGNMENT but not without its owner in the set SCHEDULE. (9) Imposed Keys Within the relational model the problem is overcome by introducing for each entity type a new attribute which assumes the role of the primary key. Following that discipline from (8) we obtain model (9) which obviously does not require the integrity constraint (7). LECTURER(LECTURER #, NAME, RANK) COURSE(COURSE #, TITLE, CREDIT) LECTURE(LECTURE #, COURSE #, LECTURER #, DAY, ROOM) (10) Specification of the Integrity Constraints in SQL If we want to specify the condition that a lecture must have its course specified in the database than we do that for model (9) as follows: ASSERT CHECKOWNER: SELECT COURSE # FROM COURSE CONTAINS SELECT COURSE # FROM LECTURE 2.3 Generalization Another important way of forming larger meaningful units is by an abstraction which is called generalization. (1) Definition of Generalization Generalization is an abstraction in which a set of similar entity types is viewed as a single entity type. If the entity type E is the result of generalization of the entity types E1 , E2, .... , E, then E is called the generic entity type of E,, E 2, ... , E, and the latter are called its subtypes. To define E from E1 , E2, .... , E, we ignore the differences between Ej (i = 1,2, .. ., n) so that their common attributes become the attributes of the generic entity E. (2) Example of Generalization We present a decomposition of the generic entity type MOTOR VEHICLE into its subtypes and apply the same procedure to the subtypes so that we obtain the following generalization hierarchy:
89
2 Abstractions
MOTOR VEHICLE
AIR VEHICLE
AIRCRAFT
HELICOPTER
LAND VEHICLE
ROAD VEHICLE
BUS
TRUCK
WATER VEHICLE
TRACK VEHICLE
CAR
TRAIN
MOTOR BOAT
SAIL BOAT
TRAM
Definition (1) of generalization leads to the following general integrity constraint. (3) Integrity Constraint for Generalization Suppose that E is the generic entity type and E, (i = 1, 2, .... , n) are its subtypes. For every entity of the type Ei (i = 1, 2 ... ., n) in the database there must exist its generic entity of type E. (4) Unapplicable Property So far we have observed the need for an attribute to have an undefined value if that value is still unknown. However, there exist situations of a quite different semantic nature in which the attribute assumes an undefined value. Such situations will be illustrated by modeling three subtype entitiesENGINEER, SECRETARY, and DRIVER-of the generic entity type EMPLOYEE. The attributes of these entity types are given as follows. EMPLOYEE(EMP #,NAME, SALARY, JOB) ENGINEER(EMP #, DEGREE) SECRETARY(EMP #, TYPINGSPEED) DRIVER(EMP #, LICENSETYPE)
EMPLO'
ENGINEER
SECRET"ARY
DRIVER
If two levels of the generalization hierarchy had not been distinguished, the model would have contained a single relation specified as:
90
2 Logical Design
EMPLOYEE(EMP #, NAME, SALARY, JOB, DEGREE, TYPINGSPEED, LICENSETYPE) Each tuple of the above relation would have values of some attributes undefined. In particular, tuples representing engineers must have undefined values of the attributes TYPINGSPEED and LICENSETYPE, tuples representing secretaries must have undefined values of the attributes DEGREE and LICENSETYPE, and finally tuples representing DRIVERS must have undefined values of the attributes DEGREE and TYPINGSPEED. In all these cases an appearance of a specific value of appropriate type is an error. It is obvious that this type of situation is quite different from the situation in which a specific value of an attribute is applicable, but not known. In that case an undefined value may be substituted by any specific value of the appropriate type. Undefined values considered here correspond to unapplicable properties of entities. (5) Views
Views are virtual relations and often represent an appropriate tool for modeling generic objects. Although views do not exist in the sense in which the base relations do, they behave like base relations in (almost) all other ways. Views may be created, deleted, used for stating queries, and under certain circumstances they may even be updated. To see how and why views may be introduced consider again example (4). A natural way of modeling the entities ENGINEER, SECRETARY, and DRIVER would be to represent them by three separate relations where each relation contains only the actual attributes (applicable properties) of the type of an employee which it represents. This approach leads to the following model: ENGINEER(EMP #, NAME, SALARY, DEGREE) SECRETARY(EMP #, NAME, SALARY, TYPINGSPEED) DRIVER(EMP #, NAME, SALARY, LICENSETYPE) We assume that each employee has a unique EMP #. A virtual relation (view) is then defined in such a way that it represents a generic entity type EMPLOYEE. EMPLOYEE(EMP#, NAME, SALARY) Tuples of the relation EMPLOYEE in fact do not exist. These virtual tuples are at any moment of time derived from the relations ENGINEER, SECRETARY, and DRIVER in such a way that the set of virtual tuples which corresponds to the relation EMPLOYEE is determined as: ENGINEER[EMP #, NAME, SALARY] U SECRETARY[EMP #, NAME, SALARY] U DRIVER[EMP #, NAME, SALARY] Because of that, all changes in the sets ENGINEER, SECRETARY, and
91
2 Abstractions
DRIVER are reflected in the set EMPLOYEE of virtual tuples. In other words, the integrity constraint (3) for generalization is enforced. In this particular example even a stronger condition holds which says that to each virtual tuple there corresponds exactly one actual tuple in the database. This condition makes updates to the view EMPLOYEE acceptable. (6) View Definition in SQL The set of tuples which corresponds to a view is specified by a query in SQL. View definition also includes the name of the view and the names of its attributes. Their domains are implicitly specified by the query which the view definition contains. All this is best illustrated by the SQL definition of the view EMPLOYEE: DEFINE VIEW EMPLOYEE(EMP #, NAME, SALARY) AS SELECT EMP#, NAME, SALARY FROM ENGINEER UNION SELECT EMP#, NAME, SALARY FROM SECRETARY UNION SELECT EMP#, NAME, SALARY FROM DRIVER In SQL views may be used to define other views on them so that it is indeed possible to construct a generalization hierarchy using virtual relations on all levels except on the lowest one. (7) Alternative Generalization A more precise name of the type of abstraction considered in this section is unconditional generalization. It is often the case that an entity type may be generalized in two or more alternative ways. For example, in a database of suppliers (of a company) the entity type SUPPLIER may have alternative generic entity types COMPANY, PARTNERSHIP, and INDIVIDUAL. On the other hand, the generic entity type for all of these is LEGAL UNIT. These relationships are illustrated by the following figure: LEGAL UNIT
COMPANY
PARTNERSHIP
SUPPLIER
INDIVIDUAL
92
2 Logical Design
3 Design Methodology 3.1 Extended Relational Model (1) Extended Relational Model In the Extended Relational Model (RM/T) entities and their types can be classified depending on whether they: (i) Perform a subordinate role in describing entities of some other type, in which case they are called characteristic. (ii) Perform a superordinate role in interrelating entities of other types, in which case they are called associative. (iii) Perform neither of the above roles, in which case they are called kernel. Objects that interrelate entities but do not themselves have the status of entities are called nonentity associations. In RM/T there is a unary relation (called E-relation) for each entity type. The main purpose of an E-relation is to list the identifiers of entities that have that type and are currently recorded in the database. The immediate (single-valued) properties of an entity type are represented as distinctly named attributes of one or more property-defining relations which are called P-relations. Each P-relation has as its primary key an Eattribute (entity identifier) whose main function is to tie the properties of each entity type to the assertion of its existence in the E-relation. To indicate which P-relations represent properties associated with each E-relation the property graph (PG) relation is introduced. The property graph relation is binary and it has two attributes: SUB and SUP whose domain is RELNAME (set of names of relations). These attributes denote an entity and its multivalued property. Entity types are so defined that each multivalued property of an entity E is represented by a characteristic entity C together with immediate properties for C. A characteristic entity may itself have one or more characteristic entities subordinate to it. The characteristic entity types that provide description of a given kernel entity type form a strict hierarchy, which is called the characteristic tree. To represent the collection of characteristic trees, the characteristic graph relation (CG-relation) is introduced. CG is a binary relation whose attributes are defined on the RELNAME domain and denote an entity and its immediate characteristic. The representation of associative entities in RM/T is the same as that of kernel entities. In particular, there is an E-relation for each associative entity type and zero or more P-relations. If a given associative entity type has subordinate characteristic entity types, there will be corresponding tuples in the CG-relation to define the tree of these types and there will be characteristic relations to support each of the characteristic entity types involved.
93
3 Design Methodology
An associative entity type interrelates entities of other types (kernel or associative or both). These other entities are referred to as immediate participants in the given associative entity type. To support the specification of which entity types participate in which associative entity types, we introduce the association graph relation (AG-relation), a binary relation with attributes SUP and SUB whose domain is RELNAME and whose attributes identify associative entity type and an immediate participant in the association. A nonentity association type has no E-relation so that it cannot participate in another association. Such an entity type is represented by a single nary relation whose attributes include the E-attributes identifying the entity types participating in the association together with the immediate properties of the association. The above notions are illustrated by the following example of a library database whose schema is specified according to the extended relational model: (2) Library Database in RM/T E-relations: BOOK(BOOK#) BORROWER (BORROWER #) AUTHOR(AUTHOR#) LOAN(LOAN#) P-relations: TITLE(BOOK #, TITLE) PUBLISHER(BOOK #, NAME, CITY) NAME(AUTHOR #, FIRST, MIDDLE, LAST) ADDRESS(BORROWER #, NUMBER, STREET, TOWN) BOOKLOAN(LOAN #, BOOK #, BORROWER #) DATE(LOAN #, DAY, MONTH, YEAR) PG
AG
SUB
SUP
TITLE PUBLISHER NAME ADDRESS DATE BOOKLOAN
BOOK BOOK AUTHOR BORROWER LOAN LOAN
SUB
SUP
BOOK BORROWER
LOAN LOAN
94
2 Logical Design
(3) RM/T Integrity Constraints In RM/T the following integrity constraints are adopted (all but one are listed here): (i) Entity Integrity. E-relations accept insertions and deletions but not updates. E-relations do not accept null values. (ii) Property Integrity. A tuple may not appear in a P-relation unless its identifier appears in the corresponding E-relation. (iii) CharacteristicIntegrity. A characteristic entity can not exist in the database unless the entity of which it is an immediate multivalued property also exists in the database. (iv) Association Integrity. An associative entity can exist in the database even though one or more entities participating in that association are unknown. A nonentity association may not exist in the database unless the entities it interrelates also exist in the database. It is important to observe that the first part of rule (iv) is a general rule. A stronger integrity constraint is in fact allowed if required by the actual semantics of an application environment. (4) Generalization in RM/T (i) In RM/T entities and their types may be related by other criteria other than description and association. An entity type E1 is said to be a subtype of an entity type E2 if all entities of the type E1 are necessarily entities of the type E2 . Any entity type (characteristic, kernel, or associative) may have one or more subtypes, which in turn may have subtypes. An entity type E together with its immediate subtypes, their subtypes, and so on constitute the generalization hierarchy of E (a category). To handle generalization the unconditional generalization inclusion relation (UGI-relation) is introduced. It has three attributes-SUB, SUP, and PER where the first two denote names of a subtype (SUB) and its associated generic type (SUP) and the third denotes the name of the generalization category. (ii) Integrity Constraintfor Generalization. a. Property inheritance rule: Given any subtype E, all the properties of its parent types are applicable to E. b. Whenever an entity identifier E# belongs to the E-relation for an entity type E, E# must also belong to the E-relation for each entity type of which E is a subtype. c. A subtype of a characteristic entity type is also characteristic. A subtype
3 Design Methodology
95
of a kernel entity type is also kernel. A subtype of an associative entity type is also associative. Taking full advantage of the property inheritance rule would imply representation of common properties and characteristics of entity types in some generalization hierarchy as high up as possible. But this design discipline is in fact not required by RM/T. 3.2 Relational Database Programming Environment (1) Structure and Action Specification Data modeling of an application environment is usually performed in such a way that structural properties (entities, their relationships, and integrity constraints) are emphasized. A number of concepts, tools, and techniques have been developed which support such a view of data modeling. However, the major reason for developing such a model is to allow users to perform simple and composite actions and queries on the sets of entities described in the model. The relational data model provides a unified highlevel nonprocedural language for the specification of such actions (insert, delete, update) in a manner which is very similar to the way queries are defined. In fact, queries may be regarded as identity actions on the objects to which they refer. It was believed that such a general-purpose tool was all that was required especially if it was appropriately coupled with a standard programming language. We could say that the approach was to extract all the structural properties from all the programs and queries that access the database and design a schema which represents those properties appropriately. The reason, of course, is that structural properties were considered stable and could be defined abstractly in the schema which can be used by any program or query. An important observation is that it is also possible to define a collection of simple and composite actions on the entities represented in a schema which is also stable in the sense that all the users who want to use that schema would need them and in fact will be required to use them the same way they are required to comply with the structural requirements of the schema. From the viewpoint of the application environment that means that some general behavioral properties of that environment are represented by such a collection of actions. It follows that specification of actions on the entities represented in the schema of an application environment is an essential part of a data model of that application environment. Although a relational schema appears to be flat it is in fact a highly structured representation. This structure is properly maintained if various types of referential integrity constraints are maintained. Specifying simple and composite actions together with the schema in fact completes the definition
96
2 Logical Design
of the structural relationships since those actions had better be defined in such a way that the integrity constraints are maintained. Since such constraints always accompany data abstractions like aggregation or generalization, a natural question is whether there are appropriate abstractions for action structuring which would be used for specifying actions on entities obtained by the above forms of abstraction. Abstractions for actions have been studied extensively in the area of programming languages. It is a pleasant discovery that those abstractions are in fact appropriate for specifying composite actions on entity types which are obtained by such data abstraction techniques as aggregation, generalization, cover aggregation, and so on. When all of those tools are available within a unified framework we obtain a database programming environment which in its most general form would also include other concepts, tools, and techniques which will be developed later on in this book. In this section we describe an integration which is possible on the basis of the concepts, tools, and techniques which have been covered so far in the book. (2) Classification Recall that a collection of entities is considered an entity type (class) if all the entities in that collection share the same set of properties (attributes). Classification of entities into entity types thus requires precise specification of all the properties of those types. Conversely, an entity is an instance of the entity type to which it belongs. So instantiation is used to obtain entities which belong to an entity type. The instance-of relationship is represented in the diagram below:
Entity type instance-of Entity
Programming languages offer the concept of a type, a set of objects whose most essential common property is the set of actions which may be performed on objects which belong to that type. Pascal-like languages are particularly careful about types of object. A number of simple types are given and all other types are defined on the basis of the already defined types using a predefined set of data structuring and action structuring rules. Classification of entities into entity types in recent Pascal-like languages requires specification of actions which are associated with each such type. If the approach from the recently developed programming languages is adopted then whenever an entity type is introduced the essential part of such a
97
3 Design Methodology
definition is specification of all actions which are applicable to instances of that type. This observation means that a data model should contain not only a specification of the structure of entities but also a specification of behavioral properties of entities represented by the model. In fact, examples which follow demonstrate that a model of an application environment which does not include the behavioral aspects is an incomplete specification of the semantics of that application environment. Primitive actions which are associated with classification are: (i) Inclusion of an instance of an entity type into the set of entities of that type. This is the insertion action and in fact represents a special case of the set union operation. (ii) Deletion of an instance of an entity type from the set of entities of that type. The deletion action is a special case of the set difference operation. (iii) Update of values of attributes of an instance of an entity type. Although this action may be represented as a composition of deletion and insertion, it is much more natural to regard update as a primitive action. (iv) Selection (and display) of values of the attributes of an instance of an entity type. This is not an action on the entity whose attributes are selected and displayed since it does not affect that entity in any way. However, we may regard its effect on that entity to be the same as that of the identity action. (v) It turns out that if we want to be careful we in fact need the identity action. (3) Cartesian Aggregation Recall that aggregation is an abstraction in which a relationship among component entity types is considered a higher-level entity type. The relationship between the component entity and its aggregate entity is a part-of (component-of) relationship. Conversely, decomposition is applied to an aggregate (composite) entity to obtain its constituent parts (component entities). These relationships are represented by the following diagram:
Aggregate entity type l
part -of
Component entity type
Given a fixed number of types, Pascal-like languages allow specification of a new composite type which has the given types as its components. This
98
2 Logical Design
new type is called a record type. Given an instance of a record type object, it is possible to perform decomposition, that is, select its components. Since an aggregate entity is a composite made out of a fixed number of component entities, an action on an aggregate entity type consists of a fixed number of actions, one action per each component entity type. So the action structuring rule which corresponds to aggregation is a collateral (parallel) composition of a fixed number of actions. Some recent Pascal-like languages would offer such a composition rule. A less attractive but acceptable and commonly available rule is sequential composition of a fixed number of actions. Quite often some of the component actions will be the identity actions. If E is an aggregate of the entity types E, E2 , ... , E., that is, E(EI, E2, ..... E) with the graphical representation
E
E2
En
then an action A on the entity type E is a collateral composition of n actions AI, A2 , ... , A,, where Ai is an action on the entity type Ei. This composite action may be represented using the following graphical notation:
E AEl I
AEn
AE2
El
En
E2
The usual notation developed in Pascal-like languages which may be used to describe the entity type E is type E = record
components : E1 ; component2: E2 ; componentn: En
end
99
3 Design Methodology
Pascal-like notation for an action on an instance of the entity type E has the following general form: begin
action on E.component 1, action on E.component2, action on E.componentn
end If we apply both aggregation and classification we have to be able to define sets of records. Indeed, a typical entity could be viewed as an aggregation of its attributes. An entity type is then a set of entities which have those properties. So a typical entity type would be defined as a set of records. Pascal-like languages support set types, but the cardinality of such sets is extremely small and their elements must be of simple type so that records in particular are not permitted. Once we introduce this extended set type we also have to equip that type with the membership relation and then preferably with the standard set operations such as union, intersection, and difference. (4) Generalization Recall that generalization is an abstraction in which a more general (generic) entity type is defined by extracting common properties of one or more entity types while suppressing the differences among them. These entity types become subtypes of the generic entity type. Conversely, specialization introduces a new entity type adding specific properties of that type which are different from the general properties of its generic type. So generalization introduces the is-a relationship between a subtype entity and its generic entity. This relationship is represented by the following diagram:
Generic entity type
is-a Entity subtype
Pascal-like languages provide explicit support only for generic entity types whose subtypes are disjoint. The generic entity type is defined as a record whose components are attributes shared by all subtypes. Furthermore, additional attributes which characterize particular subtypes are also given in the record definition. This type of record is called a record with variants. When an action on a generic entity is performed, a choice of the appropriate subtype entity must be performed and an action on that subtype entity
1 00
2 Logical Design
performed as well. So in addition to the sequential composition, generalization also requires the composition rule for actions which permits selection of one among a finite number of actions. This is the well-known case composition in Pascal-like languages. If subtype entities are not disjoint then nesting of conditional compositions (if-then-else) of actions is an appropriate type of composition. It can also be used in all cases when the case composition is not available. If an entity type E is a generic entity type of the entity types E1 , E2, ... , Eý then we may represent such a situation by the following diagram:
E
E• i
.. .
E2
En
The usual notation in Pascal-like programming languages (assuming that the subtypes are disjoint) is type E = record
attributes of E; case subtype E of subtype I: (attributes of E,); subtype2: (attributes of E 2 );
subtypen: (attributes of E.) end An action A on the entity type E is graphically represented as
A
;AE E
AI
JAE 4 E2
ýAn
En
101
3 Design Methodology
The usual notation in Pascal-like languages is begin
action on E, case subtype E of subtypes: action on E,; subtype2: action on E2 ; subtypen: action on E. end
end (5) Example of Structure Modeling Suppose that we deal with an application environment in which an inventory of various types of object is maintained. The common properties of all the objects in the inventory are extracted (classification and generalization) and the entity type ITEM is defined as a set of attributes (aggregation) in the following Pascal-like notation: type ITEM = record
ITEM#: DESCRIPTION: PRICE:
INTEGER; STRING; REAL
end;
ITEM
ITEM#
DESCRIPTION
PRICE
The first view of the inventory is very simple. It is represerted as a variable of the following type: type ITEMSET = relation of ITEM In the next step we classify the items in the inventory into two classes: products and raw materials. These two classes have a number of common properties of their generic entity type ITEM. However, they also have some special properties of their own so that by specialization we obtain two subtypes of the entity type ITEM which are PRODUCT and MATERIAL. This refinement obtained by specialization is defined in the following Pascallike notation:
102
2 Logical Design
ITEM
PRODUCT
type ITEM = record
MATERIAL
ITEM#: INTEGER;
DESCRIPTION: STRING; PRICE: REAL; case KIND: (PRODUCT, MATERIAL) of PRODUCT: (HEIGHT, WIDTH, DEPTH: REAL); MATERIAL: (UNIT: STRING, MEASURE: REAL) end; (6) Cover Aggregation Cover aggregation is an abstraction in which a relationship between member entity types is defined as a higher-level cover entity type where each entity of the cover type is an aggregation of sets of instances of its member entity types. These sets are not determined at the model definition time. The relationship between a member entity type and its associated cover entity type is a member-of relationship and is represented by the following diagram:
Cover entity type member-of Member entity type
Cover aggregation also requires specification of sets of records since sets of entities of one or more member types (which are typically records) are associated with each entity of the cover type. A very common approach to the representation of cover aggregation is an extension of each member type with a new attribute whose value identifies the cover entity. So to associate a particular cover entity with a set of its members the natural join operation is required. Of course, this approach works only if covering is in fact partitioning of cover members into disjoint subsets where each such subset is associated with one cover entity. If covering is not partitioning then we have to apply the techniques for representing M: N relationships by separate relations. In other words, within the framework of the basic relational model it is possible to represent cover aggregation using Cartesian aggregation.
103
3 Design Methodology
An action on an instance of a cover entity type consists of an action on that entity followed by an appropriate action on all its associated member entities. So an additional type of action composition rule which is required by this type of data abstraction is the foreach composition rule. It permits specification of an identical action (simple or composite) on a set of entities which are selected at the time of the execution of the composite foreach action. We already observed that Pascal-like languages do not offer any suitable support for cover aggregation. However, they do support aggregation via records. If cover aggregation is implemented by supplementing each member entity type with the attributes which identify an instance of the cover entity type then selection of the appropriate member entities is based on a particular type of predicate (logical expression). If the foreach composition rule is extended by a predicate which selects the entities on which the action is to be performed then we obtain the usual query facilities like those offered by SQL. Indeed, one of the actions associated with an entity type is the selection (and display) of the values of its attributes. If E is a cover type of the entity types E, E2, ... , E, then we may represent such a situation using the following graphical notation:
E
El
E2
....
En
In comparison with aggregation the only new relationship is that of a cover type C and one of its associated member types M according to the following graph:
C
M
An action on C may be graphically represented as
104
2 Logical Design
/AC C AMý$ M
Pascal-like notation for this action is foreach m in M do action on m An extension of the foreach action structuring rule permits selection of the set of (member) entities to be performed on the basis of a qualification expression like in SQL (restriction). So we obtain the following: foreach where do
r in R restriction on R action on r
The action on R may be simple (like printing the values of the selected attributes of r) or structured according to the composition rules discussed in this section. Further generalization of the above action composition rule is in fact well suited for the representation of cover aggregation within the basic relational model so that it requires a join of two relations. The form of such an action is foreach where do
rinR, sinS restriction on R x S action on (r, s)
(7) Example of Structure Modeling In the further development of our data model we observe that each product may have a complex structure which consists of a set of other products and a set of raw materials which are necessary to produce it using a set of working operations. This means that applying cover aggregation we define the entity type PRODUCT as a cover type of the entity types ITEM and OPERATION. In other words, each PRODUCT is viewed as an association of a set of items (parts and materials) and a set of working operations.
105
3 Design Methodology
PRODUCT
ITEM
OPERATION
There is no proper support in Pascal-like languages for immediate and explicit representation of the above structure. If we assume that one and the same type of working operation is required in a number of products and that one and the same item is a component of many products we obtain the following relational representation of our conceptual model in Pascal-like notation: type ITEM = record ITEM #: INTEGER;
DESCRIPTION: STRING; PRICE: REAL; case KIND: (PRODUCT, MATERIAL) of PRODUCT: (HEIGHT, WIDTH, DEPTH: REAL); MATERIAL: (UNIT: STRING, MEASURE: REAL) end; OPERATION = record OPERATION #: INTEGER;
DESCRIPTION: STRING; PRICE: REAL end;
ASSEMBLY = record SUPITEM #: INTEGER; SUBITEM #: INTEGER; QUANTITY: INTEGER end; LABOR = record ITEM # :INTEGER; OPERATION #: INTEGER; QUANTITY: REAL end;
ITEMSET = relation of ITEM; OPERATIONSET = relation of OPERATION; ASSEMBLYSET = relation of ASSEMBLY; LABORSET = relation of LABOR; database INVENTORY of ITEMS: ITEMSET; OPERATIONS: OPERATIONSET; STRUCTURE: ASSEMBLYSET; WORK: LABORSET end; (8) Recursive Abstraction If we apply repeatedly data abstraction techniques we obtain a semantic hierarchy of entities on the one hand and a hierarchical structure of actions
106
2 Logical Design
on those entities on the other hand. However, there are many application environments in which such a semantic hierarchy contains an entity type which already occurred among its own predecessors. This means that in the definition of an entity type we referred to that entity type itself. In other words, we applied recursive data abstraction. This is precisely the situation which occurs in our example. Indeed, putting together the three diagrams which we developed so far we obtain the following diagram:
ITEM
PRODUCT
MATERIAL
OPERATION
The appropriate abstraction for structuring actions on entities which are recursively defined is recursion. This is yet another reason for the integration of data and programming languages. So an action on an entity of the type ITEM would have the following abstract (conceptual) structure: procedure Action on item (e) begin Action on the generic entity e; case type e of material: Action on the subtype entity material; product: begin Action on the subtype entity product p; foreach m in OPERATION where m working operation of p do Action on m; foreach n in ITEM where n subproduct of p do Action on item (n) end end; Observe that in the definition of the above action (procedure) we referred to that very action so that we obtained a recursively defined action (procedure) which mirrors the recursive structure of the data model. The above conceptual action has the following Pascal-like representation within the framework of the basic relational model:
3 Design Methodology
107
procedure PRODUCTANALYSIS(P: INTEGER); var X: ITEM; Y: ASSEMBLY; begin foreach Yin STRUCTURE where Y.SUPITEM # = P do foreach X in ITEMS where Y.SUBITEM # = X.ITEM # do case X.KIND of MATERIAL: PRINTMATERIAL(X.ITEM #, X.DESCRIPTION, X.UNIT, X.MEASURE, Y.QUANTITY); PRODUCT: begin LABORANALYSIS(Y.SUPITEM #); PRODUCTANALYSIS (Y.SUBITEM #) end end The procedure LABORANALYSIS whose call occurs in the procedure PRODUCTANALYSIS is specified as follows: procedure LABORANALYSIS(P: INTEGER); var X: LABOR; Y: OPERATION; begin foreach X in WORK where X.ITEM # = P do foreach Yin OPERATIONS where Y.OPERATION # = X.OPERATION # do PRINTLABOR(Y.OPERATION #, DESCRIPTION, X.QUANTITY) end; 3.3 Conceptual Modeling (1) Abstraction, Localization, Refinement, and Incremental Design The term conceptual modeling in this book refers to the specification of structural properties of entities and their relationships and the associated actions on them which preserve all those properties, integrity constraints in particular. Structure modeling at the conceptual level involves first of all classification of entities into entity types and specification of the associated integrity constraints. Furthermore, it requires application of data abstraction techniques such as aggregation, generalization, and cover aggregation to compose/decompose entity types. Each such application within the re-
108
2 Logical Design
lational framework requires specification of the associated referential integrity constraints. Conceptual action modeling consists of specification of abstract actions on each entity type which preserve all the properties of such a type, integrity constraints in particular. Such an action is specified using action composition rules (sequential, conditional, selective, and foreach composition) starting from the primitive actions associated with the classification (insert, delete, and update instance of an entity type). The design methodology presented in this section relies on a number of techniques all of which are in one way or another based on some form of abstraction. Abstraction is a technique in which some properties of an application environment are suppressed at a particular design phase to emphasize other properties which are considered to be more relevant for that particular phase. This in particular means that after the final design phase the developed model contains only those properties abstracted from the application environment which are considered to be relevant at all. This technique is used both for structure and action modeling. Classification, aggregation, generalization, and cover aggregation are data abstractions used in structure modeling. They have corresponding abstractions which are used for action modeling (parallel or sequential, selective and foreach composition). Localization is one of the techniques which is based on abstraction, but in conceptual modeling it has a specific form. It refers to a technique in which each entity type is modeled independently of all other entity types other than the ones which are immediately related to it by aggregation, generalization, and cover aggregation or some other form of data abstraction. When localization is applied to action design then each such abstract action associated with an entity type is modeled in such a way that it preserves all the properties and integrity constraints specified for that entity type and for the relationships of that entity type with the entities which are immediately related to it by the above data abstraction techniques. So localization permits incremental design which reduces the complexity of each design phase. Furthermore, since every entity type can simultaneously have part-of, is-a, and member-of relationships with other entity types if the incremental design methodology is used, it is possible to introduce those relationships gradually, one at a time, redefining each time the appropriate abstract actions. Refinement is a technique in which global design is performed first to produce a global model which consists of global data and action specification. In the subsequent phases this global model is gradually decomposed into a hierarchy of more detailed specifications of both structure and actions. This technique of top-down design in which the amount of detail is carefully controlled in each step of the refinement process is a well-known programming methodology. However, a complementary bottom-up approach is also required for both structure and action modeling. Using the top-down ap-
3 Design Methodology
109
proach entities are decomposed into their subcomponents (aggregation), subtypes (generalization), and sets of member entities (cover aggregation). This decomposition of structure is accompanied with the decomposition of abstract actions using sequential, selection, and foreach composition rules. Using the bottom-up approach higher-level entities are derived from the already defined ones using the above data abstraction techniques. These higher-level entities are accompanied with higher-level actions which are composed using the abstract actions on the previously defined entities. (2) Example of Conceptual Modeling Having introduced briefly the conceptual modeling techniques we will illustrate most of them on a particular example. We will assume that the project management activity is abstracted from an application environment and that our task is data modeling of that particular application. Since that application can be very complex we will adopt some further simplifications along the way using the abstraction technique again. Because of that only two types of abstraction (Cartesian and cover aggregation) will be used for data modeling. Consequently, sequential and foreach composition will be used for the specification of abstract actions. The top-down design methodology will be applied in such a way that the most fundamental entity type will be introduced first (which is obviously PROJECT) together with the associated abstract actions. Further development of the model will proceed incrementally, introducing at each step one relationship of a previously defined entity with another entity which is either introduced at that step or already defined in one of the previous steps. Whenever such a relationship is introduced, all the relevant actions are either specified or redefined in such a way that the referential integrity constraints are satisfied. Localization is heavily used particularly in the way abstract actions are specified for each entity type. Each such action is defined for only one entity type. In addition to that entity type, it refers only to those entity types immediately related to that entity via cover aggregation. Consequently, whenever a new relationship is introduced only actions on the related entities must be reconsidered for the refinement which would maintain the referential integrity constraints. A further interesting observation about this design example is that data abstraction which drives the whole design is cover aggregation. This means that whenever a new entity type is introduced it is related by cover aggregation to a previously defined entity type. So cover aggregation is used as a vehicle for the incremental development of the data model. This does not mean that it is the only data abstraction which appears de facto in the final model. Two types of cover aggregation occur in this example (disjoint and overlapping member types) and both are represented within the basic relational framework.
110
2r Logical Design
A particularly simple way in which abstraction is used occurs in the presented design example. Initially, out of all possibly relevant attributes only the identifiers of entities are introduced (keys and foreign keys). Specification of other attributes of the introduced entities is suppressed until the design phase in which all entities and their relationships are specified as well as all the actions on them which maintain those relationships.
(3) Project Management Example (i) Introducing Projects. In the first design phase we introduce the most fundamental entity type PROJECT and its identifier PRO # as the only attribute, thus postponing specification of its other attributes for some later design phase. We also specify two actions associated with the entity type PROJECT: Insert and Delete, using sequential composition. The primitive action Request is used to obtain the values of the attributes quoted in that action. Those values may be requested interactively from the user or, in case of entity identifiers, they may be generated in some system-determined manner as was proposed for the extended relational model. Data abstraction techniques which we used so far are classification and aggregation. PROJECT(PRO #) PROJECT: Insert: Request project identifier Request project attributes Insert project Delete: Request project identifier Delete project (ii) Subdividing Projects into Tasks. To manage a project we subdivide it into a number of tasks. So a set of tasks is associated with each project using cover aggregation which partitions the set of entities of the newly introduced entity type TASK into disjoint subsets. Because of that two attributes of the entity type TASK are introduced: its entity identifier TASK# and the identifier PRO # of the project to which it belongs. This representation of cover aggregation within the basic relational framework requires specification of the appropriate referential integrity constraint which says that a task must refer to an existing project.
PROJECT
TASK
3 Design Methodology
ill
PROJECT(PRO #) TASK(TASK #, PRO #) TASK[PRO#] -_ PROJECT[PRO#] PROJECT: Insert: Request project identifier Request project attributes Insert project For each task of the project insert task Delete: Request project identifier Delete project For each task of the project delete task TASK: Insert: Request task identifier Request project identifier Request task attributes Insert task Delete: Request task identifier Delete task Two actions-Insert and Delete-are specified for the entity type TASK at this phase in the same way that such actions are specified for the entity type PROJECT in the previous design phase. However, since we in fact redefined PROJECT as the cover type of the newly introduced entity type TASK, we have to redefine the Insert and Delete action associated with that entity type. Insertion of a project is refined in such a way that it requires insertion of all its tasks. Deletion of a project requires deletion of all its tasks in accordance with the requirement that a task can exist in the database only if the project to which it belongs also exists in the database. Observe that this action in fact specifies an important property of the chosen application environment. The refinement of Insert and Delete actions is performed by applying the localization technique and considering only the relationships of the entity type PROJECT with the entity type TASK which is immediately related to it by cover aggregation. Consequently, in the refinement of the above actions the foreach composition rule is used. (iii) Assignment of Tasks to Departments. In the further development of our conceptual model we will assume that the application environment has the property that each task is assigned to an organizational unit such as a department. Because of that we introduce a new entity type DEPARTMENT and its identifying attribute DEPT#. The entity types DEPARTMENT and TASK are immediately related by cover aggregation of the partitioning type. In other words, a task belongs to one and only one department. Because of that we refine its specification introducing a new attribute DEPT# which identifies the department of a task. As in the design
112
2 Logical Design
phase (ii) we introduce the appropriate referential integrity constraint which states that a task must refer to an existing department.
DEPARTMENT
TASK
TASK(TASK #, PRO #, DEPT #) DEPARTMENT(DEPT #) TASK[DEPT# ] g DEPARTMENT[DEPT#] DEPARTMENT: Insert: Request department identifier Request department attributes Insert department Delete: Request department identifier Delete department For each task of the department change its department identifier TASK: Insert: Request task identifier Request project identifier Request task attributes Select department Insert task Change department: Request task identifier Request new department Select task and change its department identifier Specification of the insertion action of the entity type DEPARTMENT brings nothing new. But specification of the delete action requires update of the department identifier of each task assigned to the deleted department. Because of that a new update action Change department is associated with the entity type TASK. Insertion of a task now requires specification of an existing department and so that action is refined appropriately. It is important to observe once again that specification of actions captures some important properties of the application environment without which our conceptual model would have been incomplete.
113
3 Design Methodology
(iv) Assignment of Employees to Tasks. To perform a task a set of employees is assigned to the task. So a new entity type EMPLOYEE is introduced and related immediately to the TASK entity type via cover aggregation. However, this time we will assume that this cover aggregation is not partitioning. In other words, an employee may be assigned to a number of tasks. Observe that this is a property of the application environment. To represent this property within the relational framework we introduce a nonentity association ASSIGNMENT of the entity types TASK and EMPLOYEE whose only attributes are for the time being identifiers of the task TASK # and the employee EMP# participating in an instance of that association. An instance of the entity ASSIGNMENT is permitted to exist in the database as long as the instances of the entity types EMPLOYEE and TASK which participate in the association exist in the database. This referential integrity constraint is introduced together with the entity type ASSIGNMENT and the actions which are specified in this design phase are defined in such a way that this newly introduced integrity constraint is maintained.
EMPLOYEE
TASK
TASK(TASK #, PRO #, DEPT #) EMPLOYEE(EMP #) ASSIGNMENT(EMP #, TASK #) ASSIGNMENT[TASK #] g TASK[TASK #] ASSIGNMENT[EMP #] _ EMPLOYEE[EMP # ] EMPLOYEE: Insert: Request employee identifier Request employee attributes Insert employee Delete: Request employee identifier Delete employee For each assignment of the employee change the employee identifier TASK: Insert: Request project identifier Request task identifier Request task attributes
114
2 Logical Design
Select department Select employees for the task Insert task For each selected employee insert its assignment to the task Delete: Request task identifier Delete task For each assignment of the task delete assignment ASSIGNMENT: Insert: Request task identifier Request employee identifier Request assignment attributes Insert assignment Delete: Request task identifier Request employee identifier Delete assignment Change employee: Request task identifier Request employee identifier Request new employee identifier Select assignment and change its employee identifier Specification of actions associated with the newly introduced entity type is performed in such a way that whenever an employee is deleted, each of its assignments is updated (its employee identifier is changed). An alternative would be to delete all of them. This alternative is adopted when the delete action of that entity type is also further refined in such a way that it refers to its relationship with the newly introduced and immediately related entity type EMPLOYEE. In view of how this immediate conceptual relationship is represented within the relational framework, insertion of a task calls for selection of the employees to be assigned to the task and insertion of an assignment of each selected employee to the task. Specification of the insert, delete, and update actions of the entity type ASSIGNMENT is straightforward. (v) Assignment of Employees to Departments. Further development of the conceptual model of the project management application brings nothing new. Further relationships of the entity types introduced so far are modeled using the same design methodology. The assignment of employees to departments is represented and then afterward the assignment of projects and task leaders and department managers. Whenever a new relationship is introduced further refinement of the already specified actions or specification of new ones is performed, but always locally.
115
3 Design Methodology
DEPARTMENT
EMPLOYEE
DEPARTMENT(DEPT #) EMPLOYEE(EMP #, DEPT #) EMPLOYEE[EMP #]
; DEPARTMENT[DEPT #]
EMPLOYEE: Insert: Request employee identifier Request employee attributes Select department Insert employee DEPARTMENT: Delete: Request department identifier Delete department For each employee of the department change its department EMPLOYEE: Change department: Request employee identifier Request new department identifier Select employee and change its department identifier (vi) Assignment of Project Leaders. EMPLOYEE
PROJECT PROJECT(PRO #, LEADER #) EMPLOYEE(EMP #, DEPT #) PROJECT [LEADER#] ý__EMPLOYEE[EMP#]
116
2 Logical Design
PROJECT: Insert: Request project identifier Request project attributes Select project leader Insert project For each task of the project insert task Change leader: Request project identifier Request/Select new leader Select project and change its leader identifier EMPLOYEE: Delete: Request employee identifier Delete employee For each assignment of the employee change its employee identifier For each project where the employee is the leader, change the leader identifier of the project (vii) Assignment of Task Leaders.
TASK
EMPLOYEE
EMPLOYEE(EMP #, DEPT #) TASK(TASK #, PRO #, DEPT #, LEADER #) TASK[LEADER#] #
EMPLOYEE[EMP #]
TASK: Insert: Request/Select project identifier Request task identifier Request task attributes Select department Select employees Select leader Insert task For each selected employee insert its assignment to the task Change leader: Request task identifier Select new leader Change the leader of the task
117
3 Design Methodology
EMPLOYEE: Delete: Request employee identifier Delete employee For each assignment of the employee change its employee identifier For each project where the employee is the leader, change the leader identifier For each task where the employee is the leader, change its leader identifier (viii) Assignment of Department Managers.
DEPARTMENT
EMPLOYEE
EMPLOYEE(EMP #, DEPT #) DEPARTMENT(DEPT #, MGR #) DEPARTMENT[MGR#]
_•EMPLOYEE[EMP#]
DEPARTMENT: Insert: Request department identifier Request department attributes Select manager Insert department Change manager: Request department identifier Select new manager Change department manager EMPLOYEE: Delete: Request employee identifier Delete employee For each assignment of the employee change its employee identifier For each project where the employee is the leader, change the leader identifier For each task where the employee is the leader change its leader identifier For each department where the employee is the manager, change the department manager
118
2 Logical Design
(ix) Relational Schema. Having developed our conceptual model of the project management application to the level of detail at which all the entity types are introduced, their relationships established, and actions on them which properly maintain those relationships also defined, we consider other attributes of those entities and relationships, as well as other update actions which these new attributes may require. Further development of the conceptual model along those lines is left to the reader. We just present a relational schema which is a possible outcome of such a development. EMPLOYEE(EMP #, NAME, JOB, EXPERIENCE, SALARY, DEPT #) DEPARTMENT(DEPT #, NAME, MGR #) PROJECT(PRO #, NAME, LEADER #, CUSTOMER, FUNDS, BALANCE, STARTDATE, DEADLINE) TASK(TASK #, PRO #, DEPT #, LEADER #, NAME, FUNDS, BALANCE) ASSIGNMENT(TASK #, EMP #, ROLE, PERCENTAGE)
Exercises (1) Views (i) One important application of views is to permit a user to access only a subset of a relation. For example, if a user is entitled to read only the names and salaries of employees in a particular department (to which he or she belongs) we may wish to specify the following view: Define a view containing the employee name and salary of those employees in department 13. DEFINE VIEW DEPARTMENT13 AS SELECT NAME, SALARY FROM EMPLOYEE WHERE DEPARTMENT# = 13 (ii) Views are also useful for providing statistical summaries of data. For example, the following view provides the average salary of each department without disclosing any individual salary: Define a view containing all the department identifiers and the average salary of each. DEFINE VIEW SELECT FROM GROUP BY
AVERAGE AS DEPARTMENT#, AVG(SALARY) EMPLOYEE DEPARTMENT#
(iii) If we require that the view contains the names rather than the identifiers
Exercises
119
of departments, then such a view will be constructed using the natural join of the relations EMPLOYEE and DEPARTMENT. a. Define a view containing all the department names and the average salary of each. b. Define a view called PROGRAMMERS consisting of the names and salaries of all programmers and locations of their departments. (iv) Once defined a view behaves like a base table in many ways. In particular, queries may be stated on a view. Using view (iii) b find the average salary of programmers in Boston. (2) Generalization In SQL views may be defined on top of other views. Such a hierarchy of views may be constructed to implement a hierarchy of generalizations as in the following exercise: (i) Choose appropriate attributes of the entities BUS, TRUCK, TRAIN, TRAM, AIRCRAFT, and HELICOPTER. (ii) Define a view ROAD VEHICLE representing a generic entity type of the entity types BUS, TRUCK, and CAR. (iii) Define a view TRACK VEHICLE as a generalization of the entities TRAIN and TRAM. (iv) Define a view LAND VEHICLE on top of the views ROAD VEHICLE and TRACK VEHICLE. The entity type LAND VEHICLE is meant to be a generalization of the entity types ROAD VEHICLE and TRACK VEHICLE. (v) Define a view AIR VEHICLE representing a generic entity type of the entities AIRCRAFT and HELICOPTER. (vi) Define a view MOTOR VEHICLE on top of the views AIR VEHICLE and LAND VEHICLE in such a way that it contains the generic properties of these two entity types. (3) Types of Database Entities (i) Consider the following relational schema PERSON(PERSON #, NAME, AGE)
MARRIAGE(LICENSE #, HUSBAND #, WIFE #, DATE #) CHILD(CHILD #, NAME, FATHER #, MOTHER #, MARRIAGE #) (ii) Classify the entities and their associations as: a. Kernel b. Associative c. Characteristic (iii) Is the integrity constraint which says that an association exists if both
120
2 Logical Design
entities which participate in the association exist justified in this particular example? (iv) Consider the following possibilities, their semantics, and their relational representation: a. CHILD is a characteristic entity of MARRIAGE. b. CHILD is a characteristic entity of PERSON. c. CHILD is an associative entity according to the following entity-relationship diagram:
(4) Soccer Team Design a soccer relational schema according to the following requirements: (i) There are three kernel entities: PLAYER, TEAM, and POSITION. (ii) There is an associative entity SEASON. The entities participating in the association are PLAYER and TEAM. (iii) There exists an M: N nonentity association PLAYS among the entities PLAYER and POSITION. (iv) There exists an N: 1 association between the entities POSITION and TEAM. The above relationships are represented by the following entity-relationship diagram:
Exercises
121
(5) Associations (i) Specify the relational schema which contains one kernel entity COURSE and an M: N association PREREQUISITE of that entity with itself. (ii) Does it make sense to require the integrity constraint which permits a tuple in the PREREQUISITE relation to exist only if both the successor and the predecessor courses are specified in that tuple?
(6) Students and Advisors Specify a relational schema according to the following requirements: (i) There are three kernel entities: STUDENT, PROFESSOR, COURSE. (ii) For each of the above entities the schema contains one relation whose only attribute is the entity identifier. (iii) Compound properties of the entity STUDENT are NAME and ADDRESS, each represented by a separate relation. (iv) Compound properties of the entity PROFESSOR are NAME and ADDRESS, each represented by a separate relation. (v) Properties of the entity COURSE which are CREDITS and DESCRIPTION are also represented by separate relations.
122
2 Logical Design
The following relationships are assumed to be relevant: (vi) A student is advised by one professor only but a professor can advise several students (ADVISING). (vii) A student can attend several courses and a course can be followed by several students (ENROLLMENT). (viii) A professor teaches a number of courses and a course may be taught by a number of professors (TEACHING). (ix) If an association is regarded as an associative entity then its representation in the relational schema includes a relation whose only attribute is the entity identifier. (x) A property of the association ENROLLMENT is GRADE and a property of the association TEACHING is ROOM, both represented by separate relations. (7) Library Database (i) Express the property integrity constraint for the relation NAME in the library database using SQL. (ii) Redesign the extended relational model of the library database so that BOOKLOAN is a nonentity association and not an associative entity. (iii) Using the BOOKLOAN association state the following SQL query: Print the titles of all books which are on loan to a particular borrower. (8) Aggregation, Generalization, and Characterization (i) Consider the following graphical representation of a hierarchy of a collection of entity types:
CONTRACT
a
a
COMPANY
gn/
gen
geil PRO
MATERIAL
char
char
MEA
CHEMICAL PROPERTIES NOMENCLATURE
char
MECHANICAL PROPERTIES
SUPPLIER
CUSTOMER char
ADDRESS
123
Exercises
(ii) Observe that the above example conforms to the following general structure of RM/T hierarchy of types: ASSOCIATIVE TYPES Aggregation INNER KERNEL TYPES Generalization KERNEL SUBTYPES Characterization CHARACTERISTIC TYPES (iii) Choose appropriate attributes of the above entity types and specify all the relations required by the extended relational model of data. (iv) Make explicit specification of all the appropriate integrity constraints. (9) Aggregation and Generalization Consider the following submodels: (i) Departure management data submodel consists of the following entities: a. Kernel: DEPARTURE, EMPLOYEE, PLANE b. Nonentity association of the above three entities: ASSIGNMENT
124
2 Logical Design
(ii) Select appropriate attributes of the above entities and the association and specify the relational submodel. (iii) Booking data submodel consists of the following entities: a. Kernel: FLIGHT b. Characteristic: DEPARTURE (of FLIGHT) PASSENGER (of DEPARTURE)
(iv) Select appropriate attributes of the above entities and specify the relational submodel. (v) Consider the following alternative: BOOKING is an associative entity type and PASSENGER exists as long as at least one of its BOOKINGs exists. (vi) Support data submodel consists of the following: a. Inner kernel: EMPLOYEE (generic), PLANE b. Kernel subtypes: PILOT, TECHNICIAN of the generic entity type EMPLOYEE. c. Nonentity association CAN-FLY of the entity types PILOT and PLANE.
125
Exercises
(vii) Select appropriate attributes of the above entities and specify the relational submodels.
(viii) Specify the overall relational model integrating the three submodels. (ix) Using SQL specify three views of the integrated data model: ASSIGNMENT, BOOKING, and SUPPORT as appropriate associative entities. (10) Cover Aggregation (i) Cover aggregation is an abstraction in which a relationship between
member objects is considered as a higher-level set object. For example, the set TRADE-UNION is a cover aggregation of EMPLOYEE members and the set MANAGEMENT is a cover aggregation of EMPLOYEE members.
TRADE UNION
Covering
MANAGEMENT
COVER TYPE
z EMPLOYEE
MEMBER TYPE
(ii) Cover aggregation is not necessarily partitioning of the member types into disjoint sets as can be seen from the following example of cover aggregation:
126
2 Logical Design
IEEE
ACM
BSC
COMPUTER PROFESSIONAL
(iii) On the other hand, a typical cover member may or may not be homogeneous in type. For example, a task force may consist of ships, tanks, and personnel. TASK FORCE
SHIP
PLANE
TANK
PERSONNEL
(iv) Each cover aggregation type is treated in RM/T as an entity type, having the usual E-relation plus possible P-relations and possible subordinate characteristic relations. In addition, the cover membership relation (KG-relation) is introduced which specifies for every cover aggregation type what are allowable types that may become its members. (11) Cover Aggregation, Generalization, and Characterization (i) Consider the following hierarchy of entity types from the tanker monitoring application environment: CONVOY
COVER TYPE
cover
SHIP char
INCIDENT
gen OIL SPILL
gen
OIL TANKER
INNER KERNEL gen
MERCHANT SHIP
KERNEL SUBTYPE
char INSPECTION
CHARACTERISTIC
127
Exercises
(ii) Choose appropriate attributes of the above entity types and specify the relational schema. (iii) Make explicit specification of the appropriate integrity constraints. (12) Taxonomic Design Methodology. (i) In the section on conceptual modeling, generalization was not used to simplify the exposition. However, generalization and the converse relationship specialization are very important tools for the top-down development of a conceptual model. The key idea of such a methodology is that most general concepts (entity types) and their associated actions from the chosen application environment should be modeled first. The refinement process is then carried out in such a way that more specialized concepts (subtypes of the previously introduced generic entity types and the associated actions) are considered as subcases. In doing that, property inheritance is used so that a subtype entity inherits all the properties of its generic entity type. Specialization is used not only to develop subtypes but also to specify actions associated with the subtype entities. This approach permits gradual introduction of specific details and is thus particularly suitable for those application environments which are characterized by large amounts of simple details. (ii) Consider the student enrollment application. It is easy to identify immediately the kernel entity types STUDENT and COURSE and their aggregate entity type ENROLLMENT
ENROLLMENT
STUDENT
COURSE
Specify the relational schema which corresponds to the above graph choosing appropriate attributes for the above three entity types. Observe that a particularly important multivalued property of a student is the set of courses which he or she has taken. Having specified the above entity types define their associated actions. (iii) Refine the conceptual model defined in the previous design phase using specialization in such a way that two subtypes are introduced for the entity types STUDENT and COURSE which correspond to two levels of study: undergraduate and graduate.
128
2 Logical Design
STUDENT
UNDERGRADUATE STUDENT
COURSE
GRADUATE STUDENT
UNDERGRADUTE COURSE
GRADUATE COURSE
All the advantages of property inheritance should be used. (iv) Refine the abstract actions specified in the design phase (ii) in such a way that they reflect the specialization performed in the next step (iii). (13) Exception Modeling (i) During the initial design phases one particularly important case in which the abstraction technique is used is suppression of all the details related to the exceptional situations which may occur in action modeling. Since actions correspond to activities or events in an application environment that means that during the initial phases of modeling of such activities we disregard all irregular situations and describe only the normal flow of events. Eventually, we have to consider the problem of representation and management of exceptions. Generally speaking, exceptional situations are related to violation of integrity constraints. An abstract action is always designed assuming that some conditions are satisfied before the action is executed. Since an abstract action is a composite action there are usually many situations during its execution in which some integrity constraint may be violated. Finally, there is always a condition which corresponds to a successful completion of an action and describes its desired effect. Action modeling should thus include specification of those conditions as well as specification of appropriate actions which should be performed when such exceptional situations occur. (ii) Consider again the student enrollment application from the previous exercise. Specify all the preconditions which are required for the abstract actions of that particular conceptual model. Examples of such preconditions are: A student cannot take the same course more than once. Graduate students cannot take first, second, or third year undergraduate courses. A student cannot enroll in a course whose enrollment limit has been reached. The implication is that if any of the preconditions of an action are not satisfied that action is aborted and all its possible effects on the database undone. Of course, an appropriate message is sent to the user.
129
Exercises
(iii) Refine further the conceptual model of the student enrollment application using recursive data abstraction in such a way that the entity COURSE is related to itself by cover aggregation which corresponds to the prerequisite relationship among courses. The refinement of the previously defined actions and the newly introduced ones should be performed in such a way that new preconditions are introduced such as: A student cannot take a course unless he or she has already taken all of its prerequisites. (14) Exception Handling (i) Rather than making frequent detours to describe what happens in rare exceptional cases, it is more attractive to specify only the natural flow of events for each abstract action adopting the convention that whenever a constraint is violated an instance of an exception type is inserted into the appropriate relation. It is then possible to specify in a separate design phase how exceptions are to be handled. Furthermore, exceptions are entities themselves and data abstraction techniques may be used for their representation. Exception handlers are actions which may be modeled using the same model composition rules which were developed for actions in general. (ii) Using the approach described in (i) specify exception handling for the conceptual model of the student enrollment system proceeding methodically down the hierarchical structure of the model. (iii) Apply the same methodology to the conceptual model of the project management application. (iv) Consider the possibility of organizing exceptions into a generalization hierarchy. The two top-most levels of such a hierarchy for the student enrollment application might look like this:
ENROLL EXCEPTION
COURSE CHOICE EXCEPTION
TIME VIOLATION EXCEPTION
COURSE AVAILABILITY EXCEPTION
Develop fully this approach for the student enrollment application and specify the required exception handlers.
130
2 Logical Design
Bibliographical Notes The first three normal forms were introduced by Codd (1971a). The Boyce-Codd normal form was presented in this chapter along the lines of Beeri and Bernstein (1979). The fourth normal form was introduced by Fagin (1977). He is also responsible for the projection/join (fifth) normal form (Fagin, 1979). Among the papers on advantages and disadvantages of normalized and unnormalized models we mention Su and Emam (1978). Aggregation and generalization were introduced by Smith and Smith (1977a, 1977b). Those and other data abstraction techniques for expressing semantics of application environments are systematically presented in Codd (1979). The papers by Smith and Smith (1977b) and Codd (1979) also present views on various types of referential integrity constraint which are related to the data abstraction techniques when they are used in the relational framework. The conceptual design methodology presented in this chapter is mainly based on the results of Brodie (1984). Another reference used for this chapter is Borgida et al. (1984). The views on the relational programming environment are based in part on the papers by Schmidt (1977), Brodie (1984), Codd (1979), and Borgida et al. (1984). The exercises in this chapter are adapted from the examples in Ullman (1980), Vetter (1983), Brodie (1984), Codd (1979), and Borgida et al. (1984).
CHAPTER 3
Structural Design
I Relational Images A relation is represented by a number of other sets. One of these sets is the relation itself; others are orderings of tuples of that relation called images. This situation is pictured by the following diagram:
R
I R
image relation
(1) Relational Images Given a relation R(A 1 , A 2..... A.) an image I over R is defined for a set of attributes A, where A _ {A 1,A 2 , ... ,A,,}. The domain of the set A is as-
132
3 Structural Design
sued to be linearly ordered. In general, an image comes equipped with two functions: (i) A successor function nextA: R[A] --* R[A] (ii) An access function accessA : domain(A) --* P(R) where P(R) is the power set of R. The successor function follows from the assumption that the domain of A is linearly ordered. So R[A] is also equipped with two tuples denoted as: (iii)
firstA in R[A], lastA in R[A]
where firstA is a successor of no tuples in R[A] and nextA is undefined for lastA (lastA does not have a successor in R[A]). The access function is defined as follows: accessA(a) = {r: r[A] = a}.
(iv)
In other words, accessA(a) is the set of tuples of the relation R whose A attributes equal a (restriction of the relation R with respect to the condition R[A] = a). On the basis of the above assumptions we may introduce the following two sets and an additional function: (v)
FirstA c R, LastA _ R NextA : P(R) --* P(R)
defined as follows: (vi)
FirstA {r: r in R & r[A] = firstA} LastA = {r: r in R & r[A] = lastA} NextA(S) = {r r in R & there exists s in S such that r[A] = nextA(s[A])} where S ___ R.
The relationship between the above two successor functions and the access function is described by the following diagram:
(vii)
R[A]I
accessA
P(R) NextA
nextA R[A]
accessA
1
P(R)
133
1 Relational Images
The above diagram is meant to be commutative, that is, the following two compositions of functions are equal:
accessA
R[AI
RIA]
NextA -
nextA
P(R)
-b R[AI
1 P(R) accessA
• P(R)
Observe that the definition of the access function implies that (viii)
a not in R[A] implies access(a)
0
(2) Types of Image (i) Index. An image of the most general type, which means equipped with the successor and the access functions defined above, is called an index. (ii) Unique Image. A particularly important case occurs when A in (1) is a (super) key of the relation R. Such an image is called unique. For a unique image the result of the accessA function is at most a singleton set. (iii) Clustering Image. Among the images of a relation R one may have the clustering property. That property refers to the efficiency of retrieving tuples of R using the access and successor functions. As a rule such actions are much more efficient for the clustering image than for other images. (iv) Sort Order. A particular type of a clustering image is a sort order. A sort order does not have an explicit access function, but it is still equipped with the sets First, Last, and the function Next for which the efficiency property remains valid. (v) List. A particular case of a sort order is a list. In a list the set of sort attributes is empty, which simply means that tuples of a relation are ordered in some way irrespective of the values of their attributes. But the function Next is still applicable as are the remarks about its efficiency. (vi) Hash Image. Finally, an image to which only the access function is applicable is called a hash image. In that case the set A in (1) may be linerly ordered, but the function nextA is not available. Every relation has at least one image. If it has exactly one, then that image is its clustering image. The default type of a clustering image is a list.
134
3 Structural Design
(3) Images in SQL SQL allows the user to create and delete images, but no statements which operate on them are available. Selection of an optimal image for processing a query or updating of images as the relations over which they are created are updated is left to the system. Suppose that we have created a relation whose structure is EMPLOYEE(EMP #, NAME, JOB, SALARY, DEPT #) The statement CREATE IMAGE NAMES ON EMPLOYEE (NAME) creates an index of names of employees. The index is neither unique nor clustering. The statement CREATE UNIQUE IMAGE EMPNO ON EMPLOYEE (EMP #) creates an index on the primary key of the relation EMPLOYEE. The statement specifices that the attribute EMP # is a key. Null values of the attribute EMP# are forbidden in the definition of the relation EMPLOYEE (see CREATE statement). The statement CREATE CLUSTERING IMAGE DEPTS ON EMPLOYEE (DEPT#) creates a clustering image on the relation EMPLOYEE. That means that tuples of the relation will in fact be ordered (grouped) according to their department numbers. Accessing all employees of a specific department will be a more efficient operation than it would have been if the image was not clustering. An image may be both unique and clustering. Suppose that we want to forbid duplicate names and that we want to give preference (in terms of efficiency) to those queries in which the relation EMPLOYEE is processed in alphabetical order. Then we may create the following image: CREATE UNIQUE CLUSTERING IMAGE NAMES ON EMPLOYEE (NAME) The efficiency property now refers to the accessNAME and nextNAME functions.
135
2 Decomposition of Unary Queries
2 Decomposition of Unary Queries Assuming that database relations are represented in terms of images which are equipped with the operations described in the previous section, we consider the problem of decomposition of a relational query into a procedural form. The procedural form is expressed in terms of the operations on images and it produces the relation which is the result of the query (whose decomposition it represents). We consider first unary queries, that is, queries on a single relation. Such a query is a composition of restriction and projection operations. In addition, the relation which represents the result of the query may have to be sorted (to have a sort order on some attributes). Rather than giving a general approach first, we analyze three typical queries whose decomposition will require the development of three unary algorithms to be used for procedural decomposition of any unary query. Furthermore, the unary algorithms will appear as modules used for decomposition of binary and more complex queries. (1) Exhaustive Restriction and Projection (ER) (i) Suppose that we are given a query, SELECT FROM WHERE AND
NAME EMPLOYEE JOB = 'PROGRAMMER' SALARY > 30000
on the relation EMPLOYEE(EMP #, NAME, JOB, SALARY) for which we assume that it has no indices over the attributes JOB and SALARY. (ii) Under the chosen assumptions the above query is processed using the following strategy: Successive tuples of the relation EMPLOYEE are processed according to the clustering image of that relation. The restriction and the projection operations are applied to each tuple accessed that way. (iii) In the graphical representation of the decomposition of query (i) in terms of the operations of the relational algebra, those operations are placed in one global rectangle indicating that all of them are performed in the specified order on the current tuple of the relation EMPLOYEE.
136
3 Structural Design
EMPLOYEE
(iv) The abstract algorithm ER for the exhaustive restriction and projection may be defined as follows: foreach r in R do if Q(r) then output(r[P])
R Q P
relation restriction condition output attributes
The above algorithm may be further decomposed into the following form: S j implies ai 5 lock. The steps 1, 2, ... , j - 1 constitute the growing phase of the transaction T and the steps j, j + 1, ... , n its shrinking phase. The terms growing and shrinking refer to the set of objects which the transaction has locked. During the growing phase a two-phase transaction locks objects. The beginning of the shrinking phase is determined by the first unlock action which a twophase transaction executes. After that step such a transaction never locks an object. (7) Examples of Legal Executions of Well-Formed and Two-Phase Transactions (i) Suppose that we are given the following two well-formed and two-phase transactions.
183
3 Locking Protocols
TI: 1. 2. 3. 4. 5. 6.
LOCK UPDATE LOCK UNLOCK UPDATE UNLOCK
A A B A B B
T2 : 1. 2. 3. 4. 5. 6.
LOCK UPDATE LOCK UPDATE UPDATE UNLOCK "7.UPDATE 8. UNLOCK
A A B B A A B B
(ii) The following is a legal execution I, of the transactions T1 and T2: I, :
T 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
LOCK UPDATE LOCK UNLOCK
T2 A A B A LOCK A UPDATE A
UPDATE B UNLOCK B LOCK UPDATE UPDATE UNLOCK UPDATE UNLOCK
B B A A B B
Observe that the dependency relation of the above execution is D(I) = {(T 1 ,A, T 2 ),(T,,B, T 2 )} Consider also another example of a legal execution tions T1 and T2:
12:
T,1
1. 2. 3. 4. 5. 6. 7. LOCK A 8. UPDATE A 9. 10.
12
of the given transac-
T2 LOCK UPDATE LOCK UPDATE UPDATE UNLOCK
A A B B A A
UPDATE B UNLOCK B
184
4 Data Integrity
I. 12. 13. 14.
LOCK UNLOCK UPDATE UNLOCK
B A B B
The dependency relation of the above execution is D(12) = {(T 2 ,A, T1 ),(T 2 ,B, TI)} In the above examples the execution I, was equivalent to the serial execution T1 ; T2 and the execution 12 to the serial execution T2; T 1 . We show that this in fact holds in general; that is every legal execution of well-formed and two-phase transactions T1 and T2 is equivalent either to the serial execution T 1 ; T2 or to the serial execution T2; T1 . (8) Legal Executions of Two Well-Formed and Two-Phase Transactions Which Are Serializable (i) If transactions TI and T2 are both well-formed and two-phase then a legal execution of T1 and T 2 is equivalent either to the serial execution T 1; T2 or to the serial execution T2 ; T 1. The above assertion is an immediate consequence of the following lemma: (ii) Suppose that I is a legal execution of well-formed and two-phase transactions T1 and T2 . If (T1, e, T 2) is in D(I) for some object e then the same holds for all objects on which transactions T, and T2 perform their actions. To prove the above lemma we will assume that there exist e and e' such that (T 1 ,e, T2) is in D(I) and (T 2,e', T 1) is in D(I). Then we will show that this assumption leads to a contradiction, so it is impossible. We proceed as follows: (T 1 , e, T2 ) in D(I) and (T2, e', T,) in D(I) together with the assumption that T, and 7'2 are well formed and two phase imply two cases: a. There exist k and m such that k<m I(k) = (T1, e, unlock) I(m) = (T 2, e, lock) k <j < m implies I(j) 0 (Tl,e, aj) I(j) 0 (T2, e, aj) b. There exist p and q such that p