TEAM LinG
Advanced Topics in Database Research Volume 4 Keng Siau University of Nebraska-Lincoln, USA
IDEA GROUP PUBLISHING Hershey • London • Melbourne • Singapore
TEAM LinG
Acquisitions Editor: Senior Managing Editor: Managing Editor: Development Editor: Copy Editor: Typesetter: Cover Design: Printed at:
Mehdi Khosrow-Pour Jan Travers Amanda Appicello Michele Rossi April Schmidt Cindy Consonery Integrated Book Technology Integrated Book Technology
Published in the United States of America by Idea Group Publishing (an imprint of Idea Group Inc.) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.idea-group.com and in the United Kingdom by Idea Group Publishing (an imprint of Idea Group Inc.) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright © 2005 by Idea Group Inc. All rights reserved. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Advanced Topics in Database Research, Volume 4 is part of the Idea Group Publishing series named Advanced Topics in Database Research (Series ISSN 1537-9299). ISBN 1-59140-471-1 Paperback ISBN 1-59140-472-X eISBN 1-59140-473-8 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
TEAM LinG
Advanced Topics in Database Research Volume 4
Table of Contents Preface ............................................................................................................. vi Chapter I Dynamic Workflow Restructuring Framework for Long-Running Business Processes ....................................................................................... 1 Ling Liu, Georgia Institute of Technology, USA Calton Pu, Georgia Institute of Technology, USA Duncan Dubugras Ruiz, Pontifical Catholic University of RS, Brazil Chapter II Design and Representation of Multidimensional Models with UML and XML Technologies ................................................................... 50 Juan Trujillo, Universidad de Alicante, Spain Sergio Luján-Mora, Universidad de Alicante, Spain Il-Yeol Song, Drexel University, USA Chapter III Does Protecting Databases Using Perturbation Techniques Impact Knowledge Discovery? ................................................................. 96 Rick L. Wilson, Oklahoma State University, USA Peter A. Rosen, University of Evansville, USA Chapter IV Simultaneous Database Backup Using TCP/IP and a Specialized Network Interface Card .......................................................................... 108 Scott J. Lloyd, University of Rhode Island, USA Joan Peckham, University of Rhode Island, USA Jian Li, Cornell University, USA Qing (Ken) Yang, University of Rhode Island, USA
TEAM LinG
Chapter V Towards User-Oriented Enterprise Modeling for Interoperability .......................................................................................... 130 Kai Mertins, Fraunhofer Institute IPK, Berlin Thomas Knothe, Fraunhofer Institute IPK, Berlin Martin Zelm, CIMOSA Association, Germany Chapter VI Using a Model Quality Framework for Requirements Specification of an Enterprise Modeling Language .......................... 142 John Krogstie, SINTEF ICT and IDI, NTNU, Norway Vibeke Dalberg, DNV, Norway Siri Moe Jensen, DNV, Norway Chapter VII Population of a Method for Developing the Semantic Web Using Ontologies .................................................................................................. 159 Adolfo Lozano-Tello, Universidad de Extremadura, Spain Asunción Gómez-Pérez, Universidad Politécnica de Madrid, Spain Chapter VIII An Evaluation of UML and OWL Using a Semiotic Quality Framework .................................................................................................. 178 Yun Lin, Norwegian University of Science and Technology, Norway Jennifer Sampson, Norwegian University of Science and Technology, Norway Sari Hakkarainen, Norwegian University of Science and Technology, Norway Hao Ding, Norwegian University of Science and Technology, Norway Chapter IX Information Modeling Based on Semantic and Pragmatic Meaning ...................................................................................................... 201 Owen Eriksson, Dalarna University, Sweden Pär J. Ågerfalk, University of Limerick, Ireland, and Örebro University, Sweden Chapter X Higher-Order Types and Information Modeling ................................ 218 Terry Halpin, Northface University, USA
TEAM LinG
Chapter XI Criteria for Comparing Information Modeling Methods: Informational and Computational Equivalence ................................... 238 Keng Siau, University of Nebraska-Lincoln, USA Chapter XII COGEVAL: Applying Cognitive Theories to Evaluate Conceptual Models .................................................................................. 255 Stephen Rockwell, University of Tulsa, USA Akhilesh Bajaj, University of Tulsa, USA Chapter XIII Quality of Analysis Specifications: A Comparison of FOOM and OPM Methodologies ....................................................................... 283 Judith Kabeli, Ben-Gurion University, Israel Peretz Shoval, Ben-Gurion University, Israel Chapter XIV Interoperability of B2B Applications: Methods and Tools ........... 297 Christophe Nicolle, Université de Bourgogne, France Kokou Yétongnon, Université de Bourgogne, France Jean-Claude Simon, Université de Bourgogne, France Chapter XV Possibility Theory in Protecting National Information Infrastructure ............................................................................................. 325 Richard Baskerville, Georgia State University, USA Victor Portougal, University of Auckland, New Zealand Chapter XVI Enabling Information Sharing Across Government Agencies ......... 341 Akhilesh Bajaj, University of Tulsa, USA Sudha Ram, University of Arizona, USA About the Authors .................................................................................... 367 Index ............................................................................................................ 377
TEAM LinG
vi
Preface
The Advanced Topics in Database Research book series has been regarded as an excellent academic books series in the fields of database, software engineering, and systems analysis and design. The goal of the book series is to provide researchers and practitioners the latest ideas and excellent works in the fields. This is the fourth volume of the book series. We are fortunate again to have authors that are committed to submit their best works for inclusion as chapters in this book. In the following, I will briefly introduce the 16 excellent chapters in this book: Chapter I, “Dynamic Workflow Restructuring Framework for Long-Running Business Processes”, applies the basic concepts of ActivityFlow specification language to a set of workflow restructuring operators and a dynamic workflow management engine in developing a framework for long-running business processes. The chapter explains how the ActivityFlow language supports a collection of specification mechanisms in increasing the flexibility of workflow processes and offers an open architecture that supports user interaction and collaboration of workflow systems of different organizations. Chapter II, “Design and Representation of Multidimensional Models with UML and XML Technologies”, presents the use of the Unified Modeling Language (UML) and the eXtensible Markup Language (XML) schema in abstracting the representation of Multidimensional (MD) properties at the conceptual level. The chapter also provides different presentations of the MD models by means of eXtensible Stylesheet Language Transformations (XSLT). Chapter III, “Does Protecting Databases Using Perturbation Techniques Impact Knowledge Discovery”, examines the effectiveness of Generalized Additive Data Perturbation methods (GADP) in protecting the confidentiality of data. Data perturbation is a data security technique that adds noise in the form of random numbers to numerical database attributes. The chapter discusses whether perturbation techniques add a so-called Data Mining Bias to
TEAM LinG
vii
the database and explores the competing objectives of protection of confidential data versus disclosure for data mining applications. Chapter IV, “Simultaneous Database Backup Using TCP/IP and a Specialized Network Interface Card”, introduces a prototype device driver, Realtime Online Remote Information Backup (RORIB) in response to the problems in current backup and recovery techniques used in e-business applications. The chapter presents a true real time system that is hardware and software independent that accommodates to any type of system as the alternative to the extremely expensive Private Backup Network (PBN) and Storage Area Networks (SANs). Chapter V, “Towards User-Oriented Enterprise Modeling for Interoperability”, introduces user oriented Enterprise Modeling as a means to support new approaches for the development of networked organizations. The chapter discusses the structuring of user requirements and describes the initial design of the Unified Enterprise Modeling Language (UEML) developed in a research project sponsored by the European Union. Chapter VI, “Using a Model Quality Framework for Requirements Specification of an Enterprise Modeling Language”, introduces a Model Quality Framework that tackles the selection and refinement of a modeling language for a process harmonization project in an international organization. The harmonization project uses process models that prioritize what was to be implemented in the specialized language and develops a support environment for the new harmonized process. Chapter VII, “Population of a Method for Developing the Semantic Web Using Ontologies”, introduces an ONTOMETRIC method that allows the evaluation of existing ontologies and making better selection of ontologies. Chapter VIII, “An Evaluation of UML and OWL Using a Semiotic Quality Framework”, systematically evaluates the Unified Modeling Language (UML) and Web Ontology Language (OWL) models by using a semiotic quality framework. The chapter highlights the strengths and weaknesses of the two modeling languages from a semiotic perspective. This evaluation better assists researchers in the selection and justification of modeling languages in different scenarios. Chapter IX, “Information Modeling Based on Semantic and Pragmatic Meaning”, introduces an information modeling approach based on the speech act theory to support meaningful communication between different actors within a social action context. The chapter discusses how taking both semantic and pragmatic meaning into consideration will theoretically justify problems central to information modeling—the identifier problem, the ontological problem, and the predicate problem. Chapter X, “Higher-Order Types and Information Modeling”, examines the advisability and appropriateness of using higher-order types in information models. The chapter discusses the key issues involved in implementing the model,
TEAM LinG
viii
suggests techniques for retaining a first-order formalization, and provides good suggestions for adopting a higher-order semantics. Chapter XI, “Criteria for Comparing Information Modeling Methods: Informational and Computational Equivalence”, introduces an evaluation approach based on the human information processing paradigm and the theory of equivalence of representations. This evaluation approach proposes informational and computational equivalence as the criteria for evaluation and comparison. Chapter XII, “COGEVAL: Applying Cognitive Theories to Evaluate Conceptual Models”, proposes a propositional framework called COGEVAL that is based on cognitive theories to evaluate conceptual models. The chapter isolates the effect of a model-independent variable on readability and illustrates the dimensions of modeling complexity. This evaluation is particularly useful for creators of new models and practitioners who use currently available models to create schemas. Chapter XIII, “Quality of Analysis Specifications: A Comparison of FOOM and OPM Methodologies”, shows that the Functional and Object Oriented Methodology (FOOM) is a better approach in producing quality analysis models than the Object-Process Methodology (OPM). The comparison is based on a controlled experiment, which compares the quality of equivalent analysis models of the two methodologies, using a unified diagrammatic notation. Chapter XIV, “Interoperability of B2B Applications: Methods and Tools”, introduces a Web-based data integration methodology and tool framework called X-TIME in supporting the development of Business-to-Business (B2B) design environments and applications. The chapter develops X-TIME as the tool to create adaptable semantic-oriented meta models in supporting interoperable information systems and building cooperative environment for B2B platforms. Chapter XV, “Possibility Theory in Protecting National Information Infrastructure”, introduces a quantitative approach called Possibility theory as an alternative to information security evaluation. This research responds to the national concern of the security of both military and civilian information resources due to information warfare and the defense of national information infrastructures. This approach is suitable for information resources that are vulnerable to intensive professional attacks. Chapter XVI, “Enabling Information Sharing Across Government Agencies”, attends to the increased interest in information sharing among government agencies with respect to improving security, reducing costs, and offering better quality service to users of government services. The chapter proposes a comprehensive methodology called Interagency Information Sharing (IAIS) that uses eXtensible Markup Language (XML) to facilitate the definition of information that needs to be shared. The potential conflicts and the comparison of IAIS with two other alternatives are further explored.
TEAM LinG
ix
These 16 chapters provide an excellent sample of the state-of-the-art research in the field of database. I hope this book will be a useful reference and a valuable collection for both researchers and practitioners. Keng Siau University of Nebraska-Lincoln, USA October 2004
TEAM LinG
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
1
Chapter I
Dynamic Workflow Restructuring Framework for Long-Running Business Processess Ling Liu, Georgia Institute of Technology, USA Calton Pu, Georgia Institute of Technology, USA Duncan Dubugras Ruiz, Pontifical Catholic University of RS, Brazil
ABSTRACT
This chapter presents a framework for dynamic restructuring of longrunning business processes. The framework is composed of the ActivityFlow specification language, a set of workflow restructuring operators, and a dynamic workflow management engine. The ActivityFlow specification language enables the flexible specification, composition, and coordination of workflow activities. There are three unique features of our framework design. First, it supports a collection of specification mechanisms, allowing workflow designers to use a uniform workflow specification interface to describe different types of workflows involved in their organizational processes. A main objective of this characteristic is to help increase the flexibility of workflow processes in accommodating changes. The ActivityFlow language also provides a set of activity modeling facilities, enabling workflow designers to describe the flow of work declaratively and incrementally, and allowing to reason about correctness and security of Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
2 Liu, Pu and Ruiz
complex workflow activities independently from their underlying implementation mechanisms. Finally, it offers an open architecture that supports user interaction as well as collaboration of workflow systems of different organizations. Furthermore, our business process restructuring approach enables the dynamic restructuring of workflows while preserving the correctness of ActivityFlow models and related instances. We report a set of simulation-based experiments to show the benefits and cost of our workflow restructuring approach.
INTRODUCTION
The focus of office computing today has shifted from automating individual work activities to supporting the automation of organizational business processes. Examples of such business processes include handling bank loan applications, processing insurance claims, and providing telephone services. Such a requirement shift, pushed by technology trends, has promoted the workflow management systems (WFMSs) based computing infrastructure, which provides not only a model of business processes but also a foundation on which to build solutions supporting the coordination, execution, and management of business processes (Aalst & Hee, 2002; Leymann & Roller, 2000). One of the main challenges in today’s WFMSs is to provide tools to support organizations to coordinate and automate the flow of work activities between people and groups within an organization and to streamline and manage business processes that depend on both information systems and human resources. Workflow systems have gone through three stages over the last decade. First, homegrown workflow systems were monolithic in the sense that all control flows and data flows were hard-coded into applications, thus they are difficult to maintain and evolve. The second generation of workflow systems was driven by imaging/document management systems or desktop object managements. The workflow components of these products tend to be tightly coupled with the production systems. Typical examples are smart form systems (e.g., expense report handling) and case folder systems (e.g., insurance claims handling). The third generation workflow systems have an open infrastructure, a generic workflow engine, a database or repository for sharing information, and use middleware technology for distributed object management. Several research projects are contributing toward building the third generation workflow systems (Mohan, 1994; Sheth, 1995; Sheth et al., 1996). For a survey of some of the workflow automation software products and prototypes, see Georgakopoulos, Hornick, and Sheth (1995) and Aalst and Hee (2002). Recently, workflow automation has been approached in the light of Web services and related technology. According to Alonso, Casati, Kuno, and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
3
Machiraju (2004), the goal of Web services is to achieve interoperability between applications by using Web application standards, exemplified by SOAP (an XML messaging protocol), WSDL (Web Services Description Language), and UDDI (Universal Description, Discovery and Integration), to publish and discover services. According to the W3C (2004) definition of Web services, a Web service is “a software system identified by a URI, whose public interfaces and bindings are defined and described using XML. Its definition can be discovered by other software systems. These systems may then interact with the Web service in a manner prescribed by its definition, using XML based messages conveyed by Internet protocols.” Computing based on Web services constitutes a new middleware technology for the third generation workflow systems that permits an easier description of the interactions among Internet-oriented software applications. In this sense, workflow automation plays a strategic role in coordination and management of the flow of activities implemented as Web services. Although workflow research and development have attracted more and more attention, it is widely recognized that there are still technical problems, ranging from inflexible and rigid process specification and execution mechanisms, insufficient possibilities to handle exceptions, to the need of a uniform interface support for various types of workflows, that is, ad hoc, administrative, collaborative, or production workflows. In addition, the dynamic restructuring of business processes, process status monitoring, automatic enforcement of consistency and concurrency control, recovery from failure, and interoperability between different workflow servers should be improved. As pointed out by Sheth et al. (1996), many existing workflow management systems use Petrinets based tools for process specification. The available design tools typically support definition of control flows and data flows between activities by connecting the activity icons with specialized arrows, specifying the activity precedence order and their data dependencies. In addition to graphical specification languages, many workflow systems provide rule-based specification languages (Dayal, Hsu & Ladin, 1990; Georgakopoulos et al., 1995). Although these existing workflow specification languages are powerful in expressiveness, one of the common problems (even those based on graphical node and arc programming models) is that they are not well-structured. Concretely, when used for modeling complex workflow processes without discipline, these languages may result in schemas with intertwined precedence relationships. This makes debugging, modifying, and reasoning of complex workflow processes difficult (Liu & Meersman, 1996). In this chapter, we concentrate our discussion on the problem of flexibility and extensibility of process specification and execution mechanisms as well as the dynamic restructuring of business processes. We introduce the ActivityFlow specification language for structured specification and flexible coordination of workflow activities and a set of workflow activity restructuring operators to Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
4 Liu, Pu and Ruiz
tackle the workflow restructuring problem. Our restructuring approach enables the optimization of business processes without necessarily reengineering an enterprise. The most interesting features of the ActivityFlow specification language include:
•
•
•
A collection of specification mechanisms, which allows the workflow designer to use a uniform workflow specification interface to describe different types of workflows involved in their organizational processes and helps to increase the flexibility of workflow processes in accommodating changes; A set of activity modeling facilities, which enables the workflow designer to describe the flow of work declaratively and incrementally, allowing reasoning about correctness and security of complex workflow activities independently from their underlying implementation mechanisms; and An open architecture, which supports user interactions as well as collaboration of workflow systems of different organizations.
The rest of this chapter proceeds as follows. In the Basic Concepts of ActivityFlow section, we describe the basic concepts of ActivityFlow and highlight some of the important features. In the ActivityFlow Process Definition Language section, we present our ActivityFlow specification language and illustrate the main features of the language using the telephone service provisioning workflow application as the running example. In the Dynamic Workflow Restructuring of ActivityFlow Models section, we present a set of workflow activity restructuring operators to the dynamic change of ActivityFlow models and simulation experiments that demonstrate the effectiveness of such operators. The Implementation Considerations section discusses the implementation architecture of ActivityFlow and the related implementation issues. We conclude the chapter with a discussion on related works and a summary in the Related Work and Conclusion section.
BASIC CONCEPTS OF ACTIVITY FLOW
Business Process vs. Workflow Process
Business processes are a collection of activities that support critical organizational and business functions. The activities within a business process have a common business or organizational objective and are often tied together by a set of precedence dependency relationships. One of the important problems in managing business processes (by organization or human) is how to effectively capture the dependencies among activities and utilize the dependencies for scheduling, distributing, and coordinating work activities among human and information system resources efficiently. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
5
A workflow process is an abstraction of a business process, and it consists of activities, which correspond to individual process steps, and actors, which execute these activities. An actor may be a human (e.g., a customer representative), an information system, or any of the combinations. A notable difference between business process and workflow process is that a workflow process is an automated business process; namely, the coordination, control, and communication of activities is automated, although the activities themselves can be either automated or performed by people (Sheth et al., 1996). A workflow management system (WFMS) is a software system which offers a set of workflow enactment services to carry out a workflow process through automated coordination, control, and communication of work activities performed by both humans and computers. An execution of a workflow process is called a workflow case (Hollingsworth & WfMC, 1995; WfMC, 2003). Users communicate with workflow enactment services by means of workflow clients, programs that provide an integrated user interface to all processes and tools supported by the system.
Reference Architecture
Figure 1 shows the WFMS reference architecture provided by the Workflow Management Coalition (WfMC) (Hollingsworth & WfMC, 1995). A WFMS consists of an engine, a process definition tool, workflow application clients, invoked applications, and administration and monitoring tools. The process definition tool is a visual editor used to define the specification of a workflow process, and we call it workflow process schema in ActivityFlow. The same schema can be used later for creating multiple instances of the same business process (i.e., each execution of the schema produces an instance of the same business process). The workflow engine and the surrounding tools communicate with the workflow database to store, access, and update workflow process control data (used by the WFMS only) and workflow process-specific data (used by both application and WFMS). Examples of such data are workflow activity schemas, statistical information, and control information required to execute and monitor the active process instances. Existing WFMSs maintain audit logs that keep track of information about the status of the various system components, changes to the status of workflow processes, and various statistics about past process executions. This information can be used to provide real-time status reports about the state of the system and the state of the active workflow process instances, as well as various statistical measurements, such as the average execution time of an activity belonging to a particular process schema and the timing characteristics of the active workflow process instances. ActivityFlow, discussed in this chapter, can be seen as a concrete instance of the WfMC reference architecture in the sense that in ActivityFlow concrete solutions are introduced for process definitions, workflow activity enactment Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
6 Liu, Pu and Ruiz
Figure 1. Reference architecture of Workflow Management Coalition
services, and interoperability with external workflow management systems. Our focus is on the ActivityFlow process definition facilities, including the ActivityFlow meta model (see ActivityFlow Meta Model section), the ActivityFlow workflow specification language (see ActivityFlow Process Definition Language section), and graphical notation for ActivityFlow process definition based on UML Activity diagrams.
ActivityFlow Meta Model
The ActivityFlow meta model describes the basic elements that are used to define a workflow process schema, which describes the pattern of a workflow process and its coordination agreements. In ActivityFlow, a workflow process schema specifies activities that constitute the workflow process and dependencies between these constituent activities. Activities represent steps required to complete a business process. A step is a unit of processing and can be simple (primitive) or complex (nested). Activity dependencies determine the execution order of activities and the data flow between these activities. Activities can be executed sequentially or in parallel. Parallel executions may be unconditional; that is, all activities are executed, or conditional, and only activities that satisfy the given condition are executed. In addition, activities may be executed repeatedly, and the number of iterations may be determined at run-time. A workflow process schema can be executed many times. Each execution is called a workflow process instance (or a workflow process for short), which is a partial order of activities and connectors. The set of activity-precedencedependency relationships defines a partial order over the given set of activities. The connectors represent the points where the control flow changes. For instance, the point where control splits into multiple parallel activities is referred to as split point and is specified using a split connector. The point where control Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
7
Figure 2. UML graphical representation of AND-split, OR-split, AND-join, and OR-join
merges into one activity is referred to as join point and is specified using a split connector. A join point is called AND-join if the activity immediately following this point starts execution only when all the activities preceding the join point finish execution. A join point is called OR-join when the activity immediately following this point starts execution as soon as one of the activities preceding the join point finishes execution. A split point that can be statically determined (before execution) in which all branches are taken is called AND-split. A split point which can be statically determined in which exactly one of the branches will be taken is called OR-split. Figure 2 lists the typical graphical representation of AND-split, OR-split, AND-join, and OR-join by the use of UML activity diagram constructs (Fowler & Scott, 2000; Rumbaugh, Jacobson & Booch, 1999). The workflow process schema also specifies which actors can execute each workflow activity. Such specification is normally done by associating roles with activities. A role serves as a description or a placeholder for a person, a group, an information system, or any of the combinations required for the enactment of an activity. Formally, a role is a set of actors. Each activity has an associated role that determines which actors can execute this activity. Each actor has an activity queue associated with it. Activities submitted for execution are inserted into the activity queue when the actor is busy. The actor follows its own local policy for selecting from its queue for the next activity to execute. The most common scheduling policies are priority-based and FIFO. The notion of a role facilitates load balancing among actors and can flexibly accommodate changes in the workforce and in the computing infrastructure of an organization by changing the set of actors associated with roles. Figure 3 shows a sketch of the ActivityFlow meta model using the UML class diagram constructs (Fowler & Scott, 2000; Rumbaugh et al., 1999). The following concepts are the basics of the activity-based process model:
• •
A workflow process: consists of a set of activities and roles, and a collection of information objects to be accessed from different information resources. An activity: is either an elementary activity or a composite activity. The execution of an activity consists of a sequence of interactions (called
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
8 Liu, Pu and Ruiz
Figure 3. Activity flow meta model
• •
•
• •
events) between the performer and the workflow management system, and a sequence of actions that change the state of the system. An elementary activity: represents a unit of work that an individual, a machine, or a group can perform in an uninterrupted span of time. In other words, it is not decomposed any further in the given domain context. A composite activity: consists of several other activities, either elementary or composite. The nesting of activities provides higher levels of abstraction that help to capture the various structures of organizational units involved in a workflow process. A role: is a placeholder or description for a set of actors, who are the authorized performers that can execute the activity. The concept of associating roles with activities not only allows us to establish the rules for association of activities or processes with organizational responsibilities but also provide a flexible and elegant way to grant the privilege of execution of an activity to individuals or systems that are authorized to assume the associated role. An actor: can be a person, a group of people, or an information system, that are granted memberships into roles and that interact with other actors while performing activities in a particular workflow process instance. Information objects: are the data resources accessed by a workflow process. These objects can be structured (e.g., relational databases), semistructured (e.g., HTML forms), or unstructured (e.g., text documents). Structured or semi-structured data can be accessed and interpreted automatically by the system, while unstructured data cannot and thus often requires human involvement through manual activities.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
9
Important to note is that activities in ActivityFlow can be (1) manual activities, performed by users without further support from the system; (2) automatic activities, carried out by the system without human intervention; or (3) semi-automatic activities, using specific interactive programs for performing an activity.
The Running Example
To illustrate the ActivityFlow meta model, we use a telephone service provisioning process in a telecommunication company. This example was originally presented by Ansari, Ness, Rusinkiewicz, and Sheth (1992) and Georgakopoulos et al. (1995). Figure 12 (Restructuring Possibilities on TeleConnect WorkFlow section) presents an environment where a set of computer systems are offering services on the Web. A synopsis of the examples is described below. Consider a business process TeleConnect that performs telephone-serviceprovision task by installing and billing telephone connections between the telecomm company and its clients. Suppose the workflow process A:TELECONNECT consists of five activities A 1 :CLIENTREGISTER, A2:CREDITCHECK, A3:CHECKRESOURCE, A11:INSTALLNEWCIRCUIT, and B:ALLOCATECIRCUIT (see Figure 4(A)). A: TELECONNECT is executed when an enterprise’s client requests telephone service installation. Activity A1:CLIENTREGISTER registers the client information, and activity A2:CREDITCHECK evaluates the credit history of the client by accessing financial data repositories. Activity A3:CHECKRESOURCE consults the facility database to determine whether existing facilities can be used, and B: ALLOCATECIRCUIT attempts to provide a connection by allocating existing resources, such as allocating lines (C: ALLOCATELINES), allocating slots in switches (A8:ALLOCATESWITCH, A9:ALLOCATESWITCH), and preparing a bill to establish the connection (A10:PREPAREBILL) (see Figure 4(B)). The activity of allocating lines (C:ALLOCATELINES) in turn has a number of subtasks, such as selecting nearest central offices (A 4 :SELECTCENTRALOFFICES), relocating existing lines (A5:ALLOCATELINE, A6:ALLOCATELINE), and spans (trunk connection) between two allocated lines (A7:ALLOCATESPAN) (see Figure 4(C)). If A3:CHECKRESOURCE succeeds, the costs of connection are minimal. The activity A11:INSTALLNEWCIRCUIT is designed to perform an alternative task that involves physical installation of new facilities in the event of failure of activity A3:CHECKRESOURCE. The roles involved with these activities are the CreditCheck-GW, the Telecommunication Company, and the Telecomm Contractor. In addition, the Telecommunication Company is detailed into three roles: Telecomm-HQ, T-central 1, and T-central 2. We use the swimlane feature on UML activity diagrams to depict such different roles of actors as involved on performing activity instances.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
10 Liu, Pu and Ruiz
Advanced Concepts
ActivityFlow provides a number of facilities to support advanced concepts, such as a variety of possibilities for handling errors and exceptions. For example, at the activity specification stage, we allow the workflow designers to specify valid processes and the compensation activities. At run-time, additional possibilities are offered to support recovery from errors or crashes by triggering alternative executions defined in terms of user-defined compensation activities or system-supplied recovery routines. Time dimension is very important for the deadline control of workflow processes. In ActivityFlow, we provide a construct to allow the workflow designer to specify the maximum allowable execution durations for both the activities (i.e., subactivities or component activities) and the process (i.e., top activity). This time information can be used to compute deadlines for all activities in order to meet an overall deadline of the whole workflow process. When an activity misses its deadline, special actions may be triggered. Furthermore, this time information plays an essential role in decisions about priorities and in monitoring deadlines and generating time errors in the case that deadlines are missed. It also provides the possibility to delay some activities for a certain amount of time or to a specific date. The third additional feature is the concept of workflow administrator (WFA). Modern business organizations build the whole enterprise around their key business processes. It is very important for the success of process-centered organizations that each process has a WFA who is responsible for monitoring the workflow process according to deadlines, handling exceptions and failures, Figure 4. Telephone service provisioning workflow
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
11
which cannot be resolved automatically. More specifically, the WFA is able to analyze the current status of a workflow process, make decisions about priorities, stop and resume a workflow process, abort a workflow process, dynamically restructure a workflow process, or change a workflow specification, and so forth. A special workflow client interface is needed which offers functionality to enable such a workflow process administrator to achieve all these goals.
ACTIVITYFLOW PROCESS DEFINITION LANGUAGE
Design Principles
Most workflow management systems provide graphical specification of workflow processes. The available design tools typically support iconic representation of activities. Definition of control flows and data flows between activities is accomplished by connecting the activity icons with specialized arrows specifying the activity precedence order and their data dependencies. In addition to graphical specification languages, many WFMSs provide rule-based specification languages (Dayal et al., 1990). One of the problems with existing workflow specification languages (even those based on graphical node and arc programming models) is that they are not well-structured languages in the sense that when used without a discipline, these languages may result in schemas with a spaghetti of intertwined precedence relationships, which make debugging, modifying, and reasoning of complex workflow processes difficult (Liu & Meersman, 1996). As recognized by Sheth et al. (1996), there is a need for finding a more structured way of defining the wide spectrum of activity dependencies. Thus, the first and most important design principle in ActivityFlow is to develop a well-structured approach to specification of workflow processes by providing a small set of constructs and a collection of mechanisms to allow workflow designers to specify the nested process structure and the variety of activity dependencies declaratively and incrementally. The second design principle is to support the specification of basic requirements that are not only critical in most of the workflow applications (Sheth et al., 1996) but also essential for correct coordination among activities in accomplishing a workflow process. These basic requirements include:
• • •
activity structure (control flow) and information exchange between actors (data flows) in a workflow process; exception handling, specifying what actions are necessary if an activity fails or a workflow cannot be completed; and activity duration, specifying the estimated or designated maximum allowable execution time for both the workflow process (top activity) and its
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
12 Liu, Pu and Ruiz
constituent activities. This time information is critical for monitoring deadlines of activities and for providing priority attributes, specifying priorities for activity scheduling.
Main Components of a Workflow Specification
In ActivityFlow, a workflow process is described in terms of a set of activities and the dependencies between them. For presentation convenience in the rest of the chapter, we refer to a workflow process as top activity and workflow component activities as subactivities. We use activities to refer to both the process and its component activities when no distinction needs to be made. Activities are specified by activity templates or so-called parameterized activity patterns. An activity pattern describes concrete activities occurring in a particular organization, which have similar communication behavior. An execution of the activity pattern is called an instantiation (or an activity instance) of the activity pattern. Informally, an activity pattern consists of objects, messages, message exchange constraints, preconditions, postconditions, and triggering conditions (Liu & Meersman, 1996). Activities can be composed of other activities. The tree organization of an activity pattern a is called the activity hierarchy of α. The set of activity dependencies specified in the pattern α can be seen as the cooperation agreements among activities that collaborate in accomplishing a complex task. The activity at the root of the tree is called root activity or workflow process; the others are subactivities. An activity’s predecessor in the tree is called a parent; a subactivity at the next lower level is called a child. Activities at leaf nodes are elementary activities in the context of the workflow application domain. Nonleaf node activities are composite activities. In ActivityFlow, we allow arbitrary nesting of activities since it is generally not possible to determine a priori the maximum nesting an application task may need. A typical workflow specification consists of the following five units:
•
•
Header: The header of an activity specification describes the signature of the activity, which consists of a name, a set of input and output parameters, and the access type (i.e., Read or Write). Parameters can be objects of any kind, including forms. We use keyword In to describe parameters that are inputs to the activity and Out to describe parameters that are outputs of the activity. Parameters that are used for both input and output are specified using keyword InOut. Activity Declaration: The activity declaration unit captures the general information about the activity, such as the synopsis (description) of the task, the maximum allowable execution time, the administrator of the activity (i.e., the user identifier (UID) of the responsible person), and the set of compensation activities that are used for handling errors and exceptions and their triggering conditions.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
•
•
•
13
Role Association: This unit specifies the set of roles associated with the activity. Each role is defined by a role name, a role type, and a set of actors that are granted membership into the role based on their responsibility in the business process or in the organization. Each actor is described by actor ID and role name. We distinguish two types of roles in the first prototype implementation of ActivityFlow: user and system, denoted as USER and SYS respectively. Data Declaration: The data declaration unit consists of the declaration of the classes to which the parameters of the activity belong and the set of messages (or methods) needed to manipulate the actual arguments. Constraints between these messages are also specified in this unit (Liu & Meersman, 1996). Procedure: The procedure unit is defined within a begin and end bracket. It describes the composition of the activity, the control flow and data flow of the activity, and the pre- and postcondition of the activity. The main component of the control flow includes activity-execution-dependency specification, describing the execution precedence dependencies between children activities of the specified activity and the interleaving dependencies between a child activity and children of its siblings or between children activities of two different sibling activities. The main component of the data flow specification is defined through the activity state-transition dependencies.
Dynamic Assignments of Actors
The assignment of actors (humans or information systems) to activities according to the role specification is a fundamental concept in WFMSs. At runtime, flexible and dynamic assignment resolution techniques are necessary to react adequately to the resource allocation needs and organizational changes. ActivityFlow uses the following techniques to fulfill this requirement:
•
• •
When the set of actors is empty, the assignment of actors can be any users or systems that belong to the roles associated with the specified activity. When the set of actors is not empty, only those actors listed in the associated actor set can have the privilege to execute the activity. The assignment of actors can also be done dynamically at run-time. The activity-enactment service engine will grant the assignment if the run-time assignment meets the role specification. The assignment of actors can be the administrator of the workflow process to which the activity belongs, as the workflow administrator is a default role for all its constituent activities.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
14 Liu, Pu and Ruiz
The role-based assignment of actors provides great flexibility and breadth of application. By statically and dynamically establishing and defining roles and assigning actors to activities in terms of roles, workflow administrators can control access at a level of abstraction that is natural to the way that enterprises typically conduct business.
Control Flow Specification: Activity Dependencies
In ActivityFlow, a number of facilities are provided to promote the use of declarative and incremental approach to specification of activities and their dependencies. For example, to make the specification of activity execution dependencies easier and user friendlier for the activity model designers, we classify activity dependencies into three categories: activity execution dependencies, activity interleaving dependencies, and activity state transition dependencies. We also regulate the specification scope of the set of activity dependencies associated with each activity pattern to encourage incremental specifications of hierarchically complex activities. For instance, to define an activity pattern T, we require the workflow designer to specify only the activity execution dependencies between activities that are children of a T activity, and restrict the activity interleaving dependencies specified in T to be only the interaction dependencies between (immediate) subactivities of different child activities of T or between a T’s child activity and (immediate) subactivities of its siblings. As a result, the workflow designers may specify the workflow process and the activities declaratively and incrementally, allowing reasoning about correctness and security of complex workflow activities independently from their underlying implementation mechanisms. In addition, we provide four constructs to model various dependencies between activities. They are precede, enable, disable, and compatible. The semantics of each construct are formally described in Table 1. The construct precede is designed to capture the temporary precedence dependencies and the existence dependencies between two activities. For example, “A precede B” specifies a begin-on-commit execution dependency between the two activities: “B cannot begin before A commits”. The constructs enable and disable are utilized to specify the enabling and disabling dependencies between activities. One of the critical differences between the construct enable or disable and the construct precede is that enable or disable specifies a triggering condition and an action being triggered, whereas precede only specifies an execution precedence dependency as a precondition that needs to be verified before an action can be activated, and it is not an enabling condition that, once satisfied, triggers the action. The construct compatible declares the compatibility of activities A1 and A2. It is provided solely for specification convenience since two activities are compatible when there is no execution precedence dependency between them.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
15
Table 1. Constructs for activity dependency specification Construct precede enable disable compatible
Usage
A1 precede A2 condition(A1 ) precede A2 condition(A1 ) precede condition(A2 ) condition(A1 ) enable A2 condition(A1 ) enable condition(A2 ) condition(A1 ) disable A2 condition(A1 ) disable condition(A2 ) compatible(A1, A2)
Synopsis
A2 can begin if A1 commits A2 can begin if condition(A1) = 'true' holds. If condition(A1) = 'true' then condition(A2) can be 'true' condition(A1) = 'true' → begin(A2) If condition(A1) = 'true' then condition(A2) will be 'true' condition(A1) = 'true' → abort(A2) If condition(A1) = 'true' then condition(A2) cannot be 'true' 'true' if A1 and A2 can be executed in parallel, 'false' if the order of A1 and A2 is important
Recall the telephone service provisioning workflow example given in The Running Example section. After having entered the service request in the client and service order databases, the activity A3:CHECKRESOURCE tries to determine which facilities can be used when establishing the service. If A3:CHECKRESOURCE commits, it means that the client’s request can be met. In case of failing on the allocation of the service with existing lines and spans, but viable installation of such new circuit elements, a human field engineer is selected to execute the activity A11:INSTALLNEWCIRCUIT, which may involve manual changes to some switch and the installation of a new telephone line. We have adopted the Eder and Liebhart (1995) approach and model in ActivityFlow diagrams to represent only expected exceptions. Such cooperation dependencies among A3:CHECKRESOURCE, B:ALLOCATECIRCUIT, and A11:INSTALLNEWCIRCUIT can be specified as follows: 1. 2.
A3 ∧ ¬circuitAllocated precede A11. (“circuitAllocated =false” is a precondition for a human engineer to execute A11 after A3 commits.) (A3 ∧ circuitAllocated ) ∨ A11 enable B. (if A3 commits and returns the true value in its circuitAllocated output parameter, or a human field engineer succeeds on installing the line/span needed, then B is triggered.)
The first dependency states that the commit of A3: CHECKRESOURCE and the false value of circuitAllocated output parameter are preconditions for A11: INSTALLNEWCIRCUIT. The second dependency amounts to saying that if A3: CHECKRESOURCE is successful on defining existing facilities that satisfy the request (circuitAllocated = true), or A11: INSTALLNEWCIRCUIT have installed the needed new facility, then B: ALLOCATECIRCUIT is triggered. The reason that we use the construct precede, rather than enable, for specifying the first dependency is because A11: INSTALLNEWCIRCUIT involves some manual work and thus must be executed by a human field engineer. ActivityFlow also allows the users to specify conditional execution dependencies to support activities triggered by external events (e.g., Occurs(E1) enable A1). Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
16 Liu, Pu and Ruiz
Activity Specification: An Example
To illustrate the use of ActivityFlow workflow specification language in describing activities of a nested structure, we recast the telephone-serviceprovisioning workflow, given in The Running Example section. Figure 4 shows the hierarchical organization of the workflow process TELECONNECT. The top activity A:TELECONNECT (see Figure 4(A)) is defined as a composite activity, consisting of the following five activities: A1: CLIENTREGISTER, A2: CREDITCHECK, A 3: CHECKRESOURCE, B:ALLOCATECIRCUIT, and A11: INSTALLNEWCIRCUIT. The activity B: ALLOCATECIRCUIT (see Figure 4(B)) is again a composite activity, composed of four subactivities: C:ALLOCATELINES, A8: ALLOCATESWITCH, A9: ALLOCATESWITCH, and A10: PREPAREBILL. The activity C:ALLOCATELINES (see Figure 4(C)) is also a composite activity with four subactivities: A 4 : SELECTCENTRALOFFICES, A5: ALLOCATELINE, A6: ALLOCATELINE, and A7: ALLOCATESPAN. Based on the structure of a workflow process definition discussed in the Main Components of a Workflow Specification section, we provide an example specification for the telephone service provisioning workflow (top activity) in Figure 5, the composite activities B: ALLOCATECIRCUIT in Figure 6 and C: ALLOCATELINES in Figure 7, and the elementary activity A11: INSTALLNEWCIRCUIT in Figure 8. The corresponding activity hierarchy of TELECONNECT is presented in Figure 9 as a tree.
A Formal Model for Flow Procedure Definition
In this section, we provide a graph-based model to formally describe the procedure unit of a workflow specification in ActivityFlow. This graph-based flow procedure model provides a formal foundation for ActivityFlow graphical user interface, which allows the end users to model office procedures in a workflow process using iconic representation. In ActivityFlow, we describe an activity procedure in terms of (1) a set of nodes, representing individual activities or connectors between these activities (e.g., split and join connectors described in the ActivityFlow Meta Model section) and (2) a set of edges, representing signals among the nodes. Each node in the activity flow procedure is annotated with a trigger. A trigger defines the condition required to fire the node on receiving signals from other nodes. The trigger condition is defined using the four constructs described in the Control Flow Specification: Activity Dependencies section. Each flow procedure has exactly one begin node and one end node. When the begin node is fired, an activity flow instance is created. When the end node is triggered, the activity flow instance terminates.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
17
Definition 1 (activity flow graph) An activity flow graph is described by a binary tuple < N, E >, where
•
N is a finite set of activity nodes and connector nodes.
• N = AN ∪ CN ∪ {bn, en}, where AN = {nd1, nd2,..., ndn} is a set of activity
•
•
E
• •
nodes, CN = {cn1, cn2,..., cnn} is a set of connector nodes, bn denotes the begin node and en denotes the end node. Each node ni ∈ N ( i = 1,.., n) is described by a quadruple (NN, TC, NS, NT), where NN denotes the node name. TC is the trigger condition of the node. NS is one of the two states of the node: fired or not fired. NT is the node type. • If ni ∈ AN → NT = {simple, compound, iteration} • If ni ∈ CN → NT = {AND-split, OR-Split, AND-join, OR-join} = {e1, e2,..., em} is a set of edges. Each edge is of the form ndi → ndj. An edge eij: ndi → ndj is described by a quadruple (EN, DPnd, AVnd, ES), where EN is the edge name,
Figure 5. Example specification of the op actvity TELECONNECT Activity TELECONNECT(In: ClientId:CLIENT, Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) Access Type: Write Synopsis: Telephone service provisioning Max Allowable Time: 2 weeks Administrator: UID: 0.0.0.337123545 Exception Handler: none Role Association: Role name: Telecommunication Company Role type: System Data Declaration: import class CLIENT, import class POINT, import class CIRCUIT; begin Behavioral Aggregation of component Activities: A1: CLIENTREGISTER ( In: ClientId:CLIENT, Start:POINT, End:POINT) A2: CREDITCHECK (In: ClientId:CLIENT, Start:POINT, End:POINT, Out: creditStatus:Boolean) A3: CHECKRESOURCE ( In: ClientId:CLIENT, Start:POINT, End:POINT, Out: circuitAllocated:Boolean) A11: INSTALLNEWCIRCUIT( In: ClientId:CLIENT, Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) B: ALLOCATECIRCUIT ( In: ClientId:CLIENT, Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) Execution Dependencies: ExeR1: A1 precede {A2, A3} ExeR2: A3 ∧ ¬ circuitAllocated precede A11 ExeR3: (A3 ∧ circuitAllocated) ∨ A11 enable B Interleaving Dependencies: ILR1: A2 ∧ creditStatus precede A10 State Transition Dependencies: STR1: abort(B) enable abort(self) end Activity
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
18 Liu, Pu and Ruiz
Figure 6. Example specification of the composite acivity ALLOCATECIRCUIT Activity ALLOCATECIRCUIT(In: ClientId:CLIENT, Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) Access Type: Write Synopsis: Circuit allocation Max Allowable Time: 3 days Administrator: UID: 0.0.0.337123545 Exception Handler: none Role Association: Role name: Telecommunication Company Role type: System Data Declaration: import class CLIENT, import class POINT, import class CIRCUIT, import class LINE, import class SPAN; begin Behavioral Aggregation of component Activities: C: ALLOCATELINES ( In: Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) A8: ALLOCATESWITCH ( In: Line1:LINE, Out: Span:SPAN) A9: ALLOCATESWITCH ( In: Line2:LINE, Out: Span:SPAN) A10: PREPAREBILL( In: ClientId:CLIENT, Line1:LINE, Line2:LINE, Span:SPAN, Out: CircuitId:CIRCUIT) Execution Dependencies: ExeR4: C ∧ A8 ∧ A9 precede A10 Interleaving Dependencies: ILR2: A5 ∧ A7 precede A8 ILR3: A6 ∧ A7 precede A9 State Transition Dependencies: STR2: abort(C) ∨ abort(A8) ∨ abort(A9) enable abort(self) end Activity
Figure 7. Example specification of composite activity ALLOCATELINES Activity ALLOCATELINES(In: Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) Access Type: Write Synopsis: Line allocation Max Allowable Time: 1 days Administrator: UID: 0.0.0.337123545 Exception Handler: none Role Association: Role name: Telecommunication Company Role type: System Data Declaration: import class POINT, import class LINE, import class SPAN, import class CentralOff; begin Behavioral Aggregation of component Activities: A4: SELECTCENTRALOFFICES ( In: Start:POINT, End:POINT, Out: Off1:CentralOff, Off2:CentralOff) A5: ALLOCATELINE ( In: Start:POINT, Off1:CentralOff, Out: Line1:LINE) A6: ALLOCATELINE ( In: End:POINT, Off2:CentralOff, Out: Line2:LINE) A7: ALLOCATESPAN( In: Off1:CentralOff, Off2:CentralOff, Out: Span:SPAN) Execution Dependencies: ExeR5: A4 precede {A5, A6, A7} State Transition Dependencies: STR3: abort(A4) ∨ abort(A5) ∨ abort(A6) ∨ abort(A7) enable abort(self) end Activity
Figure 8. Example scification of elementary activity INSTALLNEWCIRCUIT Activity INSTALLNEWCIRCUIT(In: ClientId:CLIENT, Start:POINT, End:POINT, Out: CircuitId:CIRCUIT) Access Type: Write Synopsis: New line/span installation Max Allowable Time: 1 week Administrator: UID: 0.0.0.337123545 Exception Handler: none Role Association: Role name: Telecomm Contractor Role type: User end Activity
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
19
Figure 9. Activity hierarchy of A:TELE CONNECT A
A1
A4
A2
A3
B
A11
C
A8
A9
A10
A5
A6
A7
DPnd is the departure node, AVnd is the arrival node, and ES is one of the two states of the node: signaled and not signaled. We call eij an outgoing edge of node ndi and incoming edge of node ndj. For each node ndi, there is a path from the begin node bn to ndi. We say that a node ndi is reachable from another node ndj if there is a path from ndi to nd j . Definition 2 (reachability) Let G = < N, E > be an activity flow graph. For any two nodes ndi, ndj ∈N, ndj is reachable from ndi, denoted by ndi *→ ndj, if and only if one of the following conditions is verified: (1) ndi = ndj. (2) ndi → ndj∈ E. (3) ∃ nd k ∈ N, ndk ≠ nd i and ndk ≠ ndj such that ndi *→ ndk and ndk *→ nd j . A node ndj is said to be directly reachable from a node ndi if the condition (2) in Definition 2 is satisfied. To guarantee that the graph G = < N, E > is acyclic, the following restrictions are placed: (1) ∀ ndi, ndj ∈ N, if ndi → ndj ∈ E then ndj → ndi ∉ E. (2) ∀ ndi, ndj ∈ N, if ndi* → ndj then ndj* → ndi does not hold.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
20 Liu, Pu and Ruiz
To illustrate the definition, let us recast the telephone service provisioning workflow procedure depicted in Figure 4(A) diagram and described in Figure 5 in terms of the above definition as follows: •
•
N = {(Begin, NeedService, not fired, simple), (A1, NeedService, not fired, simple), (A2, commit(A1), not fired, simple), (OS1, commit(A2), not fired, OR-Split), (A3, creditStatus = true, not fired, simple), (OS2, commit(A3), not fired, OR-Split), (A11, circuitAllocated = false, not fired, simple), (OJ2, circuitAllocated = true ∨ commit(A11), not fired, OR-Join), (B, terminate(OJ2), not fired, compound ), (OJ1, creditStatus = false ∨ commit(B), not fired, OR-Join), (End, terminate(OJ1), not fired, simple)} E = {Begin → A1, A1 → A2, A2 → OS1, OS1 → A3, OS1 → OJ1, A3 → OS2, OS2 → OJ2, OS2 → A11, A11 → OJ2, OJ2 → B, B → OJ1, OJ1 → End}
Note that NeedService is a Boolean variable from the ActivityFlow runtime environment. When a new telephone service request arrives, NeedService is true. Figure 4(A) shows the use of the UML-based ActivityFlow graphical notations to specify this activity flow procedure. When a node is clicked, the node information will be displayed in a quadruplet, including node type, name, its trigger, and its current state. When an edge is clicked, the edge information, such as the edge name, its departure and arrival nodes, and its current state, will be displayed. From Figure 4(A), it is obvious that activity node B is reachable from nodes A1, A2, A3 and A11. An activity flow graph G is instantiated by an instantiation request issued by an actor. The instantiation request provides the initial values of the data items (actual arguments) required by the parameter list of the flow. An activity flow instantiation is valid if the actor who issued the firing satisfies the defined role specification. Definition 3 (valid flow instantiation request) Let G = < N, E > be an activity flow graph and u = (actor_oid, role_name) be an actor requesting the activity flow instantiation T of G. The flow instantiation T is valid if and only if ∃ρ ∈ Role(G) such that role_name (u) = ρ. When the actor who initiates a flow instantiation request is not authorized, the instantiation request is rejected, and the flow instantiation is not created. When a flow instantiation request is valid, a flow instantiation, say T, is created by firing the begin node of T.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
21
Definition 4 (activity flow instantiation) Let G = < N, E > be an activity flow graph and T denote a valid flow instantiation of G. T is created by assigning a flow instance identifier and carrying out the following steps to fire the begin node bn(T):
• • •
set the state of node bn(T) to be fired; set all the outgoing edges of bn(T) to be signaled; perform a node instantiation for each node that is directly reachable from the begin node bn(T).
A node can be instantiated or triggered when all the incoming edges of the node are signaled, its trigger condition is evaluated to be true. When a node is triggered, a unique activity instance identifier is assigned, and the node state is set to fired. In ActivityFlow, all the nodes are initialized to not fired, and all the edges are initialized to not signaled. Definition 5 (node instantiation) Let G = < N, E > be an activity flow graph and T denote a valid flow instantiations of G. A node ndk ∈ N can be instantiated if ∀ ndi ∈ N such that ndi ≠ ndk and ndk is directly reachable from ndi, we have •
• •
ndi is in the state fired, the instance identifier of T is identified, the trigger of ndk can be evaluated. A node ndk is instantiated if the following steps are performed:
• • •
updates to data items are applied in all the nodes ndi from which ndk is directly reachable. all the incoming edges of ndk are set to be signaled. ndk is fired if (1) its trigger condition is evaluated to be true and (2) it is currently not fired, or it is an iteration activity node and its iteration condition is evaluated to be true.
In ActivityFlow, we use the term conditional rollback to refer to the situations that require revisiting the nodes previously terminated or not fired. Conditional rollbacks are a desirable functionality and encountered frequently in some business processes. We provide the UML activity iterator symbol (“*” into a compound activity-node construct) for the realization of conditional rollbacks. The use of iterating activities has a number of interesting features. First, by defining an activity with the iterator symbol, such an activity a composite activity, we identify the nodes that can be or allowed to be revisited by the subsequent Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
22 Liu, Pu and Ruiz
activities in the same subflow instance. Second, when using iteration rather than explicitly backward edges, the conditional rollback may be considered as a continuation of the workflow instance execution. We believe that the use of iteration provides a much cleaner graphical notation to model cyclic activity workflows. To reduce the complexity and facilitate the management of conditional rollbacks, the only restriction we place on the conditional rollback is the following: A call to rollback to an activity node ndk can only be accepted if it comes from subactivity nodes or sibling activity nodes of ndk. Figure 10 shows an example which recasts the composite activity C:ALLOCATELINES, discussed in The Running Example section, by allowing a conditional rollback of some allocation line activities (A5:ALLOCATELINE, A6:ALLOCATELINE, and A7:ALLOCATESPAN). It permits the execution of a set of C:ALLOCATELINES and evaluates which instance is more profitable. The others are rollbacked. We model this requirement using iteration (see Figure 10). By clicking the iteration-type activity node, the information about its subflow will be displayed. The rollback condition is also displayed. In this case, it says that if Figure 10. An example C:ALLOCATELINES is successful, then the using iterator connectors AND-Split type connection node following C:ALLOCATELINES is fired. Then, activities A8:ALLOCATESWITCH and A9:ALLOCATE SWITCH are fired. Otherwise, a new C:ALLOCATELINES instance is fired until the profit level required may be reached. Definition 6 (termination property) An activity flow instance terminates if its end node is triggered. A flow instance is said to satisfy the termination property if the end node will eventually be fired. The termination property guarantees that the flow procedure instantiation will not “hang”. Definition 7 (precedence preserving property) Let G = < N, E > be an activity flow. An activity flow instance of G is said to satisfy precedence preserving property if the node firing sequence is compatible with the partial order defined by the activity precedence dependencies in G. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
23
In ActivityFlow, these two latter properties are considered as correctness properties, among others, for concurrent executions of activities. For a detailed discussion on preservation of the correctness properties of workflow activities, see Liu and Pu (1998b).
DYNAMIC WORKFLOW RESTRUCTURING OF ACTIVITYFLOW MODELS
To maintain the competitiveness in a business-oriented world, enterprises must offer high-quality products and services. A key factor in successful quality control is to ensure the quality of all corporate business processes, which include clearly defined routing among activities, association among business functions (e.g., programs) and automated activities, execution dependency constraints and deadline control, at both activity level and whole workflow process level. Besides the workflow characteristics, most workflow applications are expected to have 100% uptime (24 hours per day, 7 days per week). Production workflow (Leymann & Roller, 2000) is a class of workflow that presents such characteristics, and the workflow processes have a high business value for the organizations. The enterprise commitment with a deadline for each of its workflow process execution becomes one of the design and operation objectives for workflow management systems. However, deadline control of workflow instances have led to a growing problem that conventional workflow management systems do not address, namely, how to reorganize existing workflow activities in order to meet deadlines in the presence of unexpected delays. Besides, having long-lived business-process instances, workflow designs must deal with schema evolution with the proper handling of ongoing instances. These problems are known as the workflow-restructuring problem. This section describes the notation and issues of workflow restructuring and discusses how a set of workflow activity restructuring operators can be employed to tackle the workflow-restructuring problem on ActivityFlow modeling. We restrict our discussion in the context of how to handle unexpected delays. A deeper study on such context can be found in Ruiz, Liu, and Pu (2002).
Basic Notions
Activity restructuring operators are used to reorganize the hierarchical structure of activity patterns with their activity dependencies remaining valid. Two types of activity restructuring operators are proposed by Liu and Pu (1998a): Activity-Split and Activity-Join. Activity-Split operators allow releasing committed resources that were updated earlier, enabling adaptive recovery and added concurrency (Liu & Pu, 1998b). Activity-Join operators, the inverse of activity-split, combine results from subactivities together and release them Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
24 Liu, Pu and Ruiz
atomically. The restructuring operators can be applied to both simple and composite activity patterns and can be combined in any formation. Zhou, Pu, and Liu (1998) present a practical method to implement these restructuring operators in the context of the Transaction Activity Model, or TAM (Liu & Pu, 1998b). In TAM, activities are specified in terms of activity patterns. An activity pattern describes the communication protocol of a group of cooperating objects in accomplishing a task (Liu & Meersman, 1996). We distinguish two types of activities: simple activity pattern or composite activity pattern. A simple activity pattern is a program that issues a stream of messages to access the underlying database (Liu & Pu, 1998b). A composite activity pattern consists of a tree of composite or simple activity patterns and a set of user-defined activity dependencies: (a) activity execution and interleaving dependencies and (b) activity state-transition dependencies. The activity at the root of the tree is called root activity; the others are called subactivities. An activity’s predecessor in the tree is called parent; a subactivity at the next lower level is called a child. Activity hierarchy is the hierarchical organization of activities (see Figure 4 in The Running Example section for an example). A TAM activity has a set of observable states S and a set of possible state transitions ϕ:S → S, where S = {begin, commit, abort, done, compensate} (Liu & Pu, 1998a) (see Figure 11). When an activity A is activated, it enters in the state begin and becomes active. The state of A changes from begin to commit if A commits and to abort if A or its parent aborts. If A’s root activity commits, then its state becomes done. When A is a composite activity, A enters the commit state if all its component activities legally terminate, that is, commit or abort. If an activity aborts, then all its children that are in begin state are aborted and its committed children, however, are compensated for. We call this property termination-sensitive dependency (Liu & Pu, 1998b) between an activity AC and its parent activity AP, denoted by AP ~> AC. This terminationsensitive dependency, inherent in an activity hierarchy, prohibits a child activity instance from having more than one parent, ensuring the hierarchically nested structure of active activities. When the abort of all active subactivities of an activity is completed, the compensation for committed subactivities is performed Figure 11. TAM activity state transition graph
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
25
by executing the corresponding compensations in an order that is the reverse of the original order. Definition 8 (TAM activity) Let α denote an activity pattern and Σ denote a set of activity patterns. Let AD( α) denote a set of activity dependencies specified in α, children(α) denote the set of child activity patterns of α and Pattern(A) denote the activity pattern of activity A. An activity A is said to be a TAM activity if and only if it satisfies the following conditions: • • •
∃α ∈ Σ, Pattern(A) = α. ∀P ∈ AD(α), P(A) = true. ∀C ∈ children (A), A ~> C is a TAM activity.
Another property of an activity hierarchy is the visibility of objects between activities. The visibility of an activity refers to its ability to see the results of other activities while it is executing. A child activity AC has access to all objects that its parent activity AP can access; that is, it can read objects that AP has modified (Liu & Pu, 1998b). TAM uses the multiple object version schemes (Nodine & Zdonik, 1990) to support the notion of visibility in the presence of concurrent execution of activities. The Root activity at the top of the activity hierarchy contains the most stable version of each object and guarantees the possibility to recover its copies of objects in the event of a system failure.
Workflow Restructuring Operators
There are three types of activity-split operators: serial activity-split (sSplit), parallel activity-split (p-Split) and unnesting activity-split (u-Split).
•
•
•
The s-Split operator splits an activity into two or more activities that can be performed and committed sequentially. It establishes a linear execution dependency among the resulting activities which is captured by using the precede construct. The p-Split splits an activity into two or more activities that can be submitted and committed independently of each other. The only dependency established between them is the compatibility among all split activities and can be represented by compatible construct. The u-Split splits C activity by unnesting the activity hierarchy anchored at C. U-Split operators are effective only on composite activity patterns.
A series of specializations are introduced for activity split, including s-Split - serial activity-split, (sa-Split - serial-alternative activity-split), and p-Split Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
26 Liu, Pu and Ruiz
parallel activity-split, (pa-Split - parallel-alternative activity-split, cc-Split commit-on-commit activity-split, and ca-Split - commit-on-abort activity-split). These specializations tackle situations where it is necessary to synchronize concurrent split activities and when certain activities can be performed only if another aborts. An activity-split operation is said to be valid if and only if the resulting activities (1) satisfy the implicit dependencies implied in the activity composition hierarchy, such as the termination-sensitive dependency, that is, are TAM activities; (2) all existing activity dependencies are semantically preserved after the split; and (3) do not introduce any conflicting activity dependencies (Liu & Pu, 1998a). Similarly, activity-Join has two specialized versions: join-by-group (gJoin) and join-by-merge (m-Join).
•
•
The g-Join operator groups two or more activities by creating a new activity as their parent activity, while preserving the activity composition hierarchy of each input activity. A g-Join is considered legal if the input activities are all sibling activities or independently ongoing activities; that is, they do not have a common parent activity. The m-Join operator physically merges two or more activities into a single activity. An m-Join is considered legal if for each pair of input activities (C1, C2), C 1 and C 2 are sibling activities, or one is a parent activity of another, or independently ongoing activities.
Restructuring Possibilities on TELECONNECT Workflow
Most workflow designs take into account the organizational structure, the computational infrastructure, the collection of applications provided by the corporate enterprises, and the cost involved. Such designs are based on the assumptions that the organizational structure is an efficient way to organize business processes (workflows), and the computational infrastructure has the optimal configuration within the enterprise. However, such assumptions may not hold when unexpected delays happen and when such delays severely hinder the progress of ongoing workflow executions. Typical delays in execution of business workflows are due to various types of failures or disturbances in computational infrastructure, including instabilities in network bandwidth and replacement of low power computing infrastructure in coping with server failures. Such disturbances can be transient or perennial, unexpected or intentional, and can affect an expressive number of processes. Figure 12 shows the typical implementation architecture of Telecomm computational infrastructure, which is used in our experimental study. Each telecommunications central T-central has a computer server to support its activities and to manage its controlled lines and switches. In the Telecomm Headquarters, Telecomm-HQ, a computer server supports all the management Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
27
Figure 12. A typical computing environment for the Telecomm Company
activities and controls the information with respect to communication among its branches (spans), centralizes the billing, and so forth. The credit check gateway CreditCheck-GW executes the dialog between Telecomm and credit operators and banks to check the current financial situation of the clients. Figure 12 also describes a typical computational capacity of computing systems as well as the network connection speeds assumed in the experiments reported in the Experimental Results section. We have adopted TPC-W (TPC-W subcommittee, 2001) to show the power of computing systems because we have assumed all Telecomm information systems are Web-based e-commerce applications. Recall the telephone-service-provision introduced in The Running Example section, and assume that this workflow process was designed to match the organizational structure with respect to its administrative structure and corresponding responsibilities. From the activity hierarchy shown in Figure 4, the activity A:TELECONNECT consists of two composite activities: B:ALLOCATECIRCUIT and C:ALLOCATELINES. The execution dependencies of these compound activities (A, B, and C) are given in Figure 5, Figure 6, and Figure 7. We can conclude that A2:CREDITCHECK must be completed before B:ALLOCATECIRCUIT because A 10:PREPAREBILL depends on A 2 :CREDITCHECK, and A 10 :PREPAREBILL is a subactivity of B:ALLOCATECIRCUIT. By combining the hierarchical structure of those composite activities and their corresponding execution dependencies, we present the workflow design, without compound activities, in Figure 14. In the presence of delays, restructuring operators can be applied to rearrange the activity hierarchy anchored by A:TELECONNECT. The goal is to add concurrency during execution of their instances. Such added concurrency means the earlier release of committed resources to allow access by other concurrent activities (Liu & Pu, 1998a). The TAM operators that permit to increase concurrency among TAM activities are p-Split and u-Split (see the Workflow Restructuring Operators section). For simplicity, we discuss only the use of the u-Split operator because it does not demand previous knowledge of the internal structure and behavior of the target activity. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
28 Liu, Pu and Ruiz
Figure 13. Activity hierarchy of A:TELE CONNECT with unnesting composite activities. A
A1
A3 A2
A8 C
A4
A
A5
A10
A11
A9
A6
A7
(a) Unnesting B:ALLOCATECIRCUIT
Figure 14. Plane graphical representation of the flow procedure of activity TELECONNECT
A1
A4
A2
A5
A3
A6
A7
B
A8
A11
A9
A10
(b) Unnesting B:ALLOCATELINES
By applying u-Split on A:TELECONNECT (recall the initial activity hierarchy of A:TELECONN ECT presented in Figure 9), it is possible to unnest its compound activities B: ALLOCATECIRCUIT (see Figure 13(a)) or C: ALLOCATELINES (see Figure 13(b)). Then, two different restructured workflows with added concurrency are obtained: unnesting C: ALLOCATELINES (Figure 15) and unnesting B: ALLOCATECIRCUIT (Figure 16). When compared with the initial workflow designs shown in Figure 4, unnesting C: ALLOCATELINES permits the start of activity A 8 : ALLOCATESWITCH or A9: ALLO CATESWITCH in case of delay in execution of A6: ALLOCATELINE or A5: ALLOCATELINE respectively. Similarly, unnesting B: ALLOCATE CIRCUIT allows the start of composite activity C: ALLOCATELINES before the credit check activity A2: CREDIT CHECK commits. We have chosen to control instances of activity A 2 : CREDITCHECK to decide if B: ALLOCATECIRCUIT needs restruc-
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
Figure 15. TELECONNECT workflow design after u-Split of C
29
Figure 16. TELECONNECT workflow design after u-Split of B
turing. In addition, A6: ALLOCATE LINE is the chosen activity to be controlled when examining the need for C: ALLOCATELINES restructuring because both A5 and A6 show the same behavior in the workflow model. The results on restructuring C by controlling A6 are similar to those obtained by controlling A5.
Simulation Environment
To study the effectiveness of activity restructuring operators, we built a simulator using CSIM18 (Mesquite Software, 1994) that performs workflow models. These models consist of simple and composite workflow activities, such as TAM (Zhou et al., 1998). The typical computing environment depicted in Figure 12 is used to quantify the disturbance effects and to tune the simulator. In the Experimental Observation and Discussion section, we discuss further experiments with a range of parameter settings that expand and support the results outlined here. Here we briefly describe the simulator, focusing on the aspects that are pertinent to our experiments. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
30 Liu, Pu and Ruiz
To simulate the TELECONNECT workflow activity, we assume 60 seconds as being the upper average limit for the elapsed time of one instance execution. In other words, a TeleConnect workflow instance carries out by a 60second upper limit when the installation of new facilities (execution of activity A11: INSTALLNEW CIRCUIT) is not required. We represent the elapsed time of activity instances using the uniform statistical distribution, since these activities involve a combination of computer and human activities of unpredictable duration within a known range of reasonable values. Table 2 shows the type of computer system where each activity executes the corresponding activities in the simulation and the minimum and maximum elapsed time values taken. For the sake of simplicity, we assume only three different time intervals for the elapsed time of activity instances. Each time interval corresponds to one computing system type. Activity instances executing at Telecomm-HQ (A1, A3, A4, A7, and A10) show elapsed time between 3.2 seconds and 5.2 seconds; 6.4 seconds to 10.4 seconds is the elapsed time interval for activity instances executed on any T-central systems (A5, A6, A8, and A9) and A2 instances present elapsed time between 2.0 seconds and 22.0 seconds when executing on CreditCheck-GW system. We adopt these time intervals because (1) TelecommHQ is the most powerful system in the computing environment and hosts the workflow management system (WfMS); (2) as regards Telecomm-HQ, the elapsed time of each activity instance (executed at T-central, or at CreditCheckGW) considers also the time to flow data and commands into network connections; and (3) CreditCheck-GW represents a computing system beyond the responsibilities of the Telecomm Company technical staff and with a quite variable response time. We adopted the 90% percentile principle from TPC-C (TPC-C subcommittee, 2001) to define the 60-second limit. TPC-C defines 90% percentile as the upper limit for response time on benchmarking complex OLTP application environments. Thus, 90% percent of the activity instances executed in TelecommHQ must show an elapsed time not greater than 5.0 seconds. Analogously, 10.0 seconds and 20.0 seconds correspond to T-central and CreditCheck-GW activity instances, respectively. The exponential statistical distribution, describing jobs that arrive independently, has been used to define the time interval between the start of each workflow instance. These values considered, the simulation environment has been calibrated to execute 165 workflow instances Table 2. Parameter values for the uniform statistical function Activity(ies)
A1, A3, A4, A7, A10 A5, A6, A8, A9 A2
Computer System Telecomm-HQ T-central CreditCheck-GW
Min.
3.2 sec. 6.4 sec. 2.0 sec.
Max.
5.2 sec 10.4 sec. 22.0 sec.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
31
in parallel (see detailed justification in the Experimental Results section). Such fine-tuning has been obtained by using 0.2 second as the input for the exponential statistic function. We assess the effectiveness of workflow restructuring by comparing the execution of TELECONNECT workflow with and without restructuring of its composite activities B: ALLOCATECIRCUIT and C: ALLOCATELINES. As defined in the Restructuring Possibilities on TeleConnect Workflow section, the subactivities A2: CREDITCHECK and A6: ALLOCATELINE are the chosen activities to be controlled; namely, the latency of these activities will be increased in the presence of disturbance. Then, the population of the set of workflow instances is executed for each variation of TELECONNECT workflow. To simulate a controlled activity instance facing a disturbance, the elapsed time obtained from the uniform statistic function is increased by one of the following values that have been conveniently chosen. For the activity A 2: CREDITCHECK, the values are 20%, 36%, 50%, 80%, and 100% of percent delays. In addition, for the activity A6: ALLOCATELINE, the values are 10%, 20%, 40%, 50%, and 100% of percent delays. These values were chosen considering two types of disturbances: (1) those caused by computing systems with lower computational power and (2) those caused by delays on network connections. The first type corresponds to the effect on replacing a computer by a low spare system. For this type, we adopted three different elapsed time increasing: 20%, 50%, and 100% for both controlled activities. We consider that a typical delay of 20% on the average elapsed time represents a computing system with a similar performance to the original one. The last two represent significantly slower computing systems. We adopt these three computing system disturbances as being typical of real situations. Different elapsed time increases could be assumed to perform the simulation experiments. Such assumptions are reasonable because the typical elapsed time of an activity execution is the sum of its CPU-time, I/O-time, and network-transfer-time. Hence, a loss in performance is expected when replacing a computer system by one of lower power. However, such loss hardly matches the same reducing degree of computational power. At least, the network connection remains with the same transfer speed. The second type corresponds to disturbances caused by network delays. In this case, we adopt different elapsed time increasing to each controlled activity because the network connection speeds are rather different (10Mbits/sec among Telecomm-HQ and T-centrals and 1Mbits/sec between Telecomm-HQ and CreditCheck-GW). However, the network speeds adopted depict just typical transfer rates found in the real world. Different network speeds could be assumed to perform the simulation experiments. For the controlled activity A6, we assume a slight delay of 10% caused by network speed falling to 1Mbits/sec and an average delay of 40% by 256kbits/sec network speed. In the same way for the controlled activity A2, we assume an average delay of 36% caused by a Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
32 Liu, Pu and Ruiz
256kbits/sec network connection and a high delay of 80% due to a network connection with 56kbits/sec. As a result, we have a wide range of delays permitting to experiment workflow restructuring in different situations. In this section, the simulator is used primarily to demonstrate the properties of the restructuring operation rather than carry out a detailed analysis of the algorithm for execution restructuring operators. As such, the specific settings used in the simulator are less important than the way in which a delay is either hidden or not hidden by the restructuring process. In the experiments, the various delays were generated by simply applying a uniform probabilistic function. Consider the values in Table 2 as an example. The typical elapsed time for each activity instance is the average of Min and Max values. It is more realistic to represent the elapsed time of activity instances as random values within a time interval because two instances of the same activity can perform a different number of I/O operations, demand a different amount of data from or to the network connection, and execute different sets of CPU operations. Taking into account all typical elapsed time for activity instances, the expected elapsed time for a TeleConnect workflow instance is 45.6 seconds (without A 11 : INSTALLNEWCIRCUIT execution). By applying u-Split of B: ALLOCATECIRCUIT, such elapsed time becomes 33.6 seconds. When an activity instance of A2: CREDITCHECK suffers the effects of a disturbance, the elapsed time of a TELECONNECT workflow instance grows linearly while the elapsed time of the restructured version remains the same up to a 110% delay amount. Only when the delay amount exceeds 110% of the elapsed time of a restructured TELECONNECT workflow, its elapsed time also starts to grow linearly. However, these results take into consideration none of the effects of the disturbances in the other computational components. As demonstrated in the Experimental Results section, such disturbances overload the environment, deteriorate the performance of its components, and the average elapsed time of TELECONNECT workflow instances present different behavior.
EXPERIMENTAL RESULTS
The goal of our experimental study is to show the benefits and costs of dynamic activity restructuring. Concretely, the experiments are set to maximize parallel execution of ongoing workflow instances (WI) by reorganizing the hierarchical structure of the root activity. Our experiments examine and compare the workflow execution with and without restructuring in the following two situations: (1) a temporarily nonoptimal run-time environment (Experiment 1) and (2) an unexpected malfunction in some infrastructure component (Experiment 2). The types of disturbances considered are those discussed in the Simulation Environment section.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
33
To properly evaluate the effectiveness on restructuring workflow instances, a simulation for TeleConnect workflows without restructuring is performed in an environment without disturbances. The goal of this simulation is to determine the population of ongoing WI executed in parallel that presents the highest average elapsed time satisfying the 60-second company goal. The resulting population becomes the reference to understand the effectiveness of workflow restructuring in a run-time environment with disturbances. To authenticate this population, sets of TeleConnect WI with different populations (from 1 to 300 cases in parallel) are executed considering the environment specified in the Simulation Environment section. Figure 17 plots the simulation results for each set of WI. In Figure 17, the x-axis shows the population of each set of WI. In other words, it shows how many WI are executed in parallel and concurrently have used the limited resources of the computing environment. The y-axis presents the corresponding average elapsed time of a set of WI. The line shows the results for the TeleConnect workflow. As expected, a higher number of WI executed in parallel raises their average elapsed time. The special point marked in Figure 17, (165, 59.8), shows the desired population: 165 is the number of workflow instances, executing in parallel, that present the highest average elapsed time and satisfy the 60-second upper limit (see the Simulation Environment section). We adopted only one type of graph to present the experimental results in Experiment 1 and 2. All graphs plot the average elapsed time of 165 WI executed in an environment where instances of a controlled activity (A2 or A6) face delays. The dashed line depicts results for WI without restructuring and the continuous line plots results considering a restructuring criterion. The x-axis shows the percent values of delays the controlled activity faces, and the y-axis shows the average elapsed time of WI. In the graphs related to controlled activity Figure 17. Results on simulating TeleConnect WI without delays and disturbances
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
34 Liu, Pu and Ruiz
A2 (Figure 18 and Figure 20), the asterisks in the continuous line correspond to average WI elapsed time for 20%, 36%, 50%, 80%, and 100% of percent delays. Similarly, the asterisks in the graphs related to controlled activity A6 (Figure 19 and Figure 21) correspond to 10%, 20%, 40%, 50%, and 100% of percent delays. The uniformity of this graph permits direct comparison simulation results for different restructuring criteria (at start and checkpoint 25%) and the control of different activities (A2 and A6). Experiment 1: Temporarily nonoptimal environment The goal of this experiment is to comprehend the advantages and limitations of workflow restructuring when the run-time environment presents one of the disturbances. The restructuring of activities takes place before the start of each WI. We compare cases of running TeleConnect workflows with and without restructuring for each disturbance listed. Each case has 165 WI in the set. The issue that must be answered by this experiment is which disturbances in the computing environment can be properly managed if workflow restructuring takes place at start of controlled activities? To answer this question, it is necessary to check the average WI elapsed time of the WI set for each percent value of delay on executing controlled activity instances. A particular disturbance can be properly managed by workflow restructuring if the resulting average WI elapsed time is less or equal to 60 seconds. Figure 18 show results for the controlled activity A2, and Figure 19 shows results for A6, the other controlled activity. The dashed line in Figure 18 plots the average WI elapsed time of TeleConnect workflow without restructuring for the different delays instances of A2 face. The average WI elapsed time grows linearly as A2 delays increase. For example, 50% of average A2 delay increase corresponds to 74.9 seconds for the average WI elapsed time. Similarly, 100% corresponds to 91.6 seconds. Taking into account 50% and 100% of average A2 delay means about 18 seconds and 24 seconds, respectively, for the average elapsed time of A2 instances; an increase of 6 seconds in the A2 average elapsed time then implies on the increase of 16.7 seconds to the average WI elapsed time. In other words, for each additional second of delay for A2 instances, 2.78 extra seconds for the average WI elapsed time will result, a 2.78 growth factor. This result shows the overload caused by disturbances into the computing environment. This dashed line is present also in Figure 20 with exactly the same results and meaning. It is the reference to compare results from different restructuring criteria. The continuous line in Figure 18 plots average WI elapsed times with B restructuring at the start. In this simulation, all 165 WI are restructured before the start of their A2 instances. The average WI elapsed time grows as A2 delays increase. But its growth factor also increases. For example, in the segment 0% to 20% (average A2 elapsed times 12 seconds and 14.4 seconds, respectively),
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
35
the average WI elapsed time grows from 49.5 seconds to 49.6 seconds. Hence, the growth factor is 0.04 (average WI elapsed time grows 0.04 seconds for each second of delay in the average elapsed time of A2 instances). But considering the segment 50% to 100% (18 seconds and 24 seconds, respectively), the average WI elapsed time grows from 53.5 seconds to 67.4 seconds, and the growth factor is 2.3. The transition between the two segments above presents 1.08 as the growth factor. Growth factor of less than 1.0 mean that the computing environment still presents availability to perform more WI. On the other hand, growth factors greater than 1.0 point to show an overloaded environment. Moreover, the point where the line shows 60 seconds for average WI elapsed time is 73%. Hence, disturbances that cause delays on A2 instances up to 73% are properly managed if B workflow restructuring takes place at start. Similarly to Figure 18, the dashed line in Figure 19 plots the average WI elapsed time of TeleConnect WI without restructuring for delays in A6 instances. The average WI elapsed time grows as A6 delays increase. For delays over 20%, the growth factor is virtually constant. In fact, the delays 20%, 40%, 50%, and 100% (average A6 elapsed times 10.1 seconds, 11.8 seconds, 12.6 seconds, and 16.8 seconds, respectively) correspond to 64.5 seconds, 71.1 seconds, 74.4 seconds, and 91.3 seconds, and the growth factor increases from 3.9 to 4.0. For the two first segments, (0%, 59.8 seconds) to (10%, 61.4 seconds) and (10%, 61.4 seconds) to (20%, 64.5 seconds), the growth factors are 1.9 and 3.7, respectively. These results confirm the assumption that delays on activity instances overload the computing environment, as observed in the dashed line of Figure 18. This dashed line is also used as the reference to compare results from different restructuring criteria in Figure 21, with exactly the same results and meaning. The continuous line in Figure 19 plots average WI elapsed times with C restructuring at start. All 165 WI are restructured before start of A6 instances. Figure 18. B restructuring at start
Figure 19. C restructuring at start
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
36 Liu, Pu and Ruiz
This line shows virtually the same shape depicted by the dashed line, with yvalues about 0.7 seconds shorter. For example, 63.9 seconds corresponds to 20% of delay, and 90.5 seconds corresponds to delay of 100%. It means a very narrow gain on C restructuring, and only delays not greater than 4% are properly managed with C restructuring at start. Experiment 2: Unexpected malfunction of infrastructure component The goal of this experiment is to examine the pros and cons of dynamic workflow restructuring, when the run-time environment presents some disturbance and the restructuring of activities takes place during the execution of WI. The disturbances are detected at 25% checkpoint. We compare cases of running TeleConnect workflows with and without restructuring for each disturbance defined in the Simulation Environment section. Each case has 165 WI in the set. The choice of 25% as a checkpoint to verify whether a particular WI should be restructured represents the ability of the WfMS to monitor the computational environment and to react early when it detects disturbances. Considering the Min and Max values defined in Table 2 for each activity, the expected elapsed time (E-ET) for an A2 instance is 12 seconds and, for an A6 instance, is 8.4 seconds. When an ongoing instance of A2 (or A6) is running at 25% for its E-ET, the simulator estimates which will be its real elapsed time. If the estimated elapsed time overcomes its E-ET then the workflow restructuring takes place. For a simulation controlling A2 instances, the simulator estimates the elapsed time of an ongoing A2 instance at 3 seconds of its starting. If the estimated value is greater than 12 seconds, then the corresponding WI is restructured by u-Split of B. Similarly, if A6 instances are being controlled, the checkpoint occurs at 2.1 seconds of an ongoing A6 instance execution, and the restructuring takes place if the estimated elapsed time value overcomes 8.4 seconds. Consequently, the simulator only restructures an ongoing WI if it estimates the elapsed time of the activity instance controlled is greater than its E-ET. Moreover, each set of simulated WI probably has restructured and not restructured instances. Hence, the issue that must be answered by this experiment is which disturbances in the computing environment can be properly managed if dynamic workflow restructuring is checked at 25% checkpoint on controlled activities? To answer this question it is necessary to check the average WI elapsed time of the WI set, with and without restructuring, for each delay value on executing controlled activity instances. A particular disturbance can be properly managed if the resulted average WI elapsed time is less or equal to 60 seconds. Figure 20 shows results for the controlled activity A2, and Figure 21 shows results for A6. As stated in Experiment 1, the dashed line in Figure 20 and 21 is exactly the same as those depicted in Figure 18 and 19, respectively. The continuous line in Figure 20 plots average WI elapsed time where B restructuring affects only WI with delayed A2 instances. It shows a different
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
37
behavior when compared with a previous restructuring criterion, restructuring at start, depicted in Figure 18. In fact, the growth factor of this line starts near 0.0 and begins to grow after 20% of average delay for A2 instances. In the graph, the asterisks plot the following points: (0%, 56.6 seconds), (20%, 56.7 seconds), (36%, 57.5 seconds), (50%, 59.0 seconds), (80%, 64.8 seconds), and (100%, 69.8 seconds). The percent values for average delays of A2 instances correspond, respectively, to 12.0 seconds, 14.4 seconds, 16.3 seconds, 18.0 seconds, 21.6 seconds, and 24.0 seconds. Then, the growth factors for each segment are 0.04, 0.4, 0.9, 1.6, and 2.1. The growth factor close to 0.0 in the first segment (between 0% and 20%) means that workflow restructuring can properly manage delays on A2 instances up to 20% without increasing the load over the computing environment and, consequently, without perturbing other running applications. On the other hand, only the disturbances that cause delays of up to 54% are properly managed by B workflow restructuring with 25% checkpoint. By comparing with the results in Experiment 1, B workflow restructuring at 25% checkpoint supports lower delays on A2 instances considering the 60-second company goal. Figure 21 plots average WI elapsed time for different delays affecting A6 instances. The continuous line shows results where C restructuring takes place on WI with delayed A6 instances. For delays over 20%, the behavior of this line is the same as that presented in Figure 19 by the continuous line. The slight difference is at the start of the line. At 0%, the average WI elapsed time is 59.7 seconds while the same point, in the dashed line, is 59.8 seconds. These times at 10% are, respectively, 61.1 seconds and 61.4 seconds. It means a lower gain on C restructuring than that depicted in Figure 19, and only delays under 3% are properly managed with C restructuring at 25% checkpoint.
Experimental Observations and Discussion
The experiments presented in the Experimental Results section show the
Figure 20. B restructuring at 25%
Figure 21. C restructuring at 25%
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
38 Liu, Pu and Ruiz
effectiveness of the u-Split operator on restructuring workflow instances facing disturbances in the operational environment, under certain conditions. Experiments 1 and 2 demonstrate B restructuring of workflow instances (WI) before start or, at least, at 25% of expected elapsed time of controlled activity, permitting the achievement of the 60-second upper limit for disturbances that cause delays up to 73% or 54%, respectively. However, workflow-restructuring instances of activity A6 achieve the 60-second upper limit for disturbances up to 3 to 4% of elapsed-time delays. Hence, only one of the two restructuring possibilities presented in the Restructuring Possibilities on TeleConnect Workflow section is effective on satisfying the company goal when delays happen, and only part of the disturbances are properly managed. To better evaluate the effectiveness of workflow restructuring in the experiments presented in the Experimental Results section, the experimental results are consolidated in Figure 22 and Figure 23 for the controlled activities A2 and A6, respectively. The idea is to put the results of the workflow restructuring experiments in one graph. All figures consider 165 as the population of simultaneous WI under execution. Figure 22 shows the gains on restructuring A2 instances when occasionally facing delays. The x-axis depicts the percent values of average delays for A2 instances when facing disturbances. The y-axis depicts the gain on restructuring A2 by the difference between the average elapsed time of the set of workflow instances without restructuring and the average elapsed time of the same set with restructuring. The continuous line plots results for B restructuring executed before the start of A2 instances (see Experiment 1). The dotted-dashed line plots results for B restructuring at 25% checkpoint (see Experiment 2). In a similar way, Figure 23 shows the gains on restructuring A6 instances. The x-axis depicts the percent values of average delays for A6 instances when facing disturbances, and the y-axis depicts the same difference of Figure 22 y-axis. Figure 22 shows that the gains resulting from restructuring B increase Figure 22. Gains on B restructuring Figure 23. Gains on C restructuring
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
39
independently of the moment the restructuring takes place. Moreover, restructuring is more effective if it takes place earlier. The same result is possible to be observed in Figure 23, but the values are too small. Figure 22 also presents the difference between average WI elapsed time growing faster for dynamic B restructuring at 25% checkpoint and considering lower delays (until 40%). Table 3 shows the values used to plot the lines in Figure 22 and permits to observe better the behavior of its graphs. Considering Max and Min values in Table 2, the percent values defined in the Simulation Environment section correspond to 14.4 seconds (20%), 16.3 seconds (36%), 18.0 seconds (50%), 21.6 seconds (80%), and 24.0 seconds (100%). For 0%, the related average elapsed time is 12.0 seconds. For example, at 20% delay, the y-value in the dotted-dashed line (restructuring at 25% checkpoint) is 11.2 seconds. In the same curve, 4.4 seconds correspond to 0% of A2 delay. Hence, for 2.4 seconds of delay increase, B restructuring at 25% checkpoint grows 6.8 seconds in y-value. In other words, for each second of A2 delay, B restructuring at 25% checkpoint increases the difference between the average WI elapsed time without restructuring and with restructuring by 2.8 seconds. Then, 2.8 is the growth factor in this segment. In fact, B restructuring at 25% presents the higher growth factor in the interval 0% to 50%. Consequently, such restructuring criterion is the most effective to dynamically manage delays during execution of WI because it permits achievement of the company goal for expressive delays on controlled activity A2, up to 54%, and it does not imply in a significant increase of the load over the computing environment (only affected WI are restructured). It is not possible to predict the exact amount of elapsed time saved after restructuring a workflow instance. The delay caused by a disturbance depends on the configuration and current workload of the computing infrastructure, and these factors are changing constantly. However, it is possible to characterize scenarios where the chances to save elapsed time are great. Each scenario corresponds to a possible workflow modeling based on the corresponding hierarchy of activity patterns. For the Telecomm Company, the TELECONNECT workflow is the base scenario (Figure 4), and the workflow model variants Table 3. Differences between average WI elapsed times for WI with and without B restructuring Average delay of Restructuring at Restructuring at 25% A2 (%) start checkpoint 0% 20% 36% 50% 80% 100%
10,3 15,5 19,4 21,4 23,5 24,2
3,2 8,4 12,9 15,9 20,1 21,7
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
40 Liu, Pu and Ruiz
obtained by restructuring composite activities are the others (Figure 15 and 16). The analysis of the repercussions caused by a disturbance on the base and variant scenarios permit the evaluation of the possible benefits on restructuring the base workflow model. According to the simulation experiments, it is possible to state the restructuring of AllocateCircuit as being highly beneficial while the restructuring of AllocateLines is not. In fact, the restructuring of AllocateLines would not result in effective elapsed time saving of activity instances. Consequently, the modeling of a business process by a hierarchy of activity patterns must present restructuring opportunities. Nevertheless, these restructuring opportunities must enable a considerable amount of saved elapsed time when activities are running on a disturbed environment. It is important that the restructuring process should take place only when the restructured workflow instance saves a considerable amount of time because such process restructuring takes time when performed. For a business process, such scenario analysis suggests that restructuring is beneficial for workflow instances facing some types of delays. Moreover, it suggests also that it is beneficial even in situations without facing delays. Although B workflow restructuring at start is not the most effective criterion on dynamically managing disturbances, such criterion presents the highest difference between average WI elapsed times: 10.3 seconds for A2 without delays (0% in Figure 22). Hence, workflow restructuring can be a way to improve performance of workflow instances when their workflow models do not explore all possible concurrency among activities. When considering the workflow models to reflect the organizational structure, among other aspects, the restructuring approach presented in this chapter enables the optimization of a company business process without necessarily reengineering the enterprise.
IMPLEMENTATION CONSIDERATIONS
We first describe briefly the implementation efforts concerning the ActivityFlow specification language. Then we give an overview of our first prototype system CWf-Flex which utilizes the implementation architecture of the ActivityFlow to provide an execution environment to specify and execute the workflows.
ActivityFlow Implementation Architecture
We use the HTML (HyperText Markup Language) to represent information objects required for workflow processes and to integrate different media types into a document. We access information from multiple remote information sources available on the Internet by using the uniform addressing of information via URLs (Uniform Resource Locators) and the transmission of data via HTTP
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
41
(HyperText Transfer Protocol). The HTML fill-in forms are the main interaction media between users and a server. Figure 24 shows the implementation architecture of ActivityFlow. The HTTP server translates requests from users in the HTML forms to calls of the corresponding procedures of the prototype system of ActivityFlow using a CGI or Java interface. The prototype implementation consists of three main components: 1.
The workflow actor interface toolkit: It includes Web-based workflow process definition tool, administration and monitor tool, and workflowapplication client-interface program.
• Workflow process definition tool We provide two interfaces for process definition: one is script-based language (recall Activity Specification: An Example section), and the other is graphical interface that uses the graph-based concepts, such as nodes and edges between nodes (recall A Formal Model for Flow Procedure section). When a script language is used, the process definition tool will compile and load the workflow schema into the workflow database. We also provide a facility to map the script-based specification to iconic representation that can be displayed using a Web browser. When a graphical interface is used to define the workflow procedure unit, a form will also be provided to capture the information required in the units, such as header, activity declaration, role association, and data declaration. A script-based specification can also be generated on a request.
• Administration and monitoring tool This module contains functions for creating, updating and assigning users, and roles and actors for the inspection and modification of the running process according to deadlines and priorities, including terminating undesired flow instances and restructuring ongoing activities. The interactions with the users are supported primarily by creating and receiving HTML pages and forms.
• Workflow client interface This module provides a number of convenient services for workflow clients, such as viewing the process state information, the worklist of an ongoing process, and linking to the other relevant information, that is, process description, process history, process deadlines, and so forth. It interacts with the users via HTML pages and forms. We are currently exploring the possibility of using or adapting production software, such as Caprera from Tactica (see http://www.tactica.com/) for managing and maintaining the activity dependency specifications.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
42 Liu, Pu and Ruiz
Figure 24. Implementation architecture of ActivityFlow
2.
3.
The workflow activity engine: It provides basic workflow enactment services, such as creating, accessing, and updating workflow activity description, providing the correctness guarantee for concurrent execution of activities and application-specific coordination control, and using the deadlines and priorities for scheduling and rescheduling activity flows. We have done some initial study on the correctness properties of concurrent activity executions, such as compatibility and mergeability, using userdefined activity dependencies (Liu & Pu, 1998b). We are exploring possibilities to build value-added adapters on top of existing online transaction processing (OLTP) monitors, for example, using some recent results in open implementation (Barga & Pu, 1995) of extended transaction models (Elmagarmid, 1992) and the micro-protocols (Zhou, Pu & Liu, 1996) built on top of the Transarc Encina. The distributed object manager: It provides consistent access to information objects from multiple and possibly remote information sources. The main services include resource manager, transaction router, and run-time supervisor. This component is built on top of the DIOM prototype system, an adaptive query mediation system for querying heterogeneous information sources (Lee, 1996).
To adapt our implementation architecture to the open system environment, we allow a flexible configuration of the actor interface, the workflow engine
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
43
base, and the distributed object manager. For example, one scenario is to let the actor interface, the engine base, and the distributed object manager to exist on different servers. Another scenario is to let the actor interface toolkit exist on one server, and the activity engine base and the distributed object manager to exist on the same server but different from the actor interface server. We may also take the scenario that the actor interface toolkit and the activity engine base exist on the same server, and the distributed object manager exists on a different server.
The Prototype System CWf-Flex
We have developed the first prototype system in PUCRS (Ruiz, Copstein & Oliveira, 2003). The so-called CWf-Flex aims to implement an ActivityFlow environment to describe and execute workflow processes, based primarily on the use of open-source tools. This prototype uses a simplified ActivityFlow model CWf (Bastos & Ruiz, 2001) to represent the structural and functional enterprise elements involved in business processes, such as enterprise activities, human resources, machines, and so forth. The prototype development uses the Java 2 platform, Enterprise Edition (J2EE), plus the Java Server Pages (.jsp) and proper Java-based development frameworks. A typical implementation considers the use of Apache-Tomcat to implement the Web server, PostgreSQL to implement database management functionalities, and Linux as the underlying operating system. Workflow monitoring and management functions are implemented using Java Management eXtensions (JMX) technology. Besides the use of opensource tools, the advantage of adopting this implementation architecture is that the resulting environment is being built as a component-based multitier enterprise Figure 25. XSD description for activity dependencies
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
44 Liu, Pu and Ruiz
application. This design promotes the extensibility of the system and permits further modular incorporation of new features as need arises. CWf models are described in XML using XPDL proposed by WfMC (2002), plus XML-Schema based extensions. There are other XML proposals to describe workflow processes. In particular, Thatte (2003) presents BPEL4WS, a notation for specifying business process behavior based on Web services through the description of business collaborations. In the context of the CWfFlex prototype, XPDL has been adopted as the base to design ActivityFlow/CWf models because XPDL is extensible; that is, it offers elements to express additional characteristics and represents a minimal set of constructs for describing workflow models. These languages typically represent the flow of the processes by node and arc specifications (e.g., XPDL transition: and BPEL4WS ). In ActivityFlow, model activity dependencies (see the Control Flow Specification: Activity Dependencies section) can be useful when specified in the XML-Schema based models. Figure 25 presents the XSD definition for activity dependencies, which is used as an extended attribute of ActivityFlow workflow processes in XPDL. Figure 26 presents examples of how to use such dependencies for the composite activities A LLOCATECIRCUIT and ALLOCATELINES (see Figure 6 and 7, Activity Specification: An Example section). In the CWf-Flex prototype, the restructuring of workflows is a functionality implemented with the workflow-schema evolution module. Figure 26. Examples on using dependencies on activities ALLOCATE CIRCUIT and ALLOCATELINES xmlns:tam=”http://www.inf.pucrs.br/~duncan/ddrTAMSchema.xsd” ... ... ... ... ...
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
45
RELATED WORK AND CONCLUSION
In this chapter, we have described the ActivityFlow approach to workflow process definition. Interesting features of ActivityFlow are the following. First, we use a small set of constructs and a collection of mechanisms to allow workflow designers to specify the nested process structure and the variety of activity dependencies declaratively and incrementally. The ActivityFlow framework is intuitive and flexible. Additional business rules can be added into the system simply through plug-in actors. The associated graphical notations bring workflow design and automation closer to users. In addition, the restructuring operators can change an ActivityFlow diagram preserving their business process dependencies. Second, ActivityFlow supports a uniform workflow specification interface to describe different types (i.e., ad hoc, administrative, collaborative, or production) of workflows involved in their organizational processes and to increase the flexibility of workflow processes in accommodating changes. Similar to TAM, Eder and Liebhart (1995) present the WAMO formalism to describe workflow activities as a composition hierarchy and a set of execution dependencies. In special, expected exceptions are specified with activity hierarchies. Although both TAM and WAMO have their origins on extended transaction models and organizing activity descriptions as trees of activities, only TAM offers a set of restructuring operators that is able to restructure such trees of activities. Kumar and Zhao (1999) present a similar approach of TAM (and WAMO) to describe the dependencies among activities. Through sequence constraints and workflow management rules, the properties of a business process are specified. However, the absence of a diagrammatic representation of a modeled business process makes the communication among designers and administrators difficult and turns the management of production workflow instances very difficult. Research and development for ActivityFlow continue along several dimensions. On the theoretical side, we are investigating workflow correctness properties and the correctness assurance in the concurrent execution of activities. Besides, we are studying a goal-oriented approach to restructure workflows, for example, maximum workflow elapsed time, reliability level or throughput of a computational infrastructure, and so forth. On the practical side, we are examining the implementation architecture presented in Figure 24 using opensource tools (Ruiz et al., 2003) and value-added adapters to support extended transaction models, ActivityFlow specifications, and workflow restructuring operators. In addition, we are exploring the enhancement of process design tools to interoperate with various application development environments. In summary, we have proposed a framework composed by the ActivityFlow specification language and a set of restructuring mechanisms for workflow process specification and implementation, not a new workflow model. This framework is targeted toward advanced collaborative application domains, such Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
46 Liu, Pu and Ruiz
as computer-aided design, office automation, and CASE tools, all of which require support for complex activities that have sophisticated activity interactions in addition to the hierarchically nested composition structure. Furthermore, like most of the modeling concepts and specification languages, the proposed framework is based on pragmatic ground, and hence no rigorous proofs of its completeness can be given. Rather, its usefulness is demonstrated by concrete examples of situations that could not be handled adequately within other existing formalisms for organizing workflow activities.
ACKNOWLEDGMENTS
This project was supported partially by a NSF CCR grant, a DARPA ITO grant, a DOE SciDAC grant, an HP equipment grant, an IBM SUR grant, an IBM faculty Award, and grants from Brazilian National Council on Research (CNPq). Our thanks are also due to Jesal Pandya and Iffath Zofishan for their implementation effort on the ActivityFlow specification language and its graphical user interface, to Bernardo Copstein and Flavio Oliveira for helping with the elicitation of the project requirements, to Cristiano Meneguzzi and Angelina Oliveira for their specification of the execution environment, to Rafaela Carvalho and Gustavo Forgiarini for the implementation of such environment, and to Lizandro Martins for the XML-Schema based extensions to support CWf properties on XPDL-WfMC.
REFERENCES
Aalst, W., & Hee, K. (2002). Workflow management: Models, methods, and systems. Cambridge, MA: MIT Press. Alonso, G., Casati, F., Kuno, H., & Machiraju, V. (2004). Web services – Concepts, architectures and applications. Heidelberg, Germany: Springer-Verlag. Ansari, M., Ness, L., Rusinkiewicz, M., & Sheth, A. P. (1992). Using flexible transactions to support multi-system telecommunication applications. Proceedings of the 18th International Conference on Very Large Data Bases (pp. 65-76). Oxford, UK: Morgan Kaufmann. Barga, R. S., & Pu, C. (1995). A practical and modular implementation of extended transaction models. Proceedings of 21st International Conference on Very Large Data Bases (pp. 206-217). Oxford, UK: Morgan Kaufmann. Bastos, R. M., & Ruiz, D. D. (2001). Towards an approach to model business processes using workflow modeling techniques in production systems. Proceedings of the 34th Hawaii International Conference on System
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
47
Sciences, 9, 9036-9045. Los Alamitos, CA: IEEE Computer Society Press. Dayal, U., Hsu, M., & Ladin, R. (1990). Organizing long-running activities with triggers and transactions. Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (pp. 204-214). New York: ACM Press. Eder, J., & Liebhart, W. (1995, May). The workflow activity model WAMO. Proceedings of the Conference on Cooperative Information Systems, 3rd International Conference, CoopIS 1995 (pp. 87-98). Vienna, Austria: CoopIS Press. Elmagarmid, A. K. (Ed.). (1992). Database transaction models for advanced applications. Oxford, UK: Morgan Kaufmann. Fowler, M., & Scott, K. (2000). UML distilled: A brief guide to the standard object modeling language. Boston, MA: Addison-Wesley. Georgakopoulos, D., Hornick, M. F., & Sheth, A. P. (1995). An overview of workflow management: From process modeling to workflow automation infrastructure. Distributed and Parallel Databases, 3(2), 119-153. Hollingsworth, D., & WfMC. (1995). The workflow reference model. Retrieved November 13, 2004, from http://www.wfmc.org/standards/docs/ tc003v11.pdf Kumar, A., & Zhao, J. L. (1999). Dynamic routing and operational controls in workflow management systems. Management Science, 45(2), 253-272. Lee, Y. (1996). Rainbow: Prototyping the DIOM interoperable system (Technical Report TR96-32). University of Alberta, Department of Computer Science, Canada. Leymann, F., & Roller, D. (2000). Production workflow: Concepts and techniques. Upper Saddle River, NJ: Prentice Hall. Liu, L., & Meersman, R. (1996). The building blocks for specifying communication behavior of complex objects: An activity-driven approach. ACM Transactions on Database Systems, 21(2), 157-207. Liu, L., & Pu, C. (1998a). A transactional activity model for organizing openended cooperative activities. Proceedings of the Hawaii International Conference on System Sciences, VII (pp. 733-742). Los Alamitos, CA: IEEE Computer Society Press. Liu, L., & Pu, C. (1998b). Methodical restructuring of complex workflow activities. Proceedings of the 14th International Conference on Data Engineering (pp. 342-350). Los Alamitos, CA: IEEE Computer Society Press. Mesquite Software, Inc. (1994). CSIM18 Simulation Engine (c++ version) Users Guide. Austin, TX: Mesquite Software, Inc. Mohan, C. (1994). A survey and critique of advanced transaction models. Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis (p. 521). New York: ACM Press. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
48 Liu, Pu and Ruiz
Nodine, M. H., & Zdonik, S. B. (1990, August 13-16). Cooperative transaction hierarchies: A transaction model to support design applications. Proceedings of the 6th International Conference on Very Large Data Bases, Brisbane, Queensland, Australia. Ruiz, D. D., Copstein, B., & Oliveira, F. M. (2003). CWf-Flex – an environment to flexible management of software-testing processes, based on workflow technologies and the use of open-source tools, with resource allocation. Porto Alegre, Brazil: CNPq – Brazilian National Research Council. Ruiz, D. D., Liu, L., & Pu, C. (2002). Distributed workflow restructuring: An experimental study (Technical Report GIT-CC-02-36). GeorgiaTech, College of Computing, Atlanta, GA. Rumbaugh, J., Jacobson, I., & Booch, G. (1999). The unified modeling language reference manual. Boston, MA: Addison-Wesley. Sheth, A. P. (1995). Workflow automation: Applications, technology and research. Proceedings of the ACM SIGMOD (p. 469). New York: ACM press. Sheth, A. P., Georgakopoulos, D., Joosten, S., Rusinkiewicz, M., Scacchi, W., Wileden, J. C., & Wolf. A. L. (1996). Report from the NSF workshop on workflow and process automation in information systems. SIGMOD Record, 25(4), 55-67. Thatte, S. (Ed.). (2003). Business process execution language for Web services, Version 1.1. Retrieved November 13, 2004, from http://www-106.ibm.com/ developerworks/webservices/library/ws-bpel/ TPC-C subcommittee, Transaction Processing Performance Council. (2001). TPC Benchmark C: Revision 5.0. Retrieved November 13, 2004, from http://www.tpc.org/tpcc/ TPC-W subcommittee, Transaction Processing Performance Council. (2001). TPC Benchmark W: Revision 1.7. Retrieved November 13, 2004, from http://www.tpc.org/tpcw/ W3C. (2004). Web services architecture requirements. Retrieved September 9, 2004, from http://www.w3.org/TR/wsa-reqs/ WfMC. (2002). Workflow process definition interface: XML process definition language. Retrieved November 13, 2004, from http://www.wfmc.org/ standards/docs/TC-1025_10_xpdl_102502.pdf WfMC. (2003). Workflow management coalition homepage. Retrieved November 13, 2004, from http://www.wfmc.org Zhou, T., Pu, C., & Liu, L. (1996, December 18-20). Adaptable, efficient, and modular coordination of distributed extended transactions. Proceedings of the 4th International Conference on Parallel and Distributed Information Systems (pp. 262-273). Los Alamitos, CA: IEEE Computer Society Press.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Dynamic Workflow Restructuring Framework for Long-Running Processes
49
Zhou, T., Pu, C., & Liu, L. (1998, November 3-7). Dynamic restructuring of transactional workflow activities: A practical implementation method. Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management (pp. 378-385). New York, NY: ACM Press.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
50 Trujillo, Luján-Mora and Song
Chapter II
Design and Representation of Multidimensional Models with UML and XML Technologies Juan Trujillo, Universidad de Alicante, Spain Sergio Luján-Mora, Universidad de Alicante, Spain Il-Yeol Song, Drexel University, USA
Data warehouses (DW), multidimensional databases (MDB), and OnLine Analytical Processing (OLAP) applications are based on the Multidimensional (MD) modeling. Most of these applications provide their own MD models to represent main MD properties, thereby making the design totally dependent of the target commercial application. In this chapter, we present how the Unified Modeling Language (UML) can be successfully used to abstract the representation of MD properties at the conceptual level. Then, from this conceptual model, we generate its corresponding implementation into any market OLAP tool. In our approach, the structure of the system is specified by means of a UML class diagram that considers main properties of MD modeling. If the system to be modeled is too complex, we describe how to use the package grouping mechanism provided by the UML to simplify the final model. To facilitate the interchange of conceptual MD models, we provide an eXtensible Markup
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
51
Language (XML) Schema which allows us to represent the same MD modeling properties that can be considered by using our approach. From this XML Schema, we can directly generate valid XML documents that represent MD models at the conceptual level. Finally, we provide different presentations of the MD models by means of eXtensible Stylesheet Language Transformations (XSLT).
INTRODUCTION
Multidimensional (MD) modeling is the foundation for Data warehouses (DW), multidimensional databases (MDB), and OnLine Analytical Processing (OLAP) applications. The benefit of using MD modeling is twofold. On one hand, the MD model is close to data analyzers’ ways of thinking; therefore, it helps users understand data. On the other hand, the MD model supports performance improvement as its simple structure allows us to predict final users’ intentions. Some approaches have been proposed lately (presented in the Related Work section) to accomplish the conceptual design of these systems. Unfortunately, none of them have been accepted as a standard for DW conceptual modeling. These proposals try to represent main MD properties at the conceptual level with special emphasis on MD data structures. A conceptual modeling approach for DW, however, should also concern other relevant aspects, such as users’ initial requirements, the behavior of the system (e.g., main operations to be accomplished on MD data structures), available data sources, specific issues for automatic generation of the database schema, and so on. We claim that object orientation with the UML provides an adequate notation for modeling every aspect of a DW system (MD data structures, the behavior of the system, etc.) from user requirements to implementation. We have previously proposed an object-oriented (OO) approach to accomplish the conceptual modeling of DW, MDB, and OLAP applications that introduces a set of minimal constraints and extensions of the UML (Booch, Rumbaugh & Jacobson, 1998; OMG, 2001) needed for an adequate representation of MD modeling properties (Trujillo, 2001; Trujillo et al., 2001). These extensions are based on the standard mechanisms provided by the UML to adapt it to a specific method or model (e.g., constraints, tagged values). We have also presented how to group classes into packages to simplify the final model in case that the model becomes too complex due to the high number of classes (LujánMora, Trujillo & Song, 2002). Furthermore, we have provided a UML-compliant class notation to represent OLAP users’ initial requirements (called cube class). We have also discussed issues, such as identifying attributes and descriptor attributes that set the basis for an adequate semi-automatic generation of a database schema and user requirements in a target commercial OLAP tool. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
52 Trujillo, Luján-Mora and Song
The UML can also be used with powerful mechanisms, such as the Object Constraint Language (OCL) (Warmer & Kleppe, 1998; OMG, 2001) and the Object Query Language (OQL) (Cattell et al., 2000) to embed DW constraints (e.g., additivity and derived attributes) and users’ initial requirements in the conceptual model. In this way, when we model a DW system, we can obtain simple yet powerful extended UML class diagrams that represent main MD properties at a conceptual level. On the other hand, a salient issue these days in the scientific community and in the business world is the interchange of information. The eXtensible Markup Language (XML) (W3C, 2000) is rapidly being adopted as the standard syntax for the interchange of unstructured, semi-structured, and structured data. XML is an open neutral platform and vendor independent meta-language, which allows us to reduce the cost, complexity, and effort required in integrating data within and between enterprises. XML documents can be associated to a Document Type Definition (DTD) (W3C, 2000) or an XML Schema (W3C, 2001), both of which allow us to describe and constrain the structure of XML documents. Moreover, thanks to the use of eXtensible Stylesheet Language Transformations (XSLT) (W3C, 1999), users can express their intentions about how XML documents should be presented, so they could be automatically transformed into other formats, for example, HTML documents. An immediate consequence is that we can define different XSLT stylesheets to provide different presentations of the same XML document. In a previous work (Trujillo, Luján-Mora & Song, 2004), we have presented a DTD for the representation of MD models, and this DTD is then used to automatically validate XML documents. From these considerations in this chapter, we present the following contributions. We believe that our innovative approach provides a theoretical foundation for the possible use of Object-Oriented Databases (OODB) and ObjectRelational Databases (ORDB) for DW and OLAP applications. For this reason, we provide the representation of our approach into the standard for OODB proposed by the Object Database Management Group (ODMG) (Catell et al., 2000). We also believe that a relevant feature of a conceptual model should be its capability to share information in an easy and standard form. Therefore, we also present how to represent MD models, accomplished by using our approach based on the UML by means of the XML. In order to do this, we provide an XML Schema that defines the correct structure and content of an XML document representing main MD properties. Moreover, we also address the presentation of MD models on the Web by means of eXtensible Stylesheet Language Transformations (XSLT): we provide XSLT stylesheets that allow us to automatically generate HTML pages from XML documents that represent MD models, thereby supporting different presentations of the same MD model easily. Finally, to show the benefit of our approach, we include a set of case studies to show the elegant way in which our proposal represents both structural and dynamic properties of MD modeling. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
53
The remainder of this chapter is organized as follows: The Multidimensional Modeling section details the major features of MD modeling that should be taken into account for a proper MD conceptual design. The Related Work section summarizes the most relevant conceptual approaches proposed so far by the research community. In the Multidimensional Modeling with UML section, we present how we use the UML to consider main structural and dynamic MD properties at the conceptual level. We also present how to facilitate the interchange of MD models by generating the corresponding standard provided by the ODMG and the XML Schema from UML. In the Case Studies section, we present a set of case studies taken from Kimball (2002) to show the benefit of our approach. Finally, the Conclusions section draws conclusions and sketches out new research that is currently being investigated.
MULTIDIMENSIONAL MODELING
In MD modeling, information is structured into facts and dimensions. A fact is an item of interest for an enterprise and is described through a set of attributes called measures or fact attributes (atomic or derived), which are contained in cells or points in the data cube. This set of measures is based on a set of dimensions that determine the granularity adopted for representing facts (i.e., the context in which facts are to be analyzed). Moreover, dimensions are also characterized by attributes, which are usually called dimension attributes. They are used for grouping, browsing, and constraining measures. Let us consider an example in which the fact is the product sales in a large store chain, and the dimensions are as follows: product, store, customer, and time. On the left hand side of Figure 1, we can observe a data cube typically used for representing an MD model. In this particular case, we have defined a cube for analyzing measures along the product, store, and time dimensions. We note that a fact usually represents a many-to-many relationship between any of two dimensions. For example, a product is sold in many stores, and a store sells many products. We also assume that there is a many-to-one relationship between a fact and each particular dimension. For example, for each store there are many sale tickets, but each sale ticket belongs to only one store. Nevertheless, there are some cases in which a fact may be associated with a particular dimension as a many-to-many relationship. For example, the fact product_sales is considered as a particular many-to-many relationship to the product dimension, as one ticket may consist of more than one product even though every ticket is still purchased in only one store by one customer and at one time. With reference to measures, the concept of additivity or summaribility (Blaschka et al., 1998; Golfarelli, Maio & Rizzi, 1998; Kimball, 2002; Trujillo et al., 2001; Tryfona, Busborg & Christiansen, 1999) on measures along dimensions Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
54 Trujillo, Luján-Mora and Song
is crucial for MD data modeling. A measure is additive along a dimension if the SUM operator and can be used to aggregate attribute values along all hierarchies defined on that dimension. The aggregation of some fact attributes (roll-up, in OLAP terminology), however, might not be semantically meaningful for all measures along all dimensions. A measure is semi-additive if the SUM operator can be applied to some dimensions but not all the dimensions. A measure is nonadditive if the SUM operator cannot be applied to any dimension. In our example, number of clients (estimated by counting the number of purchased receipts for a given product, day and store) is not additive along the product dimension. Since the same ticket may include other products, adding up the number of clients along two or more products would lead to inconsistent results. However, other aggregation operators (e.g., SUM, AVG, and MIN) could still be used along other dimensions, such as time. Thus, number of clients is semi-additive. Finally, examples of non-additive measures would be those measures that record a static level, such as inventory financial account balances, or measures of intensity, such as room temperatures (Kimball, 2002). Regarding dimensions, the classification hierarchies defined on certain dimension attributes are crucial because the subsequent data analysis will be addressed by these classification hierarchies. A dimension attribute may also be aggregated (related) to more than one hierarchy. Therefore, multiple classification hierarchies and alternative path hierarchies are also relevant. For this reason, a common way of representing and considering dimensions with their classification hierarchies is by means of Directed Acyclic Graphs (DAG). On the right hand side of Figure 1, we can observe different classification hierarchies defined on the product, store, and time dimensions. On the product dimension, we have considered a multiple classification hierarchy to be able to aggregate data values along two different hierarchy paths: (1) product, type, family, group and (2) product, brand. On the other hand, we can also find attributes that are not used for aggregating purposes, instead they provide features for other dimension attributes (e.g., product name). On the store dimension, we have defined an alternative classification hierarchy with two different paths that converge into the same hierarchy level: (1) store, city, province, state and (2) store, sales_area, state. Finally, we have defined Figure 1. A data cube and classification hierarchies defined on dimensions
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
55
another multiple classification hierarchy with the following paths on the time dimension: (1) time, month, semester, year and (2) time, season. Nevertheless, classification hierarchies are not so simple in most cases. The concepts of strictness and completeness are quite important, not only for conceptual purposes, but also for further design steps of MD modeling (Tryfona et al., 1999). Strictness means that an object of a lower level in a hierarchy belongs to only one in a higher level; for example, a province is only related to one state. Completeness means that all members belong to one higher class object which consists of those members only. For example, suppose that the classification hierarchy between the state and province levels is complete. In this case, a state is formed by all the provinces recorded, and all the provinces that form the state are recorded. OLAP scenarios sometimes become very large as the number of dimensions increases significantly, which may then lead to extremely sparse dimensions and data cubes. In this way, there are some attributes that are normally valid for all elements within a dimension while others are only valid for a subset of elements, also known as the categorization of dimensions (Lehner, 1998; Tryfona et al., 1999). For example, attributes alcohol percentage and volume would only be valid for drink products and will be null for food products. Thus, a proper MD data model should be able to consider attributes only when necessary, depending on the categorization of dimensions. Furthermore, let us suppose that apart from a high number of dimensions (e.g., 20) with their corresponding hierarchies, we have a considerable number of facts (e.g., 8) sharing dimensions and classification hierarchies. This system will lead us to a very complex design, thereby increasing the difficulty in reading the modeled system. To avert a convoluted design, an MD conceptual model should also provide techniques to avoid flat diagrams, allowing us to group dimensions and facts to simplify the final model. Once the structure of the MD model has been defined, OLAP users usually define a set of initial requirements as a starting point for the subsequent data analysis phase. From these initial requirements, users can apply a set of operations (usually called OLAP operations) (Chaudhuri & Dayal, 1997) to the MD view of data for further data analysis. These OLAP operations are usually as follows: roll-up (increasing the level of aggregation) and drill-down (decreasing the level of aggregation) along one or more classification hierarchies, slicedice (selection and projection) and pivoting (reorienting the MD view of data which also allows us to exchange dimensions for facts, that is, symmetric treatment of facts and dimensions).
Star Schema
In this subsection, we will summarize the star schema popularized by Kimball (2002), as it is the most well-known schema representing MD properties in relational databases. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
56 Trujillo, Luján-Mora and Song
Figure 2. Sales Dimension Model
Kimball claims that the star schema and its variants, fact constellations schema and the snowflake schema, are logical choices for MD modeling to be implemented in relational systems. We will briefly introduce this well-known approach using Sales Dimensional Model. Figure 2 shows an example of Kimball’s Sales Dimensional Model. In this model, the fact is the name of the middle box (Sales fact table). Measures are the nonforeign keys in the fact table (dollars_sold, units_sold, and dollars_cost). Dimensions are the boxes connected to the fact table in a one-to-many relationship (Time, Store, Product, Customer, and Promotion). Each dimension contains relevant attributes: day_of_week, week_number, and month in Time; store_name, address, district, and floor_type in Store, and so on. From Figure 2, we can easily see that there are many MD features that are not reflected in the Dimensional Model: Which are the classification hierarchies defined on dimensions? Can we use all aggregation operators on all measures along all dimensions? What are these classification hierarchies like (nonstrict, strict, and complete)? And many more properties. Therefore, we argue that for a proper DW and OLAP design, a conceptual MD model should be provided to better reflect user requirements. This conceptual model could then be translated into a logical model for a later implementation. In this way, we can be sure that we are analyzing the real world as users perceive it.
RELATED WORK
Lately, several MD data models have been published. Some of them fall into the logical level (such as the well-known star schema by Kimball, 2002). Others may be considered as formal models, as they provide a formalism to consider
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
57
main MD properties. A review of the most relevant logical and formal models can be found in Blaschka et al. (1998) and Abello, Samos, and Saltor (2001). In this section, we will only briefly make reference to the most relevant models that we consider “pure” conceptual MD models. These models provide a high level of abstraction for the main MD modeling properties presented in the Multidimensional Modeling section and are totally independent from implementation issues. These are as follows: the Dimensional-Fact (DF) model by Golfarelli et al. (1998), the Multidimensional/ER (M/ER) model by Sapia, Blaschka, Höfling, and Dinter (1998) and Sapia (1999), and the starER model by Tryfona et al. (1999). In Table 1, we provide the coverage degree of each above-mentioned conceptual model regarding the main MD properties described in the previous section. To start with, to the best of our knowledge, no proposal provides a grouping mechanism to avoid flat diagrams and to simplify the conceptual design when a system becomes complex due to a high number of dimensions and facts sharing dimensions and their corresponding hierarchies. Regarding facts, only the starER model considers many-to-many relationships between facts and particular dimensions by indicating the exact cardinality (multiplicity) between them. None of them consider derived measures or their derivation rules as part of the conceptual schema. The DF and the starER models consider the additivity of measures by explicitly representing the set of aggregation operators that can be applied on nonadditive measures. With reference to dimensions, all of the models consider multiple and alternative path classification hierarchies by means of Directed Acyclic Graphs (DAG) defined on certain dimension attributes. However, only the starER model considers nonstrict and complete classification hierarchies by specifying the exact cardinality between classification hierarchy levels. As both the M/ER and the starER models are extensions of the Entity Relationship (ER) model, they can easily consider the categorization of dimensions by means of Is-a relationships. With reference to the dynamic level of MD modeling, the starER model is the only one that does not provide an explicit mechanism to represent users’ initial requirements. On the other hand, only the M/ER model provides a set of basic OLAP operations to be applied from these users’ initial requirements, and it models the behavior of the system by means of state diagrams. We note that all the models provide a graphical notation that facilitates the conceptual modeling task to the designer. On the other hand, only the M/ER model provides a framework for an automatic generation of the database schema into a target commercial OLAP tool (particularly into Informix Metacube and Cognos Powerplay). Finally, none of the proposals from Table 1 provide a mechanism to facilitate the interchange of the models following standard representations. Regarding MD modeling and XML (W3C, 2000), some proposals have been presented. All
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
58 Trujillo, Luján-Mora and Song
Table 1. Comparison of conceptual multidimensional models Multidimensional modeling properties Structural level Grouping mechanism Facts Many-to-many relationships with particular dimensions Atomic measures Derived measures Additivity Dimensions Multiple and alternative path classification hierarchies Nonstrict classification hierarchies Complete classification hierarchies Categorization of dimensions Dynamic level Specifying users’ initial requirements OLAP operations Modeling system behavior Graphical notation Automatic generation into a target OLAP commercial tool
Model DF M/E R
StarEr
No
No
No
No
No
Yes
Yes No Yes
Yes No No
Yes No Yes
Yes
Yes
Yes
No No No
No No Yes
Yes Yes Yes
Yes No No Yes No
Yes Yes Yes Yes Yes
No No No Yes No
of these proposals make use of XML as the base language for describing data. In Pokorný (2001), an innovative data structure called an XML-star schema is presented with explicit dimension hierarchies using DTDs that describe the structure of the objects permitted in XML data. The approach presented in Golfarelli, Rizzi, and Vrdoljak (2001) proposes a semi-automatic approach for building the conceptual schema for a data mart starting from the XML sources. However, these approaches focus on the presentation of the multidimensional XML data rather than on the presentation of the structure of the multidimensional conceptual model itself. From Table 1, one may conclude that none of the current conceptual modeling approaches consider all MD properties at both the structural and dynamic levels. Therefore, we claim that a standard conceptual model is needed to consider all MD modeling properties at both the structural and dynamic levels. We argue that an OO approach with the UML is the right way of linking structural and dynamic level properties in an elegant way at the conceptual level.
MULTIDIMENSIONAL MODELING WITH UML
In this section, we summarize how our OO MD model, based on a subset of the UML, can represent main structural and dynamic properties of MD modeling. In the Structural Properties by Using UML Class Diagrams section, we will present how to represent main structural properties by means of a UML class diagram. The Dynamic Properties section summarizes how users’ initial requirements are easily considered by what we call cube classes. The Standard Representation By Using The ODMG Proposal section sketches how we Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
59
automatically transform an MD model accomplished by following our approach into the Object Database Standard defined by the Object Database Management Group (ODMG) (Cattell et al., 2000). Then, the Cube Classes Represented By Using OQL section presents the corresponding representation of our approach into the XML (W3C, 2000) to allow us an easy interchange of MD information. Finally, the XML to Interchange Multidimensional Properties section describes how to use XSLT stylesheets to automatically generate HTML pages from XML documents, thereby allowing us to manage different presentations of MD models in the Web.
Structural Properties By Using UML Class Diagrams
The main structural features considered by UML class diagrams are the many-to-many relationships between facts and dimensions, degenerate dimensions, multiple and alternative path classification hierarchies, and nonstrict and complete hierarchies. It is important to remark that if we are modeling complex and large DW systems, we are not restricted to using flat UML class diagrams. Instead, we can make use of the grouping mechanism provided by the UML called package to group classes together into higher level units to create different levels of abstraction, therefore, simplifying the final model (Luján-Mora et al., 2002). In this way, a UML class diagram improves and simplifies the system specification accomplished by classic semantic data models, such as the ER model. Furthermore, necessary operations and constraints (e.g., additivity rules) can be embedded in the class diagram by means of OCL expressions (OMG, 2001; Warmer et al., 1998). In this approach, the main structural properties of MD models are specified by means of a UML class diagram in which the information is clearly separated into facts and dimensions. Dimensions and facts are represented by dimension classes and fact classes, respectively. Then, fact classes are specified as composite classes in shared aggregation relationships of n dimension classes. The flexibility of shared aggregation in the UML allows us to represent manyto-many relationships between facts and particular dimensions by indicating the 1..* cardinality on the dimension class role. In our example in Figure 5(a), we can see how the fact class Sales has a many-to-one relationship with both dimension classes. By default, all measures in the fact class are considered additive. For nonadditive measures, additivity rules are defined as constraints and are included in the fact class. Furthermore, derived measures can also be explicitly considered (indicated by /), and their derivation rules are placed between braces near the fact class, as shown in Figure 3(a). This OO approach also allows us to define identifying attributes in the fact class, by placing the constraint {OID} next to an attribute name. In this way, we
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
60 Trujillo, Luján-Mora and Song
can represent degenerate dimensions (Giovinazzo, 2000; Kimball, 2002), thereby representing other fact features in addition to the measures for analysis. For example, we could store the ticket number (ticket_num) and the line number (line_num) as degenerate dimensions, as reflected in Figure 5(a). With respect to dimensions, every classification hierarchy level is specified by a class (called a base class). An association of classes specifies the relationships between two levels of a classification hierarchy. The only prerequisite is that these classes must define a Directed Acyclic Graph (DAG) rooted in the dimension class (constraint {dag} placed next to every dimension class). The DAG structure can represent both alternative path and multiple classification hierarchies. Every classification hierarchy level must have an identifying attribute (constraint {OID}) and a descriptor attribute1 (constraint {D}). These attributes are necessary for an automatic generation process into commercial OLAP tools, as these tools store the information in their metadata. The multiplicity 1 and 1..* , defined in the target associated class role, addresses the concepts of strictness and nonstrictness, respectively. Strictness means that an object at a hierarchy’s lower level belongs to only one higher level object (e.g., as one month can be related to more than one season, the relationship between them is nonstrict). Moreover, defining the {completeness} constraint in the target associated class role addresses the completeness of a classification hierarchy (see an example in Figure 3(b)). By completeness, we mean that all members belong to one higher class object, and that object consists of those members only. For example, all the recorded seasons form a year, and all the seasons that form the year have been recorded. Our approach assumes all classification hierarchies are noncomplete by default. Finally, the categorization of dimensions, used to model additional features for a class’s subtypes, is represented by means of generalizationspecialization relationships. However, only the dimension class can belong to both a classification and a specialization hierarchy at the same time. An example of categorization for the Product dimension is shown in Figure 3(c).
Figure 3. Multidimensional modeling using UML
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
61
Dynamic Properties
Regarding dynamic properties, this approach allows us to specify users’ initial requirements by means of a UML-compliant class notation called cube class. After requirements are specified, behavioral properties are usually then related to these cube classes that represent users’ initial requirements. Cube classes follow the query by example (QBE) method: the requirements are defined by means of a template with blank fields. Once requirements are defined, the user can then enter conditions for each field that are included in the query. We provide a graphical representation to specify users’ initial requirements because QBE systems are considered easier to learn than formal query languages. The structure of a cube class is shown in Figure 7:
• • • • • •
Cube class name. Measures area, which contains the measures from the fact to be analyzed. Slice area, which contains the constraints to be satisfied in the dimensions. Dice area, which contains the dimensions and their grouping conditions to address the analysis. Order area, which specifies the order of the result set. Cube operations, which cover the OLAP operations for a further dataanalysis phase.
We should point out that this graphical notation of the cube class aims at facilitating the definition of users’ initial requirements to nonexpert UML or databases users. In a more formal way, every one of these cube classes has its underlying OQL specification. Moreover, an expert user can directly define cube classes by specifying the OQL sentences (see the next section for more detail on the representation of cube classes by using OQL).
Figure 4. Cube class structure
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
62 Trujillo, Luján-Mora and Song
Standard Representation By Using The ODMG Proposal
Our approach generates the corresponding representation of an MD model in most of the relational database management systems, such as Oracle, Informix, Microsoft SQL Server, IBM DB2, and so on (Trujillo et al., 2001). Furthermore, we also provide the corresponding representation into objectoriented databases. However, this representation is totally dependent on the object database management system (ODBMS) used for the corresponding implementation. For this reason in this section, we present the corresponding representation of an MD model accomplished by our approach following the standard for ODBMS2 proposed by the Object Database Management Group (ODMG) (Cattell et al., 2000). The adoption of this standard ensures the portability of our MD model across platforms and products, thereby facilitating the use of our approach. However, we also point out some properties that cannot be directly represented by using this standard and that should be taken into account when transforming this ODBM into a particular object-oriented model of the target ODBMS. The major components of the ODMG standard are the Object Model, the Object Definition Language (ODL), the Object Query Language (OQL), and the bindings of the ODMG implementations to different programming languages (C++, Smalltalk, and Java). In this chapter, we will start by providing the corresponding representation for structural properties into the ODL, a specification language used to define the specifications of object types. Then, we will sketch how to represent cube classes into the OQL, a query language that supports the ODMG data model. The great benefit of this OQL is that it is very close to SQL and is therefore, a very simple-to-use query language.
ODL Definition of an MD Model
Our three-level MD model cannot be represented in an ODBMS because the ODL uses a flat representation for the class diagram without providing any package mechanism in the ODL. Therefore, we start the transformation of the MD models from the third level in the fact package because it contains the complete MD model definition: fact classes, dimension classes, base classes, classification hierarchy properties, and so forth. In the following, we are going to use an actual example to clarify our approach. We have selected a simplification of the grocery example taken from Kimball’s (2002) book. In this example, the corresponding MD model contains the following elements:
• •
One fact (Sales) with three measures (quantity, price, and total_price) and two degenerate dimensions (ticket_num and line_num). Two dimensions: Product with three hierarchy levels (Brand, Subgroup, and Group) and Time with two hierarchy levels (Month and Year).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
63
The first level of the MD model is represented in Error! Reference source not found. and only contains one star schema package, as the example only contains one fact. The second level contains one fact package (Sales product) and two dimension packages (Product and Time), as it can be seen in Error! Reference source not found. Finally, Figure 7 represents the content of the Product dimension package and Figure 8 the content of the Time dimension package. In Figure 9, we can see the content (level 3) of the Sales products fact package, where the complete definition of the MD model is available. The transformation process starts from this view of the MD model. For the sake of simplicity, we show the ODL representation of only three classes: Sales, Product, and Time (the representation of the other classes is very similar). The transformation process starts from the fact class (Sales). Since OID attributes cannot be represented in ODL, we have decided to use the unsigned long type to represent them. Aggregation relationships cannot be directly represented, but we transform them to association relationships. Moreover, maximum cardinality of relationships can be expressed, but the minimum cardinality is lost in the transformation process. In ODL, the definition of a relationship includes designation of the target type, the cardinality on the target side, and information about the inverse relationship found in the target side. The ODL definition for the Sales fact class is as follows:
Figure 5. First level
Figure 6. Second level
Figure 7. Product dimension (third level)
Figure 8. Time dimension (third level)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
64 Trujillo, Luján-Mora and Song
Figure 9. Third level of the MD model
class Sales { attribute unsigned long ticket_num; attribute unsigned long line_num; attribute long quantity; attribute double price; attribute double total_price; relationship Product sales_product inverse Product::product_sales; relationship Time sales_time inverse Time::time_sale; }; For expressing the cardinality ?-to-many, we use the ODL constructor set. For example, the Product class has three relationships: with Sales class (?-tomany), with Brand class (?-to-one), and with Subgroup class (?-to-many). In order to know the cardinality of the relationships in this side, we have to consult the inverse relationship in the target side. For example, the relationship between Product and Sales is one-to-many, since the type of the relationship is set (many) in this side, but in the inverse relationship (Sales::sales_product) it is Product (one). Product and Time dimension classes are specified in ODL as:
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
65
class Product { attribute unsigned long upc; attribute string name; attribute float weight; relationship set product_sales inverse Sales::sales_product; relationship Brand product_brand inverse Brand::brand_product; relationship set product_subgroup inverse Subgroup::subgroup_product; }; class Time { attribute unsigned long code; attribute date day; attribute boolean holiday; relationship set time_sales inverse Sales::sales_time; relationship Month time_month inverse Month::month_time; };
Loss of Expressiveness
As previously commented, some MD properties that are captured in our approach cannot be directly considered by using ODL. This is an obvious problem because the ODL is a general definition language that is not oriented to represent MD properties used in a conceptual design. Specifically, we ignore or transform the following properties:
• • • • • •
Identifying attribute (OID) and descriptor attribute (D) are ignored because they are considered to be an implementation issue that will be automatically generated by the ODBMS. Initial values are ignored. This is not a key issue in conceptual MD modeling. Derived attributes and their corresponding derivation rules are ignored. These derivation rules will have to be specified when defining user requirements by using the OQL. Additivity rules are ignored because the ODL specification cannot represent any information related to the aggregation operators that can be applied on measures. Minimum cardinality cannot be specified either. Completeness of a classification hierarchy is also ignored.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
66 Trujillo, Luján-Mora and Song
Up to now, these ignored properties have to be considered as footnotes in the ODMG specification. For an unambiguous specification of MD models using the ODMG specification, a formal constraint language should be used. Unfortunately, a constraint language is completely missing from the ODMG standard specification.
Cube Classes Represented By Using OQL
The OQL is not easy to use for defining users’ initial requirements because the user needs to know the underlying ODL representation corresponding to the MD model. Due to this fact, we also provide cube classes, which allow the user to define initial requirements in a graphical way. These cube classes can automatically be transformed into OQL sentences and can therefore be used to query an ODBMS that stores an MD model. For example, let us suppose the following initial requirement: The quantity sold of the products belonging to the “Grocery” Group during “January”, grouped according to the product Subgroup and the Year and ordered by the Brand of the product. In Figure 10, we can see the corresponding cube class to the previous requirement. It is easy to see how the cube class is formed: The cube class can be automatically translated into OQL. The algorithm uses the corresponding ODL definition of the MD model to obtain the paths from the fact class (the core of the analysis) to the rest of classes (dimension and base classes). For example, the path from the Sales fact class to the Year base class along the Time dimension traverses the relationships sales_time in Sales fact class, time_month in Time dimension class, and month_year in Month base class. Moreover, when attributes’ names are omitted in the cube class, the algorithm automatically selects the descriptor attribute defined in the MD model. Figure 10. An example of a user’s initial requirement Measures contains the goal of the analysis: SUM(quantity). Slice the restrictions defined on the Time and Product dimensions. Dice the grouping conditions required along the Product and Time dimensions. Order defines the order of the result set.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
67
For example, the expression Time.Month=”January” of the cube class in Figure 11 involves the use of the descriptor attribute from the Month base class because no further attribute is specified. In the same way, the order expression Product.Brand involves the use of the descriptor attribute from Brand. The OQL for the corresponding cube class in Figure 10 is as follows: SELECT SUM(s.quantity) FROM Sales s, s.sales_time st, s.sales_product sp WHERE st.time_month.name = “January” AND sp.product_subgroup.subgroup_group.name = “Grocery” GROUP BY sp.product_subgroup.name AND st.time_month.month_year.number ORDER BY sp.product_brand.name
XML to Interchange Multidimensional Properties
One key aspect in the success of an MD model should be its capability to interchange information in an easy and standard format. The XML (W3C, 2000) is rapidly being adopted as the standard for the exchange of unstructured, semistructured, and structured data. Furthermore, XML is an open neutral platform and vendor independent meta-language, which allows users to reduce the cost, complexity, and effort required in integrating data within and between enterprises. In the future, all applications may exchange their data in XML and then conversion utilities will not be necessary any more. We have adopted the XML to represent our MD models due to its advantages, such as standardization, usability, versatility, and so on. We have defined an XML Schema (W3C, 2001) that determines the correct structure and content of XML documents that represent MD models. Moreover, this XML Schema can be used to automatically validate the XML documents. In Appendix 1 we include the whole XML Schema that we have defined to represent MD models in XML. This XML Schema allows us to represent both structural and dynamic properties of MD models. In Figure 11, Figure 12, and Figure 133, we have graphically represented the main rules of our XML Schema, which contains the definition of 25 elements (tags). We have defined additional elements (in plural form) in order to group common elements together so that they can be exploited to provide optimum and correct comprehension of the model, for example, elements in plural like PKSCHEMAS or DEPENDENCIES. The XML Schema follows the three-level structure of our MD approach:
•
An MDMODEL contains PKSCHEMAS (star schema packages) at level 1 (Figure 11).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
68 Trujillo, Luján-Mora and Song
• •
A PKSCHEMA contains, at most, one PKFACT (fact package) and many PKDIM (dimension packages) grouped by a PKDIMS element at level 2 (Figure 12). A PKFACT contains, at most, one FACTCLASS (Figure 12), and a PKDIM contains, at most, one DIMCLASS and many BASECLASSES (Figure 13) at level 3.
Within our XML Schema, fact classes labeled FACTCLASS may have no fact attributes to consider factless fact tables, as can be observed in the content of the element FACTATTS (0 or more FACTATT): Group of attributes of a fact class ...
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
71
... ... Finally, a fact package (Figure 12) contains, at most, one fact class, and each fact class (Figure 14) can contain fact attributes (FACATTS), methods (METHODS), and shared aggregations with the dimension classes (SHAREDAGGS). Notice that many-to-many relationships between facts and dimensions can also be expressed by assigning the same value “M” to both attributes roleA and roleB in the XML Schema element SHAREDAGG.
Different Presentations of MD Models
Another relevant issue of our approach was to provide different presentations of the MD models in the Web. To solve this problem, XSL Transformations (XSLT) (W3C, 1999) is a technology that allows us to define the presentation for XML documents. XSLT stylesheets describe a set of patterns (templates) to match both elements and attributes defined in an XML Schema, in order to apply specific transformations for each considered match. Thanks to XSLT, the source document can be filtered and reordered in constructing the resulting output. Figure 15 illustrates the overall transformation process for an MD model. The MD model is stored in an XML document, and an XSLT stylesheet is provided to generate different presentations of the MD model, for example, as a Portable Document Format (PDF) file or as an HTML document. Due to space constraints, it is not possible to include the complete definition of the XSLT stylesheet here. Therefore, we only exhibit some fragments of the XSLT. The first example shows the instructions that generate the HTML code to display information about fact attributes (FACTATT): Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
73
Name | Type | Initial | Derivation Rule | DD | | | | | false | |
Figure 15. Generating different presentations from the same MD model
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
74 Trujillo, Luján-Mora and Song
Notice that XSLT instructions (highlighted in bold typeface) and HTML tags are intermingled. The XSLT processor copies the HTML tags to the transformed document and interprets any XSLT instruction encountered. Applied to this example, the value of the attributes of the element FACTATT are inserted in the resulting document in an HTML table.
CASE STUDIES
The aim of this section is to exemplify the usage of our conceptual modeling approach on modeling MD databases. We have selected three different examples taken from Kimball’s (2002) book, each of which introduces a new particular modeling feature: a warehouse, a large bank, and a college course. Due to the lack of space, we will only apply our complete modeling approach for the first example: we will apply all of the diagrams we use for modeling a DW (package diagrams, class diagrams, interaction diagrams, etc.). For the rest of the examples, due to space constraints, we will only focus on representing the structural properties of MD modeling by specifying the corresponding UML class diagram. This class diagram is the key one in our approach since the rest of diagrams can be easily obtained from it.
The Warehouse
This example explores three inventory models of a warehouse. The first one is the inventory snapshot, where the inventory levels are measured every day and are placed in separate records in the database. The second model is the delivery status model, which contains one record for each delivery to the warehouse, and the disposition of all the items is registered until they have left the warehouse. Finally, the third inventory model is the transaction model, which records every change of the status of delivery products as they arrive at the warehouse, are processed into the warehouse, and so forth. This example introduces two important concepts: the semi-additivity and the multistar model (also known as fact constellations). The former has already been introduced in the Multidimensional Modeling section and refers to the fact that a measure cannot be summarized by using the sum function along a dimension. In this example, the inventory level (stock) of the warehouse is semiadditive because it cannot be summed along time dimension, but it can be averaged along the same dimension. The multistar (fact constellations) concept refers to the fact that the same MD model has multiple facts. To start with in our approach, we model multistar models by means of package diagrams. In this way, at the first level, we create a package diagram for each one of the facts considered in the model. At this level, connecting package diagrams means that a model will use elements (e.g., dimensions, hierarchies) defined in the other package. Figure 22 shows the first level of the Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
75
model formed by three packages that represent the different star schemas in the case study. Then, we explore each package diagram at the second level to define packages for each one of the facts and dimensions defined in the corresponding package diagram. Figure 24 shows the content of the package Inventory Snapshot Star at level 2. The fact package Inventory Snapshot Fact is represented in the middle of Figure 24, and the dimension packages (Product Dimension, Time Dimension, and Warehouse Dimension) are placed around the fact package. As can be seen, a dependency is drawn from the fact package to each one of the dimension packages because the fact package comprises the whole definition of the star schema. At level 2, it is possible to create a dependency from a fact package to a dimension package or between dimension packages (when they share some hierarchy levels) but not from a dimension package to a fact package. Figure 18 shows the content of the package Inventory Transaction Star at level 2. As in the Inventory Snapshot Star, the fact package is placed in the middle of the figure, and the dimension packages are placed around the fact package in a star fashion. Three dimension packages (Product Dimension, Time Dimension, and Warehouse Dimension) have been previously defined in the Inventory Snapshot Star (Figure 17), and they are imported in this package. Therefore, the name of the package where they have been previously defined appears below the package name (from Inventory Snapshot Star). The content of the dimension and fact packages is represented at level 3. The diagrams at this level are only comprised of classes and their associations. For example, Figure 19 shows the content of the package Warehouse Dimension at level 3. In a dimension package, a class is drawn for the dimension class (Warehouse) and a class for each classification hierarchy level (ZIP, City, County, State, SubRegion, and Region). For the sake of simplicity, the methods of each class have not been depicted in the figure. As can be seen in Figure 19, Warehouse presents alternative path classification hierarchies: (1) ZIP, City, County, State and (2) SubRegion, Region, State. Finally, Figure 20 shows the content of the package Inventory Snapshot Fact. In this package, the whole star schema is displayed: the fact class (Inventory Snapshot) is defined, and the dimensions with their corresponding hierarchy levels are imported from the dimension packages. To avoid unnecessary details, we have hidden the attributes and methods of dimensions and hierarchy levels, but the measures of the fact are shown as attributes of the fact Figure 16. Level 1
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
76 Trujillo, Luján-Mora and Song
class: four atomic measures (quantity_on_hand, quantity_shipped, value_at_cost, and value_at_LSP) and three derived measures (number_of_turns, gross_profit, and gross_margin). The definition of the derived measures is included in the model by means of derivation rules. Regarding the additivity of the measures, only quantity_on_hand is semiadditive; because of this, an additivity rule has been added to the model. Finally, Warehouse presents alternative path classification hierarchies, and Time and Product present multiple classification hierarchies, as can be seen in Figure 20. Regarding the dynamic part of the model, let us suppose the following user’s initial requirement on the MD model specified by the UML class diagram of Figure 20: We wish to analyze the quantity_on_hand of products where the group of products is “Grocery” and the warehouse state is “Valencia”, grouped according to the product subgroup and the warehouse region and subregion, and ordered by the warehouse subregion and region. On the left hand side of Figure 21, we can observe the graphical notation of the cube class that corresponds to this requirement. The measure to be analyzed (quantity_on_hand) is specified in the measure area. Constraints defined on dimension classification hierarchy levels (group and state) are included in the Figure 17. Level 2 of inventory Figure 18. Level 2 of inventory transaction star snapshot star
Figure 19. Level 3 of warehouse dimension
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
77
Figure 20. Level 3 of Inventory Snapshot Fact
slice area, while classification hierarchy levels along which we are interested in analyzing measures (subgroup, region, and subregion) are included in the dice area. Finally, the available OLAP operations are specified in the CO (Cube Operations) section (in this example, the CO are omitted to avoid unnecessary detail). On the right hand side of Figure 21, the OQL sentence corresponding to the cube class is shown. We can notice how the descriptor attributes from the MD model are used when the attributes of the hierarchy levels are omitted in the analysis. For example, the expression Warehouse.State=”Valencia” of the cube class involves the use of the descriptor attribute from the State base class (Figure 19). From the MD model stored in an XML document, we can provide different presentations thanks to the use of XSLT stylesheets. For example, we use XSLT Figure 21. An example of a user’s initial requirement SELECT quantity_on_hand FROM Inventory_Snapshot i, i.is_warehouse iw, i.is_product ip WHERE iw.warehouse_subregion.subregion_region.region_state.State_name = "Valencia" AND ip.product_subgroup.subgroup_group.Name = "Grocery" GROUP BY iw.warehouse_subregion.subregion_region.Region_name, iw.warehouse_subregion.Subregion_name, ip.product_subgroup.Name ORDER BY iw.warehouse_subregion.Subregion_name, iw.warehouse_subregion.subregion_region.Region_name
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
78 Trujillo, Luján-Mora and Song
Figure 22. Multidimensional model on the Web
stylesheets and XML documents in a transformation process to automatically generate HTML pages that can represent different presentations of the same MD model. As an example of the applicability of our proposal, these HTML pages can be used to document the MD models in the Web, with the advantages that it implies (standardization, access from any computer with a browser, ease of use, etc.). Moreover, the automatic generation of documentation from conceptual models avoids the problem of documentation out of date (incoherencies, features not reflected in the documentation, etc.). For example, in Figure 22, we show the definition of Inventory Snapshot Star on a Web page. This page contains the general description of a star: name, description, and the names of the fact classes and dimension classes, which are active links that allow us to navigate through the different presentations of the model on a Web browser. All the information about the MD properties of the model is represented in the HTML pages.
A Large Bank
In this example, a DW for a large bank is presented. The bank offers a significant portfolio of financial services: checking accounts, savings accounts, mortgage loans, safe deposit boxes, and so on. This example introduces the following concepts:
•
Heterogeneous dimension: a dimension that describes a large number of heterogeneous items with different attributes. Kimball’s (1996) recommended technique is “to create a core fact table and a core dimension table in order to allow queries to cross the disparate types and to create a custom fact table and a custom dimension table for querying each individual type in depth” (p. 113). However, our conceptual MD approach can provide an
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
• •
79
elegant and simple solution to this problem, thanks to the categorization of dimensions. Categorization of dimensions: it allows us to model additional features for a dimension’s subtypes. Shared classification hierarchies between dimensions: our approach allows two or more dimensions to share some levels of their classification hierarchies.
Figure 23 represents level 1, which comprises five star packages: Saving Accounts Star, Personal Loans Star, Investment Loans Star, Safe Deposit Boxes Star, and Mortgage Loans Star. For now, we will only center on the Mortgage Loans Star. The corresponding level 2 of this star package is depicted in Figure 24. Level 3 of Mortgage Loans Fact is shown in Figure 25. To avoid unnecessarily complicating the figure, three of the dimensions (Account, Time, and Status) with their corresponding hierarchies are not represented. Moreover, the attributes of the represented hierarchy levels have been omitted. The fact class (Mortgage Loans) contains four attributes that represent the measures: Figure 23. Level 1
Figure 24. Level 2 of Mortgage Loans Star
Figure 25. Level 3 of Mortgage Loans Fact
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
80 Trujillo, Luján-Mora and Song
Figure 26. Multidimensional model on the Web
Figure 27. Multidimensional model on the Web
total, balance, and payment_number are atomic, whereas debt is derived (the corresponding derivation rule is placed next to the fact class). None of the measures is additive. Consequently, the additivity rules are also placed next to the fact class. In this example, the dimensions present two special characteristics. On one hand, Branch and Customer share some hierarchy levels: ZIP, City, County, and State. On the other hand, the Product dimension has a generalizationspecialization hierarchy. This kind of hierarchy allows us to easily deal with heterogeneous dimensions: the different items can be grouped together in different categorization levels depending on their properties. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
81
Finally, this MD model can be accessible through the Web thanks to the use of XSLT stylesheets. In Figure 26, we show the definition of the Mortgage Loans Star schema. On the left hand side of this figure, we notice a hierarchy index that shows the five star schemas that the MD model comprises. From this Web page, if the Mortgage Loans link is selected, the page shown in Figure 27 is loaded. In Figure 27, the definition of Mortgage Loans fact class is shown: the name of the fact class, the measures, methods, and shared aggregations. In this example, Mortgage Loans contains three measures: total, balance, and payment_number. Moreover, this fact class holds six aggregation relationships with the dimensions Account, Status, Time, Branch dim, Customer dim, and Product dim, which are active links and allow us to continue exploring the model on the Web.
The College Course
This example introduces the concept of the factless fact table (FFT): fact tables for which there are no measured facts. Kimball (2002) distinguishes two major variations of FFT: event tracking tables and coverage tables. In this example, we will focus on the first type. Figure 28. Level 1
Figure 29. Level 2 of College Course Star
Figure 30. Level 3 of College Course Star
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
82 Trujillo, Luján-Mora and Song
Event tracking tables are used when a large number of events need to be recorded as a number of dimensional entities coming together simultaneously. In this example, we will model daily class attendance at a college. In Figures 28 and 29, levels 1 and 2 of this model are depicted respectively. In this case, level 1 only contains one star package. Figure 30 shows level 3 of College Course Fact. For the sake of simplicity, the attributes and methods of every class have not been depicted in the figure. As shown, the fact class College Course contains no measures because it is an FFT. In FFT, the majority of the questions that users create imply counting the number of records that satisfy a constraint, such as: which facilities were used most heavily? Or, which courses were the least attended? Regarding the dimensions, Course and Time present multiple classification hierarchies, Professor and Student share some hierarchy levels, and Facility presents a categorization hierarchy.
CONCLUSION
In this chapter, we have presented an OO conceptual modeling approach, based on the UML, to design DWs, MD databases, and OLAP applications. Structural aspects of MD modeling are easily specified by means of a UML class diagram in which classes are related through association and shared aggregation relationships. In this context, thanks to the flexibility and the power of the UML, all the semantics required for proper MD conceptual modeling are considered, such as many-to-many relationships between facts and particular dimensions, multiple path hierarchies of dimensions, the strictness and completeness of classification hierarchies, and categorization of dimension attributes. Regarding dynamic aspects, we provide a UML-compliant class graphical notation (called cube classes) to specify users’ initial requirements at the conceptual level. Moreover, we have sketched out how to represent a conceptual MD model accomplished by our approach in the ODMG standard as a previous step for a further implementation of MD models into OODB and ORDB. Furthermore, to facilitate the interchange of MD models, we provide an XML Schema from which we can obtain valid XML documents. Moreover, we apply XSLT stylesheets in order to provide different presentations of the MD models. Finally, we have selected three case studies from Kimball’s (2002) book and modeled them following our approach. This shows that our approach is a very easy-to-use yet powerful conceptual model that represents main structural and dynamic properties of MD modeling in an easy and elegant way. Currently, we are working on several issues. On one hand, we are extending our approach to key issues in MD modeling, including temporal and slowly changing dimensions. On the other hand, we are also working on the definition of a formal constraint language for ODMG that allows us to represent the MD
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
83
modeling necessarily ignored in the generation process from our approach based on the UML.
REFERENCES
Abello, A., Samos, J., & Saltor, F. (2001). A framework for the classification and description of multidimensional data models. Proceedings of the 12th Database and Expert Systems Applications, Lecture Notes in Computer Science, 2113 (pp. 668-677). Springer-Verlag. Blaschka, M., Sapia, C., Höfling, G., & Dinter, B. (1998). Finding your way through multidimensional models. Proceedings of the 9th International Workshop on Database and Expert Systems Applications, August 2428 (pp. 198-203). IEEE Computer Society. Booch, G., Rumbaugh, J., & Jacobson, I. (1998). The unified modeling language. Boston: Addison-Wesley. Cattell, R. G. G., Barry, D. K., Berler, M., Eastman, J., Jordan, D., Russell, C., Schadow, O., Stanienda, T., & Velez, F. (2000). The object data standard: ODMG 3.0. San Francisco: Morgan Kaufmann. Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM Sigmod Record, 26(1), 65-74. Giovinazzo, W. (2000). Object-oriented data warehouse design: Building a star schema. Upper Saddle River, NJ: Prentice Hall. Golfarelli, M., Maio, D., & Rizzi, S. (1998). The dimensional fact model: A conceptual model for data warehouse. International Journal of Cooperative Information Systems, 7(2/3), 215-247. Golfarelli, M., Rizzi, S., & Vrdoljak, B. (2001). Data warehouse design from XML sources. Proceedings of the ACM 4th International Workshop on Data Warehousing and OLAP (pp. 40-47), November 9. SpringerVerlag. Kimball, R. (1996). The data warehouse toolkit (2nd ed.). New York: John Wiley & Sons. Lehner, W. (1998). Modeling large scale OLAP scenarios. Proceedings of the 6th International Conference On Extending Database Technology, Lecture Notes in Computer Science, 1377 (pp. 153-167). Luján-Mora, S., Trujillo, J., & Song, I.-Y. (2002). Multidimensional modeling with UML package diagrams. Proceedings of the 21st International Conference on Conceptual Modeling, Lecture Notes in Computer Science, vol. 2503 (pp. 199-213). Springer-Verlag. OMG, Object Management Group. (2001). Unified modeling language specification 1.4. Retrieved November 13, 2004, from http://www.omg.org Pokorný, J. (2001). Modelling stars using XML. Proceedings of the ACM 4th International Workshop on Data Warehousing and OLAP (pp. 24-31), November 9. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
84 Trujillo, Luján-Mora and Song
Sapia, C. (1999). On modeling and predicting query behavior in OLAP systems. Proceedings of the 1st International Workshop on Design and Management of Data Warehouse, June 14-15. Heidelberg, Germany. Sapia, C., Blaschka, M., Höfling, G., & Dinter, B. (1998). Extending the E/R model for the multidimensional paradigm. Proceedings of the International Workshop on Data Warehousing and Data Mining, 1552 (pp. 105-116). Springer-Verlag. Trujillo, J. (2001). The GOLD model: An object oriented conceptual model for the design of OLAP applications. Unpublished doctoral dissertation, Universidad de Alicante, Department of Software and Computing Systems, Spain. Trujillo, J., Luján-Mora, S., Song, I.-Y. (2004) Applying UML and XML for designing and interchanging information for data warehouses and OLAP applications. Journal of Database Management, 15(1), 41-72. Trujillo, J., Palomar, M., Gomez, J., & Song, I.-Y. (2001). Designing data warehouses with OO conceptual models. IEEE Computer, Special Issue on Data Warehouses, 34(12), 66-75. Tryfona, N., Busborg, F., & Christiansen, J. G. B. (1999). starER: A conceptual model for data warehouse design. Proceedings of the ACM 2nd International Workshop on Data Warehousing and OLAP, November 6, Kansas City, KS. W3C, World Wide Web Consortium. (1999). XSL transformations (XSLT) version 1.0. Retrieved November 13, 2004, from http://www.w3.org/TR/ 1999/REC-xslt-19991116/ W3C, World Wide Web Consortium. (2000). Extensible markup language (XML) 1.0 (2nd ed.). Retrieved November 13, 2004, from http:// www.w3.org/TR/2000/ REC-xml-20001006/ W3C, World Wide Web Consortium. (2001). XML schema. Retrieved November 13, 2004, from http://www.w3.org/TR/2001/REC-xmlschema-020010502/ Warmer, J., & Kleppe, A. (1998). The object constraint language: Precise modeling with UML. Reading, MA: Addison-Wesley.
ENDNOTES
1 2
3
A descriptor attribute will be used as the default label in the data analysis. The ODMG defines an ODBMS as “[…] a DBMS that integrates database capabilities with object-oriented programming language capabilities” (Cattell, Barry, Berler, Eastman, Jordan, Russell, Schadow, Stanienda, & Velez, 2000, p.3). In these figures, we use the following notation: the box with three linked dots represents a sequence of elements, the range in which an element can occur is shown with numbers (the default minimum and maximum number
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
85
of occurrences is 1), and graphically (a box with a dashed line indicates that the minimum number of occurrences is 0).
APPENDIX 1
In this section, we include the whole XML Schema that we have defined to represent MD models in XML. This XML Schema allows us to represent both structural and dynamic properties of MD models and initial requirements (cube classes). Common attributes to different elements (id y name) Common attributes to dimension and fact classes Root element of the model Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
86 Trujillo, Luján-Mora and Song
Group of dependencies Dependency between two packages Group of star schema packages
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
87
Group of dependencies Dependency between two packages
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
88 Trujillo, Luján-Mora and Song
Fact package (second level)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
89
Fact class (third level) Group of attributes of a fact class Method of a fact or a base class Group of aggregations of a fact Group of dimension packages of a star schema
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
93
Base class (third level) Attribute of a base class
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
94 Trujillo, Luján-Mora and Song
Group of relationships between classes
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Design and Representation of Multidimensional Models
95
Categorization relationship between two classes
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
96
Wilson and Rosen
Chapter III
Does Protecting Databases Using Perturbation Techniques Impact Knowledge Discovery? Rick L. Wilson, Oklahoma State University, USA Peter A. Rosen, University of Evansville, USA
ABSTRACT
Data perturbation is a data security technique that adds noise in the form of random numbers to numerical database attributes with the goal of maintaining individual record confidentiality. Generalized Additive Data Perturbation (GADP) methods are a family of techniques that preserve key summary information about the data while satisfying security requirements of the database administrator. However, effectiveness of these techniques has only been studied using simple aggregate measures (averages, etc.) found in the database. To compete in today’s business environment, it is critical that organizations utilize data mining approaches to discover information about themselves potentially hidden in their databases. Thus, database administrators are faced with competing objectives: protection of confidential data versus disclosure for data mining applications. This chapter empirically explores whether data protection provided by perturbation techniques adds a so-called Data Mining Bias to the database. While the results of the original study found limited support for this idea,
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Protecting Databases Using Perturbation Techniques
97
stronger support for the existence of this bias was found in a follow-up study on a larger more realistic-sized database.
INTRODUCTION
Today, massive amounts of data are collected by organizations about customers, competitors, supply chain partners, and internal processes. Organizations struggle to take full advantage of this data, and discovering unknown bits of knowledge in their massive data stores remains a highly sought after goal. Database and data security administrators face a problematic balancing act regarding access to organizational data. Sophisticated organizations benefit greatly by taking advantage of their large databases of individual records, discovering previously unknown relationships through the use of data mining tools, and knowledge discovery algorithms (e.g., inductive learning algorithms, neural networks, etc.). However, the need to protect confidential data elements from improper disclosure is another important issue faced by the database administrator. This protection concerns not only traditional data access issues (i.e., hackers and illegal entry) but also the more problematic task of protecting confidential record attributes from unauthorized internal users. Techniques that seek to accomplish masking of individual confidential data elements while maintaining underlying aggregate relationships of the database are called data perturbation techniques. These techniques modify actual data values to hide specific confidential individual record information. Recent research has analyzed increasingly sophisticated data perturbation techniques on two dimensions: the ability to protect confidential data and, at the same time, the ability to preserve simple statistical relationships in a database (means, variances, etc.). However, value-adding knowledge discovery and data mining techniques find relationships that are much more complex than simple averages (such as creating a decision tree for classifying customers, etc.). To our knowledge, only one previous study (Wilson & Rosen, 2003) has explored the impact of data perturbation techniques on the performance of knowledge discovery techniques. The present study expands on this initial study, better quantifying possible knowledge losses or the so-called Data Mining bias.
REVIEW OF RELEVANT LITERATURE
Data Protection through Perturbation Techniques
Organizations store large amounts of data, and most may be considered confidential. Thus, security and protection of the data is a concern. This concern
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
98
Wilson and Rosen
applies not just to those who are trying to access the data illegally but to those who should have legitimate access to the data. Our interest in this area relates to restricting access of confidential database attributes to legitimate organizational users (i.e., data protection). Data perturbation techniques are statistically based methods that seek to protect confidential numerical data by adding random noise to the original data elements. Note that these techniques are not encryption techniques, where the data is first modified, then (typically) transmitted, and, on receipt, reverted back to the original form. The intent of data perturbation techniques is to allow legitimate users the ability to access important aggregate statistics (such as mean, correlations, etc.) from the entire database while protecting the identity of each individual record. For instance, in a perturbed database on sales figures, a legitimate system user may not be able to access original data on individual purchase behavior, but the same user could determine the average of all individual purchasers. Data perturbation methods can be analyzed using various bias measures (see Muralidhar, Parsa & Sarathy, 1999). A data perturbation method exhibits bias when the results of a database query on perturbed (i.e., protected) data produces a significantly different result than the same query executed on the original data. Four types of biases have previously been identified, termed Type A, Type B, Type C, and Type D. Type A bias occurs when the perturbation of a given attribute causes summary measures (i.e., mean value) of that individual attribute to change due to a change in variance. Type B bias occurs when perturbation changes the relationships (e.g., correlations) between confidential attributes. Type C bias occurs when perturbation changes the relationship (again, e.g., correlations) between confidential and nonconfidential attributes. Type D bias deals with the underlying distribution of the data in a database, specifically, whether or not the data has a multivariate normal distribution. It was recently shown that all previously proposed methods suffered from one or more of the four aforementioned biases and thus have undesirable characteristics. Muralidhar et al. (1999) proposed the General Additive Data Perturbation (GADP) method that was free of all four biases. Therefore, these techniques remain the gold standard in data protection via perturbation (Sarathy & Muralidhar, 2002). The premise of these techniques is described below. In a database, we identify the confidential attributes that we would like protected (even from authorized users). All other attributes would be considered nonconfidential. GADP does not alter nonconfidential attributes. For all confidential attributes, the attribute value for each instance in the database will be perturbed (modified) from its original value in such a manner so that the original statistical relationships of the database will be aggregately preserved. These relationships include the mean values for confidential attributes, the measures of covariance between the confidential and nonconfidential attributes, and the Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Protecting Databases Using Perturbation Techniques
99
canonical correlation between the two sets of attributes (confidential and nonconfidential). Appendix A provides a more extensive mathematical description of the GADP process presented for the interested reader.
Proposing the Existence of Type DM Bias: The Initial Study
While the GADP method has been shown to protect confidential attributes appropriately and be theoretically bias-free, these biases represent a very limited view of the value-added capability of a database. Knowledge discovery or data mining techniques can identify underlying patterns in a database, often in decision tree or rule-based form, providing an organization deeper insight (knowledge) about that database. The four biases previously discussed focus only on simple parametric aggregate measures and relationships (mean, variance, covariance, etc.). An initial study hypothesized a deeper knowledge-related bias (referred to as a Data Mining bias) that may be incurred through these perturbation-based data protection techniques (Wilson & Rosen, 2003). To explore this possible bias, the study utilized a representative decision tree based knowledge discovery/data mining tool, QUEST (see Lim, Low & Shih, 2000) and examined its performance using two oft-used categorical data sets, IRIS and BUPA Liver, from the UCI Machine Learning Repository (Merz & Murphy, 1996). IRIS was chosen for the almost perfect linear separation of its class groups. This data set consists of 150 observations, four numerical independent variables (sepal length, sepal width, petal length, petal width), and the dependent (class) variable of iris plant type (Iris setosa, Iris versicolour, and Iris virginica). The data set is equally distributed among the three classes. The second data set utilized in the study was the BUPA Liver Disorders Database (Merz & Murphy, 1996). This data set consists of 345 observations, six numerical independent variables and a dependent class variable indicating presence or absence of a liver disorder. Two hundred of the observations were classified as group 1 (58%), and 145 of the observations were classified as group 2 (42%). Given the high error rate in this database that other researchers found in previous classification studies (e.g., Lim et al., 2000), it served as a complementary example to the easy-to-classify IRIS data set. Both data sets were perturbed using GADP and an older more naïve perturbation technique called Simple Additive Data Perturbation (SADP) (Kim, 1986). SADP was included as a lower bound in this study, as this technique suffers from all four biases mentioned above. This approach simply adds noise to confidential attributes in a database one at a time without consideration of the relationships between variables. These two perturbed data sets and the original two data sets were then analyzed using the QUEST inductive learning method Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
100
Wilson and Rosen
(Lim et al., 2000) to explore if the perturbation process impacts the performance of a data mining/knowledge discovery tool. The measure used to assess the impact was the predictive accuracy of class membership of the data sets using perturbed data (three categories for IRIS, two for Liver). The n-1 independent variables that had the highest correlation to the dependent class variable were selected as confidential attributes with the remaining attribute and class variable kept as nonconfidential. Obviously, results could be impacted by this method of variable selection. Standard tenfold cross validation (with stratification) was used to determine a robust measure of classification accuracy for the QUEST tool. A record was labeled correctly classified if the decision tree outcome on the perturbed data set matched the actual class value of the original database instance. The correct number of classifications was assessed both for the training (development) and testing partitions. Tables 1 and 2 illustrate the training and testing classification accuracy (with standard deviation) of the data mining tool on the IRIS and LIVER data sets, respectively, for the SADP, GADP, and original data sets. Also included is the by-class accuracy for each data set. SADP performed the worst in all cases, as expected, given the simple way in which it protects confidential data. The results for the IRIS database (considering testing only, the ability to generalize) show that the perturbed data was classified more accurately than the original data but not in a statistically significant manner. This unexpected finding was attributed to the linear separability of the data and to the idea that GADP possibly smoothed outlier observations. These results might also just be a spurious effect of a small database. The Liver data results were opposite the IRIS – the classification results on the original data were better than the GADP perturbed data but not statistically significant. Overall, at best, the study found marginal support for the Data Mining bias, primarily through the comparison of SADP to both the original and GADP perturbed data. From this study, it appeared that the underlying knowledge relationship that existed in the database could influence how the GADP method impacted the performance of a data mining tool. For example, the IRIS data set has linearly separable classes, but the initial study used a decision tree tool to assess knowledge discovery. Perhaps this mismatch of underlying knowledge relationship and discovery tool impacted classification results, in addition to the effects of the perturbation. Unlike IRIS, BUPA Liver has been historically difficult for any knowledge discovery tools to classify at high levels of accuracy. Thus, IRIS has very clean class separation while the Liver data set does not. Therefore, it was posited that tool effectiveness could be impacted by the database’s underlying knowledge structure (linear form vs. tree form vs. other forms, etc.) and the degree to
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Protecting Databases Using Perturbation Techniques
101
Table 1. IRIS results (correct classification accuracy)
Original
SADP
GADP
TRAINING
RESULTS
TESTING
RESULTS
Setosa
Virginica
Versicolor
Overall
Setosa
Virginica
Versicolor
Overall
100.00%
96.00%
96.00%
97.33%
100.00%
96.00%
96.00%
97.33%
0.00%
0.94%
0.94%
0.62%
0.00%
8.43%
8.43%
5.62%
74.44%
88.44%
57.33%
73.41%
74.00%
80.00%
50.00%
68.00%
14.37%
5.22%
25.66%
4.33%
26.75%
16.33%
28.67%
13.63%
100.00%
100.00%
99.56%
99.85%
100.00%
100.00%
100.00%
100.00%
0.00%
0.00%
1.41%
0.47%
0.00%
0.00%
0.00%
0.00%
Significant Differences (Tukey HSD test) - GADP and Original >> SADP (p
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
308
Nicolle, Yétongnon and Simon
Individual concepts: A distinction is made between generic and individual concepts corresponding roughly to the difference between classes and their instances. To simplify description, individual concepts can be viewed as realworld objects. Subsumption mechanism: A subsumption mechanism is used to classify the defined concepts. A concept named A subsumes a concept named B, if the definition of A is more general than the definition of B. Using this basic mechanism and applying tests on all defined concepts, an inheritance hierarchy of concepts can be generated. This mechanism is a key point of the Strategic Hierarchy Builder described in this chapter. We use the XML Schema (XML-Schema, 2001) formalism to present the various meta type definitions and examples. A kernel of the concepts used to define the meta types into XML Schema is given in Table 1. This kernel is called the generic meta type definition. It is an XML translation of the concepts described in Description Logic.
Basic Meta Types
Basic meta types represent the core of the meta model. They are used to define the description concepts of major database models. There are five basic meta types: a generic meta type “Meta”; two object-oriented meta types, “MComplex-Object” and “MSimple-Object”; and two link or association meta types, “MNary-Link” and “MBinary-Link”. The topmost meta type in the generalization hierarchy is META. It has an empty structure and is defined by META=([ ], CMeta, [ ]). CMeta comprises a set of data modeling constraints and operations shared by all meta types. For example, CMeta defines an ID function for uniquely identifying meta type instances i such as ID(i) = {Kj}, where {Kj} is a set of labels. The unicity property of the identifier ID is defined by the following rules: For each metatype M and for each label Kj ID(M), we define M.Kj, a function such as M META, i2 M, Kj ID(M) | M.Kj(i1) M.Kj(i2). Table 2 presents the definition of META in XML Schema. Table 2. META definition
Table 3. MComplex-Object definition < x s d : c o m p le x T y p e n a m e = " M C o m p l e x _ O b j e c t " > < x sd :s e q u e n c e > < x s d : e le m e n t n a m e = " D e s c r i p t i o n " t y p e = " D e s c r i p t i o n " m in O c c u r s = " 0 " m a x O c c u r s = " 1 " / > < x s d : e le m e n t n a m e = " A t t r i b u t e " t y p e = " A t t r i b u t e " m in O c c u r s = " 0 " m a x O c c u r s = " u n b o u n d e d " / > < x s d : e le m e n t n a m e = " C o m p l e x _ A t t r i b u t e _ T y p e " t y p e = " C o m p l e x _ A t t r i b u t e _ T y p e " m in O c c u r s = " 0 " m a x O c c u r s = " u n b o u n d e d " / > < /x sd :s e q u e n c e > < x s d : a t t r ib u t e n a m e = " C o m p o n e n t N a m e " t y p e = " C o m p o n e n t N a m e " / > < / x s d : c o m p le x T y p e >
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 309
Table 4. MSimple-Object definition
The meta type MComplex-Object represents object or entity with a complex structure, such as the class type of the object-oriented model, the entity type of the entity-relationship model, and the record type of the Codasyl model. MComplex-Object is specified by CO = (ACO, CCO, PCO), where A CO is defined using the usual tuple and set constructions on meta-object types. The components CCO and PCO are not redefined at this level but are inherited from the super meta type META. The XML Schema description of the meta type MComplexObject is given in the Table 3. The meta type MSimple-Object is a specialization of MComplex-Object. It can be used to represent flat structure modeling constructs, such as the tuple type of the relational model. It is defined by SO = (ASO, CSO, PSO) where the attribute structure ASO inherits the tuple structure ACO of meta type MComplexObject but restricts the type of its components to primitive domains. The components CSO and PSO are inherited from the meta type MComplex-Object. The description logic definition of the meta type MComplex-Object is given in the Table 4. To facilitate the reading of the meta type definition generated by the X-Time tools, we give a graphical representation of these definitions (Figures 10, 11, 12, and 20). The attributes of XML elements do not appear in the graphical representation. For example, the attribute “ComponentName” of “MSimple_Object” in Table 4 does not appear in Figure 10. Figure 10. MSimple-Object definition
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
310
Nicolle, Yétongnon and Simon
The meta type MNary-Link is used to model types that represent connections between real-world entities. It is defined by NL = (ANL, CNL, PNL). ANL are the link attributes. The component CNL and PNL are not redefined at this level but are inherited from the super meta type META. The definition of MNary-Link is given in Table 5. The meta type MBinary-Link categorizes binary connections involving pairs of object types. It is a special case of MNary-Link, and it is defined by BL = (A BL , C BL, P BL). The XML Schema definition of MBinary-Link is given in Table 6.
Specific Meta Types
The specific meta types capture the semantic of data models. The initial basic meta type hierarchy is extended by adding specific data model meta types. For example, the table or relation concept of the relational data model is represented by the meta type MRELATION. It is defined by REL=(AREL, CREL, PREL). MRelation specializes the meta type MSimple-Object. Thus, the structure AREL is the same as the structure of A SO. MRelation refines the ID function defined in META to specify the relational primary key property; namely, a subsequence of attribute labels that uniquely identify the tuples of a relation. Moreover, since a relation may in some cases represent a relationship between two or more real-world entities, MRelation also specializes the meta type MBinary-Link. Therefore, an edge is added between meta types MRelation and MBinary-Link, and the component CREL of MRelation must include constraints inherited from the links meta types. The constraints of CREL are defined by three axioms. Axiom 1: The non-null property of a relational primary key is defined by:
R1
REL, Ai
R1, Fk
ID(R1), R1.Fk(Ai)
NULL
Axiom 2: An identifier is made of attributes, which determine in a single way the nonkey attributes of the relation:
R1
REL, A1,A2
R1, Fk
ID(R1), R1.Fk(A1), R1(A1)
R1(A2)
Axiom 3: The notion of external identifier where the key attributes of an instance R1 determine, in a single way, a subsequence of attributes of an instance R2 is defined by:
R1,R2 REL, A1,A2 R1, A2 R2, Fk1 ID(R2) Where R2.Fk2(A2), R1(A2) R2(A2).
ID(R1) with R1.Fk1(A1), Fk2
The complete definition of the meta type MRelation is presented in Table 7. Another example of meta model extension is given by the inclusion of ObjectOriented model concepts in the meta type hierarchy. The concepts of the ObjectCopyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 311
Table 5. MNary-Link definition
Table 6. MBinary-Link definition
Table 7. MRelation definition
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
312
Nicolle, Yétongnon and Simon
Figure 11. MRelation definition
Oriented model are represented by the meta types MClass and MInheritanceLink. MClass models the static (attributes and links) and the behavioral (methods) aspect of an object. It is defined by MCLA = (ACLA, C CLA, PCLA). The structure ACLA combines the structures of ACO and ABL. The inheritance from ABL allows the representation of reference attributes. MClass refines the component PCO to introduce the behavioral aspect of the objects. A detailed and formal description of methods is beyond the scope of this chapter. Thus, the component PCLA is not presented. CCLA is empty. The definition of MClass is given in Table 8. The MInheritance-Link is a binary link. In addition to the constraints inherited from meta type MBinary-Link, it maintains a subset constraint between the population of the specialized and generalized meta types which generalize the concept of inheritance of the object-oriented data model. Moreover, MInheritance-Link specializes the structure of MBinary-Link to represent connection without attribute. The meta type MInheritance-Link is defined by MIHL = (AIHL, CIHL, [ ]). Table 9 gives the definition of MInheritance-Link. Table 8. MClass definition
Figure 12. MInheritance-Link definition
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 313
Table 9. MInheritance-Link definition
Figure 13. Example of hierarchy construction without subsumption mechanism 1. M1 2. M2 3. M3 4. M4 5. M5 6. M6
M1
M6
M2
M3
M4
M5
1. M1 2. M6 3. M3 4. M4 5. M5 6. M2
M1 M6
M2
M4
M3 M5
Concepts Concepts Resulting Hierarchy Resulting Hierarchy specification order specification order Figure 13. Example of hierarchy construction without subsumption mechanism
Strategic Hierarchy Builder (SHB) Tool
The cooperation of various information systems requires an extensible model. A solution proposed by Barsalou and Gangopadhyay (1992) uses a Metamodel in which meta concepts (or meta types) are organized in an inheritance lattice. The addition of new concepts is achieved by specialization of existing meta concepts without reorganization of the meta model structure. Thus, as shown in Figure 13, the meta model structure depended on the order in which information systems were integrated in the cooperation. An alternative approach adopted in the X-TIME project is to carry out a classification step before adding concepts to the meta model. This classification, which can be performed automatically, reorganizes the structure of the meta model every time a new set of modeling concepts is added to the cooperation. The resulting description logic specifications are classified into a generalization hierarchy to define an appropriate interoperability meta model. Automatic classification of meta types guarantees that the meta model is updated and reorganized when new data models are added to the interoperability environment. This can be done in three steps. First, the concepts of data models are mapped to description logic structures. Then, a subsumption mechanism is applied to the description logic structures to analyze and place them in the existing generalization hierarchy. Finally, the inheritance relationships are used to simplify the initial meta type definitions. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
314
Nicolle, Yétongnon and Simon
Figure 14. The meta model hierarchy including relational and objectoriented concepts
Semiautomatic subsumption and simplification steps are carried out when new information systems and the corresponding data models are added to an interoperable architecture to extend the meta model by inserting new modeling concepts based on their semantic similarities. This guarantees that a specific meta model is constructed for the set of interoperating models. From this generalization hierarchy, the Strategic Hierarchy Builder generates the corresponding XML schema by creating an XML definition for each meta type (XMLSchema, 2001). The semantic description of meta types is used to aid the generation of XML Schema structures for each meta type. A meta model for relational and object-oriented models is presented in Figure 14. A detailed presentation of the meta types and the subsumption mechanism can be found in (Nicolle, Cullot & Yétongnon, 1999).
Transformation Rule Builder (TRB) Tool
The Transformation Rule Builder is used to create schemas and data translators for interoperable environments. To carry out schema translation, transformation rules are coupled with the generalization links of the meta type hierarchy for meta type instance mapping (Figure 15). Figure 15. Rules and hierarchy of meta types Mt-b Rbg
Mt-g
Rdb
Rgb
Rbd
Mt-d
Rxy : transformation rule ---> : Sense of transformation
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 315
These rules allow instance transformations between two meta types. They are formally expressed in the first order logical predicate and have the following general form R (I1, M1, I2, M2), where I1 is the source meta schema, I2 is the target meta schema, and M1 and M 2 are meta types. This rule produces the target meta schema I2 from the source meta schema I1 by converting all instances of meta type M1 in I1 to one or more instance(s) of meta type M2. For example, R(IMSs, CO, IMSt, SO) is a basic object-oriented rule which translates all the MComplexObject instances (CO) of an intermediate Meta Schema (IMSs) to instances of the meta type MSimple-Object instances (SO) in a target intermediate Meta Schema (IMSt). This rule is called basic because it handles basic meta types. A rule is called extended when it handles at least one nonbasic meta type. The knowledge of both hierarchy and deducted descriptions of meta types reduces the cost of the rule definition process. The number and type of transformation rules created by the translation rules builder depend on the resulting hierarchy. Let new-M be the new meta type that is added to the hierarchy. Several cases can be considered:
• •
If the new meta type (New-M) is a leaf that is connected to one or more parent nodes, then two-way transformation rules are created between New-M and each parent node. If the new meta type (New-M) is an internal node (connected to a child node), then a transformation rule is defined for each child node. In this case, pre-existing rules can be decomposed and reused.
In our example, the resulting hierarchy is presented in Figure 17 where MRelation inherits from MSimple-Object and MBinary-Link meta types. Thus, three transformation rules must be defined: One to transform instances of the meta type MRelation into instances of MSimple-Object and MBinary-Link, another rule to transform instances of MSimple-Object to instances of Mrelation, and a final rule to transform MBinary-Link instances into MRelation instances. This number of rules is independent of the cooperation size and the number of heterogeneous data models cooperating. Rules are not limited to converting a structure (or constraints) to other structures (or constraints). In some specific cases, rules can convert a structure into programs or constraints and inversely. Thus, the translation of a relational schema with key constraints in an objectoriented model gives an object schema with encapsulated methods. Figure 15 shows an example of transformation rules. The rule noted Rdb ensures the transformation of instances from concept MT-d to concept MT-b. The physical files resulting from the TRB process are XSL stylesheets that convert XML documents from an XML Schema document of a source meta type to the XMLschema document of a target meta type.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
316
Nicolle, Yétongnon and Simon
THE TRANSLATOR COMPILER
The construction of translators is done in three steps. The first step determines transformation paths. A transformation path joins two distant meta types in the meta model. For example, Figure 16 depicts a combination of transformation rules from MT-h to MT-g and MT-f. Transformation paths and resulting meta schema transformations are stored in a schema translation library. Many studies have shown that using a combination of translation rules to map data models can reduce the cost or complexity of translation. Moreover, a translation path between two concepts can be reused partially or totally. The construction of transformation paths is achieved using Dijkstra’s algorithm (Dahl, Dijkstra & Hoare, 1973). The second step consists in regrouping transformation paths to create a translation path. This step is achieved by successive amalgamations of transformation paths. First, transformation paths that possess the same source meta type and the same target meta type are regrouped. For example, as depicted in Figure 17, a first meta-schema transformation path from MT-h to MT-g is partially reused for a second meta-schema transformation form MT-h to MT-i. Next, paths between meta types generalizing a source model and meta types generalizing a target model are regrouped. In Figure 18, the translation path between a relational model (MRelation) and an object model (MClass, MInheritance-Link) is [MRelation, [MSimple-Object, MBinary-Link], [MComplex-Object, MBinaryLink], [MClass, MBinary-Link], [MClass, MInheritance-Link]]. The last step is the compilation of transformation rules according to the resulting translation path. This compilation generates a specific translator that allows the translation of a schema from a source to a target data model and inversely. An association of the code of every rule is done to obtain this translator. Then an optimization of this code is made. Figure 16. Transformation paths
Figure 17. Translation path reusing
Mt-a
Mt-a Mt-c
Mt-b
Mt-e Mt-g
Mt-d
Mt-h
Mt-f
---> : Transformation rules from Mt-h to Mt-f and Mt-g
Mt-b
Mt-c Mt-e
Mt-g Mt-i Mt-d Mt-h
Mt-f
---> : Transformation rules from Mt-b to Mt-i and Mt-g —> : Transformation rules reused for both translations
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 317
Figure 18. Example of translation path
The Translator Compiler provides an XSLT stylesheet that allows the translation of a source XML Document into a target one and the Java source code of the translator that allows the combination of XML document and XSLT stylesheet. Therefore, it is possible to build the corresponding executable file on the various platforms composing the cooperation. Using the meta type hierarchy and transformation rules for database interoperation makes the definition of new transformation rules easier because the definitions of two meta types directly linked are closed. This method of transformation is independent from the number of heterogeneous models in the federation. Transformations are defined between concepts and not between models. To complete the definition of the resulting translators, we associate specific XSLT stylesheet to each cooperating system. These stylesheets are used to represent the final translated XML document into the target database syntax.
XML-GRAMMARS GENERATION
To complete the description of our solution, we detail in this section the automatic generation of XML grammars from the specification of legacy database concepts. Figure 19 represents the different levels of abstraction in a database and the corresponding grammars and documents used to represent them in an XML format. There are four levels of abstraction: meta model, model, schema, and data that are structured on two layers (local and cooperative). The local layer represents the schema and data that are contained in the database. The Figure 19. XML-grammar construction process
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
318
Nicolle, Yétongnon and Simon
cooperative layer represents all levels of abstractions. At this layer, XSLT stylesheets (noted SS in a circle in Figure 19) are used to pass from an abstraction level to another one. These XSLT stylesheets transform a source XML document into a target XML document (for the change of abstraction level). Similarly, XSLT stylesheets are used to convert from the cooperative layer to the local layer by taking into account the specific syntax of local data models to format the XML document. This conversion is made using a set of case tools and a meta model. Below we present the different levels of the cooperation layer: meta model, generic model, and XML grammar. To illustrate this process, we use the example given in Table 13. This example is an Oracle creation script (Table 10). The script talks about machines, which make products. The specific relation XML grammar of Table 12 can be used to transform the example schema of Table 1 into a valid XML document. The resulting XML document is presented in Table 13. In a cooperative environment, this document is used to exchange the view of the local databases between the cooperating information systems. From the XML document presented in Table 13, we can produce the XML grammar to exchange data through the network in XML documents. Table 14 presents the result of the transformation of the XML document of a relational schema into an XML grammar of relational data. The X-Time meta model consists of five XML documents which are XML schema. We have the following XML documents: “Generic_Metatype_Definition”, “Links_Definition”, “Simple_Type_Definition”, “Specific_Metatypes”, and “X-TIME_Metamodel_Definition”.
Table 10. Oracle creation script DROP TABLE PRODUCT CASCADE CONSTRAINTS; DROP TABLE MACHINE CASCADE CONSTRAINTS; DROP TABLE MANUFACTURE CASCADE CONSTRAINTS; CREATE TABLE PRODUCT ( PRODUCT_NUM NUMBER(4) NOT NULL, PRODUCT_NAME VARCHAR(32), PRODUCT_COST NUMBER(13,2), CONSTRAINT PK_PRODUCT PRIMARY KEY (PRODUCT_NUM)); CREATE TABLE MACHINE ( MACHINE_NUM NUMBER(4) NOT NULL, MACHINE_NAME VARCHAR(32), CONSTRAINT PK_MACHINE PRIMARY KEY (MACHINE_NUM)); CREATE TABLE MANUFACTURE ( MACHINE_NUM NUMBER(4) NOT NULL, PRODUCT_NUM NUMBER(4) NOT NULL, CONSTRAINT PK_MANUFACTURE PRIMARY KEY (MACHINE_NUM, PRODUCT_NUM)); ALTER TABLE MANUFACTURE ADD ( CONSTRAINT FK_MACHINE FOREIGN KEY (MACHINE_NUM) REFERENCES MACHINE (MACHINE_NUM)); ALTER TABLE MANUFACTURE ADD ( CONSTRAINT FK_PRODUCT FOREIGN KEY (PRODUCT_NUM) REFERENCES PRODUCT (PRODUCT_NUM));
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 319
Table 11. XML grammar of a generic relational model
The “Simple_Type_Definition” file represents the low level of the XML grammars. In this file are defined simple types which are a restriction of the string type defined in the recommendation of the W3C XML schema. These types are used by the other schema with the importation faculty of W3C XML schema. The other schemas are consisted complex structures constituting the complex types of XML schema. The last XML schema, “XTIME_Metamodel_Definition”, is consisted by an aggregation of the other XML schema and their imports. It imports all XML schemas directly, except the schema constituting the simple types. It constitutes the most level in the XML grammars. This manner of proceeding has many advantages. One can quote a greater flexibility in the schema generation. In the same way, a modification of a schema does not modify the schema which imports this one. One divides complexity into having several diagrams instead of only one. The grammar presented in Table 14 is used to generate a document that contains data, which result from queries. An example of an XML document is given in Table 16. This document is the result of the following query (Table 15) sent to the local database. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
320
Nicolle, Yétongnon and Simon
Table 12. XML grammar of a specific relational schema
Table 13. XML document of a relational schema
MACHINE MACHINE_NUM NUMBER(4) MACHINE_NAME VARCHAR(32)
MANUFACTURE MACHINE_NUM NUMBER(4) MACHINE_TYPE PRODUCT_NUM NUMBER(4) PRODUCT
PRODUCT PRODUCT_NUM NUMBER(4) PRODUCT_NAME VARCHAR(32) PRODUCT_COST NUMBER(13,2)
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 321
Figure 20. XML grammar of relational data
Table 14. XML grammar of relational data
Table 15. Example of SQL query SELECT * FROM MACHINE, MANUFACTURE, PRODUCT WHERE MANUFACTURE.MACHINE_NUM = MACHINE.MACHINE_NUM AND MANUFACTURE.PRODUCT_NUM = PRODUCT.PRODUCT_NUM AND PRODUCT.PRODUCT_COST >= 1999.99;
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
322
Nicolle, Yétongnon and Simon
Table 16. Example of XML document containing query result 15 Transvax-12000 16 Transvax-12000 44 Transvax-1500
CONCLUSION
We have presented an XML-based integration approach and toolkit, called X-TIME, to support the development of B2B applications. It uses an adaptable semantic-oriented meta model that defines a set of meta types for representing meta level semantic descriptors of data models found in Web sites. The meta types are organized in a generalization hierarchy to capture semantic similarities between modeling concepts. The solution allows the creation of cooperative environments for B2B platforms and provides the first layers for the development of a Web services integration platform. We propose a set of tools based on the meta type hierarchy to support the development of B2B applications in cooperative or interoperable architectures. The toolkit provides different tools for automating the construction of the meta type generalization hierarchies for defining and associating transformation rules to paths of a meta type generalization hierarchy and for data model translators that are required by an application to map a Web format into another representation model. Our future work will focus on using Web-enabled semantic tools for an integration of Web services based on the semantic descriptions of the services.
REFERENCES
Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web, from relations to semi-structured data and XML. Morgan Kaufmann. Retrieved November 29, 2004, from http://mkp.com adXML. (2004). Advertising for XML. Retrieved November 29, 2004, from http://www.adxml.org/ Atzeni, P., & Torlone, R. (1997). MDM:A multiple-data-model tool for the management of heterogeneous database schemes. Proceedings of the SIGMOD International Conference (pp. 538-531). B2B. (2004). Definition. Retrieved November 29, 2004, from http:// searchebusiness.techtarget.com/sDefinition/0,,sid19_gci214411,00. html
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Interoperability of B2B Applications: Methods and Tools 323
Barsalou, T., & Gangopadhyay, D. (1992). An open framework for interoperation of multimodel multidatabase systems. Proceedings of the 8th International Conference of Entity-Relationship Approach (pp. 218-227), Anaheim, California. Batini, C., & Lenzerni, M. (1983). A methodology for data schema integration in the entity-relationship model. Proceedings of the 3rd International Conference on Entity-Relationship Approach (pp. 413-420), North Holland. Benslimane, D., Leclercq, E., Savonnet, M., Terrasse, M. N., & Yétongnon, K. (2000). On the definition of generic multi-layered ontologies for urban applications. International Journal of Computers, Environment and Urban Systems, 24, 191-214. BRML. (2004). Business rules markup language. retrieved July 25, 2004 [Online]. Available: http://www.oasis-open.org/cove Conrad, A. (2001). A survey of Microsoft SQL Server 2000 XML features. Microsoft Corporation. Retrieved November 29, 2004, from http:// msdn.microsoft.com/library/default.asp?url=/library/en-us/dnexxml/ html/xml07162001.asp CORBA (2004), Common Object Request Broker Architecture, Retrieved November 29, 2004, from http://www.corba.org/ Dahl, O. J., Dijkstra, E. W., & Hoare, C. A. R. (1973). Structured programming. London/New York: Academic Press. Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J., & Widow, J. (1995). Integrating and accessing heterogeneous information sources in TSIMMIS. Proceedings of the AAAI Symposium on Information Gathering (pp. 61-64), Stanford. GPSML. (2004). Global positioning system markup language. Retrieved November 29, 2004, from http://www.chaeron.com/gps Hoppe, T., Kindermann, C., Quantz, J. J., Schmiedel, A., & Fischer, M. (1993, March). Back V5, tutorial and manual (Projekt KIT-BACK), Technische Universitat Berlin, Germany. IBM. (2004). DB2 universal database for Linux, UNIX and Windows. Retrieved November 29, 2004, from http://www-306.ibm.com/software/data/db2/ udb/ Keim, D. A., Kriegel, H. P., & Miethsam, A. (1994). Query translation supporting the migration of legacy database into cooperative information systems. Proceedings of the 2nd International Conference on Cooperative Information Systems (pp. 203-214). Lane, L., Passi, K., Madria, S., & Mohania, M. (2002). XML scheme integration to facilitate e-commerce. In A. Dahanayake & W. Gerhardt (Eds.), Webenabled system integration: Practices and challenges (pp. 000-000). Hershey, PA: Idea Group.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
324
Nicolle, Yétongnon and Simon
Nicolle, C., Cullot, N., & Yétongnon, K. (1999). SHB: A strategic hierarchy builder for managing heterogeneous databases. Proceedings of the International Database Engineering & Applications Symposium (pp. 8286), Montreal, Canada. Nicolle, C., Jouanot, F., & Cullot, N. (1998). Translation tools to build and manage cooperative heterogeneous information systems. Proceedings of the 18th International Conference on Database, Brno, Czechoslovakia. Oracle. (2004). Information architecture, XML-based integration solutions for 11i. Retrieved November 29, 2004, from http://www.oracle.com/appsnet/ technology/ia/content.html Pardi, W. J. (1999). XML, in action. Microsoft Press. Retrieved November 29, 2004, from http://www.microsoft.com/france/mspress Sheth, A. P., & Larson, J. A. (1990). Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22, 000-000. SOAP. (2003). SOAP version 1.2 part 0: Primer. W3C Recommendation. Retrieved November 29, 2004, from http://www.w3.org/TR/soap12-part0/ UDDI. (2003). Version 3.0.1, UDDI spec technical committee specification. Retrieved November 29, 2004, from http://uddi.org/pubs/uddi_v3.htm W3C. (2001). W3C workshop on Web services: Position papers. Retrieved November 29, 2004, from http://www.w3.org/2001/03/WSWS-popa/ Woods, W. A., & Schmolze, J. G. (1992). The KL-ONE family. Computer Mathematic Application, 23, 133-177. WSDL. (2001). Web services description language. Retrieved November 29, 2004, from http://www.w3.org/TR/wsdl XCBF. (2004). XML common biometric format. Retrieved November 29, 2004, from http://oasis-open.org/committees/xcbf/ XML. (2004). Extensible markup language (XML) (3rd ed.). W3C Recommendation. Retrieved November 29, 2004, from http://www.w3.org/TR/2004/ REC-xml-20040204/ XML Acronym Search Results. (2004). Retrieved November 29, 2004, from http://eccnet.eccnet.com/acronyms/acronyms.php XML-Schema. (2001). XML schema part 0: Primer. W3C Recommendation. Retrieved November 29, 2004, from http://www.w3.org/TR/xmlschema-0/ XSLT. (1999). XSL transformations (XSLT), version 1.0. W3C Recommendation. Retrieved November 29, 2004, from http://www.w3.org/TR/xslt Yan, L. L., & Ling, T. W. (1992). Translating relational schema with constraints into OODB schema. Proceedings of the IFIP WG2.6 Database Semantic Conference on Interoperable Database Systems (pp. 69-85), Lorne, Victoria, Australia.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Possibility Theory in Protecting National Information Infrastructure
325
Chapter XV
Possibility Theory in Protecting National Information Infrastructure Richard Baskerville, Georgia State University, USA Victor Portougal, University of Auckland, New Zealand
ABSTRACT
Possibility theory is an alternative to probability theory as a basis for security management in settings where information resources are elements of national information infrastructure. Probability theory is founded on the assumption that one cannot totally rule out an intrusion. Possibility theory operates under a contrasting assumption. While persistent, well-supported, and highly professional intrusion attacks will have a higher probability of success, operating instead against the possibility of intrusion places defenders in a theoretical framework more suitable for high-stakes protection. The purpose of this chapter is to introduce this alternative quantitative approach to information security evaluation. It is suitable for information resources that are potential targets of intensive professional attacks. This approach operates from the recognition that information resource security is only an opinion of officials responsible for security. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
326
Baskerville and Portougal
INTRODUCTION
Probability theory is the most commonly used theoretical base in the process of risk analysis for information systems security. When dealing with multiple intrusion events that justify using probability theory models, certain assumptions are made about distributions of the intruder population and of the intrusion events. For example, commerce settings assume that the intruder population is distributed normally, while the intrusion events have a Poisson distribution (Baskerville & Portougal, 2000). These assumptions are not valid in security settings where the intrusions are rare events; intruders are very highly skilled and intrusion events persistent, that is, attacks that draw on infinite intrusion resources. Such security settings include those where national information infrastructures become enveloped in a battle space subject to information warfare attacks. Information warfare and the defense of national information infrastructures have become a concern in national defense and law enforcement policies. Information warfare consists of both digital information operations and perception management. Perception management regards information operations that aim to affect the perception of others in order to influence their emotions, reasoning, decisions, and ultimately actions (Denning, 1999). Perception management is closely related to psychological operations (PSYOPS) that influence behavior through fear, desire, logic, and other mental factors. It also is closely associated with propaganda, the spreading of ideas, information, or rumor for the purpose of helping an institution, a cause, or a person or to damage an opposing cause. The scope of digital information warfare includes the protection and compromise of computer-based technology that bears information. For the purposes of this chapter, we will exclude perception management using a mass communications media and the underlying perception management basis for shaping information content in digital media with lies, distortions, denouncements, harassment, advertising, and censorship. Our focus is mainly on disruptive offensive information operations intended to interfere with proper functioning of computers and networks (Baskerville, 2004). We are more concerned with how electronic commerce slips into the broader (real) battle space. This slip can occur in at least three ways. (1) Most prevalent are the possible attacks on national critical infrastructures. A battlefield opponent gains advantage when an enemy’s energy supplies, finances, communications, and transportation systems (including both shipment of goods and passengers) are disrupted in such a way as to interfere with their ability to sustain combat operations. (2) The ability to share critical attack information among businesses engaged in such infrastructures forms a second target arena in which information warfare attacks may be levied against electronic commerce. (3) Further, businesses that are developing militarily useful systems, such
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Possibility Theory in Protecting National Information Infrastructure
327
as information weaponry and defense and software. When electronic commerce engages mixed civilian and military operations, it becomes a legitimate target for military information warfare (Delibasis, 2002). The United States national commission on critical infrastructure protection delivered its report in 1998. The report detailed national vulnerabilities to a wide range of threats and a lack of awareness on the part of the general public. It recommended new national organizations, new laws, research, and development, and a broad program of awareness and education. It also suggested infrastructure protection through industry cooperation and information sharing (Neumann, 1998). In the wake of this report, the U.S. created the National Infrastructure Protection Agency, jointly operated by the Defense Department, the Federal Bureau of Investigation, and other national establishments. In early 2000, the U.S. also funded an Institute for Information Infrastructure Protection, housed at the National Institute of Standards and Technology (Bacheldor, 2000). This new institute establishes a system for sharing information about intrusions, creates Internet security education initiatives, and develops systems for protecting the federal government’s computers. The potential for offensive information operations against critical national infrastructures makes this form of warfare a direct threat against such essential services as utilities, banks, transportation businesses, and so forth (Berkowitz, 2001). Electronic business, whether a critical infrastructure or not, centralizes network-based digital communication. Such systems include electronic data interchange, supply chain management systems, online retailing, and so forth. A firm understanding of the vulnerability of such business systems to information operations arises when we briefly consider some of the major areas of critical infrastructures. These include information and communications, physical distribution, energy, banking and finance, and vital human services. All of these areas are dominated in many countries by private enterprises and, at least in concept, are vulnerable to attacks via their data networking connections (The President’s Commission on Critical Infrastructure Protection, 1997). For example, Information and communications infrastructures include the public telecommunications network, the Internet, computers, and software. This information intensive industry sector is subject to attack through their own product, their networks, which include their own computerized control systems. Businesses involved in physical distribution of goods and passengers include the rail, air, and merchant marine industries. Such businesses can be attacked through their network to control systems that manage tickets, routing, staffing, and tracking. Businesses in the energy sector manufacture and distribute electricity and petroleum products. These businesses are increasingly dependent on SCADA computer networks (supervisory control and data acquisition systems). These networks are frequently accessible, albeit indirectly, through the Internet and have proven vulnerable to electronic attacks (Verton, 2003). Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
328
Baskerville and Portougal
The banking and finance sector includes banks, financial services, investment companies, and exchanges. These companies and their related government agencies operate payment interchange systems that provide potential electronic avenues for disruption. Varied groups have reason to undermine a society and destroy public confidence in social and political structures. Aside from hostile nation-states, international and domestic terrorism and organized crime also have the demonstrated potential to support persistent high-quality intrusions into a national information infrastructure (Lukasik, Greenberg & Goodman, 1998). Probability theory is not a good basis for security in the case of high-stakes information resources where these resources are subject to attacks on national infrastructure or to battle space illumination. The common assumption is that security and risk management requires balancing information technology vulnerabilities against an intruder’s ability to exploit these weaknesses. “Simply stated, an organization can’t protect everything—trying to protect everything is the same as protecting nothing” (Panda & Giordano, 1999, p. 31). Thus, security designs formulated under the assumptions of probability theory ultimately operate with a cost-benefit orientation. Protection becomes selective, aimed at the most probable attacks. For example, under probability theory, intruders who attempt to break into a black network are unlikely to have a direct monetary goal in mind. Their goal is the sheer enjoyment of vandalizing networks and network resources or, like graffiti artists, for the sheer fame. A probability framework would seek to provide reliable protection against intrusion by skilled but “amateur” intruders. Probability theory induces us to believe that one cannot totally rule out an intrusion. This resignation is based on probability thinking that is productive for multiple repeatable events. However, in the case of national infrastructure protection, a major harm may be done by a single intrusion or other unique event. Such framework is not covered by probability theory. Persistent, well-supported, and highly professional intrusion attacks will have a higher possibility of success. If, instead, we operate against the possibility of intrusion, we find ourselves in a theoretical framework more suitable for the protection of complex national information infrastructures. A shift to possibility theory opens a different range of assumptions for risk calculations. Figure 1 illustrates a comparison of several different calculation assumptions when considering five outcomes: A, B, C, D, and E. For example, certainty is shown as the absolute knowledge that outcome B will occur and none of the others, that is, B=1, A=0, C=0, D=0, E=0. Probability Theory is shown as the familiar mutually exclusive situation where A=0.25 (meaning there is a 25% chance of outcome A), B=0.5 (meaning there is a 50% chance of outcome B), and so forth. Under probability theory, the sum of the probabilities of all possible outcomes will always equal 1. Possibility theory is shown with four possible outcomes (A-D) and one impossible outcome (E=0). Outcomes B and C are not
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Possibility Theory in Protecting National Information Infrastructure
329
mutually exclusive outcomes, and it is almost the case that they will happen for sure (B=1, C=1). Recognizing that, in both theory and life, sure-thing possibilities are rare; B and C are shown with possibilities of slightly below 1. The three outcomes B, C, and E demonstrate one essential difference between probability theory and possibility theory, namely, that the total possibility of all outcomes in possibility theory may be greater than one. Notice that outcomes A and D have values other than one or zero. Most proponents of possibility theory accept that the set of possibilities may be imprecise or fuzzy. Such fuzzy estimates may take on a functional character, such as the implication that outcome A is “rather” possible and expressing this opinion as A=0.75, or the implication that outcome B is “somewhat” possible and expressing this opinion as B=0.5. Alternatively, fuzzy estimates can take on a set character, shown in Figure 1 as Fuzzy, in which outcome A is “rather” possible and expressed as some set of values between 0.65 and 0.8. Possibility theory is useful when designing security against a special group of intruders: Professionals with extensive training, extensive resources, and plenty of dedicated time. These professionals earn valuable economic or socialsymbolic goods (such as national honors, medals, parades, or offices) as a goal of their intrusion activity. This group is relatively small and does not show statistical coherence. For this reason, a criterion, such as the probability of breakin during a fixed interval, cannot be used. Of course, ordinary information systems with highly sensitive information need protection from the ordinary hackers. Intrusion detection methods have been developing over the past half-decade largely in response to corporate and government break-ins (Durst et al., 1999). Intrusion detection attempts to discover attacks, preferably discover them while they are in progress, and at Figure 1. Comparisons of risk theories (adapted from Joslyn, 1994) Possibility Theory 1
0.75
0.5
0.25
0 A
Certainty
B
C
D
E
Probability Theory
Fuzzy
1
1
1
0.75
0.75
0.75
0.5
0.5
0.5
0.25
0.25
0.25
0
0 A
B
C
D
E
0 A
B
C
D
E
A
B
C
D
E
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
330
Baskerville and Portougal
least discover them before much damage has been done. Automation of intrusion detection is typically premised on automated definition of misuse instances. This automation requires pattern recognition techniques across large databases of historical data. Methods for data mining clearly have contributed to making such intrusion detection feasible (Bass, 2000; Zhu, Premkumar, Zhang & Chu, 2001). These approaches have been growing in sophistication and include expert systems, keystroke monitoring, state transition analysis, pattern matching, and protocol analysis (Biermann, Cloete & Venter, 2001; Graham, 2001). But intrusion detection approaches, thus far, remain a probabilistic enterprise with less than 100% chance of detecting all types of intrusion. Indeed, the race between intruder technology and intrusion detection will likely remain a closely run contest (Schultz, 2002). Intrusion detection tools are necessary but not sufficient for the high-stakes information resources subject to attacks on national infrastructure or battle space illumination. In this chapter, we introduce an alternative quantitative approach to information security evaluation that is suitable for information resources that are potential targets of intensive professional attacks. This approach originates in the recognition that the safety of all information resources in an organization is only an opinion of officials responsible for information security. The set of security measures introduced to protect information resources depends on the experience of security personnel, their perceptions, and their attitude toward risk. A riskavoiding person in this position would be inclined to exaggerate the dangers and to introduce much more security than necessary (Corr, 1983). In addition to the direct costs of the security safeguards, this unnecessary security can ruin the effectiveness of the organization, as the security functions consume the same computing resources as the direct information production functions (Baskerville, 1992). The existing quantitative approaches to the security evaluation help to overcome this danger and introduce an optimum level of security. The new quantitative approach described below also allows an objective comparison of different security devices. Moreover, it shows the insufficiency of the current set of security devices. As a result, two critical security measures are developed: a protection interval and a night watch.
GOAL AND CRITERION
The fundamental goal of security safeguards is to increase security of an information resource. This goal is traditionally expressed as a criterion in terms of probability of compromising the information resource. This probability should be minimized. The probability criteria are sensible in such a case because the threat involves a large population of hackers with wide variations in both skill and dedication to their intrusion tasks. In contrast, attacks on national infrastructure and resources for battle space illumination imply a small population with intense
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Possibility Theory in Protecting National Information Infrastructure
331
training, the highest skills, and the strongest dedication to the intrusion task. Such intruders would represent the tiny group of individuals found several standard deviations above the statistical mean of the hacker population. Because of the microscopic size of the statistical evidence available in these attacks, probabilistic expressions of risk and security make no sense. Any probability, even slight, is of serious concern. In this setting, it is more sensible to introduce a possibility measure (Dubois & Prade, 1988). The construction of such a measure is as follows. The event of interest, A in our case, is an intrusion during a time interval t. Definitely, for a given system of security measures, the likelihood of this event depends on the duration of this time interval. We can assume that there is an individual who can evaluate this likelihood and furnish it as a number g(A). The mere logic of events suggests: 1. 2. 3.
g(A0) = 0, here A0 refers to the event A during zero time interval, g(A∞) = 1, which means that during an infinite interval of time, the event A∞ will happen for sure. The coherence of this estimate means that the likelihood of event A during a greater interval is no less than for a smaller one. Let us consider two events: event A – an intrusion during a time interval t1, event B – an intrusion during a time interval t2, t 1 ≤ t 2. Then A⊆B ⇒ g(A) ≤ g(B).
4.
∀A, B, g (A∪B) ≥ max (g (A), g (B)). In the case of equality, we obtain a measure function Π, such that ∀A, B, Π (A∪B) = max (Π (A), Π (B)). Zadeh (1978) calls such functions possibility measures. In a nested sequence of events ω, A0 ⊆ A1 ⊆ … An ⊆ …, (a) the maximum is always reached on the last event under consideration. (b) Π = g, which means that the possibility function consists of initial estimates.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
332
Baskerville and Portougal
One important result in Zadeh’s (1978) possibility theory is a connection between the possibility function and fuzzy sets. A possibility function may be considered as a membership function of a fuzzy set ω, thus justifying the use of the rather developed theory of fuzzy sets. Empirically, the evaluation of the membership function of w is a matter of opinion (an issue of psychometry), and can be implemented, for example, by a questionnaire (Dubois & Prade, 1988). In practice, one can get a rough idea of the form of this function that will be adequate for applications. Well-known probability distributions will approximately model possibility distributions. For example, Figure 2 presents possibility functions developed by experts for a bastion host (a highly secured host node on a network) (Siyan & Hare, 1995). If we assume that an intruder needs to use some sophisticated techniques for compromising the host, then the form of this function most naturally follows normal distribution. The parameters of the distribution are left for expert evaluation (in Figure 2 for the normal distribution: mean = 19 hours, standard deviation = 5 hours). Suppose the intruder can obtain the password for this host node by some means. For example, these means could include electronic “password sniffing” or back channel creation through intelligence means. In such a case, the corresponding distribution is more likely to be exponential (Figure 2: the exponential curve). The difference between these two cases is as follows. In the first case, we assume that some work (which will take some time) is necessary prior to obtaining the passwords or creating the back channels. In the second case, we expect the break-in will happen any time with equal possibility. The possibility functions shown in Figure 2 (these have the same mean, although different standard deviations) reflect this real situation. In the first case, for the normal distribution, the possibility of break-in at an early stage is small because of the small possibility of breaking tough codes or passwords quickly. At this early stage, the possibility of finding a loophole in the security system is more significant. Later, when elapsed time suggests that the codes have been broken, the possibility of a completed break-in grows quickly (compared to the situation in the second case, the exponential curve, where it grows at the same rate). Following from the definition of possibility, if we expect both types of hackers to be working at the same time, then the possibility function at any time is equal to the maximum value of these two functions. The convenience in the use of known statistical distributions for possibility modeling is evident, provided they truly reflect the possibility pattern. However, it is necessary to thoroughly understand the conventions of this use. The possibility axiom, as stated above, tells us that the possibility of an event at time zero should be zero. A normal distribution, however, does not permit “zero probability.” For example, in Figure 2, the probability at point zero is actually 0.000159. Although it is very small, this deficiency in the model must be understood. Similarly, we do not have a possibility equal to 1. This is less
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Possibility Theory in Protecting National Information Infrastructure
333
important because this part of possibility curves is less useful: we do not consider secure information resources to be functional when they have significantly high possibilities of compromise (e.g., with possibilities of being compromised more than 0.5). The conceptual contrast between probability and possibility is an important element in understanding possibility theory (Dubois & Prade, 1988). When our knowledge of the occurrence of events can be given in the form of observed frequencies, we can determine a confidence measure, P, as a probability measure. The probability measure is more precise, and as soon as we can obtain it, we can use it instead of the possibility measure. The alternative use of possibility, though, has important benefits as well. These come from a complementary events axiom: the probability of an event completely determines the probability of the contrary event. For example, if the probability of failure is 0.3, then the probability of success is 0.7. However, the possibility of an event and the contrary event may only be weakly linked. For example, failure may be “highly possible”, and success may also be “highly possible”. Unlike probabilities, the total of all believed possibilities may be greater than or lesser than 1.0. Possibility theory permits us to model our uncertain judgments and beliefs without unnaturally rigidifying the relationship between our estimates. In this situation, the notion of probability seems less flexible than that of possibility. We are rarely in a position in which we have a reasonable amount of statistical data about potential national conflicts. In all practicality, the “possibility of intrusion” is the only obtainable quantitative measure of security. For threats against national infrastructures and battle space illumination, we must minimize the possibility, not the probability.
Figure 2. Possibility of break-in for information resources for different assumptions
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
334
Baskerville and Portougal
PROTECTION INTERVAL AND “NIGHT WATCH” AS SECURITY MEASURES
Let us define the problem precisely. Given that the possibility function is nondecreasing in nature, the only logical approach to its minimization is by restricting the time interval of its operation. With the essential geometric shape of the possibility function, any security measure operating during an unrestricted time interval will produce (sooner or later) a possibility of compromise that approaches absolute certainty. In other words, if it can be compromised, it will eventually be compromised. A practical countermeasure in this situation involves defining a fixed interval of time and the evaluation of all protection measures only during this interval. We shall call this fixed interval of time a protection interval (PI). In implementation, the subdivision of the working time of the particular information resource into a sequence of PIs could be organized through the design of a “night watch”. A night watch is a special procedure by which the integrity of the information resource is routinely verified in order to be sure that no compromise has occurred during the protection interval. Through this procedure, we can effectively restart the PI with a new time interval having a zero possibility of compromise. If the night watch is effective, the new PI will not inherit a significant accumulated possibility of compromise from the previous PIs. Possibility theory, therefore, suggests two security measures for information resources operating in settings of national infrastructure and battle space illumination. First, there must be an established, definite protection interval (PI). Second, a night watch procedure must delineate each PI in order to confine the possibility of intrusion within acceptable limits. Of course, there must be “protection” as a prerequisite for establishing a “protection interval”. The initiating means by which a PI is raised is through the use of a strong set of security safeguards. Therefore, we presume the use of existing information security technology and procedures, such as intrusion detection, network firewalls, communications encryption, password policy enforcement, trusted operating, database systems, and so forth (compare Pfleeger, 1997; Richards, 1999). Stronger security measures extend the protection interval further. For example, subdividing a black network into secure subnetworks will force intruders to compromise two independent firewall arrangements, thereby extending the protection intervals of resources on the subnetworks. In other words, the security measures define the length and shape of the protection interval curves illustrated in Figure 2. The night watch starts and restarts the protection interval curve from its origin. In any setting, an optimum value of PI must exist. This optimum is implied by three factors: (1) the existence of a limit for PI, (2) the proportional
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Possibility Theory in Protecting National Information Infrastructure
335
relationship between PI and information resource service, and (3) the inversely proportional relationship between PI and security. First, the limit of PI is the execution time of the night watch using the information resource. For example, if the resource is a computer, we can assume that all processing (service) on that computer must be interrupted while the night watch executes. If, for example, the night watch takes five milliseconds to execute and if the PI is set to five milliseconds (or less), then the computer will have no time available to process any other function except for the night watch. Therefore, as PI approaches its limit, the uninterrupted functionality of the information resource will approach zero. This trend defines the second factor; namely, as PI decreases, the information resource will become less functional, and as PI increases, the resource will become more functional. Therefore, PI has a proportional relationship with the information resource service. The third factor denotes the inversely proportional relationship between the length of the PI and degree of security. As the PI (and service) rises, the possibility of compromise also rises, and security decreases. We can summarize these relationships as follows. First, a longer PI will provide a longer uninterrupted utilization of the information resource. However, the longer PI will yield a higher possibility of compromise. Second, a shorter PI will provide a lower possibility of compromise in the information resource. However, the uninterrupted utilization of the information resource will also be shorter. Third, as PI approaches its limit, functional availability of the information resource will disappear. Therefore, an optimum PI exists, one in which the information resource service availability and the possibility of compromise are ideal. The process of solving for this optimum is addressed below first by using traditional probabilistic models, described in literature (Portougal & Trietsch, 2001), and second by interpreting the results in the terms of possibility.
A MODEL OF OPTIMUM PI
Let us assume that, without loss of generality, we can operate with standardized distributions by using the transformation Z=(t-µ)/σ. Let fs(z) (or simply f(z)) denote the density (or mass probability) function of Z, and let Fs(z) (or F(z)) denote its cumulative distribution function. We will operate a night watch at the end of each PI. Under this operating procedure at the end of a PI, we set D=µ+kσ, where k is our (standardized) decision variable and kσ is not necessarily positive but most probably negative. The kσ might be negative because the probability of information resource compromise at the mean of the distribution is usually considered by experts to be unacceptably high. Let us also assume that the information resource has been functionally created to produce some positive output. Examples include a banking information resource that is created to produce profit or a military Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
336
Baskerville and Portougal
system that is created for battle space dominance. When the resource is compromised, then the positive output is reduced (or disappears). Let us denote the output per each unit of time by C, meaning then that the total output is C(µ+kσ). Let us further add to this model the expected penalty if a break-in happens in time t ]> a) DTD of Information Shared from Police to Treasurer. ]> b) DTD of Information Shared from Treasurer to Police
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
349
Information Storage
The information storage component is used to store the actual XML pages that conform to these DTDs. First, a one-time effort is required to create reports from the local databases or document repositories (of the police and the treasury) in the form of XML pages. Since most commercial DBMSs currently offer the ability to create scheduled reports in XML, this task is possible without adding to existing IS systems and personnel. Next, a repository is created to store these pages. This repository is a directory structure with an IP (Internet protocol) address that consists of interlinked HTML and/or XML pages. Finally, procedures are implemented to affect the periodic transfer of pages to this repository and purging of these pages from the repository.
Information Retrieval
The information retrieval component is a search agent called XSAR (XML search agent for information retrieval) that searches for information from the repository (Bajaj, Tanabe & Wang, 2002). XSAR can dynamically query large information repositories of XML documents. It is DTD independent, so the same agent can be used to query documents consisting of multiple DTDs. XSAR is a dynamic agent, in that it does not use an underlying database or directory scheme of information on pages. Rather, it dynamically queries the repository on behalf of a user. There is no requirement for the repository site to have any special software or hardware; the only assumption is that it is a site that contains a mixture of HTML and XML pages. XSAR does require the user to be aware of the underlying DTD, insofar as being aware of which fields the user needs to search. Essentially, XSAR reformulates the query as an XQL expression and then launches a spider that traverses the information repository and executes the XQL query against each XML page found in the repository. Figure 2 shows the operational flow of XSAR. Figure 2. Operational flow of XSAR U s e r c o n n e c ts to X S A R s ite u sin g W W W B ro w s e r
U s e r fo rm u la te s q u e ry u sin g X S A R in te rfa c e , sp e c i fie s in fo rm a tio n re p o s ito ry W W W a d d re s s a n d o p tio n a lly sp e c i fie s li m its o n h o w lo n g X S A R c a n ta k e fo r th e s e a rc h a n d h o w d e e p it sh o u ld s e a rc h
X S A R
X M L
X S A R la u n c h e s s p id e r th at se a rc h e s th e X M L re p o s ito ry
R E P O S IT OR Y
X S A R s ite d isp la ys s e a rc h re s u lts to U s e r’s W W W B ro w s e r
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
350 Bajaj and Ram
Prototype Implementation Of IAIS
Our implementation of XSAR was built using the Java Server Pages (JSP) architecture, which is part of the J2EE (Java 2 Enterprise Edition) specification, accessible currently at http://java.sun.com. This architecture has the advantage of cross platform portability, in that the agent software is operating system independent, and the search results produced (JSP pages) are browser independent. Figure 3 illustrates the high-level architecture of XSAR as per the Model View Controller (MVC) architecture (Foley & McCulley, 1998). Choice of XML Query Language Currently, there is no standard query language for XML designated by the WWW Consortium (W3C). However, several query languages for XML documents have already been proposed, for example, XQL, XML-QL, and Quilt (all accessible via http://www.w3c.org). We selected XQL for the implementation because (a) its grammar is based on Xpath, which has already been standardized by W3C, and (b) application programming interfaces (API) for XQL are available in Java. XSAR offers several improvements over existing facilities for searching XML files. First, XQL by itself can be used to query only individual XML pages and to return results from that page. It does not query multiple XML pages and return combined results from them. XSAR overcomes this limitation by crawling through the repository and executing the XQL query on each XML page. XSAR combines and displays the results of its repository search to the user. Second, while XSAR currently uses XQL, the source code underlying XSAR can be easily modified to use a different API to accommodate future changes in the W3C specifications. Third, XSAR allows the use of XQL to be transparent to the user by providing the option of an easy-to-use forms based interface (see Appendix 1, Figure A1-1) to formulate the query. Functionality of XSAR Based on the operational description of XSAR (see Figure 2), detailed screenshots of actual XSAR usage are shown in Appendix 1. The case in the appendix searches a repository that lists felon information, as shown in our example in Figure 2. It shows a search for a specific suspected felon whose last name is “Smith” and first name is “John”. There are two modes in which users can interact with XSAR. In the expert mode, users formulate queries using XQL, while in the novice mode, they can use the menus for searching. In Figure A11, the novice mode is used for specifying search conditions. Single quotes are used to surround string parameters. If a parameter is a number, no quotation is required. Each comparison operator is of two types: case sensitive and case insensitive. Figure A1-1 in Appendix 1 shows the search conditions for this case. Note how users formulating the query need to be aware of the DTD to specify
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
351
the element or tag whose values they want searched; however, the interface itself can be used to search for information in pages conforming to any DTD. One of the operators users can select is the contains operator, which allows for free text search within an XML tag. This operator allows the searching of free text fields in the XML DTD. However, the user does not need to be XQL cognizant when using the novice mode. Figure A1-2 in Appendix 1 shows the search result of the query posed by the user. In this case, only one XML document was found. Figure A1-3 in Appendix 1 shows the result of selecting the “See Matches” option in Figure 2. Only the matched part of felons.xml is displayed. Figure A1-4 in Appendix 1 shows the result of choosing the “See whole XML” option in Figure A1-2. The entire felons.xml is displayed Detailed Architecture of XSAR As illustrated in Figure 3, the underlying architecture of XSAR follows the Model View Controller architecture which differentiates between the data, the program logic, and the viewing interface (Foley & McCulley, 1998). XMLRetriever and ShowMatch are the Controller: they are Java Servlets which are responsible for receiving the requests from the client. The Agent component takes responsibility for retrieving the data from the Web site (Model). The search results are presented as JSP pages (View). Once XMLRetriever receives the request from the client, it generates the XML query string and hands it to Agent. Agent essentially crawls through the target repository, utilizing a breadth-first search algorithm. It parses the starting page (specified by the user) and identifies links. Each linked page is then parsed (if it is an HTML page) or parsed and searched (if it is an XML page). Figure 3. Low level architecture of XSAR
System Architecture System Boundary
L O G
OutputFile 1. XMLRetriever - Get Parameters - Invoke Agent - Forward Result
Agent - Traverse Site - Find links - Match Results
URLDocument SearchResult ResultSet
Web
Network Client 2. ShowMatch Sys tem Bo und ary
- Parse XML - Retrieve Match Part
Worker - Find - Put to queue
Sys tem Bo und ary
XMLWorker - Parse XML - Invoke XQL Engine
Servlet System Boundary
JSP Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
352 Bajaj and Ram
ShowMatch gets the search results from the agent program and puts them into JSP files for presentation. XSAR utilizes the Xerces API for XML page parsing, the GMD-IPSI XQL engine for XQL querying, and is packaged to run on the Tomcat Web container1. Appendix 2 flowcharts the high level logic of each of the software components used in XSAR. Next, we describe alternate methods for information definition, storage, and access and compare IAIS with alternate potential methodologies that can be used to facilitate information sharing in our domain of interest.
COMPARING IAIS WITH OTHER POTENTIAL METHODOLOGIES
Alternative Strategies for Information Definition, Storage, and Access
Information Definition Several choices are available for specifying how the information to be shared among government agencies can be defined. We list three commonly used standards here. The first and most obvious is free text documents, where the name of the document along with other fields, such as keywords, can be used to structure information about the document. This standard includes documents in pure text markup languages, such as HTML (hypertext markup language). Free text documents are appropriate when storing records of proceedings, hearings, rules, and regulations or memoranda. A significant portion of government information is in this format (Dizard, 2002; Minahan, 1995; Stampiglia, 1997). The second standard is using a relational database structure (Codd, 1970), and it involves storing information in the form of relations, where each table has attributes, and the tuples contain the actual data. Relations are linked to each other using common attributes that usually uniquely identify a tuple in one table. The relational model has well-accepted rules for eliminating data redundancy using the concept of data normalization. Most commercial database management systems (DBMSs) support this standard. This option is useful if the information to be shared consists of clearly defined attributes, such as the name, address, physical description, and attributes describing immigration entries and exits for individuals. The third standard for structuring information is the use of XML document type definitions (DTDs)2. XML DTDs allow information to be structured, using tags to differentiate elements and subelements. This is analogous to tables and columns in the relational model. Several researchers believe that XML will allow the Web to become more semantic in nature, such that software agents will be
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
353
able to access information and understand what it means (Berners-Lee, Hendler & Lassila, 2001). The main advantage of XML is (a) it is a platform independent standard where the data does not need to be read by a proprietary product (as it would in a relational DBMS), and (b) it allows for variable structuring of information, where some parts can be free text and others can be more structured. Information Storage and Access Based on the three alternate standards of information definition described earlier, there are alternate methods of storing the structured information as well. If the information is structured in the form of free text or HTML files, then storage on a file server (usually a component of the operating system) is a first alternative. Another option is third party document management tools, such as Documentum (http://www.documentum.com), that facilitate the creation and management of these pages. Most of these systems use the existing file system to store the applications but store additional information that makes it easier to retrieve the files. Finally, it is possible to use object-relational features of current commercial DBMSs to store and retrieve these files. For information stored in the form of free text, search engines, for example, Google and AltaVista, offer one method of accessing information. This is similar to a search on the WWW, except that instead of searching across numerous Web sites, the search in our case is on a repository of free text or HTML documents. A second method is to use directory-based search engines, such as Yahoo (http:/ /www.yahoo.com). If a third party application is used for storage, the search capabilities are limited to the capabilities of the tools offered by that application. Finally, if object-relational DBMSs are used, the ease of access depends on extensions to the structured query language supported by the DBMS. If the information is structured in a relational database, a commercial DBMS can be used to store this information. Commercial relational DBMSs are a tested technology that allow for concurrent access by multiple users. They provide good throughput, recovery, and reliability for most applications. The structured query language (SQL) standard allows easy access to information. Results can be served on the WWW using a number of available technologies, such as active server pages (ASP) or JSP. If the information is structured in XML, it consists of text files, though the text is now structured. Storage alternatives for XML files are similar to free text, that is, file systems, third party applications, and object-relational DBMSs. However, there is an important difference between free text and XML formats when accessing the information. For XML files, the methodology of using conventional search engines with keyword searches either breaks down or, at the very least, negates the whole purpose of structuring the information as an XML document. Similarly, directory-based search engines do not allow the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
354 Bajaj and Ram
Figure 4. Information formats and their possible methods of search Information Format → Search Mechanisms ↓ Search Engines using Keywords Directory-Based Search Engines Third party applications for storing and searching pages Structured Query Language XPATH, Xquery for single XML pages
Free text/HTM L Pages
Relational Alphanumeric Data
X
NA
NA
X
NA
NA
X
NA
X
NA NA
X NA
NA X
XML Pages
search and retrieval of portions of XML pages. To overcome these problems, two standards have been proposed by the WWW consortium (http://www.w3c.org) to search XML files. These are the XPATH standard, and the more recent XQUERY standard, which is still evolving. As of now, these standards allow the querying of a single XML page, not a repository of XML pages. Figure 4 summarizes the different methods of information storage (in each column) and the alternate methods available to access them (along the rows). Based on the above alternatives, we compare IAIS in this work with two alternate methodologies. In the first methodology, called the free text methodology, the parties involved make a listing of the free text documents the government agencies should share. These documents are then stored in a free text/HTML format in a centralized repository and searched by end users using keyword-based search engines, such as Google. The second methodology, called the database schema methodology, requires that the various parties examine the relational schemas of the underlying databases of the different agencies whose information is to be shared. They decide what information needs to be made available from each of these databases. This information can then be searched via a single Web interface that queries the individual databases directly, though the data itself will be stored in the individual databases. Next, we compare these methodologies with IAIS along the following dimensions: (a) ease of information definition and storage, (b) ease of information access, and (c) ease of system maintenance.
Ease of Information Definition and Storage
In the free text methodology, unstructured information, such as agency documents, can be identified and shared. However, it is not possible to define specific structured information that can be shared. Since most agencies have a collection of documents, the information definition step will, in general, be the
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
355
easiest in this phase and consist of a relatively simple identification of documents that can be shared without any in-depth study of the actual data elements in the agencies. In the database schema methodology, the schema of each government agency will have to be prepared for review by the different parties. Different data items can then be identified for sharing. A similar process can be followed for IAIS, where the data elements will need to be defined and then a DTD created. Thus, both the database schema methodology and the IAIS methodology require an examination of the actual data elements used by the agencies. From the storage standpoint, the free text methodology requires that the pages be stored on a server’s file system. A coordination mechanism is needed in place so that changes to documents in the organizations’ internal repositories are reflected in the shared repository. In the database schema methodology, there is no need for additional storage of data, since information is obtained dynamically from the underlying databases within each organization. In IAIS, the XML pages are stored in a repository, similar to the free text methodology, with a coordination mechanism to ensure that the pages are refreshed periodically to reflect the structured information within the internal databases. Thus, the database schema methodology requires less implementation effort from a storage perspective than the free text and IAIS methodologies. From this discussion, in general, we see that the free text methodology offers the easiest data definition, while the database schema methodology offers the easiest information storage.
Ease of Information Access
Web search engines that search files on different Web sites have their origin in the information retrieval (IR) systems developed during the last 50 years. IR methods include Boolean search methods, vector space methods, probabilistic methods, and clustering methods (Belkin & Croft, 1987). All these methods are aimed at finding documents relevant for a given query. For evaluating such systems, recall (ratio of the number of relevant retrieved documents to the total number of available relevant documents) and precision (ratio of the number of relevant retrieved documents to the number of total retrieved documents) are the most commonly used measures (Billsus & Pazzani, 1998). Finally, response time (time taken for users to get the information they desire) has been found to be a useful metric (Anderson, Bajaj & Gorr, 2002). Furthermore, the performance of each search tool is directly influenced by the user’s ability to narrowly define the nature of the query, for example, in the form of precise keyword strings or correct specification of a search condition. For the sake of the discussion below, we assume that the search specificity is held constant.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
356 Bajaj and Ram
Recall For the free text methodology, recall is the ratio of the number of pages identified correctly as matching to the total number that should have been identified as matching. This ratio is clearly dependent on the algorithms used to perform the matching with most current algorithms based on weighted indices. In general, the ratio will be less than one. Also, as the number of pages increases, all else being equal, the recall drops. For the database schema methodology, the recall of a search is determined by the accuracy of translation of the user’s query into the underlying database language (usually SQL) query. If the query translation is accurate and the underlying database is searched completely, the recall will be one. For agents like XSAR (the search mechanism in IAIS), the recall is dependent on the accuracy of translation of the user query into the underlying XML query language and the existence of a path from the specified starting point to all the XML pages that exist in the repository. For a well-connected repository where every page is reachable from a starting node, the recall will be one. From this discussion, we can see that, in general, the recall for the free text methodology would be lower than for the other two methodologies. Precision For the free text methodology, precision is determined by (a) the features offered to the user to specify a query to a desired level of precision (e.g., concatenated strings, wild card characters, boolean operators); (b) the extent to which precision can be sacrificed for response time; and (c) the extent of information captured in the underlying search engine database about the real pages. All else being equal, a search engine that offers more options for searching should have greater precision, as compared to one where response time is secondary to precision. For (c), a search engine that captures only the title of each page will have less precision than one that captures the title and the META tags of each page. As the number of pages increases, all else being equal, the precision drops. When using the database schema methodology, the precision of an information repository that uses an underlying database depends on (a) the features provided by the user query interface and (b) the maximum precision allowed by the underlying query language of the database. If a relational database is used, then the underlying query language is SQL, which provides excellent precision. For XSAR, the query interface fully supports XQL. Thus, the precision of XSAR is identical to the precision of XQL, which supports XPath (http:// www.w3c.org), the W3C standard for precision.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
357
Response Time 3 The response time in the free text methodology depends on (a) the algorithm used to perform the keyword match with the database that has information on each page in the repository and (b) the extent to which the designers are willing to sacrifice precision for response time. When using the database schema methodology, the response time depends on the complexity of the specified query and the extent to which the database is optimized for that query. Thus, for a relational database, a range query on nonindexed attributes with several joins is likely to take significantly more time than a simple SELECT query. Thus, these repositories would need a database administrator to keep track of most frequent queries and ensure proper tuning access optimizations. For agents like XSAR, response time is determined by the time taken by the agent to crawl through the target repository. This depends on exogenous factors, such as network speed, performance of the Web server of the target repository, and the size of the target repository. In designing XSAR, we used three endogenous strategies to minimize this time for given values of exogenous factors: (a) a multi-threaded agent; (b) a proxy server for caching; and (c) providing the user the ability to specify response time threshold. We next describe each of these strategies. As the agent program fetches new pages from the target repository, it spawns threads to parse these fetched pages. HTML pages are only parsed for links, while XML pages are parsed and searched for the query. One design tradeoff here is that spawning a new thread is expensive if the pages being parsed and/ or searched are small, while increasing the number of threads pays off as the size of each page in the repository increases. After experimenting with different real XML repositories (see Appendix 2), we selected five threads per search for the current implementation. However XSAR is an open source software, and this number can be modified under different installations. For proxy caching in XSAR, we used the DeleGate proxy server, developed by Yutaka Sato, Japan (http://www.delegate.org/delegate/). A proxy server is useful in reducing network traffic, since it caches frequently accessed pages and, hence, reduces an agent’s (or browser’s) requests to a distant WWW server. We tested the performance of using the proxy server for a set of cases. The results of using the proxy server are shown in Appendix 2. Finally, XSAR allows the user to set the threshold response time and maximum depth to which the agent should search the target repository. This allows XSAR to be used by different classes of users, for example, one class of users may want to search large repositories comprehensively and leave the agent running, say, overnight, while another class of users may want quicker responses for searches where the maximum depth of the search is set to a finite number. We tested the performance of XSAR using three experiments described in Appendix 3. In our experiments, a multi-threaded program had the maximum Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
358 Bajaj and Ram
Figure 5. Comparison of methods for information access Metrics à Methods ↓ Free Text Method Database Schema Method IAIS
Recall
Precision
Response Time
Low 100%
Low High
Fast Fast
100%
High
Slow
impact in reducing response time with a proxy cache playing a secondary role. The results (see Appendix 3) indicate that XSAR is clearly slower than the other mechanisms available for searching, which is to be expected, since it searches a target repository dynamically. XSAR trades off faster response time for the ability to search a changing information repository in real-time. Figure 5 summarizes the performance comparison between the four search mechanisms.
Ease of System Maintenance
In the free text methodology, the major maintenance issues include the upkeep of the document repository, the search engine, and the coordination software required to maintain consistency between the internal pages within the organization and the pages in the information repository. In the database schema methodology, there is a need to track changes to the underlying database schema. Whenever these changes to the schema occur, appropriate changes are needed in the application code that directly queries the database. Furthermore, the organizations have to reveal their schemas to the team maintaining the application code in the system used for information sharing. In the IAIS methodology, there is an added layer of transparency, since each organization promises to deliver information in the form of XML pages conforming to the DTDs. This allows each organization to hide their internal database schemas and to make schema changes independent of the information sharing process, as long as each organization delivers its required pages. The maintenance issues for the IAIS information sharing system include refreshing pages from each organization and maintenance of the search agent (such as XSAR). From the above discussion, we can see that, in general, the database schema method requires the most maintenance, while the free text and IAIS methods require less maintenance. The results of our comparison of these three methodologies are shown in Figure 6. From Figure 6, it is clear that the main advantages of using IAIS over the free text methodology are (a) the ability to share structured information and (b) the ability to search a repository dynamically with greater precision and recall. The main advantages of IAIS over the database schema methodology are (a) Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
359
Figure 6. Overall comparison results for the three methods Metrics à Methods↓ Free Text Method
Database Schema Method
IAIS
Information Definition
Information Storage
Information Access
Information Maintenance
Easy but structured information cannot be defined Difficult but structured information can be defined % Difficult but structured information can be defined
Difficult
Low recall and precision, fast response
Easy
Easy
High recall and precision, fast response
Difficult
High recall and precision, slow response
Difficult. Subsets of data schemas need to be made visible across organizations. Easy
the former is, in general, easier to maintain and precludes the need for making database schemas available across organizations because of the increased layer of indirection provided by DTDs and (b) the ability to share unstructured as well as structured information in IAIS. The trade-off is that queries in IAIS require more response time than in the other methodologies.
CONCLUSION AND FUTURE RESEARCH
In this work, we proposed IAIS: an XML-based methodology to facilitate sharing of information among government agencies. Information in this domain tends to be both structured and unstructured. IAIS complements previous work in the integration of structured information and leverages the XML standard as a means of integrating heterogeneous intergovernment data sources containing information of varying structure. We presented the steps needed for information definition, storage, access, and maintenance of a system in IAIS. We also described the operation of XSAR, the information access component of IAIS. Finally, we compared IAIS to two other methodologies used for information sharing and analyzed the pros and cons of each method. XSAR is free software, available under the GNU public license. It can be accessed at http:// nfp.cba.utulsa.edu/bajaja/XSAR/Xsar/index.html, where users can use it to search XML repositories on the WWW. Since the time that XSAR was created, there has been significantly increased activity in deploying XML in the government. The portal Web site, accessible at http://xml.gov, indicates various XML registries that are in progress and where the IAIS methodology can be applied. The Justice XML Data Dictionary (JXDDS), accessible at http://www.IACPtechnology.org, is one example of interagency cooperation between the U.S. Department of Justice, U.S. Attorney, and the American Association of Motor Vehicle Administrators. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
360 Bajaj and Ram
In this dictionary, common data definitions are used to support information sharing among prosecutors, law enforcement, courts, and related private and public bodies. As another example, the enactment of the Health Insurance Portability and Accountability Act (HIPAA) of 1996 has also led to a strong motivation to develop a health care XML standard. This growing awareness of the need for data sharing across agencies highlights the importance of the IAIS methodology and the relevance of this work. For future work, we plan on (a) prototyping a tool to facilitate the information definition component of IAIS by allowing different parties to collaborate on producing XML DTDs; (b) evaluating the usefulness of IAIS in real-world case studies of interagency information sharing, and (c) exploring ways to incorporate automated conflict resolution into the IAIS methodology to assist agencies in the critical phase of information definition.
ENDNOTES
1
2
3
Xerces is available at xml.apache.org; the GMD-IPSI XQl engine is available at xml.darmstadt.gmd.de/xql/; and tomcat is available at jakarta.apache.org. While alternatives to DTDs, such as XML schemas (http://www.xml.org), do exist, in this work, we focus on XML DTDs since they are the predominant standard for structuring data on the World Wide Web. We assume a given size of the universe of pages in this section.
REFERENCES
Anderson, B. B., Bajaj, A., & Gorr, W. (2002). An estimation of the relative effects of external software quality factors on senior IS managers’ evaluation of computing architectures. Journal of Systems and Software, 61(1), 59-75. Bajaj, A., Tanabe, H., & Wang, C.-C. (2002). XSAR: XML based search agent for information retrieval. Proceedings of the America’s Conference on Information Systems, Dallas, Texas. Batini, C., Lenzerini, M., & Navathe, S. B. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323-364. Belkin, N. J., & Croft, W. B. (1987). Retrieval techniques. Annual Review of Information Science and Technology, 22, 109-145. Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web. Scientific American, 28-37.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
361
Billsus, D., & Pazzani, M. (1998). Learning collaborative information filters. Proceedings of the Machine Learning 15th International Conference. Blaha, M. R., & Premerlani, W. J. (1995). Observed idosyncrasies of relational database designs. Proceedings of the 2nd Working Conference on Reverse Engineering, Toronto, Canada. Chiang, R. H. L., Lim, E.-P., & Storey, V. C. (2000). A framework for acquiring domain semantics and knowledge for database integration. The DATA BASE for Advances in Information Systems, 31(2), 46-62. Codd, E. F. (1970). A relational model for large shared databanks. Communications of the ACM, 13, 377-387. Dizard, W. P. (2002). White House promotes data sharing. Government Computer News, 21(25), 12. Foley, M., & McCulley, M. (1998). JFC unleashed. SAMS Publishing. Forsythe, J. (1990). New weapons in drug wars. Information Week, 278, 2630. Gelman, R. S. (1999). Confidentiality of social work records in the computer age. Social Work, 44(3), 243-252. Glavinic, V. (2002). Introducing XML to integrate applications within a company. Proceedings of the 24th International Conference on Information Technology Interfaces, Zagreb, Croatia. Goodman, M. A. (2001). CIA: The need for reform. Foreign Policy in Focus, Special Report 13 F 2001, 1-8. Hayne, S., & Ram, S. (1990). Multi-user view integration system (MUVIS): An expert system for view integration. Proceedings of the IEEE International Conference on Data Engineering (ICDE). Hearst, M. (1998, September-October). Trends and controversies: Information integration. IEEE Intelligent Systems Journal, 12-24. Hinton, C. (2001). Seven keys to a successful enterprise GIS (Report IQ Service Report, 33(3)). Washington, DC: International City/County Management Association. Hunter, C. D. (2002). Political privacy and online politics: how E-campaigning threatens voter privacy. First Monday, 2(4). Khare, R., & Rifkin, A. (1997). XML: A door to automated Web applications. IEEE Internet Computing, 1(4), 78-87. Larson, J. A., Navathe, S. B., & Elmasri, R. (1989). A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering, 15(4), 449-463. Li, W.-S., & Clifton, C. (1994). Semantic integration in heterogeneous databases using neural networks. Proceedings of the International Conference on Very Large Databases (VLDB).
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
362 Bajaj and Ram
Minahan, T. (1995). Lack of interoperability delays import-export database. Government and Computer News, 14(10), 72. Ram, S., & Park, J. (2004). Semantic conflict resolution ontology (SCROL): An ontology for detecting and resolving data and schema-level semantic conflicts. IEEE Transactions on Knowledge and Data Engineering 16(2), 189-202. Ram, S., Park, J., Kim, K., & Hwang, Y. (1999). A comprehensive framework for classifying data- and schema-level semantic conflicts in geographic and non-geographic databases. Proceedings of the 9th Workshop on Information Technologies and Systems (WITS), Dallas, Texas. Ram, S., & Zhao, H. (2001). Detecting both schema-level and instance level correspondences for the integration of e-catalogs. Proceedings of the Workshop on Information Technology and Systems (WITS). Raul, C. A. (2002). Privacy and the digital state: Balancing public information and personal privacy. Norwell, MA: Kluwer Academic. Reddy, M. P., Prasad, B. E., Reddy, P. G., & Gupta, A. (1994). A methodology for integration of heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 6(6), 920-933. Silberschatz, A., Stonebraker, M., & Ullman, J. (1996). Database research: Achievements and opportunities into the 21st century. SIGMOD Record, 25(1), 52-63. Sneed, H. M. (2002). Using XML to integrate exisitng software systems into the Web. Proceedings of the 26th Annual International Computer Software and Applications Conference, Los Alamitos, California. Stampiglia, T. (1997). Converging technologies and increased productivity. Health Management Technology, 18(7), 66. United States. House. Com. on the Judiciary. Subcom. on Civil and Constitutional Rights (1984). Domestic security measures relating to terrorism: Hearings (Congressional Hearings 98th Cong., 2d sess.; Serial no. 51). Washington, DC: US Congress. Vaduva, A., & Dittrich, K. R. (2001). Metadata management for data warehousing: Between vision and reality. Proceedings of the International Database Engineering and Applications Symposium (IDEAS), Grenoble, France. Yan, G., Ng, W. K., & Lim, E.-P. (2002). Product schema integration for electronic commerce: A synonym comparison approach. IEEE Transactions on Knowledge and Data Engineering, 14(3), 583-598. Zhao, J. L. (1997). Schema coordination in federated database management: A comparison with schema integration. Decision Support Systems, 20, 243257.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
363
APPENDIX 1
Illustration of Usage of XSAR
Figure A1-1. Creating query for local test Web site
Figure A1-2. Search result page for local test
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
364 Bajaj and Ram
Figure A1-3: Result of “See Matches” for local test Web site
Figure A1-4: Result of “See whole XML” for local test Web site
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Enabling Information Sharing Across Government Agencies
365
APPENDIX 2 Flowcharts of the High Level Logic for XSAR Figure A2-1. Flowchart of the controller
Figure A2-2. Flowchart of the model component Model (Agent)
Cotroller (XMLRetriever)
Initialize
Initialize
Check Constraints
Generate Query
Done
ResultSet
Incomplete Sleep
Invoke Agent
Max
Check Max Connections
Terminate
Not Max Spawn Thread
if error YES
NO Check File Extension
Error.jsp
Result.jsp HTML
XML
Terminate
Figure A2-3. Flowcharts of the threads that parse HTML and XML documents Thread (XMLWorker) Parse Document
Thread (Worker)
XQL Engine
Parse Document
SearchResult
Find Links
Terminate
Terminate
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
366 Bajaj and Ram
APPENDIX 3
Performance Results of Test Cases
We present the results of three test cases that illustrate the performance of XSAR. The first case was a repository created inside the same network as XSAR. Therefore, the target Web site and the agent server were connected by Ethernet, 10Mbps. The contents of the repository were small HTML files and XML files, less than 4Kbytes each. The second case was a real XML repository (http://www.xml-cml.org/), which has the conclusive resource for Chemical Markup Language which is used to exchange chemical information and data via the Internet. The third case was also a real-world case: the Dublin Core Metadata site (http://purl.org/dc/) which is a specification for describing library materials. Table 1 shows the search time for the cases mentioned above. The search time results clearly depend on exogenous variables like network conditions, PC hardware specifications, and load status. In order to eliminate at least some variation in these factors, we tested each case five times and present the mean values in Table 1. Table A3-1: Comparison among experiment cases Local experimental Environment
CML
Dublin Core
Number / Average size of HTML
16 / 458 [bytes]
1 / 10490 [bytes]
34 / 16966 [bytes]
Number / Average size of XML
16 / 2756 [bytes]
173 / 10314 [bytes]
27 / 974 [bytes]
Direct access
2545[ms]
19291[ms]
44637[ms]
Via cache server
1772[ms]
14577[ms]
40305[ms]
Ratio of (w/ Cache) / (w/o Cache)
0.696
0.756
0.903
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
About the Authors 367
About the Authors
Keng Siau, Professor of Management Information Systems (MIS) (University of Nebraska, Lincoln (UNL)) is currently serving as editor-in-chief of the Journal of Database Management and as the book series editor for Advanced Topics in Database Research. He received his PhD from the University of British Columbia (UBC), where he majored in MIS and minored in cognitive psychology. His master’s and bachelor’s degrees are in computer and information sciences from the National University of Singapore. Dr. Siau has more than 200 academic publications. He has published more than 70 refereed journal articles, and these articles have appeared (or forthcoming) in journals, such as Management Information Systems Quarterly, Communications of the ACM, IEEE Computer, Information Systems, ACM SIGMIS’s Database, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Professional Communication, IEEE Transactions on Information Technology in Biomedicine, IEICE Transactions on Information and Systems, Data and Knowledge Engineering, Journal of Information Technology, International Journal of Human-Computer Studies, International Journal of Human-Computer Interaction, Behaviour and Information Technology, Quarterly Journal of Electronic Commerce, and others. In addition, he has published more than 90 refereed conference papers (including nine ICIS papers), edited/coedited 10 scholarly and research-oriented books, edited/coedited nine proceedings, and written more than 15 scholarly book chapters. He served as the organizing and program chairs of the International Workshop on Evaluation of Modeling Methods in Systems Analysis and Design (EMMSAD) (1996-2005). He is also serving on the organizing committees of AMCIS 2005 and AMCIS
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
368 About the Authors
2007. For more information, please visit his Web site at URL: http:// www.ait.unl.edu/siau/. *
*
*
Pär J. Ågerfalk, PhD, is postdoctoral research fellow with the University of Limerick, Ireland, and assistant professor (universitetslektor) in informatics at Örebro University, Sweden. Dr. Ågerfalk’s current research centers on human and social aspects of information systems development and use, particularly how language/action theory can inform the design process. Akhilesh Bajaj is associate professor of MIS, University of Tulsa. He received a BTech in chemical engineering from the Indian Institute of Technology, Bombay in 1989, an MBA from Cornell University in 1991, and a PhD in MIS (minor in computer science) from the University of Arizona in 1997. Dr. Bajaj’s research deals with the construction and testing of tools and methodologies that facilitate the construction of large organizational systems, as well as studying the decision models of the actual consumers of these information systems. He has published articles in several academic journals, such as Management Science, IEEE Transactions on Knowledge and Data Engineering, Information Systems, and the Journal of the Association of Information Systems. He is on the editorial board of several journals in the MIS area. His research has been funded by the Department of Defense (DOD). He teaches graduate courses on basic and advanced database systems, management of information systems, and enterprisewide systems. Richard Baskerville is chairman and professor in the Department of Computer Information Systems, J. Mack Robinson College of Business at Georgia State University. His research specializes in security of information systems and the interaction of information systems and the organization. Baskerville is the author of Designing Information Systems Security (J. Wiley) and numerous scholarly and professional journal articles. He is an editor of The European Journal of Information Systems and a member of the editorial boards of The Information Systems Journal, The Journal of Database Management, and The Information Resources Management Journal. Together with Professor Michael Myers, he is editing the MIS Quarterly special issue on action research. Baskerville’s practical and consulting experience includes advanced information system designs for the U.S. Defense and Energy Departments. He is a former chair of the IFIP Working Group 8.2, a chartered engineer under the British Engineering Council, a member of The British Computer Society, and certified computer professional by the Institute for Certification of Computer Professionals. Baskerville holds degrees from the University of Maryland (BS, summa cum laude, management), and the London School of Economics, University of Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
About the Authors 369
London (MSc, analysis, design, and management of information systems; PhD, systems analysis). Vibeke Dalberg has an MSc (2001) from the University of Oslo, Department of Informatics. She is currently working as a system developer and researcher within Det Norske Veritas (DNV). Owen Eriksson, PhD, is assistant professor (universitetslektor) in informatics, Dalarna University, Sweden. Dr. Eriksson’s main research fields are systems architectures (including standardization), conceptual modeling, and database design based on language action theory and the use of mobile technology in the transport and travel industry. Asunción Gómez-Pérez is the director of the Ontology Group at UPM. She has a BA in computer science (1990), MSc in knowledge engineering (1991), and PhD in computer science (1993) by Universidad Politécnica de Madrid (UPM). She also has an MBA (1994) by Universidad Pontificia de Comillas. She was visiting (1994-1995) the Knowledge Systems Laboratory at Stanford University. She is associate professor at the Computer Science School at UPM. She was the executive director (1995-1998) of the Artificial Intelligence Laboratory at the school, and she is currently a research advisor of the same lab. She has been the director of the Ontology Group at the lab since 1995. Her current research activities include, among others: interoperability between different kinds of ontology development tools, methodologies and tools for building and merging ontologies, ontological reengineering, ontology evaluation, and ontology evolution, as well as uses of ontologies in applications related with Semantic Web, ecommerce, and knowledge management. She acts as consultant of the European Commission on the FP6 program on the topic of knowledge technologies. She has published more than 70 papers on the above issues. She has led several national and international projects related to ontologies funded by various institutions and/ or companies. Terry Halpin, BSc, DipEd, BA, MLitStud, PhD, is distinguished professor and vice president (conceptual modeling) at Northface University. His doctoral thesis formalized object-role modeling (ORM/NIAM). After leaving academia to work on data modeling technology at Asymetrix Corporation, InfoModelers Inc., Visio Corporation, and Microsoft Corporation, he returned to academia, specializing in software development using a business rules approach to informatics. With a research focus on conceptual modeling and conceptual query technology, he has authored more than 100 technical publications and five books, including Information Modeling and Relational Databases and Database Modeling with Microsoft Visio for Enterprise Architects. He is a member of
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
370 About the Authors
IFIP WG 8.1 (Information Systems) and is an editor or reviewer for several academic journals. Siri Moe Jensen has an MSc (1989) from the University of Oslo, Department of Informatics and a background in contract research and independent consulting. She is currently working as a senior research scientist in the research programme Organisations of the Future in DNV Research (Det Norske Veritas). Judith Kabeli earned her PhD at the Department of Information Systems Engineering of Ben-Gurion University (2004). Currently, she lectures at the Department of Industrial Engineering & Management of the Negev Academic College of Engineering and at the Open University of Israel. Her research interests include information systems analysis and design, and intelligent Webbased user interfaces. Thomas Knothe, Senior Researcher, earned a diploma in information technology for mechanical engineering at Technical University Berlin/Germany. Since 1998, he has been a senior researcher at IPK-Berlin, Systems Planning Division. He is involved in various research and consulting projects at the Fraunhofer Institute IPK in Berlin (e.g., UEML, MISSION). His consulting emphases are business process modeling, analysis and optimization, adaptation and use of business process modeling methods in different domains. He leads re-engineering projects in enterprises of various branches (software industry, mechanical engineering, IT services). John Krogstie has a PhD (1995) and an MSc (1991) in information systems, both from the Norwegian University of Science and Technology (NTNU). He is currently a senior research scientist at SINTEF. He is also a (part-time) professor at NTNU. He was employed as a manager in Accenture 1991-2000. John Krogstie is the Norwegian representative for IFIP TC8 and vice chair of IFIP WG 8.1 on information systems design and evaluation, where he is the initiator and leader of the task group for mobile information systems. He has published around 50 refereed papers in journals, books, and archival proceedings since 1991. Jian Li received a BS in automotive engineering from Tsinghua University, P.R. China, in 1996, and an MS in computer engineering from the University of Rhode Island, Kingston. He was an information system manager at AMP Inc. (China Office) during 1997-1998. He is currently working toward his PhD in the area of multi-threaded and multi-processor architecture at the Computer Systems Laboratory, Cornell University. He is a student member of the ACM, the IEEE Computer Society, and the DSI.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
About the Authors 371
Yun Lin is a doctoral candidate in the Information System Group at the Department of Computer and Information Science at the Norwegian University of Science and Technology. She is also a teaching assistant on a modeling information system course. Before she began her PhD studies, she held a position as researcher in Wuhan Research Institute of Posts & Telecommunications in China. She is currently involved in a European project — INTEROP (Interoperability Research for Networked Enterprises Applications and Software). Her research interests include ontology, conceptual modeling, MDA, model reuse, and interoperability. Ling Liu is an associate professor in the College of Computing at Georgia Tech. Her research involves both experimental and theoretical study of distributed data intensive systems, including distributed middleware, advanced Internet systems, and Internet data management. Dr. Liu has published more than 100 articles in international journals and international conferences. She is currently a member of the ACM SIGMOD executive committee, the editor-in-chief of ACM SIGMOD Record, and on the editorial board of several international journals, including TKDE, VLDB Journal, International Journal of Web Services, International Journal of Grid, and Utility Computing. Her current research is partially funded by government grants from NSF, DARPA, DoE, and industry grants from IBM and HP. Scott J. Lloyd received his BS and MS in management information systems from the Virginia Commonwealth University. After that, he worked for the Department of Defense for two years as a database analyst and designer. He then worked for Cap Gemini consulting and as a private consultant for another two years. Scott then received his PhD in information systems from Kent State University. He then worked for Roadway Technologies as a senior technology consultant. He then accepted an assistant professor position at Eastern Illinois University. After three years, Scott moved to the University of Rhode Island as an assistant professor in MIS. Adolfo Lozano-Tello is teaching/research assistant professor of computer languages and systems since 1995 at Extremadura University, Spain. He received his BS in computer science by Granada University, Spain (1993). He has worked at PROMAINEX (1993-1995) as a computer consultant and programming team leader. He was director of computer science in the Minimally Invasive Surgery Centre, Cáceres, Spain (1995). He has a PhD (2002) in computer science, Extremadura University, Spain, with a special prize of extraordinary thesis. His PhD research addresses the use of software components and ontologies in applications. His research interests include reusable software components rating, ontology design, and software component and
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
372 About the Authors
ontology features identification. He has published more than 20 papers on the above issues on software engineering, and knowledge engineering, and belongs to several projects related to these topics. At present, he is the director of the Computer Science Service at Extremadura University. Sergio Luján-Mora (
[email protected]) is a lecturer at the Computer Science School at the University of Alicante, Spain. He received a BS in 1997 and an MS in computer science in 1998 from the University of Alicante. Currently, he is a doctoral student in the Department of Language and Information Systems, advised by Dr. Juan Trujillo. His research spans the fields of multidimensional databases, data warehouses, OLAP techniques, database conceptual modeling and object-oriented design and methodologies, Web engineering, and Web programming. He is author of several national journal papers and papers presented in international conferences, such as ER, UML, and EDBT workshops. Kai Mertins has been director for corporate management at the FraunhoferInstitute for Production Systems and Design Technology (IPK), Berlin/Germany since 1988. He has more than 20 years experience in design, planning, simulation, and control of flexible manufacturing systems (FMS), manufacturing control systems (MCS), computer integrated manufacturing (CIM), business re-engineering and enterprise modeling. He was general project manager in several international industrial projects and gives lectures and seminars at the Technical University Berlin and at several other universities. Kai Mertins is a member of the editorial board of the journal Production Planning and Control and Business Process Management Journal. Christophe Nicolle is associate professor in the Computer Science Department at the University of Bourgogne. He received his PhD in computer science in 1996. Since, he is a member of the laboratory LE2I (Electronique, Informatique et Image) at the University of Bourgogne. His research interests include the building of cooperative information systems on the Web using XML. He is also working in collaboration with the Active3D-Lab Company working on adaptive and hypermedia Web-based systems. Joan Peckham is a professor of computer science at the University of Rhode Island. In the 1990s, she pursued semantic data modeling research in the context of active databases. She is currently interested in the application of these techniques to software engineering tools. She is concurrently working on a project to develop learning communities for freshman with the goal of encouraging women and other underrepresented groups to the technical disciplines. She is also working with colleagues in engineering and the biological sciences to apply
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
About the Authors 373
computing techniques to the fields of transportation and bioinformatics, respectively. Victor Portougal is associate professor in the Department of Information Systems and Operations Management, Business School, The University of Auckland. His research interests are in quantitative methods both in management science and information systems. In information systems, his research specializes in security, information systems design and development, and ERP. Portougal’s practical and consulting experience includes information and ERP systems design and implementation for companies in Russia and New Zealand. He is the author of many articles in scholarly journals, practitioner magazines, and books. Portougal holds degrees from the University of Gorki, Russia (MSc, computer science), Academy of Sciences, Moscow (PhD, operations research), Ukrainian Academy of Sciences, Kiev (Doctor of Economics). Calton Pu is a professor and John P. Imlay, Jr. chair in software at the College of Computing, Georgia Institute of Technology. He is building software tools to support information flow-driven applications, such as digital libraries and electronic commerce. He is interested in and has published in operating systems, transaction processing, and Internet data management. He has published more than 40 journal papers and book chapters, 130 conference and refereed workshop papers, and served on more than 80 program committees. He served as PC Cochair for SRDS, ICDE, and General Chair for CIKM and ICDE. His research has been supported by NSF, DARPA, ONR, and industry grants from AT&T, IBM, HP, Intel. Sudha Ram is Eller professor, management information systems in the College of Business and Public Administration, University of Arizona. She received a BS in mathematics, physics and chemistry from the University of Madras in 1979, PGDM from the Indian Institute of Management, Calcutta in 1981, and a PhD from the University of Illinois at Urbana-Champaign in 1985. Dr. Ram has published articles in such journals as Communications of the ACM, IEEE Expert, IEEE Transactions on Knowledge and Data Engineering, ISR, Information Systems, Information Science, and Management Science. Dr. Ram’s research deals with modeling and analysis of database and knowledgebased systems for manufacturing, scientific, and business applications. Her research has been funded by IBM, NCR, US ARMY, NIST, NSF, NASA, NIH, and ORD(CIA). Specifically, her research deals with interoperability among heterogeneous database systems, semantic modeling, data allocation, schema and view integration, intelligent agents for data management, and tools for database design. Dr. Ram serves on various editorial boards including Information Systems Research, Journal of Database Management, Information
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
374 About the Authors
Systems Frontiers, and Journal of Systems and Software. She is also the director of the Advanced Database Research Group based at the University of Arizona. Stephen Rockwell is an associate professor in the School of Accounting at the University of Tulsa. He received a BS in accounting from the University of Utah in 1986 and a PhD in business administration from Michigan State University (minor in MIS) in 1993. Dr. Rockwell’s research deals with conceptual design of databases, issues in computer-human interaction, and methodologies for the analysis and design of information systems. He has published articles in academic journals, such as the International Journal of Intelligent Systems in Accounting, Finance, and Management and has published a number of book chapters related to conceptual modeling and systems analysis and design. He is also a reviewer for several journals in the MIS area. His research has been funded by Arthur Andersen and Deloitte and Touche. He teaches graduate courses on information systems security and strategic technologies for corporate information systems, as well as several undergraduate courses integrating MIS and accounting. Peter A. Rosen is a visiting assistant professor at Oklahoma State University, where he is pursuing a PhD in management science and information systems. His past degrees include a BA in psychology from the University of California, Santa Barbara, and an MBA from San Diego State University. His research interests include data mining techniques, neural network training algorithms, personal innovativeness, and technology acceptance. Duncan Dubugras Ruiz is an associate professor of computer science at the Post-Graduate Program on Computer Science, Faculty of Informatics, Pontifical Catholic University of Rio Grande do Sul, Brazil. His research interests include information systems modeling and design, in particular workflow and business process modeling and automation, and applications of database technology, in special temporal and active databases, and XML databases. He received a PhD in computer science from the Federal University of Rio Grande do Sul - Brazil in 1995 and an MSc in 1987 from the same university. Jennifer Sampson is currently undertaking a PhD at the Department of Computer and Information Science at the Norwegian University of Science and Technology (NTNU). Her PhD involves the development of models, methods, and a platform for adaptive, intelligent, distributed mobile information services support. Prior to returning to academia, Jennifer worked in industry as a consultant for Andersen Consulting, Commercial Data Processing (CDP), and Telecom Mobile New Zealand. Jennifer has a BS in business studies honours in
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
About the Authors 375
information systems and an MS in business studies in information systems from Massey University, New Zealand. Her research interests include ontology mapping, conceptual modeling, and evidence-based practice. Peretz Shoval is a professor of information systems and head of the Department of Information Systems Engineering at Ben-Gurion University. He earned his PhD in information systems (1981) from the University of Pittsburgh, where he specialized in expert systems for information retrieval. In 1984, he joined BenGurion University, where he started the Information Systems Program. Prior to moving to academia, he held professional and managerial positions in computer and software companies. Shoval’s research interests include information systems analysis and design methods, data modeling and database design, and information retrieval and filtering. He has published numerous papers in journals and conference proceedings. Shoval has developed methodologies and tools for systems analysis and design, and for conceptual and logical database design. Jean-Claude Simon is a PhD student in the Computer Science Department at the University of Bourgogne. He is a student in the laboratory LE2I (Electronique, Informatique et Image). His research interest is the composition of XML Web services. Il-Yeol Song (
[email protected]) is a professor at Drexel University. Song received a PhD in computer science from Louisiana State University, Baton Rouge, LA, in 1988. His research interests include database modeling, design and performance optimization of data warehouses, OLAP, database support for e-commerce systems, and object-oriented analysis and design with UML. He has published over 80 refereed papers in conferences and journals. He served as a program committee member of over 60 conferences and workshops. He served as a program cochair for DOLAP 98, CIKM99, DOLAP99, and ER2003. He received three teaching awards from Drexel University including Exemplary Teaching Award (1992), Teaching Excellence Award (2000), and the most prestigious Lindback Distinguished Teaching Award (2001). Juan Trujillo (
[email protected]) is a professor at the Computer Science School at the University of Alicante, Spain. Trujillo received a PhD in computer science from the University of Alicante (Spain) in 2001. His research interests include database modeling, conceptual design of data warehouses, multidimensional databases, OLAP, and object-oriented analysis and design with UML. He has published papers in international conferences and journals, such as ER, UML, ADBIS, WAIM, and IEEE Computer. He is a program committee member of several workshops and conferences, such as ER, DOLAP, and SCI.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
376 About the Authors
Rick L. Wilson is the W. Paul Miller professor of business administration and professor of management science and information systems at Oklahoma State University. He received his PhD in management information systems from the University of Nebraska-Lincoln. Dr. Wilson has published numerous papers in the areas of applied artificial intelligence and data mining, applied management science, and organizational impacts of emerging technologies. Qing (Ken) Yang received a BSc in computer science from Huazhong University of Science and Technology, Wuhan, China, in 1982, an MASc in electrical engineering from University of Toronto, Canada, in 1985, and a PhD in computer engineering from The Center for Advanced Computer Studies, University of Louisiana, in 1988. Presently, he is a distinguished engineering professor in the Department of Electrical and Computer Engineering at the University of Rhode Island where he has been a faculty member since 1988. His research interests include computer architectures, memory systems, disk I/O systems, data storage architectures, performance evaluation, and local area networks. Dr. Yang is a senior member of the IEEE Computer Society and a member of SIGAR CH. Kokou Yétongnon received a BS in mathematics from the university of Bénin, Togo, in 1974; an MS and a PhD in computer science and engineering from the University of Connecticut in 1981 and 1985, respectively. He joined the University of Bourgogne in Dijon, France as an assistant professor in 1985. He is currently a professor in the Computer Science Department. He was the director of the Laboratoire d’Ingéniérie Informatique from 1991 to 1995. He currently heads the database and information technology research group. His research interests are in distributed computing systems and databases, including object-oriented information systems, cooperative systems, image database, spatial database systems, distributed information systems, semantic data models, information system analysis, and performance. Martin Zelm is an independent donsultant and founding member of the CIMOSA Association involved in enterprise engineering and integration and related standardization. Formerly working for IBM Germany as a development manager in European R&D projects, he has specialized in architectures for computer integrated manufacturing (CIM) and contributed to the design of the CIM Open System (CIMOSA) specifications for enterprise modeling. Martin Zelm holds a PhD in physics from the University of Hamburg/Germany. He is author of a number of technical papers, many about user-oriented enterprise modeling.
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Index 377
Index
A
C
ACT architecture 241 activities 5 activity declaration 12 activity hierarchy 12 activity patterns 12 ActivityFlow 21 actors 5 adaptive control of thought (ACT) 258 additivity 53 ae-message 209 alternative path hierarchies 54 analysis models 283 architecture 5 ARPA knowledge sharing effort 160
case study 148 categorization 222 categorization of dimensions 55, 60 classification hierarchies 54 cognitive theories 255 cognitive work analysis (CWA) 258 completeness 55 composite activity 8 computational equivalence 238, 244 conceptual model 255, 179 confidence measure 333 confidential 97 context 208 critical infrastructure protection 327
B
D
B2B 298 B2B platforms 297 back channel 332 bias measures 98 business processes 1 business-to-business (B2B) 297
data declaration 13 data mining 96 data perturbation 98 data perturbation techniques 97 data security 96, 97 data warehouses 51
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
378 Index
database (DB) built-in backup capabilities 112 database management systems (DBMSs) 352 degenerate dimensions 60 DEMO method 204 deontic object 210 descriptive perspective 206 device driver 113 dice 61 dimension attributes 53 dimensions 53, 165 direction-of-fit 210 document type definitions (DTDs) 52, 346, 352 dynamic workflow 1
E e-business 108 e-message 205 elementary activity 8 empirical quality 187 enterprise modeling (EM) 130, 131, 143 enterprise modeling languages 131 entity relationship model (ERM) 256 European Thematic Network Project UEML 132 experiment 283 eXtensible Markup Language 52 eXtensible Stylesheet Language Transformations 52
F fact attributes 53 facts 53 federated systems 301 first-order logic 218 framework 1 functional 283 functional and object oriented methodology (FOOM) 283
G generalization-specialization relationships 60 generalized additive data perturbation (GADP) 96 geographic information systems 342 grammar integration step 305
H HBA (host bus adapter) 115 Henkin semantics 228 heterogeneous data sources 341 hierarchical OO-DFDs 285 higher-order logics 219 higher-order types 218 Hovy’s framework 163 human information-processing model 241
I identifier problem 202, 203 illocutionary force 208 inference relationships 189 infological approach 205 information modeling 152, 238 information models 218 information operations 327 information sharing 342 information warfare 326 informational 238 informational equivalence 243 initial class diagram 285 INSTANCE and CLASS 190 interoperability 130, 138, 298 inverse property 190
J J2EE 350
K knowledge discovery 96
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Index 379
L LBA (Logic Block Address) 116
M Mayer’s model of the learning process 258 measures 53 mediation approach 302 meta level 232 meta model 6 meta-modeling 143 modeling efficiency 256 multi-base systems 301 multidimensional (MD) modeling 51 multidimensional databases 51 multilevel framework of characteristics 164 multilevel tree of characteristics 165 multiple classification hierarchies 54
ontology 159, 179 ontology model 179 ONTOMETRIC 162 OntoMetric Tool 163 OWL 179, 182
P
national infrastructure protection 328 network attached storage (NAS) 111 night watch 330 non-XML-native 303 nonfirst normal form 233
packages 51 password sniffing 332 perceived semantic quality 188 possibility measure 331 possibility theory 328 possible world 209 powertypes 219, 224 pragmatic 201 pragmatic meaning 208 pragmatic quality 189 pragmatics 202 predicate problem 203 private backup network (PBN) 109, 112 probability theory 326 procedure 13 propositional content 208 protection interval 330 protection interval (PI) 334 prototype 108
O
Q
Object Constraint Language 52 Object Database Management Group 52, 59 object database standard 59 object-oriented databases 52 object-process diagrams (OPDs) 287 object-processes methodology (OPM) 284 object-relational databases 52 object-role modeling 219 online analytical processing (OLAP) 51 ontological problem 202, 203
quality evaluation 180 quality framework of models and modeling languages 144 quality of analysis specifications 283 query by example 61
N
R readability 267 readability effectiveness 256 readability efficiency 256 reference ontology 168 role 8 role association 13
Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
380 Index
RORIB (real-time online remote information backup) 114
S semantic meaning 208 semantic quality 188 semantic Web 179 semantics 201, 202 semiotic 178 semiotic quality framework 180 semiotic theory 260 slice 61 SOAP (an XML messaging protocol) 3 social quality 189 software-engineering 283 specialized modeling approach 144 specification step 304 speech act theory 207 star schema 55 storage area networks (SAN) 109 strictness 55 structured modeling 239 subtype 220 symmetric treatment 55 syntactic quality 187 system integration 298
Unified Modeling Language (UML) 181, 219, 239 uniform resource identifier (URI) 3
W W3C 3 Web services 298 workflow automation 2 Workflow Management Coalition (WfMC) 5 workflow management systems (WFMSs) 2 workflow process 5 workflow systems 2 working memory 242 WSDL (Web Services Description Language) 3 WWW Consortium (W3C) 350
X X-Editor 304 XML 3, 299, 341 XML schema 52 XML-based legacy 304 XML-native 303 XQL 350
T TCP/IP 109 theories of short-term memory (STM) 261 theory of equivalence of representations (TER) 243, 258 translation step 305
U UDDI (Universal Description, Discovery and Integration) 3 UML 51, 179 Unified Enterprise Modeling Language (UEML) 130 Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
TEAM LinG
Instant access to the latest offerings of Idea Group, Inc. in the fields of I NFORMATION SCIENCE , T ECHNOLOGY AND MANAGEMENT!
InfoSci-Online Database BOOK CHAPTERS JOURNAL AR TICLES C ONFERENCE PROCEEDINGS C ASE STUDIES
“
The Bottom Line: With easy to use access to solid, current and in-demand information, InfoSci-Online, reasonably priced, is recommended for academic libraries.
The InfoSci-Online database is the most comprehensive collection of full-text literature published by Idea Group, Inc. in:
”
- Excerpted with permission from Library Journal, July 2003 Issue, Page 140
n n n n n n n n n
Distance Learning Knowledge Management Global Information Technology Data Mining & Warehousing E-Commerce & E-Government IT Engineering & Modeling Human Side of IT Multimedia Networking IT Virtual Organizations
BENEFITS n Instant Access n Full-Text n Affordable n Continuously Updated n Advanced Searching Capabilities
Start exploring at www.infosci-online.com
Recommend to your Library Today! Complimentary 30-Day Trial Access Available! A product of:
Information Science Publishing* Enhancing knowledge through information science
*A company of Idea Group, Inc. www.idea-group.com
TEAM LinG
New Releases from Idea Group Reference
Idea Group REFERENCE
The Premier Reference Source for Information Science and Technology Research ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
DATA WAREHOUSING AND MINING
INFORMATION SCIENCE AND TECHNOLOGY AVAILABLE NOW!
Edited by: John Wang, Montclair State University, USA Two-Volume Set • April 2005 • 1700 pp ISBN: 1-59140-557-2; US $495.00 h/c Pre-Publication Price: US $425.00* *Pre-pub price is good through one month after the publication date
Provides a comprehensive, critical and descriptive examination of concepts, issues, trends, and challenges in this rapidly expanding field of data warehousing and mining A single source of knowledge and latest discoveries in the field, consisting of more than 350 contributors from 32 countries
Five-Volume Set • January 2005 • 3807 pp ISBN: 1-59140-553-X; US $1125.00 h/c
ENCYCLOPEDIA OF
DATABASE TECHNOLOGIES AND APPLICATIONS
Offers in-depth coverage of evolutions, theories, methodologies, functionalities, and applications of DWM in such interdisciplinary industries as healthcare informatics, artificial intelligence, financial modeling, and applied statistics Supplies over 1,300 terms and definitions, and more than 3,200 references
DISTANCE LEARNING
April 2005 • 650 pp ISBN: 1-59140-560-2; US $275.00 h/c Pre-Publication Price: US $235.00* *Pre-publication price good through one month after publication date
Four-Volume Set • April 2005 • 2500+ pp ISBN: 1-59140-555-6; US $995.00 h/c Pre-Pub Price: US $850.00* *Pre-pub price is good through one month after the publication date
MULTIMEDIA TECHNOLOGY AND NETWORKING
ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
More than 450 international contributors provide extensive coverage of topics such as workforce training, accessing education, digital divide, and the evolution of distance and online education into a multibillion dollar enterprise Offers over 3,000 terms and definitions and more than 6,000 references in the field of distance learning Excellent source of comprehensive knowledge and literature on the topic of distance learning programs Provides the most comprehensive coverage of the issues, concepts, trends, and technologies of distance learning
April 2005 • 650 pp ISBN: 1-59140-561-0; US $275.00 h/c Pre-Publication Price: US $235.00* *Pre-pub price is good through one month after publication date
www.idea-group-ref.com
Idea Group Reference is pleased to offer complimentary access to the electronic version for the life of edition when your library purchases a print copy of an encyclopedia For a complete catalog of our new & upcoming encyclopedias, please contact: 701 E. Chocolate Ave., Suite 200 • Hershey PA 17033, USA • 1-866-342-6657 (toll free) •
[email protected] TEAM LinG