Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5740
Abdelkader Hameurlain Josef Küng Roland Wagner (Eds.)
Transactions on Large-Scale Data- and KnowledgeCentered Systems I
13
Volume Editors Abdelkader Hameurlain Paul Sabatier University Institut de Recherche en Informatique de Toulouse (IRIT) 118, route de Narbonne 31062 Toulouse Cedex, France E-mail:
[email protected] Josef Küng Roland Wagner University of Linz, FAW Altenbergerstraße 69 4040 Linz, Austria E-mail: {jkueng,rrwagner}@faw.at
Library of Congress Control Number: 2009932361 CR Subject Classification (1998): H.2, H.2.4, H.2.7, C.2.4, I.2.4, I.2.6
ISSN ISBN-10 ISBN-13
0302-9743 3-642-03721-6 Springer Berlin Heidelberg New York 978-3-642-03721-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12738045 06/3180 543210
Preface
Data management, knowledge discovery, and knowledge processing are core and hot topics in computer science. They are widely accepted as enabling technologies for modern enterprises, enhancing their performance and their decision making processes. Since the 1990s the Internet has been the outstanding driving force for application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolvement of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource characterized by: heterogeneity of nodes, data, and knowledge autonomy of data and knowledge sources and services large-scale data volumes, high numbers of data sources, users, computing resources dynamicity of nodes These characteristics recognize: (i) (ii) (iii)
limitations of methods and techniques developed for centralized systems requirements to extend or design new approaches and methods enhancing efficiency, dynamicity, and scalability development of large scale, experimental platforms and relevant benchmarks to evaluate and validate scaling
Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and agent systems supporting with scaling and decentralized control. Synergy between Grids, P2P systems and agent technologies is the key to data- and knowledge-centered systems in large-scale environments. The objective of the international journal on Large-Scale Data- and Knowledge-Centered Systems is to provide an opportunity to disseminate original research contributions and to serve as a high-quality communication platform for researchers and practitioners. The journal contains sound peer-reviewed papers (research, state of the art, and technical) of high quality. Topics of interest include, but are not limited to: data storage and management data integration and metadata management data stream systems data/web semantics and ontologies knowledge engineering and processing sensor data and sensor networks dynamic data placement issues flexible and adaptive query processing
VI
Preface
query processing and optimization data warehousing cost models resource discovery resource management, reservation, and scheduling locating data sources/resources and scalability workload adaptability in heterogeneous environments transaction management replicated copy control and caching data privacy and security data mining and knowledge discovery mobile data management data grid systems P2P systems web services autonomic data management large-scale distributed applications and experiences performance evaluation and benchmarking. The first edition of this new journal consists of journal versions of talks invited to the DEXA 2009 conferences and further invited contributions by well-known scientists in the field. Therefore the content covers a wide range of different topics in the field. The second edition of this journal will appear in spring 2010 under the title: Datawarehousing and Knowledge Discovery (Guest editors: Mukesh K. Mohania (IBM, India), Torben Bach Perdersen (Aalborg University, Denmark), A Min Tjoa (Technical University of Vienna, Austria). We are happy that Springer has given us the opportunity to publish this journal and are looking forward to supporting the community with new findings in the area of largescale data- and knowledge-centered systems. In particular we would like to thank Alfred Hofmann and Ursula Barth from Springer for their valuable support. Last, but not least, we would like to thank Gabriela Wagner for her organizational work.
June 2009
Abdelkader Hameurlain Josef Küng Roland Wagner
Editorial Board
Hamideh Afsarmanesh Francesco Buccafurri Qiming Chen Tommaso Di Noia Georg Gottlob Anastasios Gounaris Theo Härder Zoé Lacroix Sanjay Kumar Madria Vladimir Marik Dennis McLeod Mukesh Mohania Tetsuya Murai Gultekin Ozsoyoglu Oscar Pastor Torben Bach Pedersen Günther Pernul Colette Rolland Makoto Takizawa David Taniar Yannis Vassiliou Yu Zheng
University of Amsterdam, The Netherlands Università Mediterranea di Reggio Calabria, Italy HP-Lab, USA Politecnico di Bari, Italy Oxford University, UK Aristotle University of Thessaloniki, Greece Technical University of Kaiserslautern, Germany Arizona State University, USA University of Missouri-Rolla, USA Technical University of Prague, Czech Republik University of Southern California, USA IBM India, India Hokkaido University, Japan Case Western Reserve University, USA Polytechnic University of Valencia, Spain Aalborg University, Denmark University of Regensburg, Germany Université Paris1 Panthéon Sorbonne, CRI, France Seikei University, Tokyo, Japan Monash University, Australia National Technical University of Athens, Greece Microsoft Research Asia, China
Table of Contents
Modeling and Management of Information Supporting Functional Dimension of Collaborative Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamideh Afsarmanesh, Ekaterina Ermilova, Simon S. Msanjila, and Luis M. Camarinha-Matos
1
A Universal Metamodel and Its Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Atzeni, Giorgio Gianforme, and Paolo Cappellari
38
Data Mining Using Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . Christian B¨ ohm, Robert Noll, Claudia Plant, Bianca Wackersreuther, and Andrew Zherdin
63
Context-Aware Data and IT Services Collaboration in E-Business . . . . . . Khouloud Boukadi, Chirine Ghedira, Zakaria Maamar, Djamal Benslimane, and Lucien Vincent
91
Facilitating Controlled Tests of Website Design Changes Using Aspect-Oriented Software Development and Software Product Lines . . . . Javier C´ amara and Alfred Kobsa
116
Frontiers of Structured Business Process Modeling . . . . . . . . . . . . . . . . . . . Dirk Draheim
136
Information Systems for Federated Biobanks . . . . . . . . . . . . . . . . . . . . . . . . Johann Eder, Claus Dabringer, Michaela Schicho, and Konrad Stark
156
Exploring Trust, Security and Privacy in Digital Business . . . . . . . . . . . . . Simone Fischer-Huebner, Steven Furnell, and Costas Lambrinoudakis
191
Evolution of Query Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelkader Hameurlain and Franck Morvan
211
Holonic Rationale and Bio-inspiration on Design of Complex Emergent and Evolvable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao
243
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao, Paul Valckenaers, and Emmanuel Adam
267
Context Oriented Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . Mukesh Mohania, Manish Bhide, Prasan Roy, Venkatesan T. Chakaravarthy, and Himanshu Gupta
289
X
Table of Contents
Data Sharing in DHT Based P2P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Roncancio, Mar´ıa del Pilar Villamil, Cyril Labb´e, and Patricia Serrano-Alvarado
327
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quoc Thai Tran, David Taniar, and Maytham Safar
353
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5740
Abdelkader Hameurlain Josef Küng Roland Wagner (Eds.)
Transactions on Large-Scale Data- and KnowledgeCentered Systems I
13
Volume Editors Abdelkader Hameurlain Paul Sabatier University Institut de Recherche en Informatique de Toulouse (IRIT) 118, route de Narbonne 31062 Toulouse Cedex, France E-mail:
[email protected] Josef Küng Roland Wagner University of Linz, FAW Altenbergerstraße 69 4040 Linz, Austria E-mail: {jkueng,rrwagner}@faw.at
Library of Congress Control Number: 2009932361 CR Subject Classification (1998): H.2, H.2.4, H.2.7, C.2.4, I.2.4, I.2.6
ISSN ISBN-10 ISBN-13
0302-9743 3-642-03721-6 Springer Berlin Heidelberg New York 978-3-642-03721-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12738045 06/3180 543210
Preface
Data management, knowledge discovery, and knowledge processing are core and hot topics in computer science. They are widely accepted as enabling technologies for modern enterprises, enhancing their performance and their decision making processes. Since the 1990s the Internet has been the outstanding driving force for application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolvement of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource characterized by: heterogeneity of nodes, data, and knowledge autonomy of data and knowledge sources and services large-scale data volumes, high numbers of data sources, users, computing resources dynamicity of nodes These characteristics recognize: (i) (ii) (iii)
limitations of methods and techniques developed for centralized systems requirements to extend or design new approaches and methods enhancing efficiency, dynamicity, and scalability development of large scale, experimental platforms and relevant benchmarks to evaluate and validate scaling
Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and agent systems supporting with scaling and decentralized control. Synergy between Grids, P2P systems and agent technologies is the key to data- and knowledge-centered systems in large-scale environments. The objective of the international journal on Large-Scale Data- and Knowledge-Centered Systems is to provide an opportunity to disseminate original research contributions and to serve as a high-quality communication platform for researchers and practitioners. The journal contains sound peer-reviewed papers (research, state of the art, and technical) of high quality. Topics of interest include, but are not limited to: data storage and management data integration and metadata management data stream systems data/web semantics and ontologies knowledge engineering and processing sensor data and sensor networks dynamic data placement issues flexible and adaptive query processing
VI
Preface
query processing and optimization data warehousing cost models resource discovery resource management, reservation, and scheduling locating data sources/resources and scalability workload adaptability in heterogeneous environments transaction management replicated copy control and caching data privacy and security data mining and knowledge discovery mobile data management data grid systems P2P systems web services autonomic data management large-scale distributed applications and experiences performance evaluation and benchmarking. The first edition of this new journal consists of journal versions of talks invited to the DEXA 2009 conferences and further invited contributions by well-known scientists in the field. Therefore the content covers a wide range of different topics in the field. The second edition of this journal will appear in spring 2010 under the title: Datawarehousing and Knowledge Discovery (Guest editors: Mukesh K. Mohania (IBM, India), Torben Bach Perdersen (Aalborg University, Denmark), A Min Tjoa (Technical University of Vienna, Austria). We are happy that Springer has given us the opportunity to publish this journal and are looking forward to supporting the community with new findings in the area of largescale data- and knowledge-centered systems. In particular we would like to thank Alfred Hofmann and Ursula Barth from Springer for their valuable support. Last, but not least, we would like to thank Gabriela Wagner for her organizational work.
June 2009
Abdelkader Hameurlain Josef Küng Roland Wagner
Editorial Board
Hamideh Afsarmanesh Francesco Buccafurri Qiming Chen Tommaso Di Noia Georg Gottlob Anastasios Gounaris Theo Härder Zoé Lacroix Sanjay Kumar Madria Vladimir Marik Dennis McLeod Mukesh Mohania Tetsuya Murai Gultekin Ozsoyoglu Oscar Pastor Torben Bach Pedersen Günther Pernul Colette Rolland Makoto Takizawa David Taniar Yannis Vassiliou Yu Zheng
University of Amsterdam, The Netherlands Università Mediterranea di Reggio Calabria, Italy HP-Lab, USA Politecnico di Bari, Italy Oxford University, UK Aristotle University of Thessaloniki, Greece Technical University of Kaiserslautern, Germany Arizona State University, USA University of Missouri-Rolla, USA Technical University of Prague, Czech Republik University of Southern California, USA IBM India, India Hokkaido University, Japan Case Western Reserve University, USA Polytechnic University of Valencia, Spain Aalborg University, Denmark University of Regensburg, Germany Université Paris1 Panthéon Sorbonne, CRI, France Seikei University, Tokyo, Japan Monash University, Australia National Technical University of Athens, Greece Microsoft Research Asia, China
Table of Contents
Modeling and Management of Information Supporting Functional Dimension of Collaborative Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamideh Afsarmanesh, Ekaterina Ermilova, Simon S. Msanjila, and Luis M. Camarinha-Matos
1
A Universal Metamodel and Its Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Atzeni, Giorgio Gianforme, and Paolo Cappellari
38
Data Mining Using Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . Christian B¨ ohm, Robert Noll, Claudia Plant, Bianca Wackersreuther, and Andrew Zherdin
63
Context-Aware Data and IT Services Collaboration in E-Business . . . . . . Khouloud Boukadi, Chirine Ghedira, Zakaria Maamar, Djamal Benslimane, and Lucien Vincent
91
Facilitating Controlled Tests of Website Design Changes Using Aspect-Oriented Software Development and Software Product Lines . . . . Javier C´ amara and Alfred Kobsa
116
Frontiers of Structured Business Process Modeling . . . . . . . . . . . . . . . . . . . Dirk Draheim
136
Information Systems for Federated Biobanks . . . . . . . . . . . . . . . . . . . . . . . . Johann Eder, Claus Dabringer, Michaela Schicho, and Konrad Stark
156
Exploring Trust, Security and Privacy in Digital Business . . . . . . . . . . . . . Simone Fischer-Huebner, Steven Furnell, and Costas Lambrinoudakis
191
Evolution of Query Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelkader Hameurlain and Franck Morvan
211
Holonic Rationale and Bio-inspiration on Design of Complex Emergent and Evolvable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao
243
Self-Adaptation for Robustness and Cooperation in Holonic Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo Leitao, Paul Valckenaers, and Emmanuel Adam
267
Context Oriented Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . Mukesh Mohania, Manish Bhide, Prasan Roy, Venkatesan T. Chakaravarthy, and Himanshu Gupta
289
X
Table of Contents
Data Sharing in DHT Based P2P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Roncancio, Mar´ıa del Pilar Villamil, Cyril Labb´e, and Patricia Serrano-Alvarado
327
Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on Spatial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quoc Thai Tran, David Taniar, and Maytham Safar
353
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373
Modeling and Management of Information Supporting Functional Dimension of Collaborative Networks Hamideh Afsarmanesh1, Ekaterina Ermilova1, Simon S. Msanjila1, and Luis M. Camarinha-Matos2 1
Informatics Institute, University of Amsterdam, Science Park 107, 1098 XG, Amsterdam, The Netherlands {h.afsarmanesh,e.ermilova,s.s.msanjila}@uva.nl 2 Faculty of Sciences and Technology, New University of Lisbon, Quinta da Torre, 2829-516, Monte Capatica, Portugal
[email protected] Abstract. Fluent creation of opportunity-based short-term Collaborative Networks (CNs) among organizations or individuals requires the availability of a variety of up-to-date information. A pre-established properly administrated strategic-alliance Collaborative Network (CN) can act as the breeding environment for creation/operation of opportunity-based CNs, and effectively addressing the complexity, dynamism, and scalability of their actors and domains. Administration of these environments however requires effective set of functionalities, founded on top of strong information management. The paper introduces main challenges of CNs and their management of information, and focuses on the Virtual organizations Breeding Environment (VBE), which represents a specific form of strategic-alliances. It then focuses on the needed functionalities for effective administration/management of VBEs, and exemplifies information management challenges for three of their subsystems handling the Ontology, the profiles and competencies, and the rational trust. Keywords: Information management for Collaborative Networks (CNs), virtual organizations breeding environments (VBEs), Information management in VBEs, Ontology management, competency information management, rational trust information management.
1 Introduction The emergence of collaborative networks as collections of geographically dispersed autonomous actors which collaborate through computer networks, has led both organizations and individuals to effectively achieving common goals that go far beyond the ability of each single actor, and providing cost effective solutions, and value creating functionalities, services and products. The paradigm of “Collaborative Networks (CN)” being defined during the last decade represents a wide variety of networks of organizations as well as communities of individuals, where each has distinctive characteristics and features. While the taxonomy of existing CNs, as presented later A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 1–37, 2009. © Springer-Verlag Berlin Heidelberg 2009
2
H. Afsarmanesh et al.
(in section 2.1), indicates their categorical differences, some of their main characteristics are briefly introduced below. Wide diversity in structural forms, duration, behavioral patterns, as well as interaction forms is manifested by different collaborative networks. From the processoriented chain structures as observed in supply chains, to those centralized around dominant entities, and the project-oriented federated networks, there exists a wide range of collaborative network structures [1; 2; 3; 4]. Every structure differently influences the visibility level of each actor in the network, the intensity of its activities, co-working and involvement in decision making. Other important variant elements for networks are the variety of different life-cycle phases and durations. Goal-oriented networks are shorter-term and typically triggered by collaboration opportunities that rise in the market/society, as represented by the case of VOs (virtual organizations) established for a timely response to singular opportunities. Long-term networks on the other hand are strategic alliances / associations with the main purpose of enhancing the chances of their members to get involved in future opportunity-triggered collaboration networks, and increasing the visibility of their actors, thus serving as breeding environments for goal-oriented networks. As examples for long-term networks, the cases of industry clusters or industrial districts and sector-based alliances can be mentioned. In terms of the types of interaction among the actors involved in collaborative networks, although there is not a consensus among researchers, some working definitions are provided [5] for four main classes of interactions that are enumerated as: networking, coordinated networking, cooperation and collaboration. There is an intuitive notion of what collaboration represents, but this concept is often confused with the cooperation. Some researchers even use the two terms indistinguishably. The ambiguities around these terms reach a higher level when other related terms are also considered such as the networking, communication, and coordination [6; 7]. Therefore, it is relevant and important that for the CN research, the concepts behind these interaction terms are formalized, especially for the purpose of defining a reference model for collaborative networks, as later addressed in this paper (in Section 3). In an attempt to clarify these various concepts, based on [5], the following working definitions are proposed for these four classes of interactions, where in fact every concept defined below is itself a sub-class of the concept(s) defined above it: Networking – involves communication and information exchange among involved parties for mutual benefit. It shall be noted that this term has a broad use in multiple contexts and often with different meanings. In collaborative networks area of research, when referred to “enterprise network” or “enterprise networking” the intended meaning is probably “collaborative network of enterprises”. Coordinated Networking – in addition to the above, it involves complementarity of goals of different parties, and aligning / altering activities so that more efficient results can be achieved. Coordination, that is, the act of working together harmoniously, is one of the main components for collaboration. Cooperation – involves not only information exchange and alignments of activities, but also sharing of some resources towards achieving compatible goals. Cooperation may be achieved by division of some labor (not extensive) among participants.
Information Supporting Functional Dimension of Collaborative Networks
3
Collaboration – in addition to the above, it involves joint goals/responsibilities with specific process(es) in which parties share information, resources and capabilities and jointly plan, implement, and evaluate activities to achieve their common goals. It in turn implies sharing risks, losses and rewards. If desired by its involved parties, the collaboration can also give the image of a joint identity to the outside. In practice, collaboration typically involves mutual engagement of participants to solve a problem together, which also implies reaching mutual trust that takes time, effort, and dedication. As addressed above, different forms of interaction are suitable for different structural forms of CNs. For example the long term strategic alliances are cooperative environments, since they primarily comprise actors with compatible and/or complimentary goals towards which they align their activities. The shorter term goal-oriented networks however require intense co-working among their actors to reach their jointly-established common goals that represent the reason for their existence, and are therefore collaborative environments. On the other hand, most of the current social networks just show a networking level of interaction. But from a different perspective, different forms of the above mentioned interactions can also be seen as different levels of “collaboration maturity”. Namely, the interaction among actors in the network may strengthen in time, from simple networking interaction to intense collaboration. This implies gradual increase in the level of co-working as well as the risk taking, commitment, invested resources, etc., by the involved participants. Therefore, operational CNs represent a variety of interactions and inter-relationships among their heterogeneous and autonomous actors, which in turn increases the complexity of this paradigm. It shall be noted that in this paper and most other literature in the field, the concept of “collaborative networks or CNs” represent the generic name when referring to all varieties of such networks. 1.1 Managing the Information in CNs On managing the information in CNs, even if all information is semantically and syntactically homogeneous, a main generic challenge is related to assuring the availability of strategic information within the network, required for proper coordination and decision making. This can be handled through enforcement of a push/pull mechanism and establishment of proper mapping strategies and components, between the information managed at different sites belonging to actors in the network and all those systems (or sub-systems) that support different functionalities of the CN and its activities during its life cycle. Therefore, it is necessary that from autonomous actors involved in the CNs, various types of distributed information are collected. This information shall then be processed, organized, and accessible within the network, both for navigation by different CN stakeholders and for processing by different software systems running at the CN. However, although the information about actors evolves in time - which is typical of dynamic systems as CNs - and therefore need to be kept up to date, there is no need for continuous flow of all the information from each legacy system to the CN. This will generate a major overload on the information management systems at the CN. Rather, for effective CN’s operation and management,
4
H. Afsarmanesh et al.
only at some intervals, partial information needs to be pull/pushed from/to legacy systems to the CN. Need for access to information also varies depending on the purpose for which it is requested. These variations in turn pose a second generic information management challenge that is related to the classification, assessment, and provision of the required information based on intended use cases in CNs. Both of the above generic information challenges in CNs are addressed in the paper, and exemplified for three key CN functionalities, namely common ontology engineering, competency management, and trust management. A third generic information management challenge is related to modeling the variety and complexity of the information that needs to be processed by different functionalities, which support the management and operation of the CNs. While some of these functionalities deal with the information that is known and stored within the site of network’s actors (e.g. data required for actors’ competency management), the information required for some other functionalities of CNs may be unknown, incomplete, or imprecise, for which soft computing approaches, such as causal analysis and reasoning (addresses in Section 7.2) or other techniques introduced in computational intelligence, shall be applied to generate the needed information (e.g. data needed for trust management). The issue of modeling the information needed to be handled in CNs is addressed in details in the paper and also exemplified through the three example functionalities mentioned above. There is however a number of other generic challenges related to the management of the CN information, and it can be expected that more challenges will be identified in time as the need for other functional components unfolds in the research on supporting the management and operation of CNs. Among other identified generic challenges, we can mention: ensuring the consistency among the locally managed semantically and syntactically heterogeneous information at each organization’s legacy systems and the information managed by the management system of the CNs as well as their availability for access by the authorized CN stakeholders (e.g. individuals or organizations) when necessary. At the heart of this challenge lies the establishment of needed interoperation infrastructure, as well as a federated information management system supporting the inter-linking of autonomous information management systems. Furthermore, challenges related to update mechanisms among autonomous nodes are relevant. Nevertheless, these generic challenges are in fact common to many other application environments and are not specific to the CN’s information management area, and fall outside the scope of this paper. The remaining sections first address in Section 2 the collaborative networks, through presenting a taxonomy for collaborative networks and describing the main requirements for establishing the CNs, while emphasizing their information management and modeling aspects. Then in Section 3 it addresses the ARCON reference model for collaborative networks focusing only its endogenous elements. In Section 4, the paper further narrows down on the details of the functional dimension of the endogenous elements and exemplifying it for one specific kind of CN, i.e. the management system of the VBE strategic alliance. Specific examples of modeling and management of information are then provided in Sections 5, 6, and 7 for three subsystems of a VBE management system, addressing the VBE ontology engineering, management of profiles and competencies in VBEs, and assessment and management of trust in VBEs. Section 8 concludes the paper.
Information Supporting Functional Dimension of Collaborative Networks
5
2 Establishing Collaborative Networks Successful creation and management of inter-organizational and inter-personal collaborative networks are challenging. Cooperation and collaboration in CNs, although having the potential of bringing considerable benefits, or even representing a survival mechanism to the involved participants, are difficult processes (as explained in Section 2.2), which quite often fail [8; 9]. Therefore, there are a number of requirements that need to be satisfied to increase their chances for success. Clearly the severity of each requirement depends on the type and specificities of the CN. In other words the nature, goal, and vision of each case determine its critical points and requirements. For instance for the Virtual Laboratory type of CNs, the a priori setting up of the common collaboration infrastructure and maintaining this infrastructure afterwards pose some of their main challenges. However, for another type of CNs that can be focused on international decision making on environment issues such as the global warming, while setting up and maintaining the collaboration infrastructure is not too critical, provision of mediation mechanisms and tools to support building of trust among the CN actors and reaching of agreements on the definition of common CN policies, pose some main challenges. It is therefore important to briefly discuss the various types of CNs before addressing their requirements as addressed below with the taxonomy of the CNs. 2.1 Taxonomy and Working Definitions for Several Types of CN A first CN taxonomy is defined in [10] addressing the large diversity of manifestation of collaborative networks in different application domains. Also a set of working definitions for the terms addressed in Fig. 1 are provided in [5]. A few of these definitions that are necessary for the later sections of this paper are quoted below from [5]; namely the definitions of collaborative networks (CN), virtual organizations breeding environments (VBE), virtual organizations (VO), etc. “A collaborative network (CN) is a network consisting of a variety of actors (e.g. organizations and people) that are largely autonomous, geographically distributed, and heterogeneous in terms of their operating environment, culture, social capital and goals, but that collaborate to better achieve common or compatible goals, and whose interactions are supported by computer network.” “Virtual Organization (VO) – represents an alliance comprising a set of (legally) independent organizations that share their resources and skills, to achieve their common mission / goal, but that is not limited to an alliance of profit enterprises. A virtual enterprise is therefore, a particular case of virtual organization.” “Dynamic Virtual Organization – typically refers to a VO that is established in a short time in order to respond to a competitive market opportunity, and has a short life cycle, dissolving when the short-term purpose of the VO is accomplished.” “Long-term strategic network or breeding environments – a strategic alliance established with the purpose of being prepared for participation in collaboration opportunities, and where in fact not collaboration but cooperation is practiced among their members. In other words, they are alliances aimed at offering the conditions and environment to support rapid and fluid configuration of collaboration networks, when opportunities arise.”
6
H. Afsarmanesh et al.
Collaborative Network (CN)
Main classes
Collaborative Networked Organization (CNO)
Long-term strategic network
VO Breeding Environment (VBE)
Examples
Ad-hoc Collaboration
Industry cluster
Goal-oriented network
Collaborative virtual lab Industrial district Business ecosystem
Disaster rescue net Inter-continental enterprise alliance
Continuous activity driven net
Grasping opportunity driven net
Professional Virtual Community (PVC)
Virtual Team (VT) created _within
Community of Active Senior Professionals (CASP)
Virtual Organization (VO)
Extended enterprise
Dynamic VO
Virtual Enterprise (VE)
Supply chain
Virtual government
Collaborative transportation network Dynamic Supply Chain
Disperse manufacturing
Fig. 1. Taxonomy of Collaborative Networks
“VO Breeding Environments (VBE) – represents “strategic” alliance of organizations (VBE members) and related supporting institutions (e.g. firms providing accounting, training, etc.), adhering to a base long-term cooperation agreement and adopting common operating principles and infrastructures, with the main goal of increasing both their chances and preparedness of collaboration in potential VOs”. “Profession Virtual Communities (PVC) is an alliance of professional individuals, and provide an environment to facilitate the agile and fluid formation of Virtual Teams (VTs), similar to what a VBE aims to provide for the VOs.” “Virtual Team (VT) is similar to a VO but formed by individuals, not organizations, as such a virtual team is a temporary group of professionals that work together towards a common goal such as realizing a consultancy job, a joint project, etc., and that use computer networks as their main interaction environment. 2.2 Base Requirements for Establishing CNs A generic set of requirements, including: (1) definition of common goal and vision, (2) performing a set of initiating actions, and (3) establishing common collaboration space, represents the base pre-conditions to the setting up of the CNs. Furthermore, after the CN is initiated the environment needs to properly operate, for which its coordination and management as well as reaching the needed agreements among its actors for performing the needed tasks represent other set of challenges, including: (1) performing coordination, support, and management of activities, and (2) achieving agreements and contracts. Following five sub-sections briefly address these main basic requirements as identified and addressed within the CN area of research, while emphasizing their information management challenges in italic. 2.2.1 Defining a Common Goal and Vision Collaboration requires the pre-existence of a motivating common goal and vision to represent the joint/common purpose for establishment of the collaboration. In spite of
Information Supporting Functional Dimension of Collaborative Networks
7
all difficulties involved in the process of cooperation / collaboration, the motivating factor for establishing the CNs is the expectation of being able to reach results that could not be reached by the involved actors if working alone. Therefore the common goal and vision of the CN represents its existential purpose, and represent the motivation for attraction of actors to the required cooperation/collaboration processes [5]. At present, the information related to the common goal and vision of the CNs is typically stored in textual format and is made available to public with proper interfaces. Establishing a well-conceived vision however needs involvement of all actors in the network. To properly participate in formulating the vision, the actors need to be well informed, which in turn requires the availability of up-to-date information regarding many aspects of the network. Both the management of required information for visioning as well as the assurance of its effective accessibility to all actors within the network is challenging, as later addressed through the development of ontology for CNs. 2.2.2 Performing a Set of Initiating Actions There are a number of initiating actions that need to be taken, as a pre-condition to establishing CNs. These actions are typically taken by the founder(s) of the CN and may include [11; 12]: identifying interested parties and bring them together; defining the scope of the collaboration and its desired outcomes; defining the structure of the collaboration in terms of leadership, roles, responsibilities; setting the plan of actions in terms of access to resources, task scheduling and milestones, decision-making plan; defining policies, e.g. for handling disagreements / conflicts, accountability, rewards and recognition, ownership of generated assets, intellectual property rights; defining the evaluation / assessment measures, mechanisms and process; and identifying the risks and planning contingency measures. Typically most information related to the initiating actions is strategic and considered proprietary to be accessed only by the CN’s administration. The classification of information in CNs to ensure its confidentiality and privacy, while guaranteeing enough access to the level required by each CN stakeholder is a challenging task for the information management system of the network’s administration. 2.2.3 Substantiating a Common Collaboration Space Establishing CNs require the pre-establishment of their common collaboration space. In this context, we define the term collaboration space as a generic term to address all needed elements, principles, infrastructure, etc. that together provide the needed environment for CN actors to be able to cooperate/collaborate with each other. Establishment of such spaces is needed to enable and facilitate the collaboration process. Typically it addresses the following challenges: - Common concepts and terminology (e.g. common meta-data defined for databases or an ontology, etc., specifying the collaboration environment and purpose) [13]. - Common communication infrastructure and protocols for interaction and data/information sharing and exchange (e.g. the internet, GRID, open or commercial tools and protocols for communication and information exchange, document management systems for information sharing, etc.) [14]
8
H. Afsarmanesh et al.
- Common working and sharing principles, value system, and policies (e.g. procedures for cooperation/collaboration and sharing different resources, assessment of collaboration preparedness, measurement of the alignment between value systems, etc.) [15; 16; 17]. The CN related principles and policies are typically modeled and stored by its administration and made available to all its stakeholders. - Common set of base trustworthiness criteria (e.g. identification, modeling and specification of periodic required measurements related to some common aspects of each actor that shall fall above certain threshold for all stakeholders, in order to ensure that all joining actors as well as the existing stakeholders possess minimum acceptable trust level [18]. It is necessary to model, specify, store, and manage entities and concepts related to trust establishment and their measurements related to different actors. - Harmonization/adaptation of heterogeneities among stakeholders due to external factors such as those related to actors from different regions involved in virtual collaboration networks, e.g. differences in time, language, laws/regulations, and socio-cultural aspects [19]. Some of these heterogeneities affect the sharing and exchange of information among the actors in the network, for which proper mappings and/or adaptors shall be developed and applied. There are certain other specific characteristics of CNs that require to be supported by their common collaboration space. For example some CNs may require simultaneous or synchronous collaboration, while others depend on asynchronous collaboration. Although remote/virtual collaboration is the most relevant case in collaborative networks, which may involve both synchronous and asynchronous interactions, some CNs may require the co-location of their actors [20]. 2.2.4 Substantiating Coordination, Supporting, and Management of Activities A well defined approach is needed for coordination of CN activities, and consequently establishment of mechanisms, tools, and systems are required for common coordination, support, and management of activities in the CN. A wide range of approaches can be considered for coordination of the CNs, among which one would be selected for each CN, based on its common goal and vision. Furthermore, depending on the selected coordination approach for each CN, management of its activities requires different supporting mechanisms and tools. For instance, on one side of the spectrum, for voluntary involvement of biodiversity scientists in addressing a topic of public interest in a community, a self-organized strategic alliance may be established. For this CN, a federated structure/coordination approach can be employed, where all actors have equal rights on decision making as well as suggesting ideas for the next steps/plans for the management of the CN that will be voted in this community. On the other end of the spectrum however, for car manufacturing, a goal-oriented CN may be established for which a fully centralized coordination approach can be applied, using a star-like management approach where most activities of the CN actors are fully guided and measured by one entity in the network, with almost no involvement from others. In practice all current goal-oriented CNs typically fall somewhere in between these two extreme cases. For the long term strategic CNs, the current trend is towards
Information Supporting Functional Dimension of Collaborative Networks
9
establishing different levels of roles and involvements for actors in leading and decision making at the CN level, with a centralized management approach that primarily aims to support CN actors with their activities and to guide them towards better performance. Short-term goal-oriented CNs on the other hand vary in their coordination approach and management. For instance in the product/services industry, typically these CNs are to the extent possible centralized in their coordination, and are managed in the style of a single organization. But for example looking into CNs in research, we see a different picture. For example in EC-funded research projects, usually the coordination of the consortium organized for the project is assumed by one or a few actors that represent the CN to the outside, but internally the management of activities is far more decentralized, and the decision making is in many cases done in federated manners and through voting. Nevertheless, and no matter which coordination approach is adopted, in order for CNs to operate successfully, their management requires a number of supporting tools and systems, which shall be determined and provided in advance of the establishment of the CNs. This subject is further addressed in Section 3 of this paper, where the functional dimension of the CNs and specifically the main required functionality for management of the long term strategic alliances are enumerated and exemplified. As one example, in almost all CNs, the involved actors need to know about each others’ capabilities, capacities, resources, etc. that is referred to as the competency of the involved actors in [21]. In breeding environments, either VBEs or PVCs, for instance, such competency information constitutes the base for the partner search by the broker/planner, who needs to match partners’ competencies against the characterization of an emerged opportunity in order to select the best-fit partners. Similarly, as an antecedent to any collaboration, some level of trust must pre-exist among the involved actors in the CN and needs to be gradually strengthened depending on the purpose of the cooperation/collaboration. Therefore, as addressed in [18], as a part of the CN management system, rational measurement of the performance and achievements of CN actors can be applied to determine the trustworthiness of its members from different perspectives. Considering these and other functionalities needed for effective management of the CNs, classification, storage, and manipulation of their related information e.g. for competencies of actors and their trust-related criteria need to be effectively supported and is challenging. 2.2.5 Achieving Agreements and Contracts among Actors Successful operation of the CN requires reaching common agreements/contracts among its actors [12; 22]. At the point of joining the CN, actors must agree on its common goals and to follow its vision during the collaboration process, towards the achievement of the common goal. They must also agree with the established common collaboration space for the CN, including the common terminology, communication infrastructure, and its working and sharing principles. Additionally, through the common collaboration space, a shared understanding of the problem at hands, as well as the nature/form of sharing and collaboration at the CN level should be achieved. Further on, clear agreements should be reached among the actors on the distribution of tasks and responsibilities, extent of commitments, sharing of resources, and the distribution of both the rewards and the losses and liabilities. Some details in relation
10
H. Afsarmanesh et al.
to these challenges are addressed in [23; 24]. The ownership and sharing of resources shall be dealt with, whether it relates to resources brought in by CN actors or resources acquired by the coalition for the purpose of performing the tasks. Successful collaboration depends on sharing the responsibilities by its actors. It is as important to have clear assignment of responsibilities during the process of achieving the CN goals, as afterwards in relation to liabilities for the achieved results. The level of commitment of actors shall be also clearly defined, e.g. if all actors are collectively responsible for all results, or otherwise. Similarly, division of gains and losses shall be agreed by the CN actors. Here, depending on the type of CN, its value system, and the area in which it operates, a benefit/loss model shall be defined and applied. Such a model shall address the perception of “exchanged value” in the CN and the expectations and commitment of its members. For instance, when it comes to the creation of intellectual property at the CN, its creation in most cases is not linearly related to the proportion of resources invested by each actor. Therefore, a fair way of determining the individual contribution to the results of the CN shall be achieved and applied to the benefit/loss model for the CN. Due to their relevance and importance for the successful operation of the CNs, detailed information about all agreements and contracts established with its actors are stored and preserved by CN administration. Furthermore, some CNs with advanced management systems, model and store these agreements and contracts within a system so that they can be semi-automatically enforced, for example such a system can issue automatic warnings when a CN actor has not fulfilled or has violated some timely terms of its agreement/contract. Organizing, processing, and interfacing the variety of information to different stakeholders, required to support both reaching agreement as well as enforcing them, is quite challenging. 2.3 Relation between the Strategic-Alliance CNs and the Goal-Oriented CNs Scarcity of resources / capacities owned by actors is at the heart of the motivation for collaboration. For instance, large organizations typically hesitate to collaborate with others, when and if they own sufficient resources and skills to fully respond to emerging opportunities. On the other hand, due to the lack of needed resources and skills, SMEs in different sectors increasingly tend towards collaboration and joining their efforts. Therefore, a main motivation for establishment of CNs is to create larger resources and skills set, in order to compete with others and to survive in turbulent markets. Even in the nature, we can easily find natural alliances among many different species (e.g. bees, ants, etc.), which form communities and collaborate to compete for increasing both their resources and their power, what is needed for their survival [25]. Therefore, in today’s market/society, we like to call the “scarcity of resources (e.g. capabilities/capacities) the mother of collaborative networks”. Nevertheless, even though bigger pool of resources and skills is generated through collaboration among individuals or organizations, and face variable needs as market conditions evolve, these pools are still limited and therefore should be dealt with through careful effective planning. But unlike the case of a single actor that is selfconcerned in its decision making, e.g. to approach or not approach an opportunity, in the case of collaborative networks decision-making on this issue is quite challenging and are usually addressed by an stakeholder acting as the broker/planner of goal-oriented
Information Supporting Functional Dimension of Collaborative Networks
11
CNs. Further to this decision, there are a large number of other challenges involved in the creation phase of goal-oriented CNs, e.g. selecting the best-fit partners for an emerged opportunity; namely finding the best potential actors, through effective matching of their limited resources and skills against the required characteristics of the emerged opportunity. Other challenges include the setting up of the common infrastructure, etc. as addressed in the previous section. Additionally, trust which is a fundamental requirement for any collaboration is a long-term process that cannot be satisfied if potential participants have no prior knowledge of each other. Many of these challenges either become serious inhibitors to the mere establishment of goaloriented CNs by their broker/planner, or constitute the serious cause for their failures in the later stages of CN’s life cycle [11]. As one solution approach, both research and practice have shown that creation/foundation of goal-oriented short term CNs, to respond to emerging opportunities, can both greatly benefit from the pre-existence of a strategic alliance/association of actors, and become both cost and time effective. A line of research in the CN discipline is therefore focused on these long-term alliances, starting with the investigation of the existing networks that act as such associations – the so called 1st generation strategic alliances, but focused specifically on expanding their roles and operations in the market/society, thus modeling and development of the next generation of such associations – the so called 2nd generation VBEs [26]. Research on strategic alliances of organizations and individuals on one hand focuses on providing the support environment and functionalities, tools and systems that are required to improve the qualification and positioning of this type of CNs in the market/society in accordance to its own goal, vision, and value system. Besides defining the common goal/vision, performing the needed initiating actions, and establishing the common collaboration space, the alliance plays the main role in coordinating activities of the association, and achieving agreements among its actors, towards their successful establishment of goal-oriented CNs. Therefore, a part of the research in this area focuses on establishing a strong management system for these types of CN [27], introducing the fundamental functionality and information models needed for their effective operation and evolution. These functionalities, also addressed later in this paper, address the management of information needed for effective day-to-day administration of activities in breeding environments, e.g. the engineering of CN ontology, management of actors’ competencies and profiles, and specification of the criteria and management of information related to measurement of the level of trust in actors in the alliance. Furthermore, a number of specific subsystems are needed in this environment to support the creation of goal-oriented short term CNs, including the search for opportunities in the market/society, matching opportunities against the competencies (resources, capacities, skills, etc.) available in the alliance, and reaching agreement/negotiation among the potential partners. On the other hand, this area of research focuses on measuring and improving the properties and fitness of the involved actors in the strategic alliance as a part of the goals of these breeding environments, aiming to further prepare and enable them for participation in future potential goal-oriented CNs. As addressed later in Section 4.1, the effective management of strategic alliance type of CNs heavily depends on building and maintaining strong information management systems to support their daily activities and variety of functionalities that they provide to their stakeholders.
12
H. Afsarmanesh et al.
3 Collaborative Networks Reference Model Recent advances in the definition of the CN taxonomy as well as the reference modeling of the CNs are addressed in [5; 28], and fall outside the scope of this paper. However, this section aims to provide brief introduction to the ARCON reference model defined for the CNs. The reference model in turn provides the base for developing the CN ontology, as well as modeling some of the base information needed to be handled in the CNs. This section further focuses on the endogenous perspective of the CNs, while Section 4 focuses specifically on the functional dimension of the CNs. Then Sections 5, 6, and 7 narrow down on the management of information for several elements of the functional dimension. 3.1 ARCON Reference Model for Collaborative Networks Reference modeling of CN primarily aims at facilitating the co-working and codevelopment among its different stakeholders from multi-disciplines. It supports the reusability and portability of its defined concepts, thus providing a model that can be instantiated to capture all potential CNs. Furthermore, it shall provide insight into the modeling tools/theories appropriate for different CN components, and the base for design and building of the architectural specifications of CN components. Inspired by the modeling frameworks introduced earlier in the literature related to collaboration and networking [3; 4; 29; 30] and considering the complexity of CNs [11; 10; 31], the ARCON (A Reference model for Collaborative Networks) modeling framework is developed addressing their wide variety of aspects, features, and constituting elements. The reference modeling framework of ARCON aims at simplicity, comprehensiveness and neutrality. With these aims, it first divides the CN’s complexity into a number of perspectives that comprehensively and systematically cover all relevant aspects of the CNs. At the highest level of abstraction, the three perspectives of environment characteristics, life cycle, and modeling intent are identified and defined for the ARCON framework, respectively constituting the X, Y, and Z axes of the diagrammatic representation of the ARCON reference model. First, the life cycle perspective captures the five main stages of the CNs’ life cycle, namely the creation, operation, evolution, metamorphosis, and dissolution stages. Second, the environment characteristics perspective further consists of two subspaces: the “Endogenous Elements subspace” capturing the characteristics of the internal elements of CNs, and the “Exogenous Interactions subspace” capturing the characteristics of the external interactions of the CNs with its logical surrounding. Third, the modeling intent perspective captures different intents for the modeling of CN features, and specifically addressing three possible modeling stages of general representation, specific modeling, and implementation modeling. All three perspectives and their elements are in detailed addressed in [1]. To enhance the understanding of the content of this paper, below we briefly address only the environment characteristics perspective, and then focus on the endogenous subspace. For more details on the life cycle and modeling intent perspectives, as well as
Information Supporting Functional Dimension of Collaborative Networks
13
the description of elements of the exogenous subspace, please refer to the above mentioned publication. 3.1.1 Environment Characteristics Perspective – Endogenous Elements Subspace To comprehensively represent its environment characteristics, the reference model for CNs shall include both its Endogenous elements, as well as its Exogenous Interactions [1]. Here we focus on the endogenous elements of the CN. For much more details on any of these issues the above reference to ARCON reference model is suggested. Abstraction and classification of CN’s endogenous elements is challenging due to the large number of their distinct and varied entities, concepts, functionality, rules and regulations, etc. For instance, every CN participant can play a number of roles and have different relationships with other CN participants. Furthermore, there are certain rules of behavior that either constitute the norms in the society/market, or set internal to the CN and shall be obeyed by the CN participants. Needless to say that in every CN there are a set of activities and functionalities needed for its operation and management that also need to be abstracted in its reference model. The Endogenous Elements subspace of ARCON aims at the abstraction of the internal characteristics of CNs. To better characterize these diverse set of internal aspects of CNs, four ortogonal dimensions are proposed and defined, namely the structural, componential, functional, and behavioral dimensions: • E1 - Structural dimension. Addressing the composition of CN’s constituting elements, namely the actors (primary or support), roles (administrator, advisor, broker, planner, etc.), relationships (trusting, cooperation, supervision, collaboration, etc.), and network topology (self and potentially sub-network) etc. • E2 - Componential dimension. Addressing the individual tangible/intangible CN elements, namely domain specific devices, ICT resources (hardware, software, networks), human resources, collected information, knowledge (profile/competeny data, ontologies, bag of assets, profile and competency data, etc.), and its accumulated assets (data, tools, etc.) etc. • E3 - Functional dimension. Addressing the “base functions / operations” that run to support the network, time-sequenced flows of executable operations (e.g. processes for the management of the CN, processes to support the participation and activities of members in the CN), and methodologies and procedures running at the CN (network set up procedure, applicant’s acceptance, CN dissolution and inheritance handling, etc.) etc. • E4 - Behavioral dimension. Addressing the principles, policies, and governance rules that either drive or constrain the behavior of the CN and its members over time, namely principles of governance, collaboration and rules of conduct (prescriptive or obligatory), contracts and agreements, and constraints and conditions (confidentiality, conflict resolution policies, etc.) etc. Diagrammatic representation of the cross between the life-cycle perspective and the Endogenous Elements, exemplifying some elements of each dimension is illustrated in Fig. 2.
14
H. Afsarmanesh et al.
L5. / Dissolution Dissolution
n tatio en eling
si s rp ho ta mo Me
n ol ut io
Ev
-Processes
-Prescriptive behavior
-Relationships
-Human res.
-Roles
-Auxiliary processes
-Obligatory behavior
-Information/ knowledgeres.
-Procedures -Methodologies
-Constraints& conditions
-Ontologyres.
L1. Creation
Ope
Exo-I Abstractions
n al er tatio G en esen pr Re
© H. Afsarmanesh & L.M. Camarinha-Matos 2007
*
Endogenous Elements (Endo-E) (Endo -E) E1. Structural
n C re atio
Endo-E Abstractions
-Contracts& agreements
*
-Network topology
*
-Hardware/ software res.
m t ple M od Im ten l In d e ec if ic ing Mo Sp odel M
ra
ti on
CN O-Life-Cycle Stages
Di
ss
ol ut io
L2. Operation
e.g.
e.g.
-Participants
*
n
L3. Evolution
e.g.
e.g.
CNO-Life-Cycle Stages
L4. L4. Metamorphosis Metamorphosis / Dissolution
E2. Componential
E3. Functional
E4. Behavioral
Inside view
Fig. 2. Crossing CN life cycle and the Endogenous Elements perspective [1]
The remaining of this paper focuses only on the functional dimension of the CN and in specific it addresses in more details the functionality required for effective management of the long term strategic alliances. In order to exemplify the involved complexity, It then further focuses down on the management of information required to support three specific functionality of ontology engineering, profile and competency management, and trust management within the functional dimension of this type of CNs. In specific the collection, modeling, and processing of the needed information for these functionalities that constitute three sub-systems of the management system for these type of CNs are addressed.
4 Functional Dimension of Collaborative Networks The detailed elements in the ARCON reference model that comprehensively represent the functional dimension of the CNs is addressed in [1], where also instantiations of the functional dimension of CNs for both the long-term strategic alliances as well as the shorter term goal-oriented networks are presented. This section specifically focuses on the long term strategic alliances of organizations or individuals. It first addresses the set of functionality that are necessary for both the management of daily operation of strategic alliances as well as those needed to support its members with their participation and activities in this type of CN. Managing variety of heterogeneous and distributed information is required within strategic alliances, such as the VBEs and PVCs to support their operation stage, as characterized in the functional dimension of these two types of CNs. For such networks to succeed, their administration needs to collect a wide variety of information partially from their involved actors and partially from the network environment itself, classify and organize this information to fit the need of their supporting sub-systems, and continuously keeping them updated and complete to the extent possible [32].
Information Supporting Functional Dimension of Collaborative Networks
15
Current research indicates that while the emergence of CNs delivers many exciting promises to improve the chances of success for its actors in current turbulent market/society, it poses many challenges related to supporting its functional dimension. This in turn results challenges for capturing, modeling and management of the information within these networks. Some of the main required functions to support both the management and the daily operation of the strategic alliances include: (i) engineering of network ontology, (ii) classification and management of the profile and competency of actors in the network, (iii) establishing and managing rational trust in the network, (iv) matching partners capabilities/capacities against requirements for collaboration opportunity, (v) reaching agreements (negotiation) for collaboration, and (vi) collection of the assets (data, software tools, lessons learned, best practices, etc.) gathered and generated in the network, and management of components in such bag of assets, among others [11; 27]. A brief description of a set of main functionalities is provided in the next sub-section. 4.1 Functionalities and Sub-systems Supporting Strategic Alliances Research and development on digital networks, particularly the Internet, addresses challenges related to the online search of information and the sharing of expertise and knowledge between organizations and individuals, irrespective of their geographical locations. This in turn paves the way for collaborative problem solving and cocreation of services and products, which go far beyond the traditional interorganizational or inter-personal co-working boundaries and geographical constraints, addressing challenging questions about how to manage information to support the cooperation of organizations and individuals in CNs. A set of functionalities are required to support the operation stage of strategic alliances. In particular, supporting the daily management of CN activities and actors, and their agile formation of goal-oriented CNs to address emerging opportunities are challenging. Furthermore, these functionalities handle variety of information, thus need effective management of their gathered information, considering the geographical distribution of and heterogeneous nature of the CN actors, such as their applied technologies, organizational culture, etc. Due to the specificities of the functionalities required for management of strategic alliances, developing one large management system for this purpose is difficult to realize and maintain. A distributed architecture is therefore typically considered for their development. Applying the service orientation approach, a number of interoperable independent sub-systems can be developed to and applied, that in turn requires support the management of their collaboration-related information in the strategic alliances. As an example for such development and required functionality, as addressed in [27], a so-called VBE management system (VMS) is designed and implemented constituting a number of inter-operable subsystems. These sub-systems either directly support the daily management of the VBE operation [33], or are developed to assist the opportunity-broker and the VO-planner with effective configuration and formation of the VOs in the VBE environment [34]. In Fig. 3, eight specific functionalities address the subsystems supporting the management of daily operation of VBEs, while four specific functionalities, appearing inside the VO creation box, address the subsystems supporting different aspects related to the creation of VOs.
16
H. Afsarmanesh et al.
Focusing on their information management aspects, each of these sub-systems provide a set of services related to their explicit access, retrieve, and manipulation and of information for different specific purposes These subsystems interoperate through exchanging their data, and together provide an integrated management system as shown in Fig. 3. For each subsystem illustrated in this figure, a brief description is provided below, while more details can be found in the above two references. 2
ODMS A
M
DSS
14
1
PCMS
Low performance
3
7
2
A
14
A
14
6 7 MSMS Member registration
6
8 TrustMan A
A
A
DSS
DSS
5
Lack of competency
Low trust
M
A
17
VIMS VO inheritance
VIMS VO registration
A
M
VOMS 13
B
9 14
10
13 VO creation
4 6 CO-Finder 10 B
COC-Plan 11
A
PSS
B
WizAN
12
B
B
Main users/editors of data in the systems / tools: SIMS S
15
BAMS
16
Value system S
M
MSMS
M
A
B
S
rewarding
A
A
VBE Member VBE Administrator Broker
Support Institution Manager
Data transfer
1 2 3 4 5
Profile/competency classification Profile/competency element classification Member’s competency specification Competency classes Low base trust level of organizations
6 7 8 9 10 11
Members’ general data Bas trust level of membership applicants Specific trustworthiness of VO partners Organizations’ performance data from the VO Collaborative opportunities’ definitions VO model
12 13 14 15 16 17
VO model and candidate partners VO model and VO partners Processed VO inheritance Support institutions’ general data Asset contributors’ general data VO inheritance
Fig. 3. VMS and its constituent subsystems
Membership Structure Management Systems (MSMS): Collection and analysis of the applicants’ information as a means to ascertain their suitability in the VBE has proved particularly difficult. This subsystem provides services which support the integration, accreditation, disintegration, rewarding, and categorization of members within the VBE. Ontology Discovery Management Systems (ODMS): In order to systematize all VBE-related concepts, a generic/unified VBE ontology needs to be developed and managed. The ODMS system provides services for the manipulation of VBE ontologies, which is required for the successful operation of the VBE and its VMS as further addressed in Section 5. Profile and Competency Management Systems (PCMS): In VBEs, several functionalities need to access and process the information related to members’ profiles and competencies. PCMS provides services that support the creation, submission, and
Information Supporting Functional Dimension of Collaborative Networks
17
maintenance of profiles and detailed competency related elements of the involved VBE organizations, as well as categorizing collective VBE competencies, and organizing competencies of VOs registered within the VBE, as further addressed in Section 6. Trust Management system (TrustMan): Supporting the VBE stakeholders, including the VBE administration and members, with handling tasks related to the analysis and assessment of rational trust level for other organizations is of great importance for successful management and operation of the VBEs, such as the selection of best fit VO partner as further addressed in Section 7. Decision Support Systems (DSS): The decision making process in a VBE needs to involve a number of actors whose interests may even be contradictory. The DSS has three components that support the following operations related to decisionmaking within a VBE, namely: Warning of an organization’s lack of performance, Warning related to the VBE’s competency gap, and Warning of an organization’s low level of trust. VO information management system (VIMS): It supports the VBE administrator and other stakeholders with management of information related to the creation stage of the VOs within the VBE, storing summary records related to measurement of performance during the VO’s operation stage, and recording and managing of information and knowledge gathered from the dissolved VOs, which constitute means to handle and access inheritance information. Bag of assets management system (BAMS): It provides services for management and provision of fundamental VBE information, such as the guidelines, bylaws, value systems guidelines, incentives information, rules and regulations, etc. It also supports the VBE members with publishing and sharing some of their “assets” of common interest with other VBE members, e.g. valuable data, software tools, lessons learned etc. Support institution management system (SIMS): The support institutions in VBEs are of two kinds. The first kind refers to those organizations that join the VBE to provide/market their services to VBE members. These services include advanced assisting tools to enhance VBE Members’ readiness to collaborate in VOs. They can also provide services to assist the VBE members with their daily operation, e.g. accounting and tax, training, etc. The second kind refers to organizations that join the VBE to assist it with reaching its goals e.g. ministries, sector associations, chamber of commerce, environmental organizations, etc. SIMS supports the management of the information related to activities of support institutions inside the VBEs. Collaboration Opportunity Identification and Characterization (coFinder): This tool assists the opportunity broker to identify and characterize a new Collaboration Opportunity (CO) in the market/society that will trigger the formation of a new VO within the VBE. A collaboration opportunity might be external, initiated by a customer and brokered by a VBE member that is acting as a broker. Some opportunities might also be generated internally, as part of the VBE’s development strategy. CO characterization and VO’s rough planning (COC-plan): This tool supports the planner of the VO with developing a detailed characterization of the CO needed resources and capacities, as well as with the formation of a rough structure for the
18
H. Afsarmanesh et al.
potential VO, therefore, identifying the types of required competencies and capacities needed from organizations that will form the VO. Partners search and suggestion (PSS): This tool assists the VO planner with the search for and proposal of one or more suitable sets of partners for VO configurations. The tool also supports an analysis of different potential VO configurations in order to select the optimal formation. Contract negotiation wizard (WizAN): This tool supports the VO coordinator to involve the selected VO partners in the negotiating process, agreeing on and committing to their participation in the VO. The VO is launched once the needed agreements have been reached, contracts established, and electronically signed. About managing the information in CNs, a summary of the main related challenges are presented in section 1.1, where several requirements are addressed, in relation to different aspects and components of the CNs. The next three sections focus down and provide details on the information management aspects of three of the above functionalities and address their subsystems, namely the ontology engineering, the management of profiles and competencies and the management of trust, in this type of CNs.
5 VBE-Ontology Specification and Management Ontologies are increasingly applied to different areas of research and development, for example they are effectively used in artificial intelligence, semantic web, software engineering, biomedical informatics, library science, among many others, as the means for representing knowledge about their environments. Therefore, a wide variety of tasks related to processing information/knowledge is supported through the specification of ontologies. As examples for these tasks we can mention: natural language processing, knowledge management, geographic information retrieval, etc. [35]. This section introduces an ontology developed for VBEs that aims to address a number of challenging requirements for modeling and management of VBE information. It first presents the challenges being addressed and then sections 5.1 and 5.2 present two specific ontology-based solutions. 5.1 Challenges for VBE Information Modeling and Management The second generation VBEs must handle a wide variety and types of information related to both their constituents and their required daily operations and activities. Therefore, these networks must handle and maintain a broad set of concepts and entities to support processing of a large set of functionalities. Among others, complexity, dynamism, and scalability requirements can be identified as characteristics describing the VBEs and their following aspects: (i) autonomous geographically distributed stakeholders, (ii) wide range of running management functionalities and support systems, and (iii) diverse domains of activities and application environments. The analysis of several 1st generation VBEs in different domain has shown that the development of an ontology for VBEs can address their following main requirements:
Information Supporting Functional Dimension of Collaborative Networks
19
Establishing common understanding in VBEs. Common understanding of the general as well as domain-related VBE concepts is the base requirement for modelling and management of information/knowledge in different VBE functionalities. To facilitate interoperability and smooth collaboration, all VBE stakeholders must use the same definition and have the same understanding of different aspects and concepts applied in the VBE, including: VBE policies, membership regulations, working/sharing principles, VBE competencies, performance measurement criteria, etc. There is still a lack of consensus on the common and coherent definitions and terminology addressing the generic VBE structure and operations [36]. Therefore, identification and specification of common generic VBE terminology, as well as development of a common semantic subspace for VBE information/knowledge is challenging. VBE instantiation in different domains. New VBEs are being created and operated in a variety and range of domains and application environments, from e.g. the provision of healthcare services, and the product design and manufacturing to the management of natural disasters, biodiversity and the scientific virtual laboratory experimentations in physics or biomedicine, among others. Clearly, each domain/application environment has its own features, culture, terminology, etc. that shall be considered and supported by the VBEs’ management systems. During the VBE’s creation stage, parameterization of its management system with both the generic VBE characteristics as well as with the specific domain-related and application-related characteristics is required. Furthermore at the creation stage of the VBE, several databases need to be created to support the storage and manipulation of the information/knowledge handled by different sub-systems. Design and development of these databases shall be achieved together with the experts from the domain, requiring knowledge about complex application domains. Therefore, development of approaches for speeding up and facilitating instantiation and adaptation of VBEs to different domains / areas of activity is challenging. Supporting dynamism and scalability in VBEs. Frequent changes in the market and society, such as the emergence of new types of customer demands or new technological trends drive VBEs to work in a very dynamic manner. Supporting dynamic aspects of VBEs require that the VBE management system is enabled by functionalities that support human actors with necessary changes in the environment. The VBE’s information is therefore required to be processed dynamically by semi-automated reusable software tools. As such, the variety of VBE information and knowledge must be categorized and formally specified. However, there is still a lack of such formal representations and categorizations. Therefore formal modelling and specification of VBE information, as well as development of semi-automated approaches for speeding up the VBE information processing is challenging. Responding to the above challenges through provision of innovative approaches, models, mechanisms, and tools represents the main motivation for the research addressed in this section. The following conceptual and developmental approaches together address the above three challenges: • Conceptual approach - unified ontology: The unified ontology for VBEs, which is further referred to as the VBE-ontology described as follows [13]:
20
H. Afsarmanesh et al.
VBE-ontology is a form of unified and formal conceptual specification of the heterogeneous knowledge in VBE environments to be easily accessed by and communicated between human and application systems, for the purpose of VBE knowledge modelling, collection, processing, analysis, and evolution. Specifically, the development of the unified VBE-ontology supports responding to the challenge of common understanding as follows: (i) supports to represent definitions of all VBE concepts and the relationships among concepts within a unified ontology that establishes the common semantic subspace for the VBE knowledge; (ii) introduces linguistic annotations such as synonyms and abbreviations to address the problem of varied names for concepts; (iii) through sharing the VBE-ontology within and among VBEs, supports reusing common concepts and terminology. In relation to the challenge of VBE instantiation, the VBE ontology addresses it as follows: (1) the ontological representation of VBE knowledge is semi-automatically convertible/transferable to database schemas [37] supporting the semi-automated development of the needed VBE databases during the VBE creation stage; (2) pre-defined domain concepts within the VBE-ontology support the semi-automated parameterization of generic VBE management tools, e.g. PCMS, TrustMan, etc. In relation to the challenge of VBE dynamism and scalability, the developed VBE-ontology responds it in the following way: (a) formal representation of the knowledge in the VBE-ontology facilitates semi-automated processing of this knowledge by software tools; (b) the ontology itself can be used to support semi-automated knowledge discovery from text-corpora [38]. • Developmental approach - ontology discovery and management system: In order to benefit from the VBE-ontology specification, a number of ontology engineering and management functionalities are developed on top of the VBE-ontology [39]. Namely, the ontology engineering functionalities support discovery and evolution of the VBE-ontology itself, while the ontology management functionalities support VBE stakeholders learning about VBE concepts, preserve the consistency among VBE databases and domain parameters with the VBE-ontology, and perform semiautomated information discovery. These needed functionalities are specified and developed within one system, called Ontology Discovery and Management System (ODMS) [39]. The ODMS plays a special role in the functional dimension of the CN reference model (as addressed in section 4). Unlike other information management sub-systems addressed in section 4.1, e.g. profile and competency management, trust management, etc., ODMS does not aim at management of only real information/data of the VBE, but also of the ontological representation of its conceptual aspects, namely the meta-data. Precisely, this sub-system aims to support the mapping of information handled in other VMS sub-systems to its generic meta-models. This mapping supports consistency between the portions of information accumulated by different VMS sub-systems and their models. It also supports preserving semantics of the information, which is the first step for development of semi-automated and intelligent approaches for information management. The remaining of this section further describes the above two approaches in more details.
Information Supporting Functional Dimension of Collaborative Networks
21
5.2 VBE-Ontology To define the scope of the VBE-ontology, first the VBE information and knowledge are characterised and categorised. The two following main characteristics of the VBE information / knowledge are used to categorise them: • Reusable VBE information at different levels of abstraction: In order to respond to the challenge of common understanding, the VBE-ontology is primarily addressed in three levels of abstraction, called here “concept-reusability levels” that refer to reusability of the VBE information at core, domain, and application levels (see Fig. 4). The core level constitutes the concepts that are generic for all VBEs, for example concepts such as “VBE member”, “Virtual Organization”, “VBE competency”, etc. The domain level has a variety of “exemplars” – one for each specific domain or business area. Each domain level constitutes the concepts that are common only for those VBEs that are operating in that domain or sector. Domain level concepts constitute population of the core concepts into a specific VBE domain environment. For example the core “VBE competency” concept can be populated with “Metalworking competency” or “Tourism competency” depending on the domain. The application level includes larger number of exemplars – one for each specific VBE application within each domain. Each application level constitutes the concepts that are common to that specific VBE and cannot be reused by other VBEs. Application level concepts mainly constitute population of the domain level concepts into one specific VBE application environment. The levels of reusability also include one very high level called meta level. This level represents a set of high level meta-properties, such as “definition”, “synonym”, and “abbreviation”, used for specification of all concepts from the other three levels. • Reusable VBE information in different work areas: In order to respond to the challenge of VBE creation in different domains and the challenge of VBE dynamism and scalability, the concepts used by different VBE management functionalities, as addressed in the functional dimension of the CN reference model, should be addressed in the VBE-ontology. Therefore, the VBE-ontology supports both: development of the databases for VBE functionality related data, and the semi-automated processing of these functionalities related information. Additionally, addressing these concepts in the VBE-ontology responds to the challenge of common understanding for these functionalities. Following the approach addressed in [40] for the AIAI enterprise ontology, ten different “work areas” are identified for VBEs and their management (see Fig. 4). Each work area focuses on a set of interrelated concepts that are typically assi8ciated with a specific VBE document repository and/or in a specific VBE management functionality, such as: the Membership Management functionality, management of Bag of Assets repository, Profile and Competency management, Trust managements as addressed in section 4.1. These work areas are complimentary and each of them has some concepts that it shares with some other work areas. In addition, while extensive attention is spent on the design of these ten work areas, it is clear that in future more work areas can be defined and added to the VBE-ontology. Additionally, each of the ten work areas can be further split into some smaller work areas depending of the
22
H. Afsarmanesh et al.
VBE management system
VBE value systems
VBE governance
VBE trust
VBE bag of assets
VBE history
VBE profile and competency
Virtual organization
VBE actor / participant
VBE-self
details they need to capture. For example from the Profile and Competency work area, the Competency work area can be separated from Profile work area. The introduced structure of the VBE-ontology represents (a) embedding of the “horizontal” reusability levels and (b) intersection of them with the “vertical” work areas. The horizontal reusability levels are embedded in each other hierarchically. Namely, the core level includes the meta level, as illustrated in Fig. 4. Furthermore, every domain level includes the core level. Finally, every application level may include a set of domain levels (i.e. those related to this VBE’s domains of activity). The work areas are presented vertically, and thus intersect with the core level, domain level, and application levels, but not with the meta level, that consists of the meta-data applicable to all other levels. The cells resulted from the intersection of the reusability levels and the work areas are further called sub-ontologies. The structure of the VBEontology is also illustrated in Fig. 4. Particularly this figure addresses how intersection of the horizontal core level and the vertical VBE profile and competency work area results into the core level profile and competency sub-ontology. The idea behind the sub-ontologies is to apply the divide and rule principle to the VBE-ontology in order to simplify coping with its large size and wide variety of aspects. Furthermore, sub-ontologies represent the minimal physical units of the VBEontology, i.e. physical ontology files on a computer, while the VBE-ontology itself shall be compiled out of its physical sub-ontologies according to its logical structure. Sub-ontologies also help to cope with evolution of different VBE information. Every time when a new piece of information needs to be introduced in the VBE-ontology, only the relevant new sub-ontology for that information can be specified for it within the VBE-ontology. Typically, when a new VBE is established, it does not need to adapt the entire VBE-ontology. Rather it should build its own “application VBEontology” out of related sub-ontologies as out of the “construction bricks”. Thus, the design of the VBE-ontology also provides solutions to the technical question about the differences in information accumulated by different VBE applications.
Levels of abstraction
Application Domain Core
“Trust management” work area Core level “profile and competency” sub-ontology
Meta
Fig. 4. Structure of the VBE-ontology consisting of sub-ontologies
Information Supporting Functional Dimension of Collaborative Networks
23
One partial screenshot from the developed sub-ontology for the core level of the profile and competency information is addressed below in Fig. 5, as also later addressed in section 6.
Fig.5. Partial screen-shot of the VBE profile and competency sub-ontology (at the core level)
5.3 ODMS Subsystem Functionalities The ODMS (Ontology Discovery and Management System) functionalities aim to assist the main information management processes and operations that take place through the entire life-cycle of a VBE. They include both ontology engineering functionalities that are needed for maintaining the VBE-ontology itself, and ontology management functionalities that are needed to support VBE information management. The five specified functionalities for ODMS include: • Sub-ontology registry: In order to maintain the sub-ontologies of the VBEontology, this functionality, rooted in [41; 42], aims at uploading, registering, organizing, and monitoring the collection of sub-ontologies within an application VBE-ontology. Particularly, it aims at grouping and re-organizing sub-ontologies for further management, partitioning, integration, mapping, and versioning. • Sub-ontology modification: This functionality aims at manual construction and modification of sub-ontologies. Particularly it has an interface through which users can perform operations of introducing new concepts and adding definitions, synonyms, abbreviations, properties, associations and inter-relationships for the existing concepts. The concepts in sub-ontologies are both represented in a textual format as well as visualized through graphs or diagrams. • Sub-ontology navigation: This functionality aims at familiarising VBE members with the VBE terminology and concepts, and thus addressing the challenge of common understanding. In order to view the terminology, the VBE members first select a
24
H. Afsarmanesh et al.
specific sub-ontology from the registry. The concepts in sub-ontologies are also both represented in a textual format as well as visualized through graphs or diagrams. • Repository evolution: This functionality supports establishment and monitoring of consistency between VBE database schemas (as well as content in some cases) and their related sub-ontologies, and thus addresses the challenge of VBE instantiation in different domains. In response to this challenge, the VBE databases shall be developed semiautomatically guided by the VBE-ontology. Several approaches for conversion of subontologies into database schemas suggest creation of a map between an ontology and a database schema [37]. This map later supports monitoring consistency between these ontology and database schema. Specifically, this functionality aims to indicate if the database schemas need to be updated after changes to the VBE-ontology. • Information discovery: This functionality, rooted in [38], aims at semiautomated discovery of information from text-corpora, based on the VBE-ontology, which addresses the challenge of VBE dynamism and scalability. Particularly, the information discovery functionality supports discovery of relevant information about the VBE member organizations in order to augment the current VBE repositories. The text-corpora used by this functionality can include semi-structured (e.g. HTMLpages) or unstructured sources (e.g. brochures). These are typically provided by VBE member organizations.
6 Profile and Competency Modeling and Management To support both the proper cooperation among the VBE members and the fluid configuration and creation of VOs in the 2nd generation VBEs, it is necessary that all VBE members are characterised by their uniformly formatted “profiles”. This requirement is especially severe in the case of the medium- to large-size VBEs, where the VBE administration and coaches have less of a chance to get to know directly each VBE member organization. As such, profiles shall contain the most important characteristics of VBE members (e.g. their legal status, size, area of activity, annual revenue, etc.) that are necessary for performing fundamental VBE activities, such as search for and suggestion of best-fit VO partners, VBE performance measurement, VBE trust management, etc. The VBE “competencies” represent a specific part of the VBE member organizations’ profiles that is aimed to be used directly for VO creation activities. Competency information about organizations is exactly what the VO broker and/or planner needs to retrieve, in order to determine what an organization can offer for a new VO. This section first addresses the specific tasks in VBEs that require handling of profiles and competencies, and then it presents the two complimentary solution approaches developed for solving these tasks. 6.1 Task Requiring Profile and Competency Modeling and Management In the 2nd generation VBEs, characteristic information about all VBE members should be collected and managed in order to support the following four tasks [43].
Information Supporting Functional Dimension of Collaborative Networks
25
• Creation of awareness about potentials inside the VBE. In order to successfully cooperate in the VBE and further successfully collaborate in VOs, the VBE members need to familiarize with each other. In small-size VBEs, e.g. with less then 30 members, VBE members may typically have the chance to get to know each other directly. However, this becomes increasingly more difficult and even impossible in the geographically dispersed medium-size and large-size VBEs (e.g. with 100-200 members). Thus uniformly organizing the VBE members’ information, e.g. to represent the members’ contact data, industry sector, vision, role in the VBE, etc., is a critical instrument supporting awareness of the VBE members about each other. • Configuration of new VOs. The information about the VBE member organization is needed to be accessed by both the human individuals and the software tools assisting the VO broker / VO planner in order to suggest configuration of the VOs with best-fit partners. Therefore the information about members’ qualification, resources, etc. that can be offered to a new VO needs to be structured and represented in a uniform format. • Evaluation of members by the VBE administration. At the stage of evaluating the member applicants and also during the VBE members’ participation in the VBE, the VBE administration needs to evaluate the members’ suitability for the VBE. The members’ information is also needed for automated assessment of members’ collaboration readiness, trustworthiness, and their performance, supported by software tools. • Introduction / advertising the VBE in the marker / society. Another reason for collection and management of VBE members’ information is to introduce / advertise the VBE to the outside market / society. Therefore, summarized information about the registered VBE members can be used to promote the VBE towards potential new customers and therefore against new collaborative opportunities. Collection of the members’ information in a unified format especially supports harmonising/adapting heterogeneities among VBE members, which represents one requirement for substantiation of a common collaboration space for CNs, as addressed in section 2.2.3. The profile for VBE member organization represents a separate uniformly formatted information unit, and is defined as follows: The VBE member organization’s profile consists of the set of determining characteristics (e.g. name, address, capabilities, etc.) about each organization, collected in order to facilitate the semi-automated involvement of each organization in some specific line of activities / operations in the VBE that are directly or indirectly aimed at VO creation. An important part of the profile information represents the organization’s competency, which is defined as follows: Organizations’ competencies in VBEs represent up-to-date information about their capabilities, capacities, costs, as well as conspicuities, illustrating the accuracy of their provided information, all aimed at qualifying organizations for VBE participation, and mostly oriented towards their VO involvement. The remaining of this section addresses two solution approaches, a conceptual one and a developmental one, that together address the above task.
26
H. Afsarmanesh et al.
6.2 Profile and Competency Models The main principle used for definition of the unified profile structure is identification of the major groups of the organization’s information. Following are the identified categories of profile information: 1. VBE-independent information includes those organization’s characteristics that are independent of the involvement of the organization in any collaborative and cooperative consortia. 2. VBE-dependent information includes those organization’s characteristics that are dependent on the involvement of the organization in collaborative and cooperative consortia within the VBEs, VOs, or other types of CNs. 3. Evidence documents are required to represent the indication / proof of validity of the profile information provided by the organizations related to the two previous categories of information. An evidence can either be an on-line document or some web accessible information, e.g. organization’s brochures, web-site, etc. The above mentioned four tasks can then be addressed by the profile model, as follows: Creation of awareness about potentials inside the VBE: addressed through basic information about name, foundation date, location, size, area of activity, general textual description of the organization. Configuration of new VOs: handled through name, size, contact information, competency information (addressed below), and financial information. Evaluation of members by the VBE administration: addressed through records about past activities of organizations, including past collaboration/cooperation activities, as well as produced products/services and applied practices. Introduction / advertising the VBE in the marker / society: addressed through aggregation of characteristics such locations, competencies, and past history of its achievements. The resulting profile model is presented in Fig. 6. The main objective of the competency model for VBE member organizations, which is called “4C-model of competency”, is the “promotion of the VBE member organizations towards their participation in future VOs”. The main technical challenge for the competency modelling is the unification of existing organizations competency models, e.g. as addressed by [44; 45]. Although, these competency models are developed for other purposes than the 2nd generation VBE, some of their aspects can be applied to VBE members’ competencies [21]. However the main principle for specification of the competency model is to organize different competency related aspects. These are further needed to search VBE members that best fit some requirements of an emerged collaboration opportunity. The resulting 4C competency model is unified and has a compound structure. The primary emphasis of this model goes to the four following components, which are identified through our experimental study as necessary and sufficient: 1. Capabilities represent the capabilities of organizations, e.g. their processes and activities. When collective business processes are modelled for a new VO, the VO planner has to search for specific processes or activities that can be performed by different potential organizations, an order to instantiate the model.
Information Supporting Functional Dimension of Collaborative Networks
27
Fig. 6. Model of the VBE member’s profile
2. Capacities represent free availability of resources needed to perform each capability. Specific capacities of organizations are needed to fulfil the quantitative values of capabilities, e.g. amount of production units per day. If the capacity of members for a specific capability in the VBE is not sufficient to fulfil market opportunities, another member (or a group of members) with the same capability may be invited to the VBE. 3. Costs represent the costs of provision of products/services in relation to each capability. They are needed to estimate if invitation of a specific group of members to a VO does not exceed the planned VO budget. 4. Conspicuities represent means for the validity of information provided by the VBE members about their capabilities, capacities and costs. The conspicuities in VBEs mainly include certified or witnessed documents, such as certifications, licenses, recommendation letters, etc. An illustration of the generic 4C-model of competency, applicable to all variety of VBEs, is addressed in Fig. 7. 6.3 PCMS Subsystem Functionalities Based on the objective and identified requirements, PCMS (Profile and Competency Management System) supports the following four main functionalities. • Model customization: This functionality aims at management of profile and competency models within a specific VBE application. The idea for this functionality is to support the customization of the VBE for a specific domain of activity or a specific application environment. Prior to performing the profile and competency management at the VBE creation stage, the profile and competency models need to be specified and customized.
28
H. Afsarmanesh et al.
Fig. 7. Generic 4C-model of competency
• Data submission: This functionality supports uploading of profile and competency knowledge from each member organization. The approach for incremental submission of data is developed for the PCMS. This approach specially supports uploading of large amounts of data. To support the dynamism and scalability of PCMS, the advanced ODMS’s mechanism for ontology-based information discovery is applied. • Data navigation: This functionality needs to be extensive in the PCMS. It supports different ways for retrieval and viewing of the profile and competency knowledge accumulated in the VBE. The navigation scope addresses both: single profile information as well as the collective profile information of the entire VBE. Structuring of the knowledge in the PCMS’s user interface mimics the VBE profile and competency sub-ontology. 5. Data analysis: PCMS shall collect the competency data and analyze it in order to evolve the VBE’s collection of competencies for addressing more opportunities in the market and society. A number of analysis functions are specified for the PCMS including: data validation, retrieval and search, gap analysis, and development of new competencies.
7 Modeling and Management of Trust in VBEs Traditionally, trust among organizations involved in collaborative networks was established both “bi-laterally” and “subjectively” based on reputation. In large networks, particularly with geographical dispersion, such as many VBEs however, trust issues are sensitive at the network level, and need to be reasoned/justified for example when applied to the selection of best-fit organizations among several competitors [18]. Thus in VBEs, analysis of inter-organizational trust is a functionality supported
Information Supporting Functional Dimension of Collaborative Networks
29
by the VBE administration, which needs to apply fact-based data such as current organizations’ standing and performance data, for its assessment. Thus, a variety of strategic information related to trust aspects must be collected from VBE actors (applying pull/push mechanisms), then modeled and classified, and stored a priori to assessing the level of trust in VBE organizations. Furthermore, in order to identify the common set of base trust related criteria for organizations in the VBE, as briefly addressed in section 2.2.3, relevant elements for each specific VBE must be determined. These trust criteria together with their respective weights constitute the threshold for assessment of organization’s level of trust in VBEs. In the past manual ad-hoc manners were applied to the manipulation and processing of organizations’ information related to trust. This section addresses the development of the Trust Management (TrustMan) subsystem at the VBE and describes its services supporting the rational assessment of level of trust in organizations. 7.1 Requirements for Managing Trust-Related Information Objectives for establishing trust may change with time, which means the information required to support the analysis of the trust level of organizations will also vary with time. As addressed in [18] a main aim for management of trust in VBEs is to support the creation of trust among VBE member organizations. The introduced approach to support inter-organizational trust management applies the information related to organization’s standing as well as its past performance data, in order to determine its rational trust level in the VBE. Thus organizations’ activities within the VBE, and their participation in configured VOs are relevant to be assessed. Four main information management requirements are identified which need to be addressed for supporting management of trust-related information in VBEs, as follows: Requirement 1 – Characterization of wide variety of dynamic trust-related information: The information required to support the establishment of trust among organizations is dynamic, since depending on the specific objective(s) for which the trust must be established, the needed information may change with time, and these changes cannot be predicted. Therefore, characterization of relevant trust-related information needed to support the creation of trust among organizations, for every trust objective, is challenging. Requirement 2 - Classification of information related to different trust perspectives: As stated earlier, analysis of trust in organizations within VBEs shall rely on factbased data and needs to be performed rationally. For this purpose some measurable criteria need to be identified and classified. The identification and classification of a comprehensive set of trust criteria for organizations is challenging, especially when considering different perspectives of trust. Requirement 3 - Processing of trust-related information to support trust measurement: In the introduced approach, the trust in organizations is measured rationally using fact-based data. For this purpose, formal mechanisms must be developed using a set of relevant trust criteria. In addition to measuring trustworthiness of organizations, the applied mechanisms should support fact-based reasoning about the results
30
H. Afsarmanesh et al.
based on the standing and performance of the organizations. The development of such trust-related information processing mechanisms is challenging. Requirement 4 - Provision of services for analysis and measurement of trust: Measurement of trust in organizations involves the computation of fact-based data using complex mechanisms that may need to be performed in distributed and heterogeneous environments. Development of services to manage and process trust-related information for facilitating the analysis of trust is challenging. 7.2 Approaches for Managing Trust-Related Information of Organizations Below we propose some approaches to address the four requirements presented above. We address the establishment of a pool of generic set of trust criteria, the identification and modeling of trust elements, the formulation of mechanisms for analyzing inter-organizational trust, and the designing of the TrustMan system. 7.2.1 Approach for Establishment of a Pool of Generic Concepts and Elements Solutions such as specialized models, tools or mechanisms developed to support the management of trust within “application specific VBEs” or within “domain specific VBEs” are difficult to replicate, adapt and reuse in different environments. Therefore, there is a need to develop a generic pool of concepts and elements that can be customized for every specific VBE. Generic set of trust criteria for organizations: In the introduced approach a large set of trust criteria for VBE organizations is identified and characterized through applying the HICI methodology and mechanisms (Hierarchical analysis, Impact analysis and Causal Influence analysis) [46]. The identified trust elements for organizations are classified in a generalization hierarchy as shown in Fig. 8. Trust objectives and five identified trust perspectives are generic, cover all possible VBEs, and do not change with time. A set of trust requirements and trust criteria can be identified at the VBE dynamically and changes with time. Nevertheless, a base generic set of trust requirements and trust criteria is so far established that can be expanded/customized depending on the VBE. Fig. 8 presents an example set of trust criteria for economical perspective. 7.2.2 Approach for Identification, Analysis and Modeling of Trust Elements To properly organize and inter-relate all trust elements, an innovative approach was required. The HICI approach proposed in [46] constitutes three stages, each one focusing on a specific task related to the identification, classification and interrelation of trust criteria related to organizations. The first stage called the Hierarchical analysis stage focuses on the identification of types of trust elements and classifying them through a generalization hierarchy based on their level of measurability. This classification enables to understand what values can be measured for the trust related elements which in turn supports the decision on what attributes need to be included in the database schema. A general set of trust criteria is presented in [18] and exemplified in Fig. 8.
Information Supporting Functional Dimension of Collaborative Networks
Trust perspective
Trust requirements
31
Trust criteria Cash capital
Creating trust among organizations
Capital
Structural perspective
Cash in
Financial stability
Managerial perspective
Cash out Net gains Operational cost
Economical perspective Social perspective
Physical capital Material capital
Technological perspective
VO cash in
VO financial stability
VO cash out VO net gains
Financial standards
Auditing standards Auditing Frequency
Fig. 8. An example set of trust criteria for organizations
The second stage called the Impact analysis stage focuses on the analysis of the impacts of changes in values of trust criteria on the trust level of organizations. This enables to understand the nature and frequency of change of values of trust criteria in order to support the decision regarding the frequency of updates for the trust-related information. The third stage called the Causal Influence analysis stage focuses on the analysis of causal relations between different trust criteria as well as between the trust criteria and other VBE environment factors, such as the known factors within the VBE and intermediate factors, which are defined to link the causal relations among all trust criteria and known factors. The results of causal influence analysis are applied to the formulation of mechanisms for assessing the level of trust in each organization as addressed below. 7.2.3 Approach for Intensive Modeling of Trust Elements’ Interrelationships-to Formulate Mechanisms for Assessing Trust Level of Organizations Considering the need for assessing the trust level of every organization in the VBE, a wide range of trust criteria may be considered for evaluating organizations’ trustworthiness. In the introduced approach, trust is characterized as a multi-objective, multiperspective and multi-criteria subject. As such, trust is not a single concept that can be applied to all cases for trust-based decision-making [47], and its measurement for each case depends on the purpose of establishing a trust relationship, the preferences of the VBE actor who constitutes the trustor in the case and the availability of trust related information from the VBE actor who constitutes the trustee in the case [18]. In this
32
H. Afsarmanesh et al.
respect, the trust level of an organization can be measured rationally in terms of the quantitative values available for related trust criteria e.g. rooted on past performance. Therefore from analytic modeling, formal mechanisms can be deduced for rational measurement of organizations’ trust level [18, 48]. These mechanisms are the formalized into mathematical equations resulted from causal influence analysis and interrelationships the between trust criteria, the known factors within the VBE, and the intermediate factors that are defined to link those causal relations. A causal model, as inspired from the discipline of systems engineering, supports the analysis of causal influence inter-relationships among measurable factors (trust criteria, known factors and intermediate factors) while it also supports modeling the nature of influences qualitatively [49]. For example, as shown in Fig. 9 while the factors “cash capital” and “capital” are measured quantitatively, the influence of the cash capital on the capital is qualitatively modeled as positive.
Fig. 9. A causal model of trust criteria associated with the economical perspective
Furthermore, applying techniques from systems engineering the formulation of mathematical equations applying causal models is thoroughly addressed in [50]. To exemplify the formulation of equations based on results of analysis and modeling of causal influences, below we present the equations for two intermediate factors of capital (CA) and financial acceptance (FA) (see Fig. 9): CA = CC + PC + MC
and
FA =
SC RS
Where CC represents cash capital, PC represents physical capita, MC represents material capital, SC represents standards complied, and RS represents required standards. 7.2.4 Approach for Development TrustMan Subsystem – Focused on Database Aspects TrustMan system is developed to support a number of different users in the VBE with dissimilar roles as well as rights which means different services and user interfaces are required for each user. As a part of the system analysis, all potential users of the TrustMan system were identified and classified into groups depending on their roles and rights on the VBE. Then, for each user group a set of functional requirements
Information Supporting Functional Dimension of Collaborative Networks
33
were identified, to be supported by the TrustMan system. The classified user groups of the TrustMan system include: the VBE administrator, the VO planner, the VBE member, the VBE membership applicant, the trust expert and the VBE guest. The identified user requirements and their specified services for the TrustMan system are addressed in [51]. Moreover, to enhance interoperability with other sub-systems in the VBE, the design of TrustMan system adopts the service-oriented architecture and specifically, the web service standards. In particular, the design of TrustMan system adapts the layering approach for classifying services. A well-designed architecture of TrustMan system based on the concepts of service oriented architecture is addressed in [36]. Focusing here only on the information management aspects of the TrustMan system, one important issue is related to the development of the schemas for the implementation of its required database. In order to enhance the interoperability and sharing of data that is managed by the TrustMan system with both the existing/legacy databases at different organizations as well as with other sub-systems of the VBE management system, the relational approach is adopted for the TrustMan database. More specifically, three schemas are developed to support the following: (1) general information related to trust elements, (2) general information about organizations, and (3) Specific trust related data of organizations. These are further defined below 1.
2.
3.
General information related to trust elements - This information constitutes a list and a set of descriptions of trust elements, namely of different trust perspectives, trust requirements, and trust criteria. General information about organizations - This refers to the information that is necessary to accurately describe each physical or virtual organization. For physical organizations, this information may constitute the name, legal registration details, address, and so on. For virtual organizations, this information may constitute, among others, the VO coordinator details, launching and dissolving dates, involved partners, and the customers. Specific trust related data for organizations - This information constitutes the values of trust criteria for each organization. This information represents primarily the organization’s performance data, expressed in terms of different trust criteria, and is used as the main input data for the services that assess the level of trust in each organization.
8 Conclusion A main challenging criterion for the success of collaborative networks is the effective management of the wide variety of information that needs to be handled inside the CNs to support their functional dimension. The paper defends that for efficient creation of dynamic opportunity-based collaborative networks, such as virtual organizations and virtual teams, complete and up-to-date information on wide variety of aspects are necessary. Research and practice have indicated that preestablishment of supporting long-term strategic alliances, can provide the needed environment for creation of cost and time effective VOs and VTs. While some manifestations of such strategic alliances already exist, their 2^nd generation needs
34
H. Afsarmanesh et al.
a much stronger management system, providing functionalities on top of enabling information management systems. This management system is shown to model, organize, and store partly the information gathered from the CN actors, and partly the information generated within the CN itself. The paper first addressed the main challenges of the CNs, while addressing their requirements for management of information. Furthermore, the paper focuses down on the strategic alliances and specifically on the management of the VBEs, in order to introduce the complexity of their needed functionality. Specific examples of information management challenges have been then addressed through the specification of three subsystems of the VBE management system, namely the subsystems handling the engineering of VBE Ontology, the profile and competency management in VBEs, and assessment and management of the rational trust in VBEs. As illustrated by these examples, collaborative networks raise quite complex challenges, requiring modeling and management of large amounts of heterogeneous and incomplete information, which require a combination of approaches such as distributed/federated databases, ontology engineering, computational intelligence and qualitative modeling.
References 1. Afsarmanesh, H., Camarinha-Matos, L.M.: The ARCON modeling framework. In: Collaborative networks reference modeling, pp. 67–82. Springer, New York (2008) 2. Afsarmanesh, H., Camarinha-Matos, L.M.: Towards a semi-typology for virtual organization breeding environments. In: COA 2007 – 8th IFAC Symposium on Cost-Oriented Automation, Habana, Cuba, vol. 8, part 1, pp. 22(1–12) (2007) 3. Camarinha-Matos, L.M., Afsarmanesh, H.: A comprehensive modeling framework for collaborative networked organizations. The Journal of Intelligent Manufacturing 18(5), 527– 615 (2007) 4. Katzy, B., Zang, C., Loh, H.: Reference models for virtual organizations. In: Virtual organizations – Systems and practices, pp. 45–58. Springer, Heidelberg (2005) 5. Camarinha-Matos, L.M., Afsarmanesh, H.: Collaboration forms. In: Collaborative networks reference modeling, pp. 51–66. Springer, New York (2008) 6. Himmelman, A.T.: On coalitions and the transformation of power relations: collaborative betterment and collaborative empowerment. American journal of community psychology 29(2), 277–284 (2001) 7. Pollard, D.: Will that be coordination, cooperation or collaboration? Blog (March 25, 2005), http://blogs.salon.com/0002007/2005/03/25.html#a1090 8. Bamford, J., Ernst, D., Fubini, D.G.: Launching a World-Class Joint Venture. Harvard Business Review 82(2), 90–100 (2004) 9. Blomqvist, K., Hurmelinna, P., Seppänen, R.: Playing the collaboration game rightbalancing trust and contracting. Technovation 25(5), 497–504 (2005) 10. Camarinha-Matos, L.M., Afsarmanesh, H.: Collaborative networks: A new scientific discipline. J. Intelligent Manufacturing 16(4-5), 439–452 (2005) 11. Afsarmanesh, H., Camarinha-Matos, L.M.: On the classification and management of virtual organization breeding environments. The International Journal of Information Technology and Management – IJITM 8(3), 234–259 (2009) 12. Giesen, G.: Creating collaboration: A process that works! Greg Giesen & Associates (2002)
Information Supporting Functional Dimension of Collaborative Networks
35
13. Ermilova, E., Afsarmanesh, H.: A unified ontology for VO Breeding Environments. In: Proceedings of DHMS 2008 - IEEE International Conference on Distributed HumanMachine Systems, Athens, Greece, pp. 176–181. Czech Technical University Publishing House (2008) ISBN: 978-80-01-04027-0 14. Rabelo, R.: Advanced collaborative business ICT infrastructure. In: Methods and Tools for collaborative networked organizations, pp. 337–370. Springer, New York (2008) 15. Abreu, A., Macedo, P., Camarinha-Matos, L.M.: Towards a methodology to measure the alignment of value systems in collaborative Networks. In: Azevedo, A. (ed.) Innovation in Manufacturing Networks, pp. 37–46. Springer, New York (2008) 16. Romero, D., Galeano, N., Molina, A.: VO breeding Environments Value Systems, Business Models and Governance Rules. In: Methods and Tools for collaborative networked organizations, pp. 69–90. Springer, New York (2008) 17. Rosas, J., Camarinha-Matos, L.M.: Modeling collaboration preparedness assesment. In: Collaborative networks reference modeling, pp. 227–252. Springer, New York (2008) 18. Msanjila, S.S., Afsarmanesh, H.: Trust Analysis and Assessment in Virtual Organizations Breeding Environments. The International Journal of Production Research 46(5), 1253– 1295 (2008) 19. Romero, D., Galeano, N., Molina, A.: A conceptual model for Virtual Breeding Environments Value Systems. In: Accepted for publication in Proceedings of PRO-VE 2007 - 8th IFIP Working Conference on Virtual Enterprises. Springer, Heidelberg (2007) 20. Winkler, R.: Keywords and Definitions Around “Collaboration”. SAP Design Guild, 5th edn. (2002) 21. Ermilova, E., Afsarmanesh, H.: Competency modeling targeted on promotion of organizations towards VO involvement. In: The proceedings of PRO-VE 2008 – 9th IFIP Working Conference on Virtual Enterprises, Poznan, Poland, pp. 3–14. Springer, Boston (2008) 22. Brna, P.: Models of collaboration. In: Proceedings of BCS 1998 - XVIII Congresso Nacional da Sociedade Brasileira de Computação, Belo Horizonte, Brazil (1998) 23. Oliveira, A.I., Camarinha-Matos, L.M.: Agreement negotiation wizard. In: Methods and Tools for collaborative networked organizations, pp. 191–218. Springer, New York (2008) 24. Wolff, T.: Collaborative Solutions – True Collaboration as the Most Productive Form of Exchange. In: Collaborative Solutions Newsletter. Tom Wolff & Associates (2005) 25. Kangas, S.: Spectrum Five: Competition vs. Cooperation. The long FAQ on Liberalism (2005), http://www.huppi.com/kangaroo/ LiberalFAQ.htm#Backspectrumfive 26. Afsarmanesh, H., Camarinha-Matos, L.M., Ermilova, E.: VBE reference framework. In: Methods and Tools for collaborative networked organizations, pp. 35–68. Springer, New York (2008) 27. Afsarmanesh, H., Msanjila, S.S., Ermilova, E., Wiesner, S., Woelfel, W., Seifert, M.: VBE management system. In: Methods and Tools for collaborative networked organizations, pp. 119–154. Springer, New York (2008) 28. Afsarmanesh, H., Camarinha-Matos, L.M.: Related work on reference modeling for collaborative networks. In: Collaborative networks reference modeling, pp. 15–28. Springer, New York (2008) 29. Tolle, M., Bernus, P., Vesterager, J.: Reference models for virtual enterprises. In: Camarinha-Matos, L.M. (ed.) Collaborative business ecosystems and virtual enterprises, Kluwer Academic Publishers, Boston (2002)
36
H. Afsarmanesh et al.
30. Zachman, J.A.: A Framework for Information Systems Architecture. IBM Systems Journal 26(3) (1987) 31. Camarinha-Matos, L.M., Afsarmanesh, H.: Emerging behavior in complex collaborative networks. In: Collaborative Networked Organizations - A research agenda for emerging business models, ch. 6.2. Kluwer Academic Publishers, Dordrecht (2004) 32. Shuman, J., Twombly, J.: Collaborative Network Management: An Emerging Role for Alliance Management. In: White Paper Series - Collaborative Business, vol. 6. The Rhythm of Business, Inc. (2008) 33. Afsarmanesh, H., Camarinha-Matos, L.M., Msanjila, S.S.: On Management of 2nd Generation Virtual Organizations Breeding Environments. The Journal of Annual Reviews in Control (in press, 2009) 34. Camarinha-Matos, L.M., Afsarmanesh, H.: A framework for Virtual Organization creation in a breeding environment. Int. Journal Annual Reviews in Control 31, 119–135 (2007) 35. Nieto, M.A.M.: An Overview of Ontologies, Technical report, Conacyt Projects No. 35804−A and G33009−A (2003) 36. Ollus, M.: Towards structuring the research on virtual organizations. In: Virtual Organizations: Systems and Practices. Springer Science, Berlin (2005) 37. Guevara-Masis, V., Afsarmanesh, H., Hetzberger, L.O.: Ontology-based automatic data structure generation for collaborative networks. In: Proceedings of 5th PRO-VE 2004 – Virtual Enterprises and Collaborative Networks, pp. 163–174. Kluwer Academic Publishers, Dordrecht (2004) 38. Anjewierden, A., Wielinga, B.J., Hoog, R., Kabel, S.: Task and domain ontologies for knowledge mapping in operational processes. Metis deliverable 2003/4.2. University of Amsterdam (2003) 39. Afsarmanesh, H., Ermilova, E.: Management of Ontology in VO Breeding Environments Domain. To appear in International Journal of Services and Operations Management – IJSOM, special issue on Modelling and Management of Knowledge in Collaborative Networks (2009) 40. Uschold, M., King, M., Moralee, S., Zorgios, Y.: The Enterprise Ontology. The Knowledge Engineering Review 13(1), 31–89 (1998) 41. Ding, Y., Fensel, D.: Ontology Library Systems: The key to successful Ontology Re-use. In: Proceedings of the First Semantic Web Working Symposium (2001) 42. Simoes, D., Ferreira, H., Soares, A.L.: Ontology Engineering in Virtual Breeding Environments. In: Proceedings of PRO-VE 2007 conference, pp. 137–146 (2007) 43. Ermilova, E., Afsarmanesh, H.: Modeling and management of Profiles and Competencies in VBEs. J. of Intelligent Manufacturing (2007) 44. Javidan, M.: Core Competence: What does it mean in practice? Long Range planning 31(1), 60–71 (1998) 45. Molina, A., Flores, M.: A Virtual Enterprise in Mexico: From Concepts to Practice. Journal of Intelligent and Robotics Systems 26, 289–302 (1999) 46. Msanjila, S.S., Afsarmanesh, H.: On Architectural Design of TrustMan System Applying HICI Analysis Results. The case of technological perspective in VBEs. The International Journal of Software 3(4), 17–30 (2008) 47. Castelfranchi, C., Falcone, R.: Trust Is Much More than Subjective Probability: Mental Components and Sources of Trust. In: Proceedings of the 33rd Hawaii International Conference on System Sciences (2000)
Information Supporting Functional Dimension of Collaborative Networks
37
48. Msanjila, S.S., Afsarmanesh, H.: Modeling Trust Relationships in Collaborative Networked Organizations. The International Journal of Technology Transfer and Commercialisation; Special issue: Data protection, Trust and Technology 6(1), 40–55 (2007) 49. Pearl, J.: Graphs, causality, and structural equation models. The Journal of Sociological Methods and Research 27(2), 226–264 (1998) 50. Byne, B.M.: Structural equation modeling with EQS: Basic concepts, Applications, and Programming, 2nd edn. Routlege/Academic (2006) 51. Msanjila, S.S., Afsarmanesh, H.: On development of TrustMan system assisting configuration of temporary consortiums. The International Journal of Production Research; Special issue: Virtual Enterprises – Methods and Approaches for Coalition Formation 47(17) (2009)
A Universal Metamodel and Its Dictionary Paolo Atzeni1 , Giorgio Gianforme2 , and Paolo Cappellari3 1
Universit` a Roma Tre, Italy
[email protected] 2 Universit` a Roma Tre, Italy
[email protected] 3 University of Alberta, Canada
[email protected] Abstract. We discuss a universal metamodel aimed at the representation of schemas in a way that is at the same time model-independent (in the sense that it allows for a uniform representation of different data models) and model-aware (in the sense that it is possible to say to whether a schema is allowed for a data model). This metamodel can be the basis for the definition of a complete model-management system. Here we illustrate the details of the metamodel and the structure of a dictionary for its representation. Exemplifications of a concrete use of the dictionary are provided, by means of the representations of the main data models, such as relational, object-relational or XSD-based. Moreover, we demonstrate how set operators can be redefined with respect to our dictionary and easily applied on it. Finally, we show how such a dictionary can be exploited to automatically produce detailed descriptions of schema and data models, in a textual (i.e. XML) or visual (i.e. UML class diagram) way.
1
Introduction
Metadata is descriptive information about data and applications. Metadata is used to specify how data is represented, stored, and transformed, or may describe interfaces and behavior of software components. The use of metadata for data processing was reported as early as fifty years ago [22]. Since then, metadata-related tasks and applications have become truly pervasive and metadata management plays a major role in today’s information systems. In fact, the majority of information system problems involve the design, integration, and maintenance of complex application artifacts, such as application programs, databases, web sites, workflow scripts, object diagrams, and user interfaces. These artifacts are represented by means of formal descriptions, called schemas or models, and, consequently, metadata. Indeed, to solve these problems we have to deal with metadata, but it is well known that applications solving metadata manipulation are complex and hard to build, because of heterogeneity and impedance mismatch. Heterogeneity arises because data sources are independently developed by different people and for different purposes and subsequently need to be integrated. The data sources may use different data models, different schemas, and different value encodings. Impedance mismatch A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 38–62, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Universal Metamodel and Its Dictionary
39
arises because the logical schemas required by applications are different from the physical ones exposed by data sources. The manipulation includes designing mappings (which describe how two schemas are related to each other) between the schemas, generating a schema from another schema along with a mapping between them, modifying a schema or mapping, interpreting a mapping, and generating code from a mapping. In the past, these difficulties have always been tackled in practical settings by means of ad-hoc solutions, for example by writing a program for each specific application. This is clearly very expensive, as it is laborious and hard to maintain. In order to simplify such manipulation, Bernstein et al. [11,12,23] proposed the idea of a model management system. Its goal is to factor out the similarities of the metadata problems studied in the literature and develop a set of high-level operators that can be utilized in various scenarios. Within such a system, we can treat schemas and mappings as abstractions that can be manipulated by operators that are meant to be generic in the sense that a single implementation of them is applicable to all of the data models. Incidentally, let us remark that in this paper we use the terms “schema” and “data model” as common in the database literature, though some model-management literature follows a different terminology (and uses “model” instead of “schema” and “metamodel” instead of “data model”). The availability of a uniform and generic description of data models is a prerequisite for designing a model management system. In this paper we discuss a “universal metamodel” (called the supermodel ), defined by means of metadata and designed to properly represent “any” possible data model, together with the structure of a dictionary for storing such metadata. There are many proposals for dictionary structure in the literature. The use of dictionaries to handle metadata has been popular since the early database systems of the 1970’s, initially in systems that were external to those handling the database (see Allen et al. [1] for an early survey). With the advent of relational systems in the 1980’s, it became possible to have dictionaries be part of the database itself, within the same model. Today, all DBMSs have such a component. Extensive discussion was also carried out in even more general frameworks, with proposals for various kinds of dictionaries, describing various features of systems (see for example [9,19,21]) within the context of industrial CASE tools and research proposals. More recently, a number of metadata repositories have been developed [26]. They generally use relational databases for handling the information of interest. There are other significant recent efforts towards the description of multiple models, including the Model Driven Architecture (MDA) and, within it, the Common Warehouse Metamodel (CWM) [27], and Microsoft Repository [10]; in contrast to our approach, these do not distinguish metalevels, as the various models of interest are all specializations of a most general one, UML based. The description of models in terms of the (meta-)constructs of a metamodel was proposed by Atzeni and Torlone [8]. But it used a sophisticated graph language, which was hard to implement. The other papers that followed the same or similar approaches [14,15,16,28] also used specific structures.
40
P. Atzeni, G. Gianforme, and P. Cappellari
We know of no literature that describes a dictionary that exposes schemas in both model-specific and model-independent ways, together with a description of models. Only portions of similar dictionaries have been proposed. None of them offer the rich interrelated structure we have here. The contributions of this paper and its organization are the following. In Section 2 we briefly recall the metamodel approach we follow (based on the initial idea by Atzeni and Torlone [8]). In Section 3 we illustrate the organization of the dictionary we use to store our schemas and models, refining the presentation given in a previous conference paper (Atzeni et al. [3]). In Section 4 we illustrate a specific supermodel used to generalize a large set of models, some of which are also commented upon. Then, in Section 5 we discuss how some interesting operations on schemas can be specified and implemented on the basis of our approach. Section 6 is devoted to the illustration of generic reporting and visualization tools built out of the principles and structure of our dictionary. Finally, in Section 7 we summarize our results.
2
Towards a Universal Metamodel
In this section we summarize the overall approach towards a model-independent and model-aware representation of data models, based on an initial idea by Atzeni and Torlone [8]. The first step toward a uniform solution is the adoption of a general model to properly represent many different data models (e.g. entity-relationship, objectoriented, relational, object-relational, XML). The proposed general model is based on the idea of construct: a construct represents a “structural” concept of a data model. We find out a construct for each “structural” concept of every considered data model and, hence, a data model is completely represented by the set of its constructs. Let us consider two popular data models, entity-relationship (ER) and object-oriented (OO). Indeed, each of them is not “a model,” but “a family of models,” as there are many different proposals for each of them: OO with or without keys, binary and n-ary ER models, OO and ER with or without inheritance, and so on. “Structural” concepts for these data models are, for example, entity, attribute of entity, and binary relationship for the ER and class, field, and reference for the OO. Moreover, constructs have a name, may have properties and are related to one another. A UML class diagram of this construct-based representation of a simple ER model with entities, attributes of entities and binary relationships is depicted in Figure 1. Construct Entity has no attribute and no reference; construct AttributeOfEntity has a boolean property to specify whether an attribute is part of the identifier of the entity it belongs to and a property type to specify the data type of the attribute itself; construct BinaryRelationship has two references toward the entities involved in the relationship and several properties to specify role, minimum and maximum cardinalities of the involved entities, and whether the first entity is externally identified by the relationship itself.
A Universal Metamodel and Its Dictionary
41
Fig. 1. A simple entity-relationship model
Fig. 2. A simple object-oriented model
With similar considerations about a simple OO model with classes, simple fields (i.e. with standard type) and reference fields (i.e. a reference from a class to another) we obtain the UML class diagram of Figure 2. Construct Class has no attribute and no reference; construct Field is similar to AttributeOfEntity but it does not have boolean attributes, assuming that we do not want to manage explicit identifiers of objects; construct ReferenceField has two references toward the class owner of the reference and the class pointed by the reference itself. In this way, we have uniform representations of models (in terms of constructs) but these representations are not general. This is unfeasible as the number of (variants of) models grows because it implies a corresponding rise in the number of constructs. To overcome this limit, we exploit an observation of Hull and King [20], drawn on later by Atzeni and Torlone [7]: most known models have constructs that can be classified according to a rather small set of generic (i.e. model independent) metaconstructs: lexical, abstract, aggregation, generalization, and function. Recalling our example, entities and classes play the same role (or, in other terms, “they have the same meaning”), and so we can define a generic metaconstruct, called Abstract, to represent both these concepts; the same happens for attributes of entities and of relationships and fields of classes, representable by means of a metaconstructs called Lexical. Conversely, relationships and references do not have the same meaning and hence one metaconstruct is not enough to properly represent both concepts (hence BinaryAggregationOfAbstracts and AbstractAttribute are both included). Hence, each model is defined by its constructs and the metaconstructs they refer to. This representation is clearly at the same time model-independent (in
42
P. Atzeni, G. Gianforme, and P. Cappellari
the sense that it allows for a uniform representation of different data models) and model-aware (in the sense that it is possible to say to whether a schema is allowed for a data model). An even more important notion is that of supermodel (also called universal metamodel in the literature [13,24]): it is a model that has a construct for each metaconstruct, in the most general version. Therefore, each model can be seen as a specialization of the supermodel, except for renaming of constructs. A conceptual view of the essentials of this idea is shown in Figure 3: the supermodel portion is predefined, but can be extended (and we will present our recent extension later in this paper), whereas models are defined by specifying their respective constructs, each of which refers to a construct of the supermodel (SMConstruct) and so to a metaconstruct. It is important to observe that our approach is independent of the specific supermodel that is adopted, as new metaconstructs and so SM-Constructs can be added. This allows us to show simplified examples for the set of constructs, without losing the generality of the approach. In this scenario, a schema for a certain model is a set of instances of constructs allowed in that model. Let us consider the simple ER schema depicted in Figure 4. Its construct-based representation would include two instances of Entity (i.e. Employee and Project), one instance of BinaryRelationship (i.e. Membership) and four instances of AttributeOfEntity (i.e. EN, Name, Code, and Name). The model-independent representation (i.e. based on metaconstructs) would include two instances of Abstract, one instance of BinaryAggregationOfAbstracts and four instances of Lexical. For each of these instances we have to specify values for its attributes and references, meaningful for the model. So for example, the instance of Lexical corresponding to EN would refer to the instance of Abstract of employee through its abstractOID reference and would have a ‘true’ value for its isIdentifier attribute. This example is illustrated in Figure 5, where we omit not relevant properties, represent references only by means of arrows, and represent links between constructs and their instances by means of dashed arrows. In the same way, we can state that a database for a certain schema is a set of instances of constructs of that schema.
Fig. 3. A simplified conceptual view of models and constructs
Fig. 4. A simple entity-relationship schema
A Universal Metamodel and Its Dictionary
43
Fig. 5. A construct based representation of the schema of Figure 4
As a second example, let us consider the simple OO schema depicted in Figure 6. Its construct-based representation would include two instances of Class (i.e. Employee and Department), one instance of ReferenceField (i.e. Membership) and five instances of Field (i.e. EmpNo, Name, Salary, DeptNo, and SeptName). Alternatively, the model independent representation (i.e. based on metaconstructs) would include two instances of Abstract, one instance of AbstractAttribute and five instances of Lexical. On the other side, it is possible to use the same approach based on “concepts of interest” in order to obtain a high-level description of the supermodel (i.e. of the whole set of metaconstructs). From this point of view the concepts of interest are three: construct, construct property and construct reference. In this way we have a full description of the supermodel, with constructs, properties and references, as follows. Each construct has a name and a boolean attribute (isLexical ) that
44
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 6. A simple object-oriented schema
Fig. 7. A description of the supermodel
specifies whether its instances have actual, elementary values associated with (for example, this property would be true for AttributeOfAbstract and false for Abstract ). Each property belongs to a construct and has a name and a type. Each reference relates two constructs and has a name. A UML class diagram of this representation is presented in Figure 7.
3
A Multilevel Dictionary
The conceptual approach to the description of models and schemas presented in Section 2, despite being very useful to introduce the approach, is not effective to actually store data and metadata. Therefore, we have developed a relational implementation of the idea, leading to a multilevel dictionary organized in four parts, which can be characterized along two coordinates: the first corresponding to whether they describe models or schemas and the second depending on whether they refer to specific models or to the supermodel. This is represented in Figure 8.
6 model descriptions (the “metalevel”)
metamodels (mM)
meta-supermodel (mSM)
schema descriptions
models (M)
supermodel (SM)
model specific
model generic
Fig. 8. The four parts of the dictionary
-
A Universal Metamodel and Its Dictionary
45
The various portions of the dictionary correspond to various UML class diagrams of Section 2. In the rest of this section, we comment on them in detail. The meta-supermodel part of the dictionary describes the supermodel, that is, the set of constructs used for building schemas of various models. It is composed of three relations (whose names begin with MSM to recall that we are in the meta-supermodel portion), one for each “class” of the diagram of Figure 7. Every relation has one OID column and one column for each attribute and reference of the corresponding “class” of such a diagram. The relations of this part of the dictionary, with some of the data, are depicted in Figure 9. It is worth noting that these relations are rather small, because of the limited number of constructs in our supermodel. The metamodels part of the dictionary describes the individual models, that is, the set of specific constructs allowed in the various models, each one corresponding to a construct of the supermodel. It has the same structure as the meta-supermodel part with two differences: first, each relation has an extra column containing a reference towards the corresponding element of the supermodel (i.e. of the meta-supermodel part of the dictionary); second, there is an extra relation to store the names of the specific models and an extra column in the Construct relation referring to this extra relation. The relations of this part of the dictionary, with some of the data, are depicted in Figure 10. We refer to these first two parts as the “metalevel” of the dictionary, as it contains the description of the structure of the lower level, whose content describes schemas. The lower level is also composed of two parts, one referring to the supermodel constructs (therefore called the SM part) and the other to model-specific constructs (the M part). The structure of the schema level is, in
OID mc1 mc2 mc3 mc4 mc5 ...
OID mp1 mp2 mp3 mp4 mp5 ...
MSM Construct Name IsLexical Abstract false Lexical true Aggregation false BinaryAggregationOfAbstracts false AbstractAttribute false ... ...
MSM Property Name Construct Name mc1 Name mc2 IsIdentifier mc2 IsOptional mc2 Type mc2 ... ...
Type string string bool bool string ...
OID mr1 mr2 mr3 mr4 mr5 mr6 ...
MSM Reference Name Construct ConstructTo Abstract mc2 mc1 Aggregation mc2 mc3 Abstract1 mc4 mc1 Abstract2 mc4 mc1 Abstract mc5 mc1 AbstractTo mc5 mc1 ... ... ...
Fig. 9. The mSM part of the dictionary
46
P. Atzeni, G. Gianforme, and P. Cappellari
MM Model OID Name m1 ER m2 OODB
OID pr1 pr2 pr3 pr4 pr5 ... pr6 pr7 ...
MM Construct OID Name Model MSM-Constr. IsLexical co1 Entity m1 mc1 false co2 AttributeOfEntity m1 mc2 true co3 BinaryRelationship m1 mc4 false co4 Class m2 mc1 false co5 Field m2 mc2 true co6 ReferenceField m2 mc5 false
MM Property Name Constr. Type MSM-Prop. Name co1 string mp1 Name co2 string mp2 IsKey co2 bool mp3 Name co3 string ... IsOpt.1 co3 bool ... ... ... ... ... Name co4 string mp1 Name co5 string mp2 ... ... ... ...
OID ref1 ref2 ref3 ref4 ref5 ref6
MM Reference Name Constr. Constr.To MSM-Ref. Entity co2 co1 mr1 Entity co3 co1 mr3 Entity co3 co1 mr4 Class co5 co4 mr1 Class co6 co4 mr5 ClassTo co6 co4 mr6
Fig. 10. The mM part of the dictionary
our system, automatically generated out of the content of the metalevel: so, we can say that the dictionary is self-generating out of a small core. In detail, in the model part there is one relation for each row of MM Construct relation. Hence each of these relations corresponds to a construct and has, besides an OID column, one column for each property and reference specified for that construct in relations MM Property and MM Reference, respectively. Moreover, there is a relation schema to store the name of the schemas stored in the dictionary and each relation has an extra column referring to it. Hence, in practice, there is a set of relations for each specific model, with one relation for each construct allowed in the model. This portion of the dictionary is depicted in Figure 11, where we show the data for the schemas of Figures 4 and 6. Analogously, in the supermodel part there is one relation for each row of MSM Construct relation; hence each one of these relations corresponds to a metaconstruct (or a construct of the supermodel) and has, besides an OID column, one column for each property and reference specified for that metaconstruct in relations MSM Property and MSM Reference, respectively. Again, there is a relation schema to store the name of the schemas stored in the dictionary and each relation has an extra column referring to it. Moreover, the Schema relation has an extra column referring to the specific model each schema belongs to. This portion of the dictionary is depicted in Figure 12, where we show the data for the schemas of Figures 4 and 6, and hence we show the same data presented in Figure 11. It is worth noting that Abstract contains the same data as ER-Entity and OOClass taken together. Similarly, AttributeOfAbstract contains data in ERAttributeOfEntity and OO-Field.
A Universal Metamodel and Its Dictionary
Schema OID Name s1 ER Schema s2 OO Schema
ER-Entity OID Name Schema e1 Employee s1 e2 Project s1
ER-AttributeOfEntity OID Entity Name Type isKey Schema a1 e1 EN int true s1 a2 e1 Name string false s1 a3 e2 Code int true s1 a4 e2 Name string false s1
ER-BinaryRelationship OID Name IsOptional1 IsFunctional1 . . . Entity1 Entity2 Schema r1 Membership false false ... e1 e2 s1 OO-Class OID Name Schema cl1 Employee s2 cl2 Department s2 OO-ReferenceField OID Name Class ClassTo Schema ref1 Membership cl1 cl2 s2
OID f1 f2 f3 f4 f5
OO-Field Class Name Type Schema cl1 EmpNo int s2 cl1 Name string s2 cl1 Salary int s2 cl2 DeptNo int s2 cl2 DeptName string s2
Fig. 11. The dictionary for schemas of specific models
Schema OID Name Model s1 ER Schema m1 s2 OO Schema m2 Abstract OID Name Schema e1 Employee s1 e2 Project s1 cl1 Employee s2 cl2 Department s2
47
Lexical OID Abstract Name Type IsIdentifier Schema a1 e1 EN int true s1 a2 e1 Name string false s1 a3 e2 Code int true s1 a4 e2 Name string false s1 f1 cl1 EmpNo int ? s2 f2 cl1 Name string ? s2 f3 cl1 Salary int ? s2 f4 cl2 DeptNo int ? s2 f5 cl2 DeptName string ? s2
AbstractAttribute OID Name Abstract AbstractTo Schema ref1 Membership cl1 cl2 s2 BinaryRelationship OID Name IsOptional1 IsFunctional1 . . . Entity1 Entity2 Schema r1 Membership false false ... e1 e2 s1 Fig. 12. A portion of the SM part of the dictionary
48
4
P. Atzeni, G. Gianforme, and P. Cappellari
A Significant Supermodel with Models of Interest
As we said, our approach is fully extensible: it is possible to add new metaconstructs to represent new data models, as well as to refine and increase precision of actual representations of models. The supermodel we have mainly experimented with so far is a supermodel for database models and covers a reasonable family of them. If models were more detailed (as is the case for a fully-fledged XSD model) then the supermodel would be more complex. Moreover, other supermodels can be used in different contexts: we have had preliminary experiences with Semantic Web models [5,6,18], with the management of annotations [25], and with adaptive systems [17]. In this section we discuss in detail our actual supermodel. We describe all the metaconstructs of the supermodel, describing which concepts they represent, and how they can be used to properly represent several well known data models. A complete description of all the metaconstructs follows: Abstract - Any autonomous concept of the scenario. Aggregation - A collection of elements with heterogeneous components. It make no sense without its components. StructOfAttributes - A structured element of an Aggregation, an Abstract, or another StructOfAttributes. It could be not always present (isOptional ) and/or admit null values (isNullable). It could be multivalued or not (isSet ). AbstractAttribute - A reference towards an Abstract that could admit null values (isNullable). The reference may originate from an Abstract, an Aggregation, or a StructOfAttributes. Generalization - It is a “structural” construct stating that an Abstract is a root of a hierarchy, possibly total (isTotal). ChildOfGeneralization - Another “structural” construct, related to the previous one (it can not be used without Generalization). It is used to specify that an Abstract is leaf of a hierarchy. Nest - It is a “structural” construct used to specify nesting relationship between StructOfAttributes. BinaryAggregationOfAbstracts - Any binary correspondence between (two) Abstracts. It is possible to specify optionality (isOptional1/2 ) and functionality (isFunctional1/2 ) of the involved Abstract s as well as their role (role1/2 ) or whether one of the Abstracts is identified in some way by such a correspondence (isIdentified). AggregationOfAbstracts - Any n-ary correspondence between two or more Abstracts. ComponentOfAggregationOfAbstracts - It states that an Abstract is one of those involved in an AggregationOfAbstracts (and hence can not be used without AggregationOfAbstracts). It is possible to specify optionality (isOptional1/2 ) and functionality (isFunctional1/2 ) of the involved Abstract as well as whether the Abstract is identified in some way by such a correspondence (isIdentified ).
A Universal Metamodel and Its Dictionary
49
Lexical - Any lexical value useful to specify features of Abstract, Aggregation, StructOfAttributes, AggregationOfAbstracts, or BinaryAggregationOfAbstracts. It is a typed attribute (type) that could admit null values, be optional, and identifier of the object it refers to (the latter is not applicable to Lexical of StructOfAttributes, BinaryAggregationOfAbstracts, and AggregationOfAbstracts). ForeignKey - It is a “structural” construct stating the existence of some kind of referential integrity constraints between Abstract, Aggregation and/or StructOfAttributes, in every possible combination. ComponentOfForeignKey - Another “structural” construct, related to the previous one (it can not be used without ForeignKey). It is used to specify which are the Lexical attributes involved (i.e. referring and referred) in a referential integrity constraint. A UML class diagram of these (meta)constructs is presented in Figure 13.
Fig. 13. The Supermodel
We summarize constructs and (families of) models in Figure 14, where we show a matrix, whose rows correspond to the constructs and columns to the families we have experimented with. In the cells, we use the specific name used for the construct in the family (for example, Abstract is called Entity in the ER model). The various models within
50
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 14. Constructs and models
a family differ from one another (i) on the basis of the presence or absence of specific constructs and (ii) on the basis of details of (constraints on) them. To give an example for (i) let us recall that versions of the ER model could have generalizations, or not have them, and the OR model could have structured columns or just simple ones. For (ii) we can just mention again the various restrictions on relationships in the binary ER model (general vs. one-to-many), which can be specified by means of constraints on the properties. It is also worth mentioning that a given construct can be used in different ways (again, on the basis of conditions on the properties) in different families: for example, a structured attribute could be multivalued, or not, on the basis of the value of a property isSet. The remainder of this section is devoted to a detailed description of the various models. 4.1
Relational
We consider a relational model with tables composed of columns of a specified type; each column could allow null value or be part of the primary key of the table. Moreover we can specify foreign keys between tables involving one or more columns. Figure 15 shows a UML class diagram of the constructs allowed in the relational model with the following correspondences: Table - Aggregation. Column - Lexical. We can specify the data type of the column (type) and whether it is part of the primary key (isIdentifier ) or it allows null value (isNullable). It has a reference toward an Aggregation.
A Universal Metamodel and Its Dictionary
51
Fig. 15. The Relational model
Foreign Key - ForeignKey and ComponentOfForeignKey. With the first construct (referencing two Aggregations) we specify the existence of a foreign key between two tables; with the second construct (referencing one ForeignKey and two Lexical s) we specify the columns involved in a foreign key. 4.2
Binary ER
We consider a binary ER model with entities and relationships together with their attributes and generalizations (total or not). Each attribute could be optional or part of the identifier of an entity. For each relationship we specify minimum and maximum cardinality and whether an entity is externally identified by it. Figure 16 shows a UML class diagram of the constructs allowed in the model with the following correspondences: Entity - Abstract. Attribute of Entity - Lexical. We can specify the data type of the attribute (type) and whether it is part of the identifier (isIdentifier ) or it is optional (isOptional ). It refers to an Abstract. Relationship - BinaryAggregationOfAbstracts. We can specify minimum (0 or 1 with the property isOptional ) and maximum (1 or N with the property isFunctional ) cardinality of the involved entities (referenced by the construct). Moreover we can specify the role (role) of the involved entities and whether the first entity is externally identified by the relationship (IsIdentified ). Attribute of Relationship - Lexical. We can specify the data type of the attribute (type) and whether it is optional (isOptional ). It refers to a BinaryAggregationOfAbstracts Generalization - Generalization and ChildOfGeneralization. With the first construct (referencing an Abstract) we specify the existence of a generalization
52
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 16. The binary ER model
rooted in the referenced Entity; with the second construct (referencing one Generalization and one Abstract ) we specify the childs of the generalization. We can specify whether the generalization is total or not (isTotal).
4.3
N-Ary ER
We consider an n-ary ER model with the same features of the aforementioned binary ER. Figure 17 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained): Entity - Abstract. Attribute of Entity - Lexical. Relationship - AggregationOfAbstracts and ComponentOfAggregationOfAbstracts. With the first construct we specify the existence of a relationship; with the second construct (referencing an AggregationOfAbstracts and an Abstract) we specify the entities involved in such relationship. We can specify minimum (0 or 1 with the property isOptional ) and maximum (1 or N with the property isFunctional ) cardinality of the involved entities. Moreover we can specify whether an entity is externally identified by the relationship (IsIdentified ).
A Universal Metamodel and Its Dictionary
53
Fig. 17. The n-ary ER model
Attribute of Relationship - Lexical. It refers to an AggregationOfAbstracts. Generalization - Generalization and ChildOfGeneralization.
4.4
Object-Oriented
We consider an Object-Oriented model with classes, simple and reference fields. We can also specify generalizations of classes. Figure 18 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained): Class - Abstract. Field - Lexical. Reference Field - AbstractAttribute. It has two references toward the referencing Abstract and the referenced one. Generalization - Generalization and ChildOfGeneralization.
4.5
Object-Relational
We consider a simplified version of the Object-Relational model. We merge the constructs of our Relational and OO model, where we have typed-tables rather
54
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 18. The OO model
than classes. Moreover we consider structured columns of tables (typed or not) that can be nested. Reference columns must be toward a typed table but can be part of a table (typed or not) or of a structured column. Foreign keys can involve also typed tables and structured columns. Finally, we can specify generalizations that can involve only typed tables. Figure 19 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained): Table - Aggregation. Typed Table - Abstract. Structured Column - StructOfAttributes and Nest. The structured column, represented by a StructOfAttributes can allow null values or not (isNullable) and can be part of a simple table or of a typed table (this is specified by its references toward Abstract and Aggregation. We can specify nesting relationships between structured columns by means of Nest, that has two references toward the top StructOfAttributes and the nested one. Column - Lexical. It can be part of (i.e. refer to) a simple table, a typed table or a structured column. Reference Column - AbstractAttribute. It may be part of a table (typed or not) and of a structured column (specified by a reference) and must refer to a typed table (i.e. it has a reference toward an Abstract).
A Universal Metamodel and Its Dictionary
55
Fig. 19. The Object-Relational model
Foreign Key - ForeignKey and ComponentOfForeignKey. With the first construct (referencing two tables, typed or not, and a structured column) we specify the existence of a foreign key between tables (typed or not) and structured column; with the second construct (referencing one ForeignKey and two Lexical s) we specify the columns involved in a foreign key. Generalization - Generalization and ChildOfGeneralization.
4.6
XSD as a Data Model
XSD is a very powerful technique for organizing documents and data, described by a very long specification. We consider a simplified version of the XSD language. We are only interested in documents that can be used to store large amount of data. Indeed we consider documents with at least one top element unbounded. Then we deal with elements that can be simple or complex (i.e. structured). For these elements we can specify whether they are optional or whether they can be null (nillable according to the syntax and terminology of XSD). Simple elements could be part of the key of the element they belong to and have an associated type. Moreover we allow the definition of foreign keys (key and keyref according to XSD terminology). Clearly, this representation is highly simplified but, as we said, it could be extended with other features if there were interest in them.
56
P. Atzeni, G. Gianforme, and P. Cappellari
Figure 20 shows a UML class diagram of the constructs allowed in the model with the following correspondences (we omit details already explained):
Fig. 20. The XSD language
Root Element - Abstract. Complex Element - StructOfAttributes and Nest. The first construct represent structured elements that can be unbounded or not (isSet ), can allow null values or not (isNullable) and can be optional (isOptional ). We can specify nesting relationships between complex elements by means of Nest, that has two references toward the top StructOfAttributes and the nested one. Simple Element - Lexical. It can be part of (i.e. refer to) a root element or a complex one. Foreign Key - ForeignKey and ComponentOfForeignKey.
5
Operators over Schema and Models
The model-independent and model-aware representation of data models and schemas can be the basis for many fruitful applications. Our fist major application has been the development of a model-independent approach for schema and data translation [4] (a generic implementation of the modelgen operator, according to Bernstein’s model management [11]). We are currently working on additional applications, towards a more general model management system [2], the most interesting of which is related to set operators (i.e. union, difference, intersection). In this section we discuss the redefinition of these operators against our construct-based representation. Let us concentrate on models first. The starting point is clearly the definition of an equality function between constructs. Two
A Universal Metamodel and Its Dictionary
57
constructs belonging to two models are equal if and only if they correspond to the same metaconstruct, have the same properties with the same values, and, if they have references, they have the same references with the same values (i.e. the same number of references, towards constructs proved to be equal). Two main observations are needed. First, we can refer to supermodel constructs without loss of generality, as every construct of every specific model corresponds to a (meta)construct of the supermodel, as we said in Section 2. Second, the definition is recursive but well defined as well, since the graph of the supermodel (i.e. with constructs as nodes and references between constructs as edges) is acyclic; this implies that a partial order on the constructs can be found, and all the equality check between constructs can be performed traversing the graph accordingly to such a partial order. The union of two models is trivial, as we have simply to include in the result the constructs of both the involved models. For difference and intersection, we need the aforementioned definition of equality between constructs. When one of these operators is applied, for each construct of the first model, we look for an equal construct in the second model. If the operator is the difference, the result is composed by all the constructs of the first model that has not an equal construct in the second model; if the operator is the intersection, the result is composed only by the constructs of the first model that has an equal construct in the second model. A very similar approach can be followed for set operators on schemas, which are usually called merge and diff [11], but we can implement in terms of union and difference, provided they are supported by a suitable notion of equivalence. Some care is needed to consider details, but the basic idea is that the operators can be implemented by executing the set operations on the constructs of the various types, where the metalevel is used to see which are the involved types, those that are used in the model at hand.
6
Reporting
In this section we focus on another interesting application of our approach, namely the possibility of producing reports for models and schemas, again in a manner that is both model-independent and model-aware. Reports can be rendered as detailed textual documentations of data organization, in a readable and machineprocessable way, or as diagrams in a graphical user interface. Again, this is possible because of the supermodel: we visualize supermodel constructs together with their properties, and relate them each other by means of their references. In this way, we could obtain a “flat” report of a model, which does not distinguish between type of references; so, for example, the references between a ForeignKey and the two Aggregations involved in it would be represented as a reference from a Lexical towards an Abstract. This is clearly not satisfactory. The core idea is to classify the references in two classes: strong and weak. Instances of constructs related by means of a strong reference (e.g. an Abstract with its Lexical s) are presented together, while those having a weak relationship (e.g.
58
P. Atzeni, G. Gianforme, and P. Cappellari
a ForeignKey with the Aggregations involved in it) are presented in different elements. In rendering reports as text, we adopt the XML format. The main advantage of XML reports is that they are both self-documenting and machine processable if needed. Constructs and their instances can be presented according to a partial order on the constructs that can be found since, as we already said in the previous section, the graph of the supermodel (i.e. with constructs as nodes and references between constructs as edges) is acyclic. As we said in Section 2, a schema (as well as a model) is completely represented by the set of its constructs. Hence, a report for a schema would include a set of construct elements. In order to produce a report for a schema S we can consider its constructs following a total order, C1 , C2 , ..., Cn , for supermodel constructs (obtained serializing a partial order of them). For each construct Ci , we consider its occurrences in S, and for each of them not yet inserted in the report, we add a construct element named Ci with all its properties as XML attributes. Let us consider an occurrence oij of Ci . If oij is pointed by any strong reference, we add a set of component elements nested in the corresponding construct element: the set would have a component element for each occurrence of a construct with a strong reference toward oij . If oij has any weak reference towards another occurrence of a construct, we add a set of reference elements: each element of this set correspond to a weak reference and has OID and name properties of the pointed occurrence as XML attributes. As an example, the textual report of the ER schema of figure 4 would be as follows: <schema name="ERsimple" model="binaryER"> <ER-Entity OID="e1" name="Employee"> <ER-AttributeOfEntity OID="a1" name="EN" isKey="true" type="int"> <ER-AttributeOfEntity OID="a2" name="Name" isKey="false" type="string"> <ER-Entity OID="e2" name="Project"> ... <ER-BinaryRelationship OID="r1" name="Membership" isOptional1="false" isFunctional1"false" ...\> <entity1 OID="e1" name="Employee"/> <entity2 OID="e2" name="Project"/>
A Universal Metamodel and Its Dictionary
59
As we already said, a second option for report rendering is through a visual graph. A few examples, for different models are shown in Figures 21, 22, and 23.
Fig. 21. An ER schema
Fig. 22. An OO schema
60
P. Atzeni, G. Gianforme, and P. Cappellari
Fig. 23. An XML-Schema
The rationale is the same as for textual reports: – visualization is model independent as it is defined for all schemas of all models in the same way: strong references lead to embedding the “component” construct within the “containing” one, whereas weak references lead to separate graphical objects, connected by means of arrows; – visualization is model aware, in two sense: first of all, as usual the specific features of each model are taken into account; second, and more important, for each family of models it is possible to associate a specific shape with each construct, thus following the usual representation for the model (see for example the usual notation for relationships in the ER model in Figure 21. An extra feature of the graphical visualization is the possibility to represent instances of schemas also by means of a “relational” representation that follows straightforward our construct-based modeling.
7
Conclusions
We have shown how a metamodel approach can be a the basis for a number model-generic and model-aware techniques for the solution of interesting problems. We have shown a dictionary we use to store our schemas and models, a specific supermodel (a data model that generalizes all models of interest modulo construct renaming). This is the bases for the specification and implementation of interesting high-level operations, such as schema translation as well as
A Universal Metamodel and Its Dictionary
61
set-theoretic union and difference. Another interesting application is the development of generic visualization and reporting features.
Acknowledgement We would like to thank Phil Bernstein for many useful discussions during the preliminary development of this work.
References 1. Allen, F.W., Loomis, M.E.S., Mannino, M.V.: The integrated dictionary/directory system. ACM Comput. Surv. 14(2), 245–286 (1982) 2. Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: From schema and model translation to a model management system. In: Gray, A., Jeffery, K., Shao, J. (eds.) BNCOD 2008. LNCS, vol. 5071, pp. 227–240. Springer, Heidelberg (2008) 3. Atzeni, P., Cappellari, P., Bernstein, P.A.: A multilevel dictionary for model man´ agement. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, O. (eds.) ER 2005. LNCS, vol. 3716, pp. 160–175. Springer, Heidelberg (2005) 4. Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P.A., Gianforme, G.: Modelindependent schema translation. VLDB J. 17(6), 1347–1370 (2008) 5. Atzeni, P., Del Nostro, P.: Management of heterogeneity in the semantic web. In: ICDE Workshops, p. 60. IEEE Computer Society, Los Alamitos (2006) 6. Atzeni, P., Paolozzi, S., Nostro, P.D.: Ontologies and databases: Going back and forth. In: ODBIS (VLDB Workshop), pp. 9–16 (2008) 7. Atzeni, P., Torlone, R.: A metamodel approach for the management of multiple models and translation of schemes. Information Systems 18(6), 349–362 (1993) 8. Atzeni, P., Torlone, R.: Management of multiple models in an extensible database design tool. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 79–95. Springer, Heidelberg (1996) 9. Batini, C., Battista, G.D., Santucci, G.: Structuring primitives for a dictionary of entity relationship data schemas. IEEE Trans. Software Eng. 19(4), 344–365 (1993) 10. Bernstein, P., Bergstraesser, T., Carlson, J., Pal, S., Sanders, P., Shutt, D.: Microsoft repository version 2 and the open information model. Information Systems 22(4), 71–98 (1999) 11. Bernstein, P.A.: Applying model management to classical meta data problems. In: CIDR Conference, pp. 209–220 (2003) 12. Bernstein, P.A., Halevy, A.Y., Pottinger, R.: A vision of management of complex models. SIGMOD Record 29(4), 55–63 (2000) 13. Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: SIGMOD Conference, pp. 1–12 (2007) 14. B´ezivin, J., Breton, E., Dup´e, G., Valduriez, P.: The ATL transformation-based model management framework. Research Report 03.08, IRIN, Universit´e de Nantes (2003) 15. Claypool, K.T., Rundensteiner, E.A.: Sangam: A framework for modeling heterogeneous database transformations. In: ICEIS (1), pp. 219–224 (2003) 16. Claypool, K.T., Rundensteiner, E.A., Zhang, X., Su, H., Kuno, H.A., Lee, W.-C., Mitchell, G.: Sangam - a solution to support multiple data models, their mappings and maintenance. In: SIGMOD Conference, p. 606 (2001)
62
P. Atzeni, G. Gianforme, and P. Cappellari
17. De Virgilio, R., Torlone, R.: Modeling heterogeneous context information in adaptive web based applications. In: ICWE Conference, pp. 56–63. ACM, New York (2006) 18. Gianforme, G., Virgilio, R.D., Paolozzi, S., Nostro, P.D., Avola, D.: A novel approach for practical semantic web data management. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS, vol. 5178, pp. 650–655. Springer, Heidelberg (2008) 19. Hsu, C., Bouziane, M., Rattner, L., Yee, L.: Information resources management in heterogeneous, distributed environments: A metadatabase approach. IEEE Trans. Software Eng. 17(6), 604–625 (1991) 20. Hull, R., King, R.: Semantic database modelling: Survey, applications and research issues. ACM Computing Surveys 19(3), 201–260 (1987) 21. Kahn, B.K., Lumsden, E.W.: A user-oriented framework for data dictionary systems. DATA BASE 15(1), 28–36 (1983) 22. McGee, W.C.: Generalization: Key to successful electronic data processing. J. ACM 6(1), 1–23 (1959) 23. Melnik, S.: Generic Model Management: Concepts and Algorithms. Springer, Heidelberg (2004) 24. Mork, P., Bernstein, P.A., Melnik, S.: Teaching a schema translator to produce O/R views. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 102–119. Springer, Heidelberg (2007) 25. Paolozzi, S., Atzeni, P.: Interoperability for semantic annotations. In: DEXA Workshops, pp. 445–449. IEEE Computer Society, Los Alamitos (2007) 26. Rahm, E., Do, H.: On metadata interoperability in data warehouses. Technical report, University of Leipzig (2000) 27. Soley, R., The OMG Staff Strategy Group: Model driven architecture. White paper, draft 3.2, Object Management Group (November 2000) 28. Song, G., Zhang, K., Wong, R.: Model management though graph transformations. In: IEEE Symposium on Visual Languages and Human Centric Computing, pp. 75–82 (2004)
Data Mining Using Graphics Processing Units Christian B¨ohm1 , Robert Noll1 , Claudia Plant2 , Bianca Wackersreuther1, and Andrew Zherdin2 1 University of Munich, Germany {boehm,noll,wackersreuther}@dbs.ifi.lmu.de 2 Technische Universit¨ at M¨ unchen, Germany {plant,zherdin}@lrz.tum.de
Abstract. During the last few years, Graphics Processing Units (GPU) have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks such as rendering of 3D scenarios but can also be used for general numeric and symbolic computation tasks such as simulation and optimization. As major advantage, GPUs provide extremely high parallelism (with several hundred simple programmable processors) combined with a high bandwidth in memory transfer at low cost. In this paper, we propose several algorithms for computationally expensive data mining tasks like similarity search and clustering which are designed for the highly parallel environment of a GPU. We define a multidimensional index structure which is particularly suited to support similarity queries under the restricted programming model of a GPU, and define a similarity join method. Moreover, we define highly parallel algorithms for density-based and partitioning clustering. In an extensive experimental evaluation, we demonstrate the superiority of our algorithms running on GPU over their conventional counterparts in CPU.
1
Introduction
In recent years, Graphics Processing Units (GPUs) have evolved from simple devices for the display signal preparation into powerful coprocessors supporting the CPU in various ways. Graphics applications such as realistic 3D games are computationally demanding and require a large number of complex algebraic operations for each update of the display image. Therefore, today’s graphics hardware contains a large number of programmable processors which are optimized to cope with this high workload of vector, matrix, and symbolic computations in a highly parallel way. In terms of peak performance, the graphics hardware has outperformed state-of-the-art multi-core CPUs by a large margin. The amount of scientific data is approximately doubling every year [26]. To keep pace with the exponential data explosion, there is a great effort in many research communities such as life sciences [20,22], mechanical simulation [27], cryptographic computing [2], or machine learning [7] to use the computational capabilities of GPUs even for purposes which are not at all related to computer A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 63–90, 2009. Springer-Verlag Berlin Heidelberg 2009
64
C. B¨ ohm et al.
graphics. The corresponding research area is called General Processing-Graphics Processing Units (GP-GPU). In this paper, we focus on exploiting the computational power of GPUs for data mining. Data Mining consists of ’applying data analysis algorithms, that, under acceptable efficiency limitations, produce a particular enumeration of patterns over the data’ [9]. The exponential increase in data does not necessarily come along with a correspondingly large gain in knowledge. The evolving research area of data mining proposes techniques to support transforming the raw data into useful knowledge. Data mining has a wide range of scientific and commercial applications, for example in neuroscience, astronomy, biology, marketing, and fraud detection. The basic data mining tasks include classification, regression, clustering, outlier identification, as well as frequent itemset and association rule mining. Classification and regression are called supervised data mining tasks, because the aim is to learn a model for predicting a predefined variable. The other techniques are called unsupervised, because the user does not previously identify any of the variables to be learned. Instead, the algorithms have to automatically identify any interesting regularities and patterns in the data. Clustering probably is the most common unsupervised data mining task. The goal of clustering is to find a natural grouping of a data set such that data objects assigned to a common group called cluster are as similar as possible and objects assigned to different clusters differ as much as possible. Consider for example the set of objects visualized in Figure 1. A natural grouping would be assigning the objects to two different clusters. Two outliers not fitting well to any of the clusters should be left unassigned. Like most data mining algorithms, the definition of clustering requires specifying some notion of similarity among objects. In most cases, the similarity is expressed in a vector space, called the feature space. In Figure 1, we indicate the similarity among objects by representing each object by a vector in two dimensional space. Characterizing numerical properties (from a continuous space) are extracted from the objects, and taken together to a vector x ∈ Rd where d is the dimensionality of the space, and the number of properties which have been extracted, respectively. For instance, Figure 2 shows a feature transformation where the object is a certain kind of an orchid. The phenotype of orchids can be characterized using the lengths and widths of the two petal and the three sepal leaves, of the form (curvature) of the labellum, and of the colors of the different compartments. In this example, 5
Fig. 1. Example for Clustering
Data Mining Using Graphics Processing Units
65
Fig. 2. The Feature Transformation
features are measured, and each object is thus transformed into a 5-dimensional vector space. To measure the similarity between two feature vectors, usually a distance function like the Euclidean metric is used. To search in a large database for objects which are similar to a given query objects (for instance to search for a number k of nearest neighbors, or for those objects having a distance that does not exceed a threshold ), usually multidimensional index structures are applied. By a hierarchical organization of the data set, the search is made efficient. The well-known indexing methods (like e.g. the R-tree [13]) are designed and optimized for secondary storage (hard disks) or for main memory. For the use in the GPU specialized indexing methods are required because of the highly parallel but restricted programming environment. In this paper, we propose such an indexing method. It has been shown that many data mining algorithms, including clustering can be supported by a powerful database primitive: The similarity join [3]. This operator yields as result all pairs of objects in the database having a distance of less than some predefined threshold . To show that also the more complex basic operations of similarity search and data mining can be supported by novel parallel algorithms specially designed for the GPU, we propose two algorithms for the similarity join, one being a nested block loop join, and one being an indexed loop join, utilizing the aforementioned indexing structure. Finally, to demonstrate that highly complex data mining tasks can be efficiently implemented using novel parallel algorithms, we propose parallel versions of two widespread clustering algorithms. We demonstrate how the density-based clustering algorithm DBSCAN [8] can be effectively supported by the parallel similarity join. In addition, we introduce a parallel version of K-means clustering [21] which follows an algorithmic paradigm which is very different from densitybased clustering. We demonstrate the superiority of our approaches over the corresponding sequential algorithms on CPU. All algorithms for GPU have been implemented using NVIDIA’s technology Compute Unified Device Architecture (CUDA) [1]. Vendors of graphics hardware have recently anticipated the trend towards general purpose computing on GPU and developed libraries, pre-compilers and application programming interfaces to support GP-GPU applications. CUDA offers a programming interface for the
66
C. B¨ ohm et al.
C programming language in which both the host program as well as the kernel functions are assembled in a single program [1]. The host program is the main program, executed on the CPU. In contrast, the so-called kernel functions are executed in a massively parallel fashion on the (hundreds of) processors in the GPU. An analogous technique is also offered by ATI using the brand names Close-to-Metal, Stream SDK, and Brook-GP. The remainder of this paper is organized as follows: Section 2 reviews the related work in GPU processing in general with particular focus on database management and data mining. Section 3 explains the graphics hardware and the CUDA programming model. Section 4 develops an multidimensional index structure for similarity queries on the GPU. Section 5 presents the non-indexed and indexed join on graphics hardware. Section 6 and Section 7 are dedicated to GPU-capable algorithms for density-based and partitioning clustering. Section 8 contains an extensive experimental evaluation of our techniques, and Section 9 summarizes the paper and provides directions for future research.
2
Related Work
In this section, we survey the related research in general purpose computations using GPUs with particular focus on database management and data mining. General Processing-Graphics Processing Units. Theoretically, GPUs are capable of performing any computation that can be transformed to the model of parallelism and that allow for the specific architecture of the GPU. This model has been exploited for multiple research areas. Liu et al. [20] present a new approach to high performance molecular dynamics simulations on graphics processing units by the use of CUDA to design and implement a new parallel algorithm. Their results indicate a significant performance improvement on an NVIDIA GeForce 8800 GTX graphics card over sequential processing on CPU. Another paper on computations from the field of life sciences has been published by Manavski and Valle [22]. The authors propose an extremely fast solution of the Smith-Waterman algorithm, a procedure for searching for similarities in protein and DNA databases, running on GPU and implemented in the CUDA programming environment. Significant speedups are achieved on a workstation running two GeForce 8800 GTX. Another widespread application area that uses the processing power of the GPU is mechanical simulation. One example is the work by Tascora et al. [27], that presents a novel method for solving large cone complementarity problems by means of a fixed-point iteration algorithm, in the context of simulating the frictional contact dynamics of large systems of rigid bodies. As the afore reviewed approaches in the field of life sciences, the algorithm is also implemented in CUDA for a GeForce 8800 GTX to simulate the dynamics of complex systems. To demonstrate the nearly boundless possibilities of performing computations on the GPU, we introduce one more example, namely cryptographic computing [2]. In this paper, the authors present a record-breaking performance for
Data Mining Using Graphics Processing Units
67
the elliptic curve method (ECM) of integer factorization. The speedup takes advantage of two NVIDIA GTX 295 graphics cards, using a new ECM implementation relying on new parallel addition formulas and functions that are made available by CUDA. Database Management Using GPUs. Some papers propose techniques to speed up relational database operations on GPU. In [14] some algorithms for the relational join on an NVIDIA G80 GPU using CUDA are presented. Two recent papers [19,4] address the topic of similarity join in feature space which determines all pairs of objects from two different sets R and S fulfilling a certain join predicate.The most common join predicate is the -join which determines all pairs of objects having a distance of less than a predefined threshold . The authors of [19] propose an algorithm based on the concept of space filling curves, e.g. the z-order, for pruning of the search space, running on an NVIDIA GeForce 8800 GTX using the CUDA toolkit. The z-order of a set of objects can be determined very efficiently on GPU by highly parallelized sorting. Their algorithm operates on a set of z-lists of different granularity for efficient pruning. However, since all dimensions are treated equally, performance degrades in higher dimensions. In addition, due to uniform space partitioning in all areas of the data space, space filling curves are not suitable for clustered data. An approach that overcomes that kind of problem is presented in [4]. Here the authors parallelize the baseline technique underlying any join operation with an arbitrary join predicate, namely the nested loop join (NLJ), a powerful database primitive that can be used to support many applications including data mining. All experiments are performed on NVIDIA 8500GT graphics processors by the use of a CUDA-supported implementation. Govindaraju et al. [10,11] demonstrate that important building blocks for query processing in databases, e.g. sorting, conjunctive selections, aggregations, and semi-linear queries can be significantly speed up by the use of GPUs. Data Mining Using GPUs. Recent approaches concerning data mining using the GPU are two papers on clustering on GPU, that pass on the use of CUDA. In [6] a clustering approach on a NVIDIA GeForce 6800 GT graphics card is presented, that extends the basic idea of K-means by calculating the distances from a single input centroid to all objects at one time that can be done simultaneously on GPU. Thus the authors are able to exploit the high computational power and pipeline of GPUs, especially for core operations, like distance computations and comparisons. An additional efficient method that is designed to execute clustering on data streams confirms a wide practical field of clustering on GPU. The paper [25] parallelizes the K-means algorithm for use of a GPU by using multi-pass rendering and multi shader program constants. The implementation on NVIDIA 5900 and NVIDIA 8500 graphics processors achieves significant increasing performances for both various data sizes and cluster sizes. However the algorithms of both papers are not portable to different GPU models, like CUDA-approaches are.
68
3
C. B¨ ohm et al.
Architecture of the GPU
Graphics Processing Units (GPUs) of the newest generation are powerful coprocessors, not only designed for games and other graphics-intensive applications, but also for general-purpose computing (in this case, we call them GP-GPUs). From the hardware perspective, a GPU consists of a number of multiprocessors, each of which consists of a set of simple processors which operate in a SIMD fashion, i.e. all processors of one multiprocessor execute in a synchronized way the same arithmetic or logic operation at the same time, potentially operating on different data. For instance, the GPU of the newest generation GT200 (e.g. on the graphics card Geforce GTX280) has 30 multiprocessors, each consisting of 8 SIMD-processors, summarizing to a total amount of 240 processors inside one GPU. The computational power sums up to a peak performance of 933 GFLOP/s. 3.1
The Memory Model
Apart from some memory units with special purpose in the context of graphics processing (e.g. texture memory), we have three important types of memory, as visualized in Figure 3. The shared memory (SM) is a memory unit with fast access (at the speed of register access, i.e. no delay). SM is shared among all processors of a multiprocessor. It can be used for local variables but also to exchange information between threads on different processors of the same multiprocessor. It cannot be used for information which is shared among threads on different multiprocessors. SM is fast but very limited in capacity (16 KBytes per multiprocessor). The second kind of memory is the so-called device memory (DM), which is the actual video RAM of the graphics card (also used for frame buffers etc.). DM is physically located on the graphics card (but not inside the GPU), is significantly larger than SM (typically up to some hundreds of MBytes), but also significantly slower. In particular, memory accesses to DM cause a typical latency delay of 400-600 clock cycles (on G200-GPU, corresponding to 300-500ns). The bandwidth for transferring data between DM and GPU (141.7 GB/s on G200) is higher than that of CPU and main memory (about 10 GB/s on current CPUs). DM can be used to share information between threads
Fig. 3. Architecture of a GPU
Data Mining Using Graphics Processing Units
69
on different multiprocessors. If some threads schedule memory accesses from contiguous addresses, these accesses can be coalesced, i.e. taken together to improve the access speed. A typical cooperation pattern for DM and SM is to copy the required information from DM to SM simultaneously from different threads (if possible, considering coalesced accesses), then to let each thread compute the result on SM, and finally, to copy the result back to DM. The third kind of memory considered here is the main memory which is not part of the graphics card. The GPU has no access to the address space of the CPU. The CPU can only write to or read from DM using specialized API functions. In this case, the data packets have to be transferred via the Front Side Bus and the PCI-Express Bus. The bandwidth of these bus systems is strictly limited, and therefore, these special transfer operations are considerably more expensive than direct accesses of the GPU to DM or direct accesses of the CPU to main memory. 3.2
The Programming Model
The basis of the programming model of GPUs are threads. Threads are lightweight processes which are easy to create and to synchronize. In contrast to CPU processes, the generation and termination of GPU threads as well as context switches between different threads do not cause any considerable overhead either. In typical applications, thousands or even millions of threads are created, for instance one thread per pixel in gaming applications. It is recommended to create a number of threads which is even much higher than the number of available SIMD-processors because context switches are also used to hide the latency delay of memory accesses: Particularly an access to the DM may cause a latency delay of 400-600 clock cycles, and during that time, a multiprocessor may continue its work with other threads. The CUDA programming library [1] contains API functions to create a large number of threads on the GPU, each of which executes a function called kernel function. The kernel functions (which are executed in parallel on the GPU) as well as the host program (which is executed sequentially on the CPU) are defined in an extended syntax of the C programming language. The kernel functions are restricted with respect to functionality (e.g. no recursion). On GPUs the threads do not even have an individual instruction pointer. An instruction pointer is rather shared by several threads. For this purpose, threads are grouped into so-called warps (typically 32 threads per warp). One warp is processed simultaneously on the 8 processors of a single multiprocessor (SIMD) using 4-fold pipelining (totalling in 32 threads executed fully synchronously). If not all threads in a warp follow the same execution path, the different execution paths are executed in a serialized way. The number (8) of SIMD-processors per multiprocessor as well as the concept of 4-fold pipelining is constant on all current CUDA-capable GPUs. Multiple warps are grouped into thread groups (TG). It is recommended [1] to use multiples of 64 threads per TG. The different warps in a TG (as well as different warps of different TGs) are executed independently. The threads in one thread group use the same shared memory and may thus communicate and
70
C. B¨ ohm et al.
share data via the SM. The threads in one thread group can be synchronized (let all threads wait until all warps of the same group have reached that point of execution). The latency delay of the DM can be hidden by scheduling other warps of the same or a different thread group whenever one warp waits for an access to DM. To allow switching between warps of different thread groups on a multiprocessor, it is recommended that each thread uses only a small fraction of the shared memory and registers of the multiprocessor [1]. 3.3
Atomic Operations
In order to synchronize parallel processes and to ensure the correctness of parallel algorithms, CUDA offers atomic operations such as increment, decrement, or exchange (to name just those out of the large number of atomic operations, which will be needed by our algorithms). Most of the atomic operations work on integer data types in Device Memory. However, the newest version of CUDA (Compute Capability 1.3 of the GPU GT200) allows even atomic operations in SM. If, for instance, some parallel processes share a list as a common resource with concurrent reading and writing from/to the list, it may be necessary to (atomically) increment a counter for the number of list entries (which is in most cases also used as the pointer to the first free list element). Atomicity implies in this case the following two requirements: If two or more threads increment the list counter, then (1) the value counter after all concurrent increments must be equivalent to the value before plus the number of concurrent increment operations. And, (2), each of the concurrent threads must obtain a separate result of the increment operation which indicates the index of the empty list element to which the thread can write its information. Therefore, most atomic operations return a result after their execution. For instance the operation atomicInc has two parameters, the address of the counter to be incremented, and an optional threshold value which must not be exceeded by the operation. The operation works as follows: The counter value at the address is read, and incremented (provided that the threshold is not exceeded). Finally, the old value of the counter (before incrementing) is returned to the kernel method which invoked atomicInc. If two or more threads (of the same or different thread groups) call some atomic operations simultaneously, the result of these operations is that of an arbitrary sequentialization of the concurrent operations. The operation atomicDec works in an analogous way. The operation atomicCAS performs a Compare-and-Swap operation. It has three parameters, an address, a compare value and a swap value. If the value at the address equals the compare value, the value at the address is replaced by the swap value. In every case, the old value at the address (before swapping) is returned to the invoking kernel method.
4
An Index Structure for Similarity Queries on GPU
Many data mining algorithms for problems like classification, regression, clustering, and outlier detection use similarity queries as a building block. In many
Data Mining Using Graphics Processing Units
71
cases, these similarity queries even represent the largest part of the computational effort of the data mining tasks, and, therefore, efficiency is of high importance here. Similarity queries are defined as follows: Given is a database D = {x1 , ...xn } ⊆ Rd of a number n of vectors from a d-dimensional space, and a query object q ∈ Rd . We distinguish between two different kinds of similarity queries, the range queries and the nearest neighbor-queries: Definition 1 (Range Query) Let ∈ R+ 0 be a threshold value. The result of the range query is the set of the following objects: N (q) = {x ∈ D : ||x − q|| ≤ }. where ||x − q|| is an arbitrary distance function between two feature vectors x and q, e.g. the Euclidean distance. Definition 2 (Nearest Neighbor Query) The result of a nearest neighbor query is the set: N N (q) = {x ∈ D :
∀x ∈ D :
||x − q|| ≤ ||x − q||}.
Definition 2 can also be generalized for the case of the k-nearest neighbor query (N Nk (q)), where a number k of nearest neighbors of the query object q is retrieved. The performance of similarity queries can be greatly improved if a multidimensional index structure supporting the similarity search is available. Our index structure needs to be traversed in parallel for many search objects using the kernel function. Since kernel functions do not allow any recursion, and as they need to have small storage overhead by local variables etc., the index structure must be kept very simple as well. To achieve a good compromise between simplicity and selectivity of the index, we propose a data partitioning method with a constant number of directory levels. The first level partitions the data set D according to the first dimension of the data space, the second level according to the second dimension, and so on. Therefore, before starting the actual data mining method, some transformation technique should be applied which guarantees a high selectivity in the first dimensions (e.g. Principal Component Analysis, Fast Fourier Transform, Discrete Wavelet Transform, etc.). Figure 4 shows a simple, 2-dimensional example of a 2-level directory (plus the root node which is considered as level-0), similar to [16,18]. The fanout of each node is 8. In our experiments in Section 8, we used a 3-level directory with fanout 16. Before starting the actual data mining task, our simple index structure must be constructed in a bottom-up way by fractionated sorting of the data: First, the data set is sorted according to the first dimension, and partitioned into the specified number of quantile partitions. Then, each of the partitions is sorted individually according to the second dimension, and so on. The boundaries are stored using simple arrays which can be easily accessed in the subsequent kernel functions. In principle, this index construction can already be done on the GPU, because efficient sorting methods for GPU have been proposed [10]. Since bottom
72
C. B¨ ohm et al.
Fig. 4. Index Structure for GPU
up index construction is typically not very costly compared to the data mining algorithm, our method performs this preprocessing step on CPU. When transferring the data set from the main memory into the device memory in the initialization step of the data mining method, our new method has additionally to transfer the directory (i.e. the arrays in which the coordinates of the page boundaries are stored). Compared to the complete data set, the directory is always small. The most important change in the kernel functions in our data mining methods regards the determination of the -neighborhood of some given seed object q, which is done by exploiting SIMD-parallelism inside a multiprocessor. In the non-indexed version, this is done by a set of threads (inside a thread group) each of which iterates over a different part of the (complete) data set. In the indexed version, one of the threads iterates in a set of nested loops (one loop for each level of the directory) over those nodes of the index structure which represent regions of the data space which are intersected by the neighborhood-sphere of N (q). In the innermost loop, we have one set of points (corresponding to a data page of the index structure) which is processed by exploiting the SIMD-parallelism, like in the non-indexed version.
5
The Similarity Join
The similarity join is a basic operation of a database system designed for similarity search and data mining on feature vectors. In such applications, we are given a database D of objects which are associated with a vector from a multidimensional space, the feature space. The similarity join determines pairs of objects which are similar to each other. The most widespread form is the -join which determines those pairs from D × D which have a Euclidean distance of no more than a user-defined radius : Definition 3 (Similarity Join). Let D ⊆ Rd be a set of feature vectors of a d-dimensional vector space and ∈ R+ 0 be a threshold. Then the similarity join is the following set of pairs: SimJoin(D, ) = {(x, x ) ∈ (D × D) :
||x − x || ≤ } ,
Data Mining Using Graphics Processing Units
73
If x and x are elements of the same set, the join is a similarity self-join. Most algorithms including the method proposed in this paper can also be generalized to the more general case of non-self-joins in a straightforward way. Algorithms for a similarity join with nearest neighbor predicates have also been proposed. The similarity join is a powerful building block for similarity search and data mining. It has been shown that important data mining methods such as clustering and classification can be based on the similarity join. Using a similarity join instead of single similarity queries can accelerate data mining algorithms by a high factor [3]. 5.1
Similarity Join without Index Support
The baseline technique to process any join operation with an arbitrary join predicate is the nested loop join (NLJ) which performs two nested loops, each enumerating all points of the data set. For each pair of points, the distance is calculated and compared to . The pseudocode of the sequential version of NLJ is given in Figure 5.
algorithm sequentialNLJ(data set D) for each q ∈ D do // outer loop for each x ∈ D do // inner loop: search all points x which are similar to q if dist(x, q) ≤ then report (x, q) as a result pair or do some further processing on (x, q) end
Fig. 5. Sequential Algorithm for the Nested Loop Join
It is easily possible to parallelize the NLJ, e.g. by creating an individual thread for each iteration of the outer loop. The kernel function then contains the inner loop, the distance calculation and the comparison. During the complete run of the kernel function, the current point of the outer loop is constant, and we call this point the query point q of the thread, because the thread operates like a similarity query, in which all database points with a distance of no more than from q are searched. The query point q is always held in a register of the processor. Our GPU allows a truly parallel execution of a number m of incarnations of the outer loop, where m is the total number of ALUs of all multiprocessors (i.e. the warp size 32 times the number of multiprocessors). Moreover, all the different warps are processed in a quasi-parallel fashion, which allows to operate on one warp of threads (which is ready-to-run) while another warp is blocked due to the latency delay of a DM access of one of its threads. The threads are grouped into thread groups, which share the SM. In our case, the SM is particularly used to physically store for each thread group the current point x of the inner loop. Therefore, a kernel function first copies the current point x from the DM into the SM, and then determines the distance of x to the query point q. The threads of the same warp are running perfectly
74
C. B¨ ohm et al.
simultaneously, i.e. if these threads are copying the same point from DM to SM, this needs to be done only once (but all threads of the warp have to wait until this relatively costly copy operation is performed). However, a thread group may (and should) consist of multiple warps. To ensure that the copy operation is only performed once per thread group, it is necessary to synchronize the threads of the thread group before and after the copy operation using the API function synchronize(). This API function blocks all threads in the same TG until all other threads (of other warps) have reached the same point of execution. The pseudocode for this algorithm is presented in Figure 6. algorithm GPUsimpleNLJ(data set D) // host program executed on CPU deviceMem float D [][] := D[][]; // allocate memory in DM for the data set D #threads := n; // number of points in D #threadsPerGroup := 64; startThreads (simpleNLJKernel, #threads, #threadsPerGroup); // one thread per point waitForThreadsToFinish(); end. kernel simpleNLJKernel (int threadID) register float q[] := D [threadID][];
// copy the point from DM into the register // and use it as query point q // index is determined by the threadID // this used to be the inner loop in Figure 5
for i := 0 ... n − 1 do synchronizeThreadGroup(); shared float x[] := D [i][]; // copy the current point x from DM to SM synchronizeThreadGroup(); // Now all threads of the thread group can work with x if dist(x, q) ≤ then report (x, q) as a result pair using synchronized writing or do some further processing on (x, q) directly in kernel end.
Fig. 6. Parallel Algorithm for the Nested Loop Join on the GPU
If the data set does not fit into DM, a simple partitioning strategy can be applied. It must be ensured that the potential join partners of an object are within the same partition as the object itself. Therefore, overlapping partitions of size 2 · can be created. 5.2
An Indexed Parallel Similarity Join Algorithm on GPU
The performance of the NLJ can be greatly improved if an index structure is available as proposed in Section 4. On sequential processing architectures, the indexed NLJ leaves the outer loop unchanged. The inner loop is replaced by an index-based search retrieving candidates that may be join partners of the current object of the outer loop. The effort of finding these candidates and refining them is often orders of magnitude smaller compared to the non-indexed NLJ. When parallelizing the indexed NLJ for the GPU, we follow the same paradigm as in the last section, to create an individual thread for each point of the outer loop. It is beneficial to the performance, if points having a small distance to each other are collected in the same warp and thread group, because for those points, similar paths in the index structure are relevant.
Data Mining Using Graphics Processing Units
75
After index construction, we have not only a directory in which the points are organized in a way that facilitates search. Moreover, the points are now clustered in the array, i.e. points which have neighboring addresses are also likely to be close together in the data space (at least when projecting on the first few dimensions). Both effects are exploited by our join algorithm displayed in Figure 7.
algorithm GPUindexedJoin(data set D) deviceMem index idx := makeIndexAndSortData(D); // changes ordering of data points int #threads := |D|, #threadsPerGroup := 64; for i = 1 ... (#threads/#threadsPerGroup) do deviceMem float blockbounds[i][] := calcBlockBounds(D, blockindex); deviceMem float D [][] := D[][]; startThreads (indexedJoinKernel, #threads, #threadsPerGroup); // one thread per data point waitForThreadsToFinish (); end. algorithm indexedJoinKernel (int threadID, int blockID) register float q[] := D [threadID][]; // copy the point from DM into the register shared float myblockbounds[] := blockbounds[blockID][]; for xi := 0 ... indexsize.x do if IndexPageIntersectsBoundsDim1(idx,myblockbounds,xi ) then for yi := 0 ... indexsize.y do if IndexPageIntersectsBoundsDim2(idx,myblockbounds,xi , yi ) then for zi := 0 ... indexsize.z do if IndexPageIntersectsBoundsDim3(idx,myblockbounds,xi , yi , zi ) then for w := 0 ... IndexPageSize do synchronizeThreadGroup(); shared float p[] :=GetPointFromIndexPage(idx,D , xi , yi , zi , w); synchronizeThreadGroup(); if dist(p, q) ≤ then report (p, q) as a result pair using synchronized writing end.
Fig. 7. Algorithm for Similarity Join on GPU with Index Support
Instead of performing an outer loop like in a sequential indexed NLJ, our algorithm now generates a large number of threads: One thread for each iteration of the outer loop (i.e. for each query point q). Since the points in the array are clustered, the corresponding query points are close to each other, and the join partners of all query points in a thread group are likely to reside in the same branches of the index as well. Our kernel method now iterates over three loops, each loop for one index level, and determines for each partition if the point is inside the partition or, at least no more distant to its boundary than . The corresponding subnode is accessed if the corresponding partition is able to contain join partners of the current point of the thread. When considering the warps which operate in a fully synchronized way, a node is accessed, whenever at least one of the query points of the warps is close enough to (or inside) the corresponding partition. For both methods, indexed and non-indexed nested loop join on GPU, we need to address the question how the resulting pairs are processed. Often, for example to support density-based clustering (cf. Section 6), it is sufficient to return a counter with the number of join partners. If the application requires to
76
C. B¨ ohm et al.
report the pairs themselves, this is easily possible by a buffer in DM which can be copied to the CPU after the termination of all kernel threads. The result pairs must be written into this buffer in a synchronized way to avoid that two threads write simultaneously to the same buffer area. The CUDA API provides atomic operations (such as atomic increment of a buffer pointer) to guarantee this kind of synchronized writing. Buffer overflows are also handled by our similarity join methods. If the buffer is full, all threads terminate and the work is resumed after the buffer is emptied by the CPU.
6
Similarity Join to Support Density-Based Clustering
As mentioned in Section 5, the similarity join is an important building block to support a wide range of data mining tasks, including classification [24], outlier detection [5] association rule mining [17] and clustering [8], [12]. In this section, we illustrate how to effectively support the density-based clustering algorithm DBSCAN [8] with the similarity join on GPU. 6.1
Basic Definitions and Sequential DBSCAN
The idea of density-based clustering is that clusters are areas of high point density, separated by areas of significantly lower point density. The point density can be formalized using two parameters, called ∈ R+ and M inP ts ∈ N+ . The central notion is the core object. A data object x is called a core object of a cluster, if at least M inP ts objects (including x itself) are in its -neighborhood N (x), which corresponds to a sphere of radius . Formally: Definition 4. (Core Object) Let D be a set of n objects from Rd , ∈ R+ and M inP ts ∈ N+ . An object x ∈ D is a core object, if and only if |N (x)| ≥ M inP ts, where N (x) = {x ∈ D : ||x − x|| ≤ }. Note that this definition is equivalent to Definition 1. Two objects may be assigned to a common cluster. In density-based clustering this is formalized by the notions direct density reachability, and density connectedness. Definition 5. (Direct Density Reachability) Let x, x ∈ D. x is called directly density reachable from x (in symbols: x x ) if and only if 1. x is a core object in D, and 2. x ∈ N (x). If x and x are both core objects, then x x is equivalent with x x . The density connectedness is the transitive and symmetric closure of the direct density reachability:
Data Mining Using Graphics Processing Units
77
Definition 6. (Density Connectedness) Two objects x and x are called density connected (in symbols: x x ) if and only if there is a sequence of core objects (x1 , ..., xm ) of arbitrary length m such that x x1 ... xm x . In density-based clustering, a cluster is defined as a maximal set of density connected objects: Definition 7. (Density-based Cluster) A subset C ⊆ D is called a cluster if and only if the following two conditions hold: 1. Density connectedness: ∀x, x ∈ C : x x . 2. Maximality: ∀x ∈ C, ∀x ∈ D \ C : ¬x x . The algorithm DBSCAN [8] implements the cluster notion of Definition 7 using a data structure called seed list S containing a set of seed objects for cluster expansion. More precisely, the algorithm proceeds as follows: 1. Mark all objects as unprocessed. 2. Consider an arbitrary unprocessed object x ∈ D. 3. If x is a core object, assign a new cluster ID C, and do step (4) for all elements x ∈ N (x) which do not yet have a cluster ID: 4. (a) mark the element x with the cluster ID C and (b) insert the object x into the seed list S. 5. While S is not empty repeat step 6 for all elements s ∈ S: 6. If s is a core object, do step (7) for all elements x ∈ N (s) which do not yet have any cluster ID: 7. (a) mark the element x with the cluster ID C and (b) insert the object x into the seed list S. 8. If there are still unprocessed objects in the database, continue with step (2). To illustrate the algorithmic paradigm, Figure 8 displays a snapshot of DBSCAN during cluster expansion. The light grey cluster on the left side has been processed already. The algorithm currently expands the dark grey cluster on the right side. The seedlist S currently contains one object, the object x. x is a core object since there are more than M inP ts = 3 objects in its -neighborhood (|N (x)| = 6, including x itself). Two of these objects, x and x have not been processed so far and are therefore inserted into S. This way, the cluster is iteratively expanded until the seed list is empty. After that, the algorithm continues with an arbitrary unprocessed object until all objects have been processed. Since every object of the database is considered only once in Step 2 or 6 (exclusively), we have a complexity which is n times the complexity of N (x) (which is linear in n if there is no index structure, and sublinear or even O(log(n)) in the presence of a multidimensional index structure. The result of DBSCAN is determinate.
78
C. B¨ ohm et al.
Fig. 8. Sequential Density-based Clustering
algorithm GPUdbscanNLJ(data set D) // host program executed on CPU deviceMem float D [][] := D[][]; // allocate memory in DM for the data set D deviceMem int counter [n]; // allocate memory in DM for counter #threads := n; // number of points in D #threadsPerGroup := 64; startThreads (GPUdbscanKernel, #threads, #threadsPerGroup); // one thread per point waitForThreadsToFinish(); copy counter from DM to main memory ; end. kernel GPUdbscanKernel (int threadID) register float q[] := D [threadID][];
// copy the point from DM into the register // and use it as query point q // index is determined by the threadID // option 1 OR // option 2
for i := 0 ... threadID do for i := 0 ... n − 1 do synchronizeThreadGroup(); shared float x[] := D [i][]; // copy the current point x from DM to SM synchronizeThreadGroup(); // Now all threads of the thread group can work with x if dist(x, q) ≤ then atomicInc (counter[i]); atomicInc (counter[threadID]); // option 1 OR inc counter[threadID]; // option 2 end.
Fig. 9. Parallel Algorithm for the Nested Loop Join to Support DBSCAN on GPU
6.2
GPU-Supported DBSCAN
To effectively support DBSCAN on GPU we first identify the two major stages of the algorithm requiring most of the processing time: 1. Determination of the core object property. 2. Cluster expansion by computing the transitive closure of the direct density reachability relation. The first stage can be effectively supported by the similarity join. To check the core object property, we need to count the number of objects which are within the -neighborhood of each point. Basically, this can be implemented by a self join. However, the algorithm for self-join described in Section 5 needs to be modified to be suitable to support this task. The classical self-join only counts the total number of pairs of data objects with distance less or equal than . For the core object property, we need a self-join with a counter associated to
Data Mining Using Graphics Processing Units
79
each object. Each time when the algorithm detects a new pair fulfilling the join condition, the counter of both objects needs to be incremented. We propose two different variants to implement the self-join to support DBSCAN on GPU which are displayed in pseudocode in Figure 9. Modifications over the basic algorithm for nested loop join (cf. Figure 6) are displayed in darker color. As in the simple algorithm for nested loop join, for each point q of the outer loop a separate thread with a unique threadID is created. Both variants of the self-join for DBSCAN operate on a array counter which stores the number of neighbors for each object. We have two options how to increment the counters of the objects when a pair of objects (x, q) fulfills the join condition. Option 1 is first to add the counter of x and then the counter of q using the atomic operation atomicInc() (cf. Section 3). The operation atomicInc() involves synchronization of all threads. The atomic operations are required to assure the correctness of the result, since it is possible that different threads try to increment the counters of objects simultaneously. In clustering, we typically have many core objects which causes a large number of synchronized operations which limit parallelism. Therefore, we also implemented option 2 which guarantees correctness without synchronized operations. Whenever a pair of objects (x, q) fulfills the join condition, we only increment the counter of point q. Point q is that point of the outer loop for which the thread has been generated, which means q is exclusively associated with the threadID. Therefore, the cell counter[threadID] can be safely incremented with the ordinary, non-synchronized operation inc(). Since no other point is associated with the same threadID as q no collision can occur. However, note that in contrast to option 1, for each point of the outer loop, the inner loop needs to consider all other points. Otherwise results are missed. Recall that for the conventional sequential nested loop join (cf. Figure 5) it is sufficient to consider in the inner loop only those points which have not been processed so far. Already processed points can be excluded because if they are join partners of the current point, this has already been detected. The same holds for option 1. Because of parallelism, we can not state which objects have been already processed. However, it is still sufficient when each object searches in the inner loop for join partners among those objects which would appear later in the sequential processing order. This is because all other object are addressed by different threads. Option 2 requires checking all objects since only one counter is incremented. With sequential processing, option 2 would thus duplicate the workload. However, as our results in Section 8 demonstrate, option 2 can pay-off under certain conditions since parallelism is not limited by synchronization. After determination of the core object property, clusters can be expanded starting from the core objects. Also this second stage of DBSCAN can be effectively supported on the GPU. For cluster expansion, it is required to compute the transitive closure of the direct density reachability relation. Recall that this is closely connected to the core object property as all objects within the range of a core object x are directly density reachable from x. To compute the transitive closure, standard algorithms are available. The most well-known among them is
80
C. B¨ ohm et al.
the algorithm of Floyd-Warshall. A highly parallel variant of the Floyd-Warshall algorithm on GPU has been recently proposed [15], but this is beyond the scope of this paper.
7
K-Means Clustering on GPU
7.1
The Algorithm K-Means
A well-established partitioning clustering method is the K-means clustering algorithm [21]. K-means requires a metric distance function in vector space. In addition, the user has to specify the number of desired clusters k as an input parameter. Usually K-means starts with an arbitrary partitioning of the objects into k clusters. After this initialization, the algorithm iteratively performs the following two steps until convergence: (1) Update centers: For each cluster, compute the mean vector of its assigned objects. (2). Re-assign objects: Assign each object to its closest center. The algorithm converges as soon as no object changes its cluster assignment during two subsequent iterations. Figure 10 illustrates an example run of K-means for k = 3 clusters. Figure 10(a) shows the situation after random initialization. In the next step, every data point is associated with the closest cluster center (cf. Figure 10(b)). The resulting partitions represent the Voronoi cells generated by the centers. In the following step of the algorithm, the center of each of the k clusters is updated, as shown in Figure 10(c). Finally, assignment and update steps are repeated until convergence. In most cases, fast convergence can be observed. The optimization function of K-means is well defined. The algorithm minimizes the sum of squared distances of the objects to their cluster centers. However, K-means is only guaranteed to converge towards a local minimum of the objective function. The quality of the result strongly depends on the initialization. Finding that clustering with k clusters minimizing the objective function actually is a NP-hard problem, for details see e.g. [23]. In practice, it is therefore recommended to run the algorithm several times with different random initializations and keep the best result. For large data sets, however, often only a very limited number of trials is feasible. Parallelizing K-means in GPU allows for a more comprehensive exploration of
(a) Initialization
(b) Assignment
(c) Recalculation
(d) Termination
Fig. 10. Sequential Partitioning Clustering by the K-means Algorithm
Data Mining Using Graphics Processing Units
81
the search space of all potential clusterings and thus provides the potential to obtain a good and reliable clustering even for very large data sets. 7.2
CUDA-K-Means
In K-means, most computing power is spent in step (2) of the algorithm, i.e. re-assignment which involves distance computation and comparison. The number of distance computations and comparisons in K-means is O(k · i · n), where i denotes the number of iterations and n is the number of data points. The CUDA-K-meansKernel. In K-means clustering, the cluster assignment of each data point is determined by comparing the distances between that point and each cluster center. This work is performed in parallel by the CUDA-KmeansKernel. The idea is, instead of (sequentially) performing cluster assignment of one single data point, we start many different cluster assignments at the same time for different data points. In detail, one single thread per data point is generated, all executing the CUDA-K-meansKernel. Every thread which is generated from the CUDA-K-meansKernel (cf. Figure 11) starts with the ID of a data point x which is going to be processed. Its main tasks are, to determine the distance to the next center and the ID of the corresponding cluster.
algorithm CUDA-K-means(data set D, int k) deviceMem float D [][] := D[][]; #threads := |D|; #threadsPerGroup := 64; deviceMem float Centroids[][] := initCentroids(); double actCosts := ∞;
// host program executed on CPU // allocate memory in DM for the data set D // number of points in D // allocate memory in DM for the // initial centroids // initial costs of the clustering
repeat prevCost := actCost; startThreads (CUDA-K-meansKernel, #threads, #threadsPerGroup); // one thread per point waitForThreadsToFinish(); float minDist := minDistances[threadID]; // copy the distance to the nearest // centroid from DM into MM float cluster := clusters[threadID]; // copy the assigned cluster from DM into MM double actCosts := calculateCosts(); // update costs of the clustering deviceMem float Centroids[][] := calculateCentroids(); // copy updated centroids to DM until |actCost − prevCost| < threshold // convergence end.
kernel CUDA-K-meansKernel (int threadID) register float x[] := D [threadID][]; // copy the point from DM into the register float minDist := ∞; // distance of x to the next centroid int cluster := null; // ID of the next centroid (cluster) for i := 1 ... k do // process each cluster register float c[] := Centroids[i][] // copy the actual centroid from DM into the register double dist := distance(x,c); if dist < minDist then minDist := dist; cluster := i; report(minDist, cluster); // report assigned cluster and distance using synchronized writing end.
Fig. 11. Parallel Algorithm for K-means on the GPU
82
C. B¨ ohm et al.
A thread starts by reading the coordinates of the data point x into the register. The distance of x to its closest center is initialized by ∞ and the assigned cluster is therefore set to null. Then a loop encounters all c1 , c2 , . . . , ck centers and considers them as potential clusters for x. This is done by all threads in the thread group allowing a maximum degree of intra-group parallelism. Finally, the cluster whose center has the minimum distance to the data point x is reported together with the corresponding distance value using synchronized writing. The Main Program for CPU. Apart from initialization and data transfer from main memory (MM) to DM, the main program consists of a loop starting the CUDA-K-meansKernel on the GPU until the clustering converges. After the parallel operations are completed by all threads of the group, the following steps are executed in each cycle of the loop: 1. 2. 3. 4.
Copy distance of processed point x to the nearest center from DM into MM. Copy cluster, x is assigned to, from DM into MM. Update centers. Copy updated centers to DM.
A pseudocode of these procedures is illustrated in Figure 11.
8
Experimental Evaluation
To evaluate the performance of data mining on the GPU, we performed various experiments on synthetic data sets. The implementation for all variants is written in C and all experiments are performed on a workstation with Intel Core 2 Duo CPU E4500 2.2 GHz and 2 GB RAM which is supplied with a Gainward NVIDIA GeForce GTX280 GPU (240 SIMD-processors) with 1GB GDDR3 SDRAM. 8.1
Evaluation of Similarity Join on the GPU
The performance of similarity join on the GPU, is validated by the comparison of four different variants for executing similarity join: 1. 2. 3. 4.
Nested loop join (NLJ) on the CPU NLJ on the CPU with index support (as described in Section 4) NLJ on the GPU NLJ on the GPU with index support (as described in Section 4)
For each version we determine the speedup factor by the ratio of CPU runtime and GPU runtime. For this purpose we generated three 8-dimensional synthetic data sets of various sizes (up to 10 million (m) points) with different data distributions, as summarized in Table 1. Data set DS1 contains uniformly distributed data. DS2 consists of five Gaussian clusters which are randomly distributed in feature space (see Figure 12(a)). Similar to DS2 , DS3 is also composed of five Gaussian clusters, but the clusters are correlated. An illustration of data set
Data Mining Using Graphics Processing Units
83
Table 1. Data Sets for the Evaluation of the Similarity Join on the GPU
Name DS1 (a) Random Clusters
(b) Linear Clusters
Fig. 12. Illustration of the data sets DS2 and DS3
Size
Distribution
3m - 10m points uniform distribution
DS2 250k - 1m points normal distribution, gaussian clusters DS3 250k - 1m points normal distribution, gaussian clusters
DS3 is given in Figure 12(b). The threshold was selected to obtain a join result where each point was combined with one or two join partners on average. Evaluation of the Size of the Data Sets. Figure 13 displays the runtime in seconds and the corresponding speedup factors of NLJ on the CPU with/without index support and NLJ on the GPU with/without index support in logarithmic scale for all three data sets DS1 , DS2 and DS3 . The time needed for data transfer from CPU to the GPU and back as well as the (negligible) index construction time has been included. The tests on data set DS1 were performed with a join selectivity of = 0.125, and = 0.588 on DS2 and DS3 respectively. NLJ on the GPU with index support performs best in all experiments, independent of the data distribution or size of the data set. Note that, due to massive parallelization, NLJ on the GPU without index support outperforms CPU without index by a large factor (e.g. 120 on 1m points of normal distributed data with gaussian clusters). The GPU algorithm with index support outperforms the corresponding CPU algorithm (with index) by a factor of 25 on data set DS2 . Remark that for example the overall improvement of the indexed GPU algorithm on data set DS2 over the non-indexed CPU version is more than 6,000. This results demonstrate the potential of boosting performance of database operations with designing specialized index structures and algorithms for the GPU. Evaluation of the Join Selectivity. In these experiments we test the impact of the parameter on the performance of NLJ on GPU with index support and use the indexed implementation of NLJ on the CPU as benchmark. All experiments are performed on data set DS2 with a fixed size of 500k data points. The parameter is evaluated in a range from 0.125 to 0.333. Figure 14(a) shows that the runtime of NLJ on GPU with index support increases for larger values. However, the GPU version outperforms the CPU implementation by a large factor (cf. Figure 14(b)), that is proportional to the value of . In this evaluation the speedup ranges from 20 for a join selectivity of 0.125 to almost 60 for = 0.333.
C. B¨ ohm et al.
Tim me((sec)
10000000.0 1000000.0 100000.0 10000.0 1000.0 100.0 10.0 1.0
CPU CPU indexed CPUindexed GPU GPUindexed 2
4
6
8
Sp peedupFFacto or
84
150.0 130 0 130.0 110.0 90.0 70.0 50.0 30.0 10.0
Withoutindex Without index Withindex
2
10 12
Tim me(ssec)
10000.0 1000.0
CPU CPU i d d CPUindexed GPU GPUindexed
100.0 10.0 10 1.0 700
100
700
1000
1000.0
CPU CPU indexed CPUindexed GPU GPUindexed
100.0 10.0 1.0 1000
Size(k)
(e) Runtime on Data Set DS3
(d) Speedup on Data Set DS2 Speed dupFactor
10000.0
Time(sec)
400
Size(k)
100000.0
700
12
Withoutindex Without index Withindex
Size(k)
400
10
150.0 130.0 110.0 90.0 70.0 50.0 30.0 10.0
1000
(c) Runtime on Data Set DS2
0.1 100
8
(b) Speedup on Data Set DS1 Speed dupFactor
100000.0
400
6
Size(m)
Size(m)
(a) Runtime on Data Set DS1
0.1 0 1 100
4
150.0 130.0 110.0 90.0 70.0 50.0 30.0 10.0
Withoutindex Without index Withindex
100
400
700
1000
Size(k)
(f) Speedup on Data Set DS3
Fig. 13. Evaluation of the NLJ on CPU and GPU with and without Index Support w.r.t. the Size of Different Data Sets
Evaluation of the Dimensionality. These experiments provide an evaluation with respect to the dimensionality of the data. As in the experiments for the evaluation of the join selectivity, we use again the indexed implementations both on CPU and GPU and perform all tests on data set DS2 with a fixed number of 500k data objects. The dimensionality is evaluated in a range from 8 to 32. We also performed these experiments with two different settings for the join selectivity, namely = 0.588 and = 1.429. Figure 15 illustrates that NLJ on GPU outperforms the benchmark method on CPU by factors of about 20 for = 0.588 to approximately 70 for = 1.429. This order of magnitude is relatively independent of the data dimensionality. As in our implementation the dimensionality is already known at compile time, optimization techniques of the compiler have an impact on the performance of
Tim me(se ec)
1000.0 100.0 CPU GPU
10.0
Speed dupFactor
Data Mining Using Graphics Processing Units
1.0 0.10
0.15
0.20
0.25
0.30
0.35
85
70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 0.10
0.15
0.20
epsilon
0.25
0.30
0.35
epsilon
(a) Runtime on Data Set DS2
(b) Speedup on Data Set DS2
Fig. 14. Impact of the Join Selectivity on the NLJ on GPU with Index Support
the CPU version as can be seen especially in Figure 15(c). However the dimensionality also affects the implementation on GPU, because higher dimensional data come along with a higher demand of shared memory. This overhead affects the number of threads that can be executed in parallel on the GPU. 8.2
Evaluation of GPU-Supported DBSCAN
As described in Section 6.2, we suggest two different variants to implement the self-join to support DBSCAN on GPU, whose characteristic are briefly reviewed in the following:
Time(se ec)
100 0 100.0 CPU GPU
10.0
Speed dupFactor
100.0
1000.0
1.0
80.0 60.0 40 0 40.0 20.0 0.0
2
6
10 14 18 22 26 30
2
6
Dimensionality
14
18
22
26
30
Dimensionality
(a) Runtime on Data Set D2 ( = 0.588)
(b) Speedup on Data Set D2 ( = 0.588)
10.0
CPU GPU
1.0
Speed dupFactor
100.0
100.0
Time(sec)
10
80.0 60.0 40 0 40.0 20.0 0.0
2
6
10 14 18 22 26 30
Dimensionality
(c) Speedup on Data Set D2 ( = 1.429)
2
6
10
14
18
22
26
30
Dimensionality
(d) Speedup on Data Set D2 ( = 1.429)
Fig. 15. Impact of the Dimensionality on the NLJ on GPU with Index Support
86
C. B¨ ohm et al.
Tim me(se ec)
1000.0 100.0 Synchronization 10.0
no Synchronization
1.0 0.10
0.35
0.60
0.85
1.10
epsilon
Fig. 16. Evaluation of two versions for the self-join on GPU w.r.t. the join selectivity
1. Increment of the counters regarding a pair of objects (x, q) that fulfills the join condition is done by the use of an atomic operation that involves synchronization of all threads. 2. Increment of the counters can be performed without synchronization but with duplicated workload instead. We evaluate both options on a synthetic data set with 500k points generated as specified as DS1 in Table 1. Figure 16 displays the runtime of both options. For ≤ 0.6, the runtime is in the same order of magnitude, the synchronized variant 1 being slightly more efficient. From this point on, the non-synchronized variant 2 is clearly outperforming variant 1 since parallelism is not limited by synchronization. 8.3
Evaluation of CUDA-K-Means
To analyze the efficiency of K-means clustering on the GPU, we present experiments with respect to different data set sizes, number of clusters and dimensionality of the data. As benchmark we apply a single-threaded implementation of K-means on the CPU to determine the speedup of the implementation of K-means on the GPU. As the number of iterations may vary in each run of the experiments, all results are normalized by a number of 50 iterations both on the GPU and the CPU implementation of K-means. All experiments are performed on synthetic data sets as described in detail in each of the following settings. Evaluation of the Size of the Data Set. For these experiments we created 8-dimensional synthetic data sets of different size, ranging from 32k to 2m data points. The data sets consist of different numbers of random clusters, generated as as specified as DS1 in Table 1. Figure 17 displays the runtime in seconds in logarithmic scale and the corresponding speedup factors of CUDA-K-means and the benchmark method on the CPU for different number of clusters. The time needed for data transfer from CPU to GPU and back has been included. The corresponding speedup factors are given in Figure 17(d). Once again, these experiments support the evidence that the performance of data mining approaches on GPU outperform classic
1000.0
1000.0
100.0
100.0
CPU GPU
10 0 10.0 1.0 0
1000
Time(sec)
Time(sec)
Data Mining Using Graphics Processing Units
CPU GPU
10 0 10.0 1.0
2000
0
0.1
2000
Size(k)
(a) Runtime for 32 clusters
(b) Runtime for 64 clusters
100000.0
1000.0 CPU GPU
100.0 100 0 10.0
Sp peedupFFacto or
1200.0
10000.0
Time(sec)
1000
0.1
Size(k)
1000 0 1000.0 800.0 600.0
k=256 k=64 k 32 k=32
400.0 200 0 200.0 0.0
1.0 0.1 0
87
1000
2000
Size(k)
(c) Runtime for 256 clusters
0
1000
2000
Size(k)
(d) Speedup for 32, 64 and 256 clusters
Fig. 17. Evaluation of CUDA-K-means w.r.t. the Size of the Data Set
CPU versions by significant factors. Whereas a speedup of approximately 10 to 100 can be achieved for relatively small number of clusters, we obtain a speedup of about 1000 for 256 clusters, that is even increasing with the number of data objects. Evaluation of the Impact of the Number of Clusters. We performed several experiments to validate CUDA-K-means with respect to the number of clusters K. Figure 18 shows the runtime in seconds of CUDA-K-means compared with the implementation of K-means on the CPU on 8-dimensional synthetic data sets that contain different number of clusters, ranging from 32 to 256, again together with the corresponding speedup factors in Figure 18(d). The experimental evaluation of K on a data set that consists of 32k points results in a maximum performance benefit of more than 800 compared to the benchmark implementation. For 2m points the speedup ranges from nearly 100 up to even more than 1,000 for a data set that comprises 256 clusters. In this case the calculation on the GPU takes approximately 5 seconds, compared to almost 3 hours on the CPU. Therefore, we determine that due to massive parallelization, CUDA-K-means outperforms CPU by large factors, that are even growing with K and the number of data objects n. Evaluation of the Dimensionality. These experiments provide an evaluation with respect to the dimensionality of the data. We perform all tests on synthetic
88
C. B¨ ohm et al.
10000.0
Time(sec)
100.0 CPU GPU
10 0 10.0
Tim me(sec)
1000.0
1000.0 100.0
1.0 0
64
128
192
CPU GPU
10.0 1.0
256
0
0.1
64
128
k
(a) Runtime for 32k points
256
(b) Runtime for 500k points 1200.0
1000.0 100.0 CPU GPU
10.0
Speed dupFactor
10000.0
Tim me(sec)
192
k
1000.0 800.0 600.0
2mpoints 500kpoints 32kpoints
400.0 200.0
1.0
0.0 0
64
128
192
256
0
64 128 192 256
k
k
(c) Runtime for 2m points
(d) Speedup for 32k, 500k and 2m points
Fig. 18. Evaluation of CUDA-K-means w.r.t. the number of clusters K
data consisting of 16k data objects. The dimensionality of the test data sets vary in a range from 4 to 256. Figure 19(b) illustrates that CUDA-K-means outperforms the benchmark method K-means on the CPU by factors of 230 for 128-dimensional data to almost 500 for 8-dimensional data. On the GPU and the CPU, the dimensionality affects possible compiler optimization techniques, like loop unrolling as already shown in the experiments for the evaluation of the similarity join on the GPU. In summary, the results of this section demonstrate the high potential of boosting performance of complex data mining techniques by designing specialized index structures and algorithms for the GPU. 10000.0
Tiime(sec))
1000.0 100.0 CPU GPU
10.0 1.0 01 0.1
Sp peedupFFacto or
700.0 600.0 500 0 500.0 400 0 400.0 300 0 300.0 200.0 0
64
128
192
Dimensionality
(a) Runtime
256
0
64
128
192
256
Dimensionality
(b) Speedup
Fig. 19. Impact of the Dimensionality of the Data Set on CUDA-K-means
Data Mining Using Graphics Processing Units
9
89
Conclusions
In this paper, we demonstrated how Graphics processing Units (GPU) can effectively support highly complex data mining tasks. In particular, we focussed on clustering. With the aim of finding a natural grouping of an unknown data set, clustering certainly is among the most wide spread data mining tasks with countless applications in various domains. We selected two well-known clustering algorithms, the density-based algorithm DBSCAN and the iterative algorithm Kmeans and proposed algorithms illustrating how to effectively support clustering on GPU. Our proposed algorithms are accustomed to the special environment of the GPU which is most importantly characterized by extreme parallelism at low cost. A single GPU consists of a large number of processors. As buildings blocks for effective support of DBSCAN, we proposed a parallel version of the similarity join and an index structure for efficient similarity search. Going beyond the primary scope of this paper, these building blocks are applicable to support a wide range of data mining tasks, including outlier detection, association rule mining and classification. To illustrate that not only local density-based clustering can be efficiently performed on GPU, we additionally proposed a parallelized version of K-means clustering. Our extensive experimental evaluation emphasizes the potential of the GPU for high-performance data mining. In our ongoing work, we develop further algorithms to support more specialized data mining tasks on GPU, including for example subspace and correlation clustering and medical image processing.
References 1. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide (2007) 2. Bernstein, D.J., Chen, T.-R., Cheng, C.-M., Lange, T., Yang, B.-Y.: Ecm on graphics cards. In: Soux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 483–501. Springer, Heidelberg (2009) 3. B¨ ohm, C., Braunm¨ uller, B., Breunig, M.M., Kriegel, H.-P.: High performance clustering based on the similarity join. In: CIKM, pp. 298–305 (2000) 4. B¨ ohm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics processors. In: BTW, pp. 57–66 (2009) 5. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104 (2000) 6. Cao, F., Tung, A.K.H., Zhou, A.: Scalable clustering using graphics processors. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 372–384. Springer, Heidelberg (2006) 7. Catanzaro, B.C., Sundaram, N., Keutzer, K.: Fast support vector machine training and classification on graphics processors. In: ICML, pp. 104–111 (2008) 8. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996) 9. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)
90
C. B¨ ohm et al.
10. Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: Gputerasort: high performance graphics co-processor sorting for large database management. In: SIGMOD Conference, pp. 325–336 (2006) 11. Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M.C., Manocha, D.: Fast computation of database operations using graphics processors. In: SIGMOD Conference, pp. 215–226 (2004) 12. Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: SIGMOD Conference, pp. 73–84 (1998) 13. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD Conference, pp. 47–57 (1984) 14. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008) 15. Katz, G.J., Kider, J.T.: All-pairs shortest-paths for large graphs on the gpu. In: Graphics Hardware, pp. 47–55 (2008) 16. Kitsuregawa, M., Harada, L., Takagi, M.: Join strategies on kd-tree indexed relations. In: ICDE, pp. 85–93 (1989) 17. Koperski, K., Han, J.: Discovery of spatial association rules in geographic information databases. In: Egenhofer, M.J., Herring, J.R. (eds.) SSD 1995. LNCS, vol. 951, pp. 47–66. Springer, Heidelberg (1995) 18. Leutenegger, S.T., Edgington, J.M., Lopez, M.A.: Str: A simple and efficient algorithm for r-tree packing. In: ICDE, pp. 497–506 (1997) 19. Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: ICDE, pp. 1111–1120 (2008) 20. Liu, W., Schmidt, B., Voss, G., M¨ uller-Wittig, W.: Molecular dynamics simulations on commodity gpus with cuda. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 185–196. Springer, Heidelberg (2007) 21. Macqueen, J.B.: Some methods of classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 22. Manavski, S., Valle, G.: Cuda compatible gpu cards as efficient hardware accelerators for smith-waterman sequence alignment. BMC Bioinformatics 9 (2008) 23. Meila, M.: The uniqueness of a good optimum for k-means. In: ICML, pp. 625–632 (2006) 24. Plant, C., B¨ ohm, C., Tilg, B., Baumgartner, C.: Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data. Bioinformatics 22(8), 981–988 (2006) 25. Shalom, S.A.A., Dash, M., Tue, M.: Efficient k-means clustering using accelerated graphics processors. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 166–175. Springer, Heidelberg (2008) 26. Szalay, A., Gray, J.: 2020 computing: Science in an exponential world. Nature 440, 413–414 (2006) 27. Tasora, A., Negrut, D., Anitescu, M.: Large-scale parallel multi-body dynamics with frictional contact on the graphical processing unit. Proc. of Inst. Mech. Eng. Journal of Multi-body Dynamics 222(4), 315–326
Context-Aware Data and IT Services Collaboration in E-Business Khouloud Boukadi1, Chirine Ghedira2, Zakaria Maamar3, Djamal Benslimane2, and Lucien Vincent1 1
Ecole des Mines, Saint Etienne, France 2 University Lyon 1, France 3 Zayed University, Dubai, U.A.E. {boukadi,Vincent}@emse.fr, {cghedira,dbenslim}@liris.cnrs.fr,
[email protected] Abstract. This paper discusses the use of services in the design and development of adaptable business processes, which should let organizations quickly react to changes in regulations and needs. Two types of services are adopted namely Data and Information Technology. A data service is primarily used to hide the complexity of accessing distributed and heterogeneous data sources, while an information technology service is primarily used to hide the complexity of running requests that cross organizational boundaries. The combination of both services takes place under the control of another service, which is denoted by service domain. A service domain orchestrates and manages data and information technology services in response to the events that arise and changes that occur. This happens because service domains are sensible to context. Policies and aspect-oriented programming principles support the exercise of packaging data and information technology services into service domains as well as making service domains adapt to business changes. Keywords: service, service adaptation, context, aspect-oriented programming, policy.
1 Introduction With the latest development of technologies for knowledge management on the one hand, and techniques for project management on the other hand, both coupled with the widespread use of the Internet, today’s enterprises are now under the pressure of adjusting their know-how and enforcing their best practices. These enterprises have to be more focused on their core competencies and hence, have to seek the support of other peers through partnership to carry out their non-core competencies. The success of this partnership depends on how business processes are designed as these processes should be loosely coupled and capable to cross organizational boundaries. Since the inception of the Service-Oriented Architecture (SOA) paradigm along with its multiple implementation technologies such as Jini services and Web services, the focus of the industry community has been on providing tools that would allow seamless and flexible application integration within and across organizational boundaries. Indeed, A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 91–115, 2009. © Springer-Verlag Berlin Heidelberg 2009
92
K. Boukadi et al.
SOA offers solutions to interoperability, adaptability, and scalability challenges that today’s enterprises have to tackle. The objective, here, is to let enterprises collaborate by putting their core services together, which leads to the creation of new applications that should be responsive to changes in business requirements and regulations. Nevertheless, looking at enterprise applications from a narrowed perspective, which consists of services and processes only, has somehow overlooked the data that these applications use in terms of input and output. Data identification and integration are left at a later stage of the development cycle of these applications, which is not very convenient when these data are spread over different sources including relational databases, silos of data-centric homegrown or packaged applications, XML files, just to cite a few [1]. As a result, data identification and integration turn out tedious for SOA application developers: bits and pieces of data need to be retrieved/updated from/in heterogeneous data sources with different interfaces and access methods. This situation has undermined SOA benefits and forced the SOA community to recognize the importance of adopting a data-oriented view of services. This has resulted into the emergence of the concept of data services. In this paper, we look into ways of exposing IT applications and data sources as services in compliance with SOA principles. To this end, we treat a service as either an IT Service (ITS) or a Data Service (DS) and identify the necessary mechanisms that would let ITSs and DSs first, work hand-in-hand during the integration exercise of enterprise applications and second, engage in a controlled way in high-level functionalities to be referred to as Service Domains (SDs). Basically, a SD orchestrates ITSs and DSs in order to provide ready-to-use high-level functionalities to users. By ready-to-use we mean service publication, selection, and combination of fine-grained ITSs and DSs are already complete. We define ITSs as active components that make changes in the environment using update operations that empower them, whereas DSs as passive component that return data only (consultation) and thus, do not impact the environment. In this paper we populate this environment with specific elements, which we denote by Business Objects (BOs). The dynamic nature of BOs (e.g., new BOs are made available, some cease to exist without prior notice, etc.) and the business processes that are built upon these BOs, requires that SDs should be flexible and sensible to all changes in an enterprise’s requirements and regulations. We enrich the specification of SDs with contextual details such as execution status of each participating service (DS or ITS), type of failure along with the corrective actions, etc. This enrichment is illustrated with the de facto standard namely the Business Process Execution Language (BPEL) for service integration. BPEL specifies a business process behavior through automated process integration both within and between organizations. We complete the enrichment of BPEL with context in compliance with Aspect-Oriented Programming (AOP) principles in terms of aspect injection, activation, and execution and through a set of policies. While context and policies are used separately in different SOA initiatives[2, 3], we examine in this paper their role in designing and developing SDs. This role is depicted using a multi-level architecture that supports the orchestration of DSs and ITSs in response to changes detected in the context. Three levels are identified namely executive, business, and resource. The role of policies and context in this architecture is as follows:
Context-Aware Data and IT Services Collaboration in E-Business
93
• The trend towards context-aware, adaptive, and on-demand computing requires that SDs should respond to changes in the environment. This could happen by letting SDs sense the environment and take actions. • Policies manage and control the participation of DSs and ITSs in SDs to guarantee free-of-conflicts SDs. Conflicts could be related to non-sharable resources and semantic mismatches. Several types of policies will be required so that the particularities of DSs and ITSs are taken into account. The rest of the paper is organized as follows. Section 2 defines some concepts and introduces a running example. Section 3 presents the multi-level architecture for service (ITSs and DSs) collaboration and outlines the specification of these services. Section 4 introduces the adaptation of services based on aspect as well as the role of policies first, in managing the participation of DSs and ITSs in SDs and second, in controlling the aspect injection within SDs. Prior to concluding in Section 6, related work is reported in Section 5.
2 Background 2.1 Definitions IT Service. The term IT services is used very often nowadays, even though not always with the same meaning. Existing definitions range from the very generic and allinclusive to the very specific and restrictive. Some authors in [4, 5] define IT services as a software application accessible to other applications over the Web. Another definition is provided by [6], which says that an IT service is provided by an IT system or the IT department (respectively an external IT service provider) to support business processes. The characteristics of IT services can vary significantly. They can comprise single software components as well as bundles of software components, infrastructure elements, and additional services. These additional services are usually information services, consulting services, training services, problem solving services, or modification services. They are provided by operational processes (IT service processes) within the IT department or the external service provider [7]. In this paper, we consider that IT services should be responsible for representing and implementing business processes in compliance with SOA principles. An IT service corresponds to a functional representation of a real-life business activity having a meaningful effect to end-users. Current practices suggest that IT services could be obtained by applying IT SOA methods such as SAP NetWeaver and IBM WebSphere. These methods bundle IT software and infrastructure and offer them as Web services with standardized and well-defined interfaces. We refer to Web services that result out of the application of such methods as enterprise IT Web services or simply IT services. Data Service is, in a recent report from Forrester Research, “an information service (i.e., data service) provides a simplified, integrated view of real-time, high-quality information about a specific business entity, such as a customer or product. It can be provided by middleware or packaged as an individual software component. The information that it provides comes from a diverse set of information resources, including
94
K. Boukadi et al.
operational systems, operational data stores, data warehouses, content repositories, collaboration stores, and even streaming sources in advanced cases’’ [8]. Another definition suggests that a data service is “a form of Web service, optimized for the real-time data integration demands of SOA. Data services virtualize data to decouple physical and logical locations and therefore avoid unnecessary data replication. Data services abstract complex data structures and syntax. Data services federate disparate data into useful composites. Data services also support data integration across both SOA and non-SOA applications” [9]. Data services can be seen as a new class of services that sits between service-based applications and enterprises’ data sources. By doing so, the complexity of accessing these sources is minimized, which lets application developers focus on the application logics of the solutions to develop. These data sources are the basis of different business objects such as customer, order, and invoice. Business object is “the representation of a thing active in the business domain, including at least its business name and definition, attributes, behavior, relationships, and constraints” (OMG’s Business Object Management Special Interest Group). Another definition by the Business Object Management Architecture (jeffsutherland.com/oopsla97/marshall.html) suggests that a business object could be defined through the concepts of purpose, process, resource, and organization. A purpose is about the rationale of a business. A process illustrates how this purpose is reached through a series of dependent activities. A resource is a computing platform upon which the activities of a process are executed. Finally, an organization manages resources in terms of maintenance, access rights, etc. Policies are primarily used to express the actions to take in response to the occurrence of some events. According to [10], policies are “information which can be used to modify the behavior of a system”. Another definition suggests that policies are “external, dynamically modifiable rules and parameters that are input to a system so that this latter can then adjust to administrative decisions and changes in the execution environment” [5]. In the Web services field, policies are treated as rules and constraints that specify and control the behavior of a Web service upon invocation or participation in composition. For example, a policy determines when a Web service can be invoked, what constraints are put on the inputs a Web service expects, how a Web service can be substituted by another in case of failure, etc. According to [11], policies could be at two levels. At the higher level, policies monitor the execution progress of a Web service. At the lower level, policies address issues like how Web services communicate and what information is needed to enable comprehensive data exchange. Context “… is not simply the state of a predefined environment with a fixed set of interaction resources. It is part of process of interacting with an ever-changing environment composed of reconfigurable, migratory, distributed, and multi-scale resources”[12]. In the field of Web services, context facilitates the development and deployment of context-aware Web services. Standard Web services descriptions are then, enriched with context details and new frameworks to support this enrichment are developed [13]. Aspect Oriented Programming. AOP is a paradigm that captures and modularizes concerns that crosscut a software system into modules called Aspects. Aspects can be
Context-Aware Data and IT Services Collaboration in E-Business
95
integrated dynamically into a system using the dynamic weaving principle [14]. In AOP, unit of modularity is introduced using aspects that contain different code fragments (known as advice) and location descriptions (known as pointcuts) that identify where to plug the code fragment. These points, which can be selected using pointcuts, are called join points. The most popular Aspect language is Java-based AspectJ [15]. 2.2 Running Example/Motivating Scenario Our running example is about a manufacturer of plush toys that gets extremely busy with orders during Christmas time. When an order is received, the first step consists of requesting from suppliers the different components that contribute to the production of the plush toys as per an agreed time frame. When the necessary components are received, the assembly operations begin. Finally, the manufacturer selects a logistic company to deliver these products by the due date. In this scenario, the focus is on the delivery service only. Let us assume an inter-enterprise collaboration is established between the manufacturer (service consumer) and a logistic enterprise (service provider). This latter delivers parcels from the manufacturer’s warehouse to a specific location (Fig. 1 - step (i)). If there are no previous interactions between these two bodies, the logistic enterprise verifies the shipped merchandise. Upon verification approval, putting merchandise in parcels service is immediately invoked. This one uses a data service known as parcel service, which checks the number of parcels to deliver. Putting merchandise in parcels service is followed by delivery price and computing delivery price data services. The delivery price data service retrieves the delivery price that corresponds to the manufacturer order based on the different enterprise business objects it has access to such as toy (e.g., size of toy), customer (e.g., discount for regular customers), and parcel (e.g., size of parcel). Finally, the merchandise is transported to the specified location at the delivery due date. The delivery service is considered as a SD that orchestrates four IT services and two data services: picking merchandise, verifying merchandise, putting merchandise in parcels, delivery price, computing delivery price, and delivering merchandise. Fig.1 depicts a graph-based orchestration schema (for instance a BPEL process) of the delivery service.
Fig. 1. The delivery service internal process
96
K. Boukadi et al.
The inter-enterprise collaboration between the manufacturer and the logistic enterprise raises the importance of establishing dynamic (not pre-established) contacts since different enterprise logistics exist. To this end, the orchestration schema of the SD should be aware of the contexts of both manufacturer and nature of collaboration. Additional details can easily affect the progress of any type collaboration. For example, if the manufacturer is located outside the country, some security controls should be added and price calculation should be reviewed. Thus, the SD orchestration schema should be enhanced with contextual information that triggers changes in the process components (ITSs and DSs) in a timely manner. In this case, environmental context such as weather conditions (e.g., snow storm, heavy rain) may affect the IT service "putting merchandise in parcels". Consequently, several actions should be anticipated to avoid the deterioration of the merchandise by using metal instead of regular cardboard boxes. Besides, the participation of the different ITSs and DSs in the delivery SD should be managed and controlled in order to guarantee that the obtained SD is free-ofconflicts. ITSs and DSs belong to different IT departments, each with its own characteristics, rules, and constraints. As a result, effective mechanisms that would ensure and regulate ITSs and DSs interaction are required. Besides, these mechanisms should also capture changes in context and guarantee the adaptation of service domains’ behaviors in order to accommodate the situation in which they are going to operate.
3 Multi-level Architecture for Service Collaboration In this section, the multi-level architecture that supports the orchestration of DSs and ITSs during the exercise of developing SDs is presented in terms of concepts, duties per layer, and service specifications. 3.1 Service Domain Concept The rationale of the service domain concept is to abstract at a higher-level the integration of a large number of ITSs and DSs. A service domain is built upon (in fact, it uses existing standards such as WSDL, SOAP, and UDDI) and enhances the Web service concept. It does not define new application programming interfaces or standards, but hides the complexity of this exercise by facilitating service deployment and self-management. Fig. 2 illustrates the idea of developing inter-enterprise business processes using several service domains. A SD involves ITSs and DSs in two types of orchestration schemas: vertical and horizontal. In the former, a SD controls a set of ITSs, which themselves controls a set of DSs. In the latter, a SD controls both ITSs and DSs at the same time. These two schemas offer the possibility of applying different types of control over the services whether data or IT. Businesses have to address different issues, so they should be given the opportunity to do so in different ways. DSs and ITSs participation in either type of orchestration is controlled through a set of policies, which permit to guarantee that SDs are free-of-conflicts. More details on policy use are given in subsections 4.2 and 4.3.2.
Context-Aware Data and IT Services Collaboration in E-Business
97
Fig. 2. Inter-enterprise collaboration based on service domains
According to the running example, the delivery SD orchestrates four IT services and two data services: picking merchandise, verifying merchandise, putting merchandise in parcels, delivery price, computing delivery price, and delivering merchandise. Keeping these ITSs and DSs in one place facilitates manageability and avoids extra composition work on the client side as well as exposing non-significant services like "Verifying merchandise" on the enterprise side.
Fig. 3. Multi-layer architecture
98
K. Boukadi et al.
The multi-level architecture in Fig.3 operates in a top-down way. It starts from executive-layer level, goes through the resource and business levels. These layers are described in detail in the following. 3.2 Roles and Duties per Layer Executive layer. In this layer, a SD consists of an Entry Module (EM), a Context Manager Module (CMM), a Service Orchestration Module (SOM), and an Aspect Activator Module (AAM). In Fig. 3, CMM, SOM, and AAM provide external interfaces to the SD. The EM is SOAP-based to receive users’ requests and return responses. In addition to these requests, the EM supports the administration of a SD. For example, an administrator can send a register command to add a new ITS to a given SD after signing up this ITS in the corresponding ITS registry. The register command can also be used to add a new orchestration schema to the orchestration schemas registry. When the EM receives a user’s request, it screens the orchestration schemas registry to select a suitable orchestration schema for this request and identify the best ITSs and DSs. The selection of this schema and other services takes into account the customer context (detailed in section 4.1). Afterwards, the selected orchestration schema is delivered to the SOM, which is basically an orchestration engine based on BPEL [16]. The SOM presents an external interface called Execution Control Interface (ECI) that lets user obtain information about the status of a SD that is under execution. This interface is very useful in case of external collaboration as it ensures the monitoring of the SD progress. This is a major difference between a SD and a regular Web service. In fact, with the ECI a SD is based on the Glass box principle, which is opposite to the black box principle. In the glass box the SD way-of-doing is visible to the environment and mechanisms are provided to monitor the execution progress of the SD. Contrarily, a Web service is seen as a black box piece of functionality: described by its message interface and has no internal process structure that is visible to its environment. Finally, the final external interface known as Context Detection Interface (CDI) is used by the CMM to detect and catch changes in context so that SD adaptability is guaranteed. This happens by selecting and injecting the right aspect with respect to the current context change. To this end, a SD uses the AAM to identify a suitable aspect for the current situation so that the AAM injects this aspect into the BPEL process. Resource Layer. It consists of two sub-layers namely source and service instances. •
The sources sub-layer is populated with different registries namely orchestration schemas, ITS, DS, context, and aspect. The orchestration schemas registry consists of a set of abstract processes based on ITSs and DSs, which are at a later stage implemented as executable processes. In addition, this sub-layer includes a set of data sources that DSs use for functioning. The processing of these requests is subject to access privileges that could be set by different bodies in the enterprise such as security administrators. The content of the data sources evolves over time following data sources addition, withdrawal, or modification. According
Context-Aware Data and IT Services Collaboration in E-Business
•
99
to the running example, customer and inventory databases are examples of data sources. The services instances sub-layer is populated with a set of instance services that originate from ITSs and DSs. On the one hand, DS instances collect data from BOs and not from data sources. This permit to shield DSs from semantic issues and changes in data sources and to prepare data from disparate BOs that are scattered cross the enterprise. The data that a DS collects could have different recipients including SDs, ITSs, or other DSs. Basically, a DS crawls the business-objects level looking for the BOs it needs. A DS consults the states that the BOs are currently in to collect the data that are reported in these states. This collection depends on the access rights (either public or private) that are put on data; some data might not be available to DSs. According to the running example, a DS could be developed to track the status of the parcels included in the delivery process (i.e., number of used parcels). This DS would have to access order and update parcel BOs. If order BO takes now on orderChecked state, the DS will know the parcels that are confirmed for inclusion in the delivery and hence, interact with the right ITS to update the parcel BO. This is not the case if this BO was still in orderUpdated state; some items might not have been confirmed yet. On the other hand, ITS instances implement business processes that characterize enterprises’ day-to-day activities. ITSs need the help of DSs and BOs for the data they produce and host, respectively. Update means, here, make a BO take on a new state, which is afterwards reflected on some data sources by updating their respective data. This is not the case with DSs that consult BOs only.
Business Layer. It (1) tracks the BOs that are developed and deployed according to the profile of the enterprise and (2) identifies the capabilities of each BO. In [17] we suggest that BOs should exhibit a goal-driven behavior instead of just responding to stimuli. The objective is to let BOs (i) screen the data sources that have the data they need, (ii) resolve data conflicts in case they raise, and (iii) inform other BOs about their capabilities of data source access and data mediation. As mentioned earlier, order, parcel, and customer are examples of BOs. These BOs are related to each other, e.g., an order is updated only upon inspection of customer record and order status. The data sources that these BOs access could be customer and inventory databases. 3.3 Layer Dependencies Executive, resource, and business layers are connected to each other through a set of dependencies. We distinguish two types of dependencies: intra-layer dependencies and inter-layers dependencies. 1. Intra-layer dependencies: Within the resource layer two types of intra-layer dependencies are identified:
100
K. Boukadi et al.
• Type and instance dependency: we differentiate between the ITSs types that are published in the IT services registry, which are groups of similar (in term of functionality) IT services, and the actual IT service instances that are made available for invocation. To illustrate the complexity of the dependencies that arises, we suggest hereafter a simple illustration of the number of IT service instances that could be obtained out of ITSs. Let us consider ITSt = {ITSt1, . . ., ITStα} a set of α IT service types and ITSi = {ITSi1, . . ., ITSiβ} a set of β service instances that exist in the services instances sub-layer. The mapping of S onto I is subjective and one-to-many. Assuming each IT service type ITst1 has N instantiations, β =N×α. The same dependency exists between DS types in the sources sub-Layer and data services instances in the services instances sub-layer. •
2.
Composition dependency involves orchestration schemas in two sublayers namely source and service instance. Composition illustrates today’s business processes that are generally developed using distributed and heterogeneous modules for instance services. For an enterprise business process, it is critical to identify the services that are required, the data that these services require, make sure that these services collaborate, and develop strategies in case conflicts occur.
Inter-layers dependencies •
•
Access dependency involves the service instances sub-layer and the business layer. Current practices expose services directly to data sources, which could hinder the reuse opportunities of these services. The opposite is here adopted by making services interact with BOs for their needs of data. For a service, it is critical to identify the BOs it needs, comply with the access privileges of these BOs, and identify the next services that it will interact with after completing an orchestration process. Access dependency could be of different types namely on-demand, periodic, or event-driven. In the on-demand case, services submit requests to BOs when needed. In the periodic case, services submit requests to BOs according to a certain agreed-upon plan. Finally, in the event-driven case services submit requests to BOs in response to some event. Invocation dependency involves the business and source layers. An invocation implements the technical mechanisms that allow a BO access the available data sources in terms of consultation or update (see Fig. 4). A given BO includes a set of operations: o A set of read methods, which provide various ways to retrieve and return one or more instances of the data included in a data source. o A set of write methods, responsible for updating (inserting, modifying, deleting) one or more instances of the data included in the data sources. o A set of navigation methods, responsible for traversing relationships from one data source to one or more data of a second data source. For example, the Customer BO can have two navigation methods getDelivOrder and getElecOrder, to fetch for a given customer the delivery orders from a delivery database and electronic orders from electronic order database.
Context-Aware Data and IT Services Collaboration in E-Business
101
Fig. 4. Invocation dependency between the business objects and the data sources
3.4 Services Specifications 3.4.1 Data Service Specification DSs come along with a good number of benefits that would smooth the development of enterprise SD: Access unification: data related to a BO might be scattered across independent data sources that could present three kinds of heterogeneities: o
o
Model heterogeneity: each data source has its own data model or data format (relational tables, WSDL with XML schema, XML documents, flat files, etc.). Interface heterogeneity: each type of data source has its own programming interface; JDBC/SQL for relational databases, REST/SOAP for Web services, file I/O calls, and custom APIs for packaged or homegrown datacentric applications (like BAPI for SAP) .
The adoption of DSs relieves SOA application developers from having to directly cope with the first two forms of heterogeneity. That is, in the field of Web services all data sources are described using WSDL and invoked via REST or SOAP calls (which means having the same interface), and all data are in XML form and described using XML Schema (which means having the same data model). Reuse and agility: the value-added of SOA to application development is reuse and agility, but without flexibility at the data tier, this value-added could quickly erode. Instead of relying on non-reusable proprietary codes to access and manipulate data in monolithic application silos, DSs can be used and reused in multiple business processes. This simplifies the development and maintenance of service-oriented applications and introduces easy-to-use capabilities to use information in dynamic and real-time processes. To define DSs, we took into account the interactions that should take place between the DSs and BOs. DSs are given access to BOs for consultation purposes. Each DS represents a specific data-driven request whose satisfaction requires the participation of several BOs. The following suggests an example of itemStatusOrder DS whose role is to confirm the status of the items to include in a customer's order.
102
K. Boukadi et al. Table 1. DS service structure
In the above structure, the following arguments are used: 1. Input argument identifies the elements that need to be submitted to a DS. These elements could be obtained from different parties such as users and other DSs. 2. Output argument identifies the elements that a DS returns after its processing is complete. 3. Method argument identifies the actions that a DS implements in response to the access requests it runs over the different BOs. Because DSs could request sensitive data from BOs, we suggest in Section 4.2.1 that appropriate policies should be developed so that data misuse cases are avoided. We refer to these policies as privacy. 3.4.2 IT Service Specification For the IT service specification, we follow the one proposed by Papazoglou and Heuvel in [18] who specify an IT service as (1) a structural specification that defines service types, messages, port types, (2) a behavioral specification that defines service Table 2. IT service structure
Context-Aware Data and IT Services Collaboration in E-Business
103
operations, effects, and side effects of service operations, and (3) a policy specification that defines the policy assertions and constraints on the service. Based on this specification, we propose in our work a set of policies as follows: − Business policies correspond to policy specification. − Behavior policy corresponds to structural specification. − Privacy policies correspond to behavioral specification.
4 Services Collaboration 4.1 Context-Aware Orchestration The concept of context appears in many disciplines as a meta-information that characterizes the specific situation of an entity, to describe a group of conceptual entities, partition a knowledge base into manageable sets or as a logical construct to facilitate reasoning services [19]. The categorization of context is critical for the development of adaptable applications. Context includes implicit and explicit inputs. For example, user context can be deducted in an implicit way by the service provider such as in pervasive environment using physical or software sensors. Explicit context is determined precisely by entities that the context involves. Nevertheless, despite the various attempts to suggest a context categorization, there is no proper categorization. Relevant information differs from one domain to another and depends on their effective use [20]. In this paper, we propose an OWL-based context categorization in Fig. 4. This categorization is dynamic as new sub-categories can be added at any time. Each context definition belongs to a certain category, which can be related to provider, customer, and collaboration.
Fig. 5. Ontology for categories of context
104
K. Boukadi et al.
In the following, we explain the different concepts that constitute our ontology-based model for context categorization: − Provider-related context deals with the conditions under which providers can offer their SDs. For example, performance attributes including some metrics to measure a service quality: time, cost, QoS, and reputation. These attributes are used to model the competition between providers. − Customer-related context represents the set of available information and metadata used by service providers to adapt their services. For example, a customer profile permits to characterize a user. − Collaboration-related context represents the context of the business opportunity. We identify three sub-categories: location, time, and business domain. The location and time represent the geographical location and the period of time within which the business opportunity should be accomplished. 4.2 Policy Specification As stated earlier, policies are primarily used to first, govern the collaboration between ITSs, DSs, and SDs and second, reinforce specific aspects of this collaboration such as when an ITS accepts to take part in a SD and when a DS rejects a data request from an ITS because of risk of access right violation. Because of the variety of these aspects, we decompose policies into different types and dissociate policies from the business logics that services implement. Any change in a policy should “slightly’’ affect a service’s business logic and vice versa. 4.2.1 Types of Policies Policies might be imposed by different types of initiators like the service itself, service provider, and user who plans to use the service [21]. − Service driven policy is defined by the individual organizational that offer services. This description is not enough. − Service flow driven policy is defined by the organizations offering a composite web service. − Customer driven policy is meant to future consumers of services. Generally, a user has various preferences in selecting a particular service, and these preferences have to be taken into account during composition or even during other steps like section selection, composition, and execution. For example, if two providers have two services with the same functionality, the user would like to consider the cheapest. Policies are used in different application domains such as telecommunication, learning, just to cite a few, which supports the rationale of developing different types of policies. In this paper, we suggest the following types based on some of our previous works [11, 22]: − Business policy defines the constraints that restrict the completion of a business process and determines how this process should be executed according to users’ requirements and organizations’ internal regulations. For example, a car loan
Context-Aware Data and IT Services Collaboration in E-Business
105
application needs to be treated within 48 hours and a bank account should maintain a minimum balance. − Behavior policy supports the decisions that a service (ITS and DS) has to make when it receives a request from a DS to be part of the orchestration schema that this associated with this DS. In [22], we defined three behaviors namely permission, dispensation, and restriction, which we continue to use in this paper. Additional details on these behaviors are given later. − Privacy policy safeguards against the cases of data misuse by different parties with focus here on DSs and ITSs that interact with BOs. For example, an ITS needs to have the necessary credentials to submit an update request to a BO. Credentials of an ITS could be based on the history of submitting similar request and reputation level. Fig. 5 illustrates how the three behaviors of a service (DS or ITS) are related to each other based on the execution outcome of behavior policies [22]. In this figure, dispensation (P) and dispensation(R) stand for dispensation related to permission and related to restriction, respectively. In addition, engagement (+) and engagement (-) stand for positive and negative engagement in a SD, respectively. • Permission: a service accepts to take part in a service domain upon validation of its current commitments in other service domains. • Restriction: a service does not wish to take part in a service domain for various reasons such as inappropriate rewards or lack of computing resources. • Dispensation means that a service breaks either a permission or a restriction of engagement in a service domain. In the former case, the service refuses to engage despite the positive permission that is granted. This could be due to the Permission no
yes
Engagement(-)
DispensationP yes
no
Engagement(-)
Restriction yes
no
DispensationR
Engagement(+)
no
Engagement(-)
yes
Engagement(+)
Fig. 6. Behaviors associated with a service
106
K. Boukadi et al.
unexpected breakdown of a resource upon which the service performance was scheduled. In the latter case, the service does engage despite the restrictions that are detected. The restrictions are overridden because of the priority level of the business scenario that the service domain implements, which requires an immediate handling of this scenario. In [11], Maamar et al. report that several types of policy specification languages exist. The selection of a policy specification language is guided by some requirements that need to be satisfied [23]: expressiveness to support the wide range of policy requirements arising in the system being managed, simplicity to ease the policy definition tasks for people with various levels of expertise, enforceability to ensure a mapping of policy specification into concrete policies for various platforms, scalability to guarantee adequate performance, and analyzability to allow reasoning about and over policies. In this paper we adopt WSPL is used. WSPL syntax is based on the OASIS eXtensible Access Control Markup Language (XACML) standard (www.oasisopen.org/committees/download.php/2406/oasis-xacml-1.0.pdf). The Listing.1 suggests a specification of a behavior policy with focus on privacy in WSPL. It shows an example of an ITS that checks the minimum age and income of a person prior to approving a car loan application.
Listing. 1. A behavior policy specification for an ITS
The Listing.2 suggests a specification of a business policy in WSPL. It shows an example of a DS that checks the possibility of taking part in a service domain. In addition to the arguments that form WSPL-defined policies, we added additional arguments for the purpose of tracking the execution of these policies. These additional arguments are as follows: − Purpose: describes the rationale of developing a policy P. − Monitoring authority: identifies the party that checks the applicability of a policy P so that the outcomes of this policy are reinforced. A service provider or policy developer could illustrate these parties.
Context-Aware Data and IT Services Collaboration in E-Business
107
Listing. 2. A business policy specification for a DS
− Scope (local or global): identifies the parties that are involved in the execution of a policy P. “Local” means that the policy involves a specific services, where “global” means that the policy involves different services. − Side-effect: describes the policies that could be triggered following the completion of policy P. − Restriction: limits the applicability of a policy P according to different factors such as time (e.g., business hours) and location (e.g., departments affected by policy P performance). 4.3 Service Domain Adaptability Using Aspects In the following we describe how we define and implement a context adaptive service domain using AOP. 4.3.1 Rationale of AOP AOP is based on two arguments. First, AOP enables crosscutting concerns, which is crucial to separate context information from the business logic. For example, in Delivery Service Domain, an aspect related to the calculation of extra fees could be defined in case there is a change in the delivery date. Second, AOP promotes the dynamic weaving principle. Aspects are activated and deactivated at runtime. Consequently, a BPEL process can be dynamically altered upon request. For the needs of SD adaptation, we suggest the following improvements in the existing AOP techniques: runtime activation of aspects in the BPEL process to enable dynamic adaptation according to context changes, and aspects selection to enable customer-specific contextualization of the Service Domain. 4.3.2 Using Policies to Express Contexts Modeling context is a crucial issue that needs to be addressed to assist context-aware applications. By context modeling we mean the language that will be used to define both service and enterprise collaboration contexts. Since, there is a diversity of contextual information, we find several context modeling languages such as ConteXtML [24], contextual schemas [25], CxBR (context-based reasoning) [26], and CxG (contextual graphs) [27]. These languages provide the means for defining context in specific application domains such as pervasive and mobile computing. All these representations have
108
K. Boukadi et al.
strengths and weaknesses. As stated in [28], lack of generality is the most frequent drawback: usually, each representation is suited for only a specific type of application and expresses a narrow vision of the context. Consequently, they present little or no support for defining context in Web service based collaboration scenarios. In this paper, we model the different types of context based on policy. Relation between context and policies is depicted in the definitions below: Definition 1. A service Context Ctxt is a pair (Ctxt-name, P) where Ctxt-name corresponds to the context name derived from the context ontology (Fig) and P is the policy related to the given context Ctxt. Let P-set= {P , P , …, P } denotes the set of 1
2
n
policies and SCx= {Cx , Cx ,…, Cx } the set of context properties related to a par1
2
n
ticular ITS or DS. We express the mapping between ITS or data service’ contexts and policies with the mapping function MFs: SCxÆP-set which gives the policies related to a given ITS or Data service. Definition 2. A customer context Custxt is a pair (Custxt -name, P) where Custxt-name corresponds to the context name derived from the context ontology (Fig) and P is the policy related to the given context Custxt. Let P-set= {P1, P2, …, Pn} denotes the set of policies and CCx= {Cx1, Cx2,…, Cxn} the set of context properties related to a particular customer. Same as the definition 1, we define a mapping function which retrieves the set of policies relating to a given customer context: MFc: CCx ÆP-set. In these definitions, context is described with policies. Consequently, to express context we need to express at first policies. We introduce the specification of context (customer, collaboration, and service contexts) in WSPL. Introducing the context concept in WSPL comes from the need to specify certain constraints that can depend on the environment in which the customer, the service, and the business collaboration are operational. For instance, a customer context that depicts a security requirement can be specified as follows.
Listing. 3. A customer context specified as a policy
4.3.3 Controlled Aspect Injection through Policies We show how policies and AOP can work hand-in-hand Policies related to customer and collaboration contexts are used to control the aspect injection within a SD. A SD provides a set of adaptation actions that are context dependent. We implement these actions as a set of aspects in order not to create any invasive code in the functional service implementation. An aspect includes a pointcut that matches a given ITS or
Context-Aware Data and IT Services Collaboration in E-Business
109
data service and one or more advices. These advices refer to the context dependent adaptation actions of this service. Advices are defined as Java methods and pointcuts are specified in XML format. Our implementation approach for the controlled aspect injection through policies is presented in Fig.7. In this figure, the Aspect Activator Module previously presented in Fig.3, includes the Aspect Manager Module (AMM), the Matching Module (MM), and the Weaver Module (WM). − The AMM is responsible for adding new aspects to a corresponding aspect registry. In addition, the AMM can deal with a new advice implementation, which could be added to this registry. The aspect registry contains the method names of the different advices related to a given ITS or data service. − The MM is the cornerstone of the proposed aspect injection approach. It receives matching requests from the AMM and returns one or a list of matched aspects. − The WM is based on an AOP mechanism known as weaving. The WM performs a run time weaving, which consists of injecting an advice implementation into the core logic of an ITS or data service. The control of an aspect injection into a DS or ITS is as follows. Once a context dependent IT service or Data service operation is reached, the Context Manager Module sends the AAM the service’s ID and its context dependent operation’ ID. Then, the AMM identifies the set of aspects that can be executed to the ITS or DS based on the information sent by the Context Manager Module (i.e., service’s ID and Operation’s ID) (action 1 and 2). The set of aspects as well as the customer policies are
Fig. 7. Controlled aspect injection through policies
110
K. Boukadi et al.
transmitted to Matching Module which returns the aspects that match the customer and the collaborations policies. The matching module is based on a matching algorithm and uses domain ontology. Finally, the WM integrates the advice implementation into the core logic of the service. By doing so, the service will execute the appropriate aspect in response to the current context (customer and collaboration contexts). For illustration purposes, consider a payment ITS which is aware of the past interactions with customers. For loyal customer, the credit card payment is accepted, but bank transfer is required for new customers. Hence, the payment operation depends on the customer context, i.e., loyal or new one. The context dependent behaviors of the payment ITS are exposed as a set of aspects. Three of them are depicted in Listing.4.
///Aspect 1
///Aspect 2
Context-Aware Data and IT Services Collaboration in E-Business
111
///Aspect 3 Listing. 4. The three aspects related to the payment ITS
For example, the advice of Aspect 1 is expressed as a Java class, which is executed instead of the operation captured by the pointcut (line 9). The join point, where the advice is weaved, is the payment operation (line 10). The pointcuts are expressed as a condition If Customer ="Loyal" (i.e., "past interaction=Yes") the advice uses the credit card number, in order to perform the customer payment. Consider now the customer related context which specifies a security requirement, which is previously described. Based on this requirement, when executing the payment service, the matching module will determine that only aspect 3 with secured transaction should be applied. This aspect is then transmitted to the weaver module in order to be injected in the payment service.
5 Related Work In this work, we identify two types of works related to our proposal: on the one hand, those proposals that, come from the data engineering field and propose approaches for data service modeling and development; and, on the other hand, those ones that focus specially in the adaptation of ITS (Web services). Data services & the SOA software industry. Data services have gained considerable attention from SOA software industry leaders over the last three years. Many products are currently offered or being developed to make the creation of Data services easier than ever, to cite a few, AquaLogic by BEA Systems [29], Astoria by Microsoft [30], MetaMatrix by RedHat [31], Composite Software [9], Xcalia [32], and IBM [33]. The products offered here integrate the enterprise’s data sources and provide a uniform access to data through Data services. As a representative example,
112
K. Boukadi et al.
AquaLogic BEA’s data service is a collection of functions that all have a common output schema, accept different sets of parameters, and are implemented via individual XQuery expressions. In a simplified example, a Data Service exports a set of functions returning Customer objects where one function takes as input the customer’s last name, another one her city and state, and so on. AquaLogic exports these Data services to SOA application developers as Data Web Services, where functions become operations. In Microsoft’s Astoria project a data service or ADO.NET data services is a RESTbased framework that allows releasing data via flexible data services and well-known industry standards (JSON and Atom). As opposed to message-oriented frameworks like SOAP-based services, REST-based services use basic HTTP requests (GET, POST, PUT and DELETE) to perform CRUD standing for Create, Read, Update and Delete operations. Such query patterns allow navigating through data, following the links established with the data schema. For example, /Customers ('PKEY1')/Orders (1)/Employees returns the employees that created sales order 1 for the customer with a key of 'PKEY1. (Source: http://msdn.microsoft.com/en-us/library/cc907912.aspx) In addition, most commercial databases products incorporate mechanisms to export database functionalities as Data Web services. Representative examples are the IBM Document Access Definition Extension (DADX) technology (Db2XMLextender1) and the Native XML Web Services for Microsoft SQL Server 2005[34]. DADX is part of the IBM DB2 XML Extender, an XML/relational mapping layer, and facilitates the development of Web services on top of relational databases that can, among other things, execute SQL queries and retrieve relational data as XML. Web services adaptation. Regarding the adaptation of Web services according to context changes [35]; [36] many ongoing research have been released. In the proposed work, we focus specially on the adaptation of a process. Some research efforts from the Workflow community address the need for adaptability. They focus on formal methods to make the workflow process able to adapt to changes in the environment conditions. For example, authors in [37] propose eFlow with several constructs to achieve adaptability. The authors use parallel execution of multiple equivalent services and the notion of generic service that can be replaced by a specific set of services at runtime. However, adaptability remains insufficient and vendor specific. Moreover, many adaptation triggers, like infrastructure changes, considered by workflow adaptation are not relevant for Web services because services hide all implementation details and only expose interfaces described in terms of types of exchanged messages and message exchange patterns. In addition, authors in [38] extend existing process modeling languages to add context sensitive regions (i.e., parts of the business process that may have different behaviors depending on context). They also introduce context change patterns as a mean to identify the contextual situations (and especially context change situations) that may have an impact on the behavior of a business process. In addition, they propose a set of transformation rules that allow generating a BPEL based business process from a context sensitive business process. However, context change patterns which regulate the context changes are specific to their running example with no-emphasis on proposing more generic patterns. 1
Go online to http://www.306.ibm.com/software/data/db2/extenders/xmlext/
Context-Aware Data and IT Services Collaboration in E-Business
113
There are a few works using an Aspect based adaptability in BPEL. In [39], the authors presented an Aspect oriented extension to BPEL: the AO4BPEL which allows dynamically adaptable BPEL orchestration. The authors combine business rules modeled as Aspects with a BPEL orchestration engine. When implementing rules, the choice of the pointcut depends only on the activities (invoke, reply or sequence). Business rules in this work are very simple and do not express a pragmatic adaptability constraint like context change in our case. Another work is proposed in [40] in which the authors propose a policy-driven adaptation and dynamic specification of Aspects to enable instance specific customization of the service composition. However, they do not mention how they can present the aspect advices or how they will consider the pointcuts.
6 Conclusion In this paper, was presented a multi-level architecture that supports the design and development of a high-level type of service known as Service Domain. This one orchestrates a set of related ITSs and DSs. Service Domain enhances the Web service concept to tackle the challenges that E-Business collaboration poses. In addition, to address enterprise adaptability to context changes, we made Service Domain sensible to context. We enhanced BPEL execution with AOP mechanisms. We have shown that AOP enables crosscutting and context-sensitive logic to be factored out of the service orchestration and modularized into Aspects. Last but not least, we illustrated the role of policies and context in a Service Domain. Different types of policies were proposed and then used first, to manage the participation of DSs and ITSs in SDs and second, to control aspect injection within the SD. In term of future work, we plan to complete the SD multi-level architecture and conduct a complete empirical study of our approach.
References 1. Carey, M., et al.: Integrating enterprise information on demand with xQuery. XML Journal 2(6/7) (2003) 2. Yang, S.J.H., et al.: A new approach for context aware SOA. In: Proc. e-Technology, eCommerce and e-Service, EEE 2005, pp. 438–443 (2005) 3. Gorton, S., et al.: StPowla: SOA, Policies and Workflows. In: Book StPowla: SOA, Policies and Workflows. Series StPowla: SOA, Policies and Workflows, pp. 351–362 (2007) 4. Arsanjani, A.: Service-oriented modeling and architecture (2004), http://www.ibm.com/developerworks/library/ws-soa-design1/ 5. Erl, T.: Service-Oriented Architecture (SOA): Concepts, Technology, and Design, p. 792. Prentice Hall, Englewood Cliffs (2005) 6. Huang, Y., et al.: A Service Management Framework for Service-Oriented Enterprises. In: Proceedings of the IEEE International Conference on E-Commerce Technology (2004) 7. Braun, C., Winter, R.: Integration of IT Service Management into Enterprise Architecture. In: Proc. The 22th Annual ACM Symposium on Applied Computing, SAC 2007 (2007)
114
K. Boukadi et al.
8. Gilpin, M., Yuhanna, N.: Information-As-A-Service: What’s Behind This Hot New Trend? (2007), http://www.forrester.com/Research/Document/Excerpt/ 0,7211,41913,00.html 9. C. Software, SOA Data Services Solutions, technical report (2008), http://compositesoftware.com/solutions/soa.html 10. Lupu, E., Sloman, M.: Conflicts in Policy-Based Distributed Systems Management. IEEE Transactions on Software Engineering 25(6) (1999) 11. Zakaria, M., et al.: Using policies to manage composite Web services. IT Professional 8(5) (2006) 12. Coutaz, J., et al.: Context is key. Communications of the ACM 48(3) (2005) 13. Keidl, M., Kemper, A.: A Framework for Context-Aware Adaptable Web Services (Demonstration). In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 826–829. Springer, Heidelberg (2004) 14. AOP, Aspect-Oriented Software Development (2007), http://www.aosd.net 15. AspectJ, The AspectJ Programming Guide (2007), http://dev.eclipse.org/viewcvs/indextech.cgi/~ checkout~aspectj-home/doc/progguide/index.html 16. Andrews, T., et al.: Business Process Execution Language for Web Services (2003), http://www.ibm.com/developerworks/library/specification/ ws-bpel/ 17. Maamar, Z., Sutherland, J.: Toward Intelligent Business Objects. Communications of the ACM 43(10) 18. Papazoglou, M.P., Heuvel, W.-J.v.d.: Service-oriented design and development methodology. International Journal of Web Engineering and Technology (IJWET) 2(4), 412–442 (2006) 19. Benslimane, D., Arara, A., Falquet, G., Maamar, Z., Thiran, P., Gargouri, F.: Contextual Ontologies: Motivations, Challenges, and Solutions. In: Yakhno, T., Neuhold, E.J. (eds.) ADVIS 2006. LNCS, vol. 4243, pp. 168–176. Springer, Heidelberg (2006) 20. Mostefaoui, S.K., Mostefaoui, G.K.: Towards A Contextualisation of Service Discovery and Composition for Pervasive Environments. In: Proc. the Workshop on Web-services and Agent-based Engineering (2003) 21. Dan, A.: Use of WS-Agreement in Job Submission (September 2004) 22. Maamar, Z., et al.: Towards a context-based multi-type policy approach for Web services composition. Data & Knowledge Engineering 62(2) (2007) 23. Damianou, N., Dulay, N., Lupu, E.C., Sloman, M.: The ponder policy specification language. In: Sloman, M., Lobo, J., Lupu, E.C. (eds.) POLICY 2001. LNCS, vol. 1995, pp. 18–38. Springer, Heidelberg (2001) 24. Ryan, N.: ConteXtML: Exchanging contextual information between a Mobile Client and the FieldNote Server, http://www.cs.kent.ac.uk/projects/mobicomp/fnc/ ConteXtML.html 25. Turner, R.M.: Context-mediated behavior for intelligent agents. Human-Computer studies 48(3), 307–330 (1998) 26. Gonzales, A.J., Ahlers, R.: Context-based representation of intelligent behavior in training simulations. International Transactions of the Society for Computer Simulation, 153–166 (1999)
Context-Aware Data and IT Services Collaboration in E-Business
115
27. Brezillon, P.: Context-based modeling of operators’ Practices by Contextual Graphs. In: Proc. 14th Mini Euro Conference in Human Centered Processes (2003) 28. Bucur, O., et al.: What Is Context and How Can an Agent Learn to Find and Use it When Making Decisions? In: Proc. international workshop of central and eastern europe on multi agent systems, pp. 112–121 (2005) 29. Carey, M.: Data delivery in a service-oriented world: the BEA aquaLogic data services platform. In: Proc. The 2006 ACM SIGMOD international conference on Management of data (2006) 30. C. Microsoft, ADO.NET Data Services (also known as Project Astoria) (2007), http://astoria.mslivelabs.com/ 31. Hat, R.: MetaMatrix Enterprise Data Services Platform (2007), http://www.redhat.com/jboss/platforms/dataservices/ 32. X. Inc, Xcalia Data Access Services (2009), http://www.xcalia.com/products/xcalia-xdasdata-access-service-SDO-DAS-data-integration-through-web-services.jsp 33. Williams, K., Daniel, B.: SOA Web Services - Data Access Service. Java Developer’s Journal (2006) 34. Microsoft, Native XML Web services for Microsoft SQL server (2005), http://msdn2.microsoft.com/en-us/library/ms345123.aspx 35. Maamar, Z., et al.: Towards a context-based multi-type policy approach for Web services composition. Data & Knowledge Engineering 62(2), 327–351 (2007) 36. Bettini, C., et al.: Distributed Context Monitoring for the Adaptation of Continuous Services. World Wide Web 10(4), 503–528 (2007) 37. Casati, F., Ilnicki, S., Jin, L., Krishnamoorthy, V., Shan, M.-C.: Adaptive and Dynamic Service Composition in eFlow. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, p. 13. Springer, Heidelberg (2000) 38. Modafferi, S., et al.: A Methodology for Designing and Managing Context-Aware Workflows. In: Mobile Information Systems II, pp. 91–106 (2005) 39. Charfi, A., Mezini, M.: AO4BPEL: An Aspect-oriented Extension to BPEL. World Wide Web 10(3), 309–344 (2007) 40. Erradi, A., et al.: Towoards a Policy-Driven Framework For Adaptive Web Services Composition. In: Proceedings of the International Conference on Next Generation Web Services Practices 2005, pp. 33–38 (2005)
Facilitating Controlled Tests of Website Design Changes Using Aspect-Oriented Software Development and Software Product Lines Javier C´ amara1 and Alfred Kobsa2 1
Department of Computer Science, University of M´ alaga Campus de Teatinos, 29071. M´ alaga, Spain
[email protected] 2 Dept. of Informatics, University of California, Irvine Bren School of Information and Computer Sciences, Irvine, CA 92697, USA
[email protected] Abstract. Controlled online experiments in which envisaged changes to a website are first tested live with a small subset of site visitors have proven to predict the effects of these changes quite accurately. However, these experiments often require expensive infrastructure and are costly in terms of development effort. This paper advocates a systematic approach to the design and implementation of such experiments in order to overcome the aforementioned drawbacks by making use of Aspect-Oriented Software Development and Software Product Lines.
1
Introduction
During the past few years, e-commerce on the Internet has experienced a remarkable growth. For online vendors like Amazon, Expedia and many others, creating a user interface that maximizes sales is thereby crucially important. Different studies [11,10] revealed that small changes at the user interface can cause surprisingly large differences in the amount of purchases made, and even minor difference in sales can make a big difference in the long run. Therefore, interface modifications must not be taken lightly but should be carefully planned. Experience has shown that it is very difficult for interface designers and marketing experts to foresee how users react to small changes in websites. The behavioral difference that users exhibit at Web pages with minimal differences in structure or content quite often deviates considerably from all plausible predictions that designers had initially made [22,30,27]. For this reason, several techniques have been developed by industry that use actual user behavior to measure the benefits of design modifications [17]. These techniques for controlled online experiments on the Web can help to anticipate users’ reactions without putting a company’s revenue at risk. This is achieved by implementing and studying the effects of modifications on a tiny subset of users rather than testing new ideas directly on the complete user base. Although the theoretical foundations of such experiments have been well established, and interesting practical lessons compiled in the literature [16], the A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 116–135, 2009. c Springer-Verlag Berlin Heidelberg 2009
Facilitating Controlled Tests of Website Design Changes
117
infrastructure required to implement such experiments is expensive in most cases and does not support a systematic approach to experimental variation. Rather, the support for each test is usually crafted for specific situations. In this work, we advocate a systematic approach to the design and implementation of such experiments based on Software Product Lines [7] and Aspect Oriented Software Development (AOSD) [12]. Section 2 provides an overview of the different techniques involved in online tests, and Section 3 points out some of their shortcomings. Section 4 describes our systematic approach to the problem, giving a brief introduction to software product lines and AOSD. Section 5 introduces a prototype tool that we developed to test the feasibility of our approach. Section 6 compares our proposal to currently available solutions, and Section 7 presents some conclusions and future work.
2
Controlled Online Tests on the Web: An Overview
The underlying idea behind controlled online tests of a Web interface is to create one or more different versions of it by incorporating new or modified features, and to test each version by presenting it to a randomly selected subset of users in order to analyze their reactions. User response is measured along an overall evaluation criterion (OEC) or fitness function, which indicates the performance of the different versions or variants. A simple yet common OEC in e-commerce is the conversion rate, that is, the percentage of site visits that result in a purchase. OECs may however also be very elaborate, and consider different factors of user behavior. Controlled online experiments can be classified into two major categories, depending on the number of variables involved:
Fig. 1. Checkout screen: variants A (original, left) and B (modified, right)1
1
c 2007 ACM, Inc. Included by permission.
118
J. C´ amara and A. Kobsa
– A/B, A/B/C, ..., A/../N split testing: These tests compare one or more variations of a single site element or factor, such as a promotional offer. Site developers can quickly see which variation of the factor is most persuasive and yields the highest conversion rates. In the simplest case (A/B test), the original version of the interface is served to 50% of the users (A or Control Group), and the modified version is served to the other 50% (B or Treatment Group2 ). While A/B tests are simple to conduct, they are often not very informative. For instance, consider Figure 1, which depicts the original version and a variant of a checkout example taken from [11].3 This variant has been obtained by modifying 9 different factors. While an A/B test tells us which of two alternatives is better, it does not yield reliable information on how combinations of the different factors influence the performance of the variant. – Multivariate testing: A multivariate test can be viewed as a combination of many A/B tests, whereby all factors are systematically varied. Multivariate testing extends the effectiveness of online tests by allowing the impact of interactions between factors to be measured. A multivariate test can, e.g., reveal that two interface elements yield an unexpectedly high conversion rate only when they occur together, or that an element that has a positive effect on conversion loses this effect in the presence of other elements. The execution of a test can be logically separated into two steps, namely (a) the assignment of users to the test, and to one of the subgroups for each of the interfaces to be tested, and (b) the subsequent selection and presentation of this interface to the user. The implementation of online tests partly blurs the two different steps. The assignment of users to different subgroups is generally randomized, but different methods exist such as: – Pseudo-random assignment with caching: consists in the use of a pseudo-random number generator coupled with some form of caching in order to preserve consistency between sessions (i.e., a user should be assigned to the same interface variant on successive visits to the site); and – Hash and partitioning: assigns a unique user identifier that is either stored in a database or in a cookie. The entire set of indentifiers is then partitioned, and each partition is assigned to a variant. This second method is usually preferred due to scalability problems with the first method. Three implementation methods are being used for the selection and presentation of the interface to the user: 2
3
In reality, the treatment group will only comprise a tiny fraction of the users of a website, so as to keep losses low if the conversion rate of the treatment version should turn out to be poorer than that of the existing version. Eisenberg reports that Interface A resulted in 90% fewer purchases, probably because potential buyers who had no promotion code were put off by the fact that others could get lower prices.
Facilitating Controlled Tests of Website Design Changes
119
– Traffic splitting: In order to generate the different variants, different implementations are created and placed on different physical or virtual servers. Then, by using a proxy or a load balancer which invokes the randomization algorithm, a user’s traffic is diverted to the assigned variant. – Server-side selection: All the logic which invokes the randomization algorithm and produces the different variants for users is embedded in the code of the site. – Client-side selection: Assignment and generation of variants is achieved through dynamic modification of each requested page at the client side using JavaScript.
3
Problems with Current Online Test Design and Implementation
The three implementation methods discussed above entail a number of disadvantages, which are a function of the choices made at the architectural level and not of the specific characteristics of an online experiment (such as the chosen OEC or the interface features being modified): – Traffic splitting: Although traffic splitting does not require any changes to the code in order to produce the different user assignments to variants, the implementation of this approach is relatively expensive. The website and the code for the measurement of the OEC have to be replicated n times, where n is the number of tested combinations of different factors (number of possible variants). In addition to the complexity of creating each variant for the test manually by modifying the original website’s code (impossible in the case of multivariate tests involving several factors), there is also a problem associated to the hardware required for the execution of the test. If physical servers are used, a fleet of servers will be needed so that each of the variants tested will be hosted on one of them. Likewise, if virtual servers are being used, the amount of system resources required to acommodate the workload will easily exceed the capacity of the physical server, requiring the use of several servers and complicating the supporting infrastructure. – Server-side selection: Extensive code modification is required if interface selection and presentation is performed at the server side. Not only has randomization and user assignment to be embedded in the code, but also a branching logic has to be added in order to produce the different interfaces corresponding to the different combinations of variants. In addition, the code may become unnecessarily complex, particularly if different combinations of factors are to be considered at the same time when tests are being run concurrently. However, if these problems are solved, server-side selection is a powerful alternative which allows deep modifications to the system and is cheap in terms of supporting infrastructure. – Client-side selection: Although client-side selection is to some extent easier to implement than server-side selection, it suffers from the same shortcomings. In addition, the features subject to experimentation are far more
120
J. C´ amara and A. Kobsa
limited (e.g., modifications which go beyond the mere interface are not possible, JavaScript must be enabled in the client browser, execution is errorprone, etc.). Independent of the chosen form of implementation, substantial support for systematic online experimentation at a framework level is urgently needed. The framework will need to support the definition of the different factors and their possible combinations at the test design stage, and their execution at runtime. Being able to evolve a site safely by keeping track of each of the variants’ performance as well as maintaining a record of the different experiments is very desirable when contrasted with the execution of isolated tests on an ad-hoc basis.
4
A Systematic Approach to Online Test Design and Implementation
To overcome the various limitations described in the previous section, we advocate a systematic approach to the development of online experiments. For this purpose, we rely on two different foundations: (i) software product lines provide the means to properly model the variability inherent in the design of the experiments, and (ii) aspect-oriented software development (AOSD) helps to reduce the effort and cost of implementing the variants of the test by capturing variation factors on aspects. The use of AOSD will also help in presenting variants to users, as well as simplifying user assignment and data collection. By combining these two foundations we aim at supplying developers with the necessary tools to design tests in a systematic manner, enabling the partial automation of variant generation and the complete automation of test deployment and execution. 4.1
Test Design Using Software Product Lines
Software Product Line models describe all requirements or features in the potential variants of a system. In this work, we use a feature-based model similar to the models employed by FODA [13] or FORM [14]. This model takes the form of a lattice of parent-child relationships which is typically quite large. Single systems or variants are then built by selecting a set of features from the model. Product line models allow the definion of the directly reusable (DR) or mandatory features which are common to all possible variants, and three types of discriminants or variation points, namely: – Single adaptors (SA): a set of mutually exclusive features from which only one can be chosen when defining a particular system. – Multiple adaptors (MA): a list of alternatives which are not mutually exclusive. At least one must be chosen. – Options (O): a single optional feature that may or may not be included in a system definition.
Facilitating Controlled Tests of Website Design Changes
121
F1(MA) The cart component must include a checkout screen. – F1.1(SA) There must be an additional “Continue Shopping” button present. • F1.1.1(DR) The button is placed on top of the screen. • F1.1.2(DR) The button is placed at the bottom of the screen. – F1.2(O) There must be an “Update” button placed under the quantity box. – F1.3(SA) There must be a “Total” present. • F1.3.1(DR) Text and amount of the “Total” appear in different boxes. • F1.3.2(DR) Text and amount of the “Total” appear in the same box. – F1.4(O) The screen must provide discount options to the user. • F1.4.1(DR) There is a “Discount” box present, with amount in a box next to it on top of the “Total” box. • F1.4.2(DR) There is an “Enter Coupon Code” input box present on top of “Shipping Method”. • F1.4.3(DR) There must be a “Recalculate” button left of “Continue Shopping.”
Fig. 2. Feature model fragment corresponding to the checkout screen depicted in Figure 1
In order to define the different interface variants that are present in an online test, we specify all common interface features as DR features in a product line model. Varying elements are modeled using discriminants. Different combinations of interface features will result in different interface variants. An example for such a feature model is given in Figure 2, which shows a fragment of a definition of some of the commonalities and discriminants of the two interface variants depicted in Figure 1. Variants can be manually created by the test designer through the selection of the desired interface features in the feature model, or automatically by generating all the possible combinations of feature selections. Automatic generation is especially interesting in the case of multivariate testing. However, it is worth noting that not all combinations of feature selections need to be valid. For instance, if we intend to generate a variant which includes F1.3.1 in our example, that same selection cannot include F1.3.2 (single adaptor). Likewise, if F1.4 is selected, it is mandatory to include F1.4.1-F1.4.3 in the selection. These restrictions are introduced by the discriminants used in the product line model. If restrictions are not satisfied, we have generated an invalid variant that should not be presented to users. Therefore, generating all possible feature combinations for a multivariate test is not enough for our purposes. Fortunately, the feature model can be easily translated into a logical expression by using features as atomic propositions and discriminants as logical connectors. The logical expression of a feature model is the conjunction of the logical expressions for each of the sub-graphs in the lattice and is achieved using logical AND. If Gi and Gj are the logical expressions for two different sub-graphs, then the logical expression for the lattice is:
122
J. C´ amara and A. Kobsa
Gi ∧ Gj Parent-child dependency is expressed using a logical AND as well. If ai is a parent requirement and aj is a child requirement such that the selection of aj is dependent on ai then ai ∧ aj . If ai also has other children ak . . . az then: ai ∧ (ak ∧ . . . ∧ az ) The logical expression for a single adaptor discriminant is exclusive OR. If ai and aj are features such that ai is mutually exclusive to aj then ai ⊕ aj . Multiple adaptor discriminants correspond to logical OR. If ai and aj are features such that at least one of them must be chosen then ai ∨ aj . The logical expression for an option discriminant is a bi-conditional4 . If ai is the parent of another feature aj then the relationship between the two features is ai ↔ aj . Table 1 summarizes the relationships and logical definitions of the model. The general expression for a product line model is G1 ∧ G2 ∧ . . . ∧ Gn where Gi is ai R aj R ak R . . . R an and R is one of ∧, ∨, ⊕, or ↔. The logical expression for the checkout example feature model shown in Figure 2 is: F 1 ∧ ( F 1.1 ∧ (F 1.1.1 ⊕ F 1.1.2) ∨ F 1.2 ∨ F 1.3 ∧ (F 1.3.1 ⊕ F 1.3.2) ∨ F 1.4 ↔ (F 1.4.1 ∧ F 1.4.2 ∧ F 1.4.3) ) By instantiating all the feature variables in the expression to true if selected, and false if unselected, we can generate the set of possible variants and then test their validity using the algorithm described in [21]. A valid variant is one for which the logical expression of the complete feature model evaluates to true. Table 1. Feature model relations and equivalent formal definitions Feature Model Relation Sub-graph Dependency Single adaptor Multiple adaptor Option
Formal Definition Gi ∧ Gj ai ∧ aj ai ⊕ a j ai ∨ aj ai ↔ a j
Manual selection can also benefit from this approach since the test administrator can be guided in the process of feature selection by pointing out inconsistencies in the resulting variant as features are selected or unselected. Figure 3 depicts the feature selections for variants A and B of our checkout example. In the feature model, mandatory features are represented with black circles, whereas options are represented with white circles. White triangles express alternative (single adaptors), and black triangles multiple adaptors. 4
ai ↔ aj is true when ai and aj have the same value.
Facilitating Controlled Tests of Website Design Changes
(F1) Checkout Screen
Variant A (Original)
(F1.1) Continue Button
(F1.1.1) Placed Top
123
(F1.2) Update Button
(F1.3) Total Display
(F1.3.1) Split Box
(F1.3.2) Same Box
(F1.1.2) Placed Bottom
(F1.4) Discount
(F1.4.1) Discount Box
(F1.4.3) Recalculate Button
(F1.4.2) Coupon Code Box
Variant B
(F1) Checkout Screen
(F1.1) Continue Button
(F1.1.1) Placed Top
(F1.2) Update Button
(F1.3) Total Display
(F1.3.1) Split Box
(F1.3.2) Same Box
(F1.1.2) Placed Bottom
(F1.4) Discount
(F1.4.1) Discount Box
(F1.4.3) Recalculate Button
(F1.4.2) Coupon Code Box
Fig. 3. Feature selections for the generation of variants A and B from Figure 1
As regards automatic variant generation, we must bear in mind that full factorial designs (i.e., testing every possible combination of interface features) provides the greatest amount of information about the individual and joint impacts of the different factors. However, obtaining a statistically meaningful number of cases for this type of experiment takes time, and handling a huge number of variants aggravates this situation. In our approach, the combinatorial explosion in multivariate tests is dealt with by bounding the parts of the hierarchy which descend from an unselected feature. This avoids the generation of all the variations derived from that specific part of the product line. In addition, our approach does not confine the test designer to a particular selection strategy. It is possible to integrate any optimization method for reducing the complexity of full factorial designs, such as for instance hill climbing strategies like the Taguchi approach [28]. 4.2
Case Study: Checkout Screen
Continuing with the checkout screen example described in Section 1, we introduce a simplified implementation of the shopping cart in order to illustrate our approach. We define a class ‘shopping cart’ (Cart) that allows for the addition and removal of different items (see Figure 4). This class contains a number of methods that render the different elements in the cart at the interface level, such as printTotalBox() or printDiscountBox(). These are private class methods called from within the public method printCheckoutTable(), which is intended
124
J. C´ amara and A. Kobsa
General Cart +printHeader () +printBanner () +printMenuTop () +printMenuBottom ()
Item -Id -name -price
1
*
1
1
-shippingmethod -subtotal -tax -total +addItem() +removeItem() -printDiscountBox() -printTotalBox() -printCouponCodeBox() -printShippingMethodBox() -recalculateButton() -continueShoppingButton() +printCheckoutTable() +doCheckout()
User -name 1 1 -email -username -password
Fig. 4. Classes involved in the shopping cart example
to render the main body of our checkout screen. A user’s checkout is completed when doCheckout() is invoked. On the other hand, the General class contains auxiliary functionality, such as representing common elements of the site (e.g., headers, footers and menus). 4.3
Implementing Tests with Aspects
Aspect-Oriented Software Development (AOSD) is based on the idea that systems are better programmed by separately specifying their different concerns (areas of interest), using aspects and a description of their relations with the rest of the system. Those specifications are then automatically woven (or composed) into a working system. This weaving process can be performed at different stages of the development, ranging from compile-time to run-time (dynamic weaving) [26]. The dynamic approach (Dynamic AOP or d-AOP) implies that the virtual machine or interpreter running the code must be aware of aspects and control the weaving process. This represents a remarkable advantage over static AOP approaches, considering that aspects can be applied and removed at run-time, modifying application behaviour during the execution of the system in a transparent way. With conventional programming techniques, programmers have to explicitly call methods available in other component interfaces in order to access their functionality, whereas the AOSD approach offers implicit invocation mechanisms for behavior in code whose writers were unaware of the additional concerns (obliviousness). This implicit invocation is achieved by means of join points. These are regions in the dynamic control flow of an application (method calls or executions, exception handling, field setting, etc.) which can be intercepted by an aspect-oriented program by using pointcuts (predicates which allow the quantification of join points) to match with them. Once a join point has been matched, the program can run the code corresponding to the new behavior
Facilitating Controlled Tests of Website Design Changes
125
(advices) typically before, after, instead of, or around (before and after) the matched join point. In order to test and illustrate our approach, we use PHP [25], one of the predominant programming languages in Web-based application development. It is an easy to learn language specifically designed for the Web, and has excellent scaling capabilities. Among the variety of AOSD options available for PHP, we have selected phpAspect [4], which is to our knowledge the most mature implementation so far, providing AspectJ5 -like syntax and abstractions. Although there are other popular languages and platforms available for Web application development (Java Servlets, JSF, etc.), most of them provide similar abstractions and mechanisms. In this sense, our proposal is technology-agnostic and easily adaptable to other platforms. Aspects are especially suited to overcome many of the issues described in Section 3. They are used for different purposes in our approach that will be described below. Variant implementation. The different alternatives that have been used so far for variant implementation have important disadvantages, which we discussed in Section 3. These detriments include the need to produce different versions of the system code either by replicating and modifying it across several servers, or using branching logic on the server or client sides. Using aspects instead of the traditional approaches offers the advantage that the original source code does not need to be modified, since aspects can be applied as needed, resulting in different variants. In our approach, each feature described in the product line is associated to one or more aspects which modify the original system in a particular way. Hence, when a set of features is selected, the appropriate variant is obtained by weaving with the base code6 the set of aspects associated to the selected features in the variant, modifying the original implementation. To illustrate how these variations are achieved, consider for instance the features labeled F1.3.1 and F1.3.2 in Figure 2. These two features are mutually exclusive and state that in the total box of the checkout screen, text and amount should appear in different boxes rather than in the same box, respectively. In the original implementation (Figure 1.A), text and amount appeared in different boxes, and hence there is no need to modify the behavior if F1.3.1 is selected. When F1.3.2 is selected though, we merely have to replace the behavior that renders the total box (implemented in the method Cart.printTotalBox()). We achieve this by associating an appropriate aspect to this feature. In Listing 1, by defining a pointcut that intercepts the execution of the total box rendering method, and applying an around-type advice, we are able to replace the method through which this particular element is being rendered at the interface. This approach to the generation of variants results in better code reusability (especially in multivariate testing) as well as reduced costs and efforts, since 5 6
AspectJ [9,15] is the de-facto standard in aspect-oriented programming languages. That is, the code of the original system.
126
J. C´ amara and A. Kobsa
Listing 1. Rendering code replacement aspect aspect replaceTotalBox{ pointcut render:exec(Cart::printTotalBox(*)); around(): render{ /* Alternative rendering code */ } }
developers do not have to replicate nor generate complete variant implementations. In addition, this approach is safer and cleaner since the system logic does not have to be temporally (nor manually) modified, thus avoiding the resulting risks in terms of security and reliability. Finally, not only interface modifications such as the ones depicted in Figure 1, but also backend modifications are easier to perform, since aspect technology allows a behavior to be changed even if it is scattered throughout the system code. The practical implications of using AOP for this purpose can be easily seen in an example. Consider for instance Amazon’s recommendation algorithm, which is invoked in many places throughout the website such as its general catalog pages, its shopping cart, etc. Assume that Amazon’s development team wonders whether an alternative algorithm that they developed would perform better than the original. With traditional approaches they could modify the source code only by (i) replicating the code on a different server and replacing all the calls7 made to the recommendation algorithm, or (ii) including a condition contingent on the variant that is being executed in each call to the algorithm. Using aspects instead enables us to write a simple statement (pointcut) to intercept every call to the recommendation algorithm throughout the site, and replace it with the call to the new algorithm. Experimenting with variants may require going beyond mere behavior replacement though. This means that any given variant may require for its implementation the modification of data structures or method additions to some classes. Consider for instance a test in which developers want to monitor how customers react to discounts on products in a catalog. Assume that discounts can be different for each product and that the site has not initially been designed to include any information on discounts, i.e., this information needs to be introduced somewhere in the code. To solve this problem we can use intertype declarations. Aspects can declare members (fields, methods, and constructors) that are owned by other classes. These are called inter-type members. As can be observed in Listing 2, we introduce an additional discount field in our item class, and also a getDiscountedPrice() method which will be used whenever the discounted price of an item is to be retrieved. Note that we need to 7
In the simplest case, only the algorithm’s implementation would be replaced. However, modifications on each of the calls may also be required, e.g., due to differences in the signature with respect to the original algorithm’s implementation,.
Facilitating Controlled Tests of Website Design Changes
127
Listing 2. Item discount inter-type declarations aspect itemDiscount{ private Item::$discount; public function Item::getDiscountedPrice(){ return ($this->price - $this->discount); } }
introduce a new method, because it should still be possible to retrieve the original, non-discounted price. Data Collection and User Interaction. The code in charge of measuring and collecting data for the experiment can also be written as aspects in a concise manner. Consider a new experiment with our checkout example in which we want to calculate how much customers spend on average when they visit our site. To this end, we need to add up the amount of money spent on each purchase. One way to implement this functionality is again inter-type declarations. Listing 3. Data collection aspect aspect accountPurchase{ private $dbtest; pointcut commitTrans:exec(Cart::doCheckout(*)); function Cart::accountPurchase(DBManager $db){ $db->insert($this->getUserName(), $this->total); } around($this): commitTrans{ if (proceed()){ $this->accountPurchase($thisAspect->dbtest); } } }
When the aspect in Listing 3 intercepts the method that completes a purchase (Cart.doCheckout()), the associated advice inserts the sales amount into a database that collects the results from the experiment (but only if the execution of the intercepted method succeeds, which is represented by proceed() in the advice). It is worth noting that while the database reference belongs to the aspect, the method used to insert the data belongs to the Cart class. Aspects permit the easy and consistent modification of the methods that collect, measure, and synthesize the OEC from the gathered data to be presented to the test administrator in order to be analyzed. Moreover, data collection procedures do not need to be replicated across the different variants, since the system will weave this functionality across all of them.
128
J. C´ amara and A. Kobsa
User Assignment. Rather than implementing user assignment in a proxy or load balancer that routes requests to different servers, or including it in the implementation of the base system, we experimented with two different alternatives of aspect-based server-side selection: – Dynamic aspect weaving: A user routing module acts as an entry point to the base system. This module assigns the user to a particular variant by looking up what aspects have to be woven to produce the particular variant to which the current user had been assigned. The module then incorporates these aspects dynamically upon each request received by the server, flexibly producing variants in accordance with the user’s assignment. Although this approach is elegant and minimizes storage requirements, it does not scale well. Having to weave a set of aspects (even if they are only a few) on the base system upon each request to the server is very demanding in computational terms, and prone to errors in the process. – Static aspect weaving: The different variants are computed offline, and each of them is uploaded to the server. In this case the routing module just forwards the user to the corresponding variant stored on the server (the base system is treated just like another variant for the purpose of the experiment). This method does not slow down the operation of the server and is a much more robust approach to the problem. The only downside of this alternative is that the code corresponding to the different variants has to be stored temporarily on the server (although this is a minor inconvenience since usually the amount of space required is negligible compared to the average server storage capacity). Furthermore, this alternative is cheaper than traffic splitting, since it does not require the use of a fleet of servers nor the modification of the system’s logic. This approach still allows one to spread the different variants across several servers in case of high traffic load.
5
Tool Support
The approach for online experiments on websites that we presented in this article has been implemented in a prototype tool, called WebLoom. It includes a graphical user interface, to build and visualize feature models that can be used as the structure upon which controlled experiments on a website can be defined. In addition, the user can write aspect code which can be attached to the different features. Once the feature model and associated code have been built, the tool supports both automatic and manual variant generation, and is able to deploy aspect code which lays out all the necessary infrastructure to perform the designed test on a particular website. The prototype has been implemented in Python, using the wxWidgets toolkit technology for the development of the user interface. It both imports and exports simple feature models described in an XML format specific to the tool. The prototype tool’s graphical user interface is divided into three main working areas:
Facilitating Controlled Tests of Website Design Changes
129
Fig. 5. WebLoom displaying the product line model depicted in Figure 2
– Feature model. This is the main working area where the feature model can be specified (see Figure 5). It includes a toolbar for the creation and modification of discriminants and a code editor for associated modifications. This area also allows the selection of features in order to generate variants. – Variant management. Variants generated on the site model area can be added or removed from the current test, renamed or inspected. A compilation of the description of all features contained in a variant is automatically presented to the user based on feature selections when the variant is selected (Figure 6, bottom). – Overall Estimation Criteria. One or more OEC to measure on the experiments can be defined in this section. Each of the OEC are labeled in order to be identified later on, and the associated code for gathering and processing data is directly defined by the test administrator. In Figure 7, we can observe the interaction with our prototype tool. The user enters a description of the potential modifications to be performed on the website, in order to produce the different variants under WebLoom’s guidance. This results in a basic feature model structure, which is then enriched with code associated to the aforementioned modifications (aspects). Once the feature model is complete, the user can freely select a number of features using the interface,
130
J. C´ amara and A. Kobsa
Fig. 6. Variant management screen in WebLoom 1. Design
2. Aspect Code Generation
3. Aspect Weaving
WebLoom 1.a. Specify Feature Model
Aspect Code for Variants 1..n
Test Implementation
Weaver
1.b. Add Feature Code
Designer
1.c Define Variants 1..n (by Selecting Features )
System Logic Data Collection Aspect Code
1.d Define OECs
Fig. 7. Operation of WebLoom
and take snapshots of the current selections in order to generate variants. These variants are automatically checked for validity before being incorporated into the variant collection. Alternatively, the user can ask the tool to generate all the valid variants for the current feature model and then remove the ones which are not interesting for the experiment. Once all necessary input has been received, the tool gathers the code for each particular variant to be tested in the experiment, by collecting all the aspects associated with the features that were selected for the variant. It then invokes
Facilitating Controlled Tests of Website Design Changes
131
the weaver to produce the actual variant code for the designed test by weaving the original system code with the collection of aspects produced by the tool.
6
Related Work
Software product lines and feature-oriented design and programming have already been successfully applied in the development of Web applications, to significantly boost productivity by exploiting commonalities and reusing as many assets (including code) as possible. For instance, Trujillo et al. [29] present a case study of Feature Oriented Model Driven Design (FOMDD) on a product line of portlets (Web portal components). In this work, the authors expressed variations in portlet functionality as features, and synthesized portlet specifications by composing them conveniently. Likewise, Petersson and Jarzabek [24] present an industrial case study in which their reuse technique XVCL was incrementally applied to generate a Web architecture from the initial code base of a Web portal. The authors describe the process that led to the development of the Web Portal product line. Likewise, aspect-oriented software development has been previously applied to the development of Web applications. Valderas et al. [31] present an approach for dealing with crosscutting concerns in Web applications from requirements to design. Their approach aims at decoupling requirements that belong to different concerns. These are separately modeled and specified using the task-based notation, and later integrated into a unified requirements model that is the source of a model-to-model and model-to-code generation process yielding Web application prototypes that are built from task descriptions. Although the aforementioned approaches meet their purpose of boosting productivity by taking advantage of commonalities, and of easing maintenance by properly encapsulating crosscutting concerns, they do not jointly exploit the advantages of both approaches. Moreover, although they are situated in the context of Web application development, they are not well suited to the specific characteristics of online test design and implementation which have been described in previous sections. The idea of combining software product lines and aspect-oriented software development techniques does already have some tradition in software engineering. In fact, Lee et al. [18] present some guidelines on how feature-oriented analysis and aspects can be combined. Likewise, Loughran and Rashid [19] propose framed aspects as a technique and methodology that combines AOSD, frame technology, and feature-oriented domain analysis in order to provide a framework for implementing fine-grained variability. In [20], they extend this work to support product line evolution using this technique. Other approaches such as [32] aim at implementing variability, and the management and tracing of requirements for implementation by integrating model-driven and aspect-oriented software development. The AMPLE project [1] takes this approach one step further along the software lifecycle and maintenance, aiming at traceability during product line evolution. In the particular context of Web applications, Alf´erez and
132
J. C´ amara and A. Kobsa
Suesaowaluk [8] introduce an aspect-oriented product line framework to support the development of software product lines of Web applications. This framework is similarly aimed at identifying, specifying, and managing variability from requirements to implementation. Although both the aforementioned approaches and our own proposal employ software product lines and aspects, there is a key difference in the way these elements are used. First, the earlier approaches are concerned with the general process of system construction by identifying and reusing aspect-oriented components, whereas our approach deals with the specific problem of online test design and implementation, where different versions of a Web application with a limited lifespan are generated to test user behavioral response. Hence, our framework is intended to generate lightweight aspects which are used as a convenient means for the transient modification of parts of the system. In this sense, it is worth noting that system and test designs and implementations are completely independent of each other, and that aspects are only involved as a means to generate system variants, but not necessarily present in the original system design. In addition, our approach provides automatic support for the generation of all valid variants within the product line, and does not require the modification of the underlying system which stays online throughout the whole online test process. To the extent of our knowledge, no research has so far been reported on treating online test design and implementation in a systematic manner. A number of consulting firms already specialized on analyzing companies’ Web presence [2,6,3]. These firms offer ad-hoc studies of Web retail sites with the goal of achieving higher conversion rates. Some of them use proprietary technology that is usually focused on the statistical aspects of the experiments, requiring significant code refactoring for test implementation8 . Finally, SiteSpect [5] is a software package which takes a proxy-based approach to online testing. When a Web client makes a request to the Web server, it is first received by the software and then forwarded to the server (this is used to track user behavior). Likewise, responses with content are also routed through the software, which injects the HTML code modifications and forwards the modified responses to the client. Although the manufacturers claim that it does not matter whether content is generated dynamically or statically by the server since modifications are performed by replacing pieces of the generated HTML code, we find this approach adequate for trivial changes to a site only, and not very suitable for user data collection and measurement. Moreover, no modifications can be applied to the logic of the application. These shortcomings severely impair this method which is not able to go beyond simple visual changes to the site.
7
Concluding Remarks
In this paper, we presented a novel and systematic approach to the development of controlled online tests for the effects of webpage variants on users, based 8
It is however not easy to thoroughly compare these techniques from an implementation point of view, since firms tend to be quite secretive about them.
Facilitating Controlled Tests of Website Design Changes
133
on software product lines and aspect oriented software development. We also described how the drawbacks of traditional approaches, such as high costs and development effort, can be overcome with our approach. We believe that its benefits are especially valuable for the specific problem domain that we address. On one hand, testing is performed on a regular basis for websites in order to continuously improve their conversion rates. On the other hand, a very high percentage of the tested modifications are usually discarded since they do not improve the site performance. As a consequence, a lot of effort is lost in the process. We believe that WebLoom will save Web developers time and effort by reducing the amount of work they have to put into the design and implementation of online tests. Although there is a wide range of choices available for the implementation of Web systems, our approach is technology-agnostic and most likely deployable to different platforms and languages. However, we observed that in order to fully exploit the benefits of this approach, a website should first be tested whether its implementation meets the modularity principle. This is of special interest at the presentation layer, where user interface component placement, user interface style elements, event declarations and application logic traditionally tend to be mixed up [23]. Regarding future work, a first perspective aims at enhancing our basic prototype with additional WYSIWYG extensions for its graphical user interface. Specifically, developers should be enabled to immediately see the effects that code modifications and feature selections will have on the appearance of their website. This is intended to help them deal with variant generation in a more effective and intuitive manner. A second perspective is refining the variant validation process so that variation points in feature models that are likely to cause significant design variations can be identified, thus reducing the variability.
References 1. 2. 3. 4. 5. 6. 7.
Ample project, http://www.ample-project.net/ Offermatica, http://www.offermatica.com/ Optimost, http://www.optimost.com/ phpAspect: Aspect oriented programming for PHP, http://phpaspect.org/ Sitespect, http://www.sitespect.com Vertster, http://www.vertster.com/ Software product lines: practices and patterns. Addison-Wesley Longman Publishing Co., Boston (2001) 8. Alf´erez, G.H., Suesaowaluk, P.: An aspect-oriented product line framework to support the development of software product lines of web applications. In: SEARCC 2007: Proceedings of the 2nd South East Asia Regional Computer Conference (2007) 9. Colyer, A., Clement, A., Harley, G., Webster, M.: Eclipse AspectJ: Aspect-Oriented Programming with AspectJ and the Eclipse AspectJ Development Tools. Pearson Education, Upper Saddle River (2005) 10. Eisenberg, B.: How to decrease sales by 90 percent, http://www.clickz.com/1588161
134
J. C´ amara and A. Kobsa
11. Eisenberg, B.: How to increase conversion rate 1,000 percent, http://www.clickz.com/showPage.html?page=1756031 12. Filman, R.E., Elrad, T., Clarke, S., Aksit, M. (eds.): Aspect-Oriented Software Development. Addison-Wesley, Reading (2004) 13. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, S.: Feature-oriented domain analysis (FODA) feasibility study. Technical Report CMU/SEI-90-TR-21, Software Engineering Institute, Carnegie Mellon University (November 1990) 14. Kang, K.C., Kim, S., Lee, J., Kim, K., Shin, E., Huh, M.: FORM: A featureoriented reuse method with domain-specific reference architectures. Ann. Software Eng. 5, 143–168 (1998) 15. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An Overview of AspectJ. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 327–353. Springer, Heidelberg (2001) 16. Kohavi, R., Henne, R.M., Sommerfield, D.: Practical Guide to Controlled Experiments on the Web: Listen to your Customers not to the HIPPO. In: Berkhin, P., Caruana, R., Wu, X. (eds.) Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, pp. 959–967. ACM, New York (2007) 17. Kohavi, R., Round, M.: Front Line Internet Analytics at Amazon.com (2004), http://ai.stanford.edu/~ ronnyk/emetricsAmazon.pdf 18. Lee, K., Kang, K.C., Kim, M., Park, S.: Combining feature-oriented analysis and aspect-oriented programming for product line asset development. In: SPLC 2006: Proceedings of the 10th International on Software Product Line Conference, Washington, DC, USA, pp. 103–112. IEEE Computer Society, Los Alamitos (2006) 19. Loughran, N., Rashid, A.: Framed aspects: Supporting variability and configurability for AOP. In: Bosch, J., Krueger, C. (eds.) ICSR 2004. LNCS, vol. 3107, pp. 127–140. Springer, Heidelberg (2004) 20. Loughran, N., Rashid, A., Zhang, W., Jarzabek, S.: Supporting product line evolution with framed aspects. In: Lorenz, D.H., Coady, Y. (eds.) ACP4IS: Aspects, Components, and Patterns for Infrastructure Software, March, pp. 22–26 21. Mannion, M., C´ amara, J.: Theorem proving for product line model verification. In: van der Linden, F.J. (ed.) PFE 2003. LNCS, vol. 3014, pp. 211–224. Springer, Heidelberg (2004) 22. McGlaughlin, F., Alt, B., Usborne, N.: The power of small changes tested (2006), http://www.marketingexperiments.com/improving-website-conversion/ power-small-change.html 23. Mikkonen, T., Taivalsaari, A.: Web applications – spaghetti code for the 21st century. In: Dosch, W., Lee, R.Y., Tuma, P., Coupaye, T. (eds.) Proceedings of the 6th ACIS International Conference on Software Engineering Research, Management and Applications, SERA 2008, Prague, Czech Republic, pp. 319–328. IEEE Computer Society, Los Alamitos (2008) 24. Pettersson, U., Jarzabek, S.: Industrial experience with building a web portal product line using a lightweight, reactive approach. In: Wermelinger, M., Gall, H. (eds.) Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Lisbon, Portugal, pp. 326–335. ACM, New York (2005) 25. PHP: Hypertext preprocessor, http://www.php.net/ 26. Popovici, A., Frei, A., Alonso, G.: A Proactive Middleware Platform for Mobile Computing. In: Endler, M., Schmidt, D. (eds.) Middleware 2003. LNCS, vol. 2672. Springer, Heidelberg (2003)
Facilitating Controlled Tests of Website Design Changes
135
27. Roy, S.: 10 Factors to Test that Could Increase the Conversion Rate of your Landing Pages (2007), http://www.wilsonweb.com/conversion/suman-tra-landing-pages.htm 28. Taguchi, G.: The role of quality engineering (Taguchi Methods) in developing automatic flexible manufacturing systems. In: Proceedings of the Japan/USA Flexible Automation Symposium, Kyoto, Japan, July 9-13, pp. 883–886 (1990) 29. Trujillo, S., Batory, D.S., D´ıaz, O.: Feature oriented model driven development: A case study for portlets. In: Proceedings of the 30th International Conference on Software Engineering (ICSE 2007), Leipzig, Germany, pp. 44–53. IEEE Computer Society, Los Alamitos (2007) 30. Usborne, N.: Design choices can cripple a website (2005), http://alistapart.com/articles/designcancripple 31. Valderas, P., Pelechano, V., Rossi, G., Gordillo, S.E.: From crosscutting concerns to web systems models. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 573–582. Springer, Heidelberg (2007) 32. Voelter, M., Groher, I.: Product line implementation using aspect-oriented and model-driven software development. In: SPLC 2007: Proceedings of the 11th International Software Product Line Conference, Washington, DC, USA, pp. 233–242. IEEE Computer Society, Los Alamitos (2007)
Frontiers of Structured Business Process Modeling Dirk Draheim Central IT Services Department University of Innsbruck
[email protected] Abstract. In this article we investigate in how far a structured approach can be applied to business process modelling. We try to contribute to a better understanding of the driving forces on business process specifications.
1
Introduction
Isn’t it compelling to apply the structured programming arguments to the field of business process modelling? Our answer to this question is ‘no’. The principle of structured programming emerged in the computer science community. From today’s perspective, the discussion of structured programming rather had the characteristics of a maturing process than the characteristics of a debate, although there have also been some prominent sceptic comments on the unrestricted validity of the structured programming principle. Structured programming is a well-established design principle in the field of program design as the third normal form is in the field of database design. It is common sense that structured programming is better than unstructured programming – or let’s say structurally unrestricted programming – and this is what is taught as foundational knowledge in many standard curricula of many software engineering study programmes. With respect to business process modelling, in practice, you find huge business process models that are arbitrary nets. How comes? Is it somehow due to some lack of knowledge transfer from the programming language community to the information system community? For computer scientists, it might be tempting to state that structured programming is a proven concept and it is therefore necessary to eventually promote a structured business process modelling discipline, however, care must be taken. In this article, we want to contribute to the understanding in how far a structured approach can be applied to business process modelling and in how far such an approach is naive. We attempt to clarify that the arguments of structured programming are about the pragmatics of programming and that they often relied on evidence in the past. Consequentially, our reasoning is at the level of pragmatics of business process modelling. We try to avoid getting lost in superficial comparisons of modelling language constructs but trying to understand the core problems of structuring business process specifications. As an example, A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 136–155, 2009. c Springer-Verlag Berlin Heidelberg 2009
Frontiers of Structured Business Process Modeling
137
so to speak as a taster to our discussion, we take forward one of our arguments here, which is subtle but important, i.e., that there are some diagrams expressing behaviour that cannot be transformed into a structured diagram expressing the same behaviour solely in terms of the same primitives as the original structurally unrestricted diagram. These are all those diagrams that contain a loop which is exited via more than one exit point, which is a known result from literature, encountered [1] by Corrado B¨ohm and Guiseppe Jacopini, proven for a special case [5] by Donald E. Knuth and Robert W. Floyd and proven in general [6] by S. Rao Kosaraju.
2
Basic Definitions
In this Section we explain the notions of program, structured program, flowchart, D-flowchart, structured flowchart, business process model and structured business process model as used in this article. The section is rather on syntactical issues. You might want to skip this Section and use it as a reference, however, you should at least glimpse over the formation rules of structured flowcharts defined in Fig. 1, which are also the basis for structured business process modelling. In the course of this article, programs are imperative programs with go-tostatements, i.e., they consist of basic statements, sequences, case constructs, loops and go-to-statements. Structured programs are those programs that abstain from go-to-statements. In programs with go-to-statements loops do not add to the expressive power, in presence of go-to-statements loops are syntactic sugar. Flowcharts correspond to programs. Flowcharts are directed graphs with nodes being basic activities, decision points or join points. A directed circle in a flowchart can be interpreted as a loop or as the usage of a go-to-statement. In general flowcharts it allowed to place join points arbitrarily, which makes it possible to create spaghetti structures, i.e., arbitrary jump structures, like the go-to-statements allows for the creation of spaghetti code. It is a matter of taste whether to make decision and joint points explicit nodes or not. If you strictly use decision and joint points the basic activities always have exactly one incoming and one outgoing edge. In concrete modelling languages like event-driven process chains, there are usually some more constraints, e.g., a constraint on decision points not to have more than one incoming edge or a constraint on join points to have not more than one outgoing edge. If you allow basic activities to have more than one incoming edge you do not need join points any more. Similarly, you can get rid of a decision point by using several outgoing edges by directly connecting the several branches of the decision point as outgoing edges to a basic activity and labelling the several branches with appropriate flow conditions. For example, in formcharts [3] we have choosen the option not to use explicit decision and join points. The discussion of this article is independent from the detail question of having explicit or implicit decision and join points, because both concepts are interchangeable. Therefore, in this article, we feel free to use both options.
138
2.1
D. Draheim
D-Charts
It is possible to define formation rules for a restricted class of flowcharts that correspond to structured programs. In [6] these diagrams are called Dijkstraflowcharts or D-flowcharts for short, named after Edgser W. Dijkstra. Figure 1 summarizes the semi-formal formation rules for D-flowcharts. (i) basic activity
A
(ii) sequence
C
(iii) case
C
C
D
D C D D D
(iv) do-while
C
(v) repeat-until
C
D n
y
C
n D
C
y
Fig. 1. Semi-formal formation rules for structured flowcharts
Actually, the original definition of D-flowcharts in [6] consists of the formation rules (i) to (iv) with one formation rule for each programming language construct of a minimal structured imperative programming language with basic statements, sequences, case-constructs and while-loops with basic activities in the flowchart corresponding to basic statements in the programming language. We have added a formation rule (v) for the representation of repeat-until-loops and call flowcharts resulting from rules (i) to (v) structured flowcharts in the sequel. The flowchart in Fig. 2 is not a structured flowchart, i.e., it cannot be derived from the formation rules in Fig. 1. The flowchart in Fig. 2 can be interpreted as consisting of a repeat-until-loop exited via the α-decision point and followed by further activities ‘C’ and ‘D’. In this case, the β-decision point can lead to a branch that jumps into the repeat-until-loop in addition to the regular loop
n A
B
D
y
E
C
n
y
Fig. 2. Example flowchart that is not a D-flowchart
D
Frontiers of Structured Business Process Modeling
139
entry point via activity ‘A’, which infringes the structured programming and structured modelling principle and gives raises to spaghetti structure. This way, the flowchart in Fig. 2 visualizes the program in Listing 1 Listing 1 01 02 03 04 05 06 07
REPEAT A; B; UNTIL alpha; C; IF beta THEN GOTO 03; D;
The flowchart in Fig. 2 can also be interpreted as consisting of a while-loop exited via the β-decision point, whereat the while-loop is surrounded by a preceding activity ‘A’ and a succeeding activity ‘D’. In this case, the α-decision point can lead to a branch that jumps out of the while-loop in addition to the regular loop exit via the β-decision point, which again infringes the structured modelling principle. This way, the flowchart in Fig. 2 visualizes the program in Listing 2 Listing 2 01 02 03 04 05 06 07
A; REPEAT B; IF NOT alpha THEN GOTO 01 C; UNTIL NOT beta; D;
Flowcharts are visualization of programs. In general, a flowchart can be interpreted ambiguously as the visualization of several different program texts, because, for example, an edge from a decision point to a join point can be interpreted as go-to-statement on the one hand side or the back branch from an exit point of a repeat-until loop to the start of the loop. Structured flowcharts are visualizations of structured programs. Loops in structured programs and structured flowcharts enjoy the property that they have exactly one entry point and exactly one exit point. Whereas the entry point and the exit point of a repeat-until loop are different, the entry point and exit point of a while-loop are the same, so that a while-loop in a structured flowchart has exactly one contact point. That might be the reason that structured flowcharts that use only while-loops instead of repeat-until loops appear more normalized. Similarly, in a
140
D. Draheim
structured program and flowchart all case-constructs has exactly one entry point and one exit point. In general, additional entry and exit points can be added to loops and case constructs by the usage of go-to-statements in programs and by the usage of arbitrary decision points in flowcharts. In structured flowcharts, decision points are introduced as part of the loop constructs and part of the case construct. In structured programs and flowcharts, loops and case-constructs are strictly nested along the lines of the derivation of their abstract syntax tree. Business process models extend flowcharts with further modelling elements like a parallel split, parallel join or non-deterministic choice. Basically, we discuss the issue of structuring business process models in terms of flowcharts in this article, because flowcharts actually are business process model diagrams, i.e., flowcharts form a subset of business process models. As the constructs in the formation rules of Fig. 1 further business process modelling elements can also be introduced in a structured manner with the result of having again only such diagrams that are strictly nested in terms of their looping and branching constructs. For example, in such definition the parallel split and the parallel join would not be introduced separately but as belonging to a parallel modelling construct. 2.2
A Notion of Equivalence for Business Processes
Bisimilarity has been defined formally in [15] as an equivalence relation for infinite automaton behaviour, i.e., process algebra [12,13]. Bisimilarity expresses that two processes are equal in terms of their observable behaviour. Observable behaviour is the appropriate notion for the comparison of automatic processes. The semantics of a process can also be understood as opportunities of one process interacting with another process. Observable behaviour and experienced opportunities are different viewpoints on the semantics of a process, however, whichever viewpoint is chosen, it does not change the basic concept of bisimilarity. Business processes can be fully automatic; however, business processes can also be descriptions of human actions and therefore can also be rather a protocol of possible steps undertaken by a human. We therefore choose to explain bisimilarity in terms of opportunities of an actor, or, as a metaphor, from the perspective of a player that uses the process description as a game board – which neatly fits to the notions of simulation and bisimulation, i.e., bisimilarity. In general, two processes are bisimilar if starting from the start node they reveal the same opportunities and each pair of same opportunities lead again to bisimilar processes. More formally, bisimilarity is defined on labelled transition systems as the existence of a bisimulation, which is a relationship that enjoys the aforementioned property, i.e., nodes related by the bisimilarity lead via same opportunities to nodes that are related again, i.e., recursively, by the bisimilarity. In the non-structured models in this article the opportunities are edges leading out of an activity and the two edges leading out of a decision point. For our purposes, bisimilarity can be characterized by the rules in Fig. 3.
Frontiers of Structured Business Process Modeling
(i) (ii)
|
A C
A
D
iff
E
y
|
D n
A A
C
y (iii)
|
D n
D
141
C
|D
C
|E
iff F
D
|
F
Fig. 3. Characterization of bisimilarity for business process models
3
Resolving Arbitrary Jump Structures
Have a look at Fig. 4. As well as Fig. 2 it shows a business process model that is not a structured business process model. The business process described by the business process model in Fig. 4 can also be described in the style of a program text as we did in Listing 3. In this interpretation the business process model consists of a while-loop followed by a further activity ‘B’, a decision point that might branch back into the while-loop and eventually an activity ‘C’. Alternatively, the business process can also be described by structured business process models. Fig. 5 shows two examples of such structured business process models and Listings 4 and 5 show the corresponding program text representations that are visualized by the business process models in Fig. 5.
Listing 3 01 02 03 04 05
WHILE alpha DO A; B; IF beta THEN GOTO 02; C;
D
y
A
n y B
E n
C
Fig. 4. Example business process model that is not structured
142
D. Draheim
D n
y
D n
A
y
A
C
C
n B
n
E
B y
y
A
(i)
D y n B
E
A
A
(ii)
n D y B
Fig. 5. Structured business process models that replace a non-structured one
The business process models in Figs. 4 and 5 resp. Listings 3, 4 and 5 describe the same business process. They describe the same business process, because they are bisimilar, i.e., in terms of their nodes, which are, basically, activities and decision points, they describe the same observable behaviour resp. same opportunities to act for an actor – we have explained the notion of equality and the more precise approach of bisimilarity in more detail in Sect. 2.2. The derivation of the business process models in Fig. 5 from the formation rules given in Fig. 1 can be understood by the reader by having a look at its abstract syntax tree, which appears at tree ψ in Fig. 6. The proof that the process models in Figs. 4 and 5 are bisimilar is left to the reader as an exercise. The reader is also invited to find structured business process models that are less complex than the ones given in Fig. 5, whereas complexity is an informal concept that depends heavily on the perception and opinion of the modeller. For example, the model (ii) in Fig. 4 results from an immediate simple attempt to reduce the complexity of the model (i) in Fig. 5 by eliminating the ‘A’-activity which follows the α-decision point and connecting the succeeding ‘yes’-branch of the α-decision point directly back with the ‘A’-activity preceding the decision point, i.e., by reducing a while-loop-construct with a preceding statement to a repeatuntil-construct. Note, that the model in Fig. 5 has been gained from the model in Fig. 4 by straightforwardly unfolding it behind the β-decision point as much as necessary to yield a structured description of the business process. In what sense the transformation from model (i) to model (ii) in Fig. 5 has lowered complexity and whether it actually or rather superficially has lowered the complexity has to be discussed in the sequel. In due course, we will also discuss another structured business process model with auxiliary logic that is oriented towards identifying repeat-until-loops in the original process descriptions.
Frontiers of Structured Business Process Modeling
143
Listing 4 01 02 03 04 05 06 07 08 09 10
WHILE alpha DO A; B; WHILE beta DO BEGIN A; WHILE alpha DO A; B; END; C;
Listing 5 01 02 03 04 05 06 07 08 09 10
WHILE alpha DO A; B; WHILE beta DO BEGIN REPEAT A; UNTIL NOT alpha; B; END; C;
The above remark on the vagueness of the notion of complexity is not just a side-remark or disclaimer but is at the core of the discussion. If the complexity of a model is a cognitive issue it would be a straightforward approach to let people vote which of the models is more complex. If there is a sufficiently precise method to test whether a person has understood the semantics of a process specification, this method can be exploited in testing groups of people that have been given different kinds of specifications of the same process and concluding from the test results which of the process specifications should be considered as more complex. Such an approach relies on the preciseness of the semantics and eventually on the quality of the test method. It is a real challenge to search for a definition of complexity of models or their representations. What we expect is that less complexity has something to do with better quality, and before we undertake efforts in defining complexity of models we should first understand possibilities to measure of quality of models. The usual categories in which modellers and programmers often judge about complexity of models like understandability or readability are vague concepts themselves. Other categories like maintainability or reusability are more telling than understandability or readability but still vague. Of course, we can define metrics for the complexity of diagrams. For example, it is possible to define
144
D. Draheim
that the number of activity nodes used in a business process model increases the complexity of a model. The problem with such metrics is that it follows immediately that the model in Fig. 5 is more complex than the model in Fig. 4. Actually, this is what we believe.
4
Immediate Arguments for and against Structure
We believe that the models in Fig. 4 are more complex than the models in Fig. 5. A structured approach to business process models would make us believe that structured models are somehow better than non-structured models in the same way that the structured programming approach believes that structured programs are somehow better than non-structured programs. So either less complexity must not be always better or the credo of the structured approach must be loosened to a rule of thumb, i.e., the believe that structured models are in general better than non-structured models, despite some exceptions like our current example. An argument in favour of the structured approach could be that our current example is simply too small, i.e., that the aforementioned exceptions are made of small models or, to say it differently, that the arguments of a structured approach become valid for models beyond a certain size. We do not think so. We rather believe that our discussion scales, i.e., that the arguments that we will give in the sequel are also working or even more working for larger models. We want to approach these questions more systematically. In order to do so, we need to answer why we do believe that the models in Fig. 4 are more complex than the model in Fig. 5. Of course, the immediate answer is simply because they are larger and therefore harder to grasp, i.e., a very direct cognitive argument. But there is another important argument why we believe this. The model in Fig. 4 shows an internal reuse that the models in Fig. 4 do not show. The crucial point is the reuse of the loop consisting of the ‘A’-activity and the α-decision point in Fig. 4. We need to delve into this important aspect and will actually do this later. First, we want to discuss the dual question, which is of equal importance, i.e., we must also try to understand or try answer the question, why modellers and programmers might find that the models in Fig. 5 are less complex than the models in Fig. 4. A standard answer to this latter question could typically be that the edge from the β-decision point to the ‘A’-activity in Fig. 4 is an arbitrary jump, i.e., a spaghetti, whereas the diagrams in Fig. 5 do not show any arbitrary jumps or spaghetti phenomena. But the question is whether this vague argument can be made more precise. A structured diagram consists of strictly nested blocks. All blocks of a structured diagram form a tree-like structure according to their nesting, which corresponds also to the derivation tree in terms of the formation rules of Fig. 1. The crucial point is that each block can be considered a semantic capsule from the viewpoint of its context. This means, that ones the semantics of a block is understood by the analyst studying the model, the analyst can forget about the inner modelling elements of the block. This is not so for diagrams in general. This has been the argument of looking from outside onto a block in the
Frontiers of Structured Business Process Modeling
145
case a modeller want to know its semantics in order to understand the semantics of the context where it is utilized. Also, the dual scenario can be convincing. If an analyst is interested in understanding the semantics of a block he can do this in terms of the inner elements of a block only. Once the analyst has identified the block he can forget about its context to understand it. This is not so easy in a non-structured language. When passing an element, in general you do not know where you end up in following the several paths behind it. It is also possible to subdivide a non-structured diagram into chunks that are smaller than the original diagram and that make sense to understand as capsules. For example, this can be done, if possible, by transforming the diagram into a structured one, in which you will find regions of your original diagram. However, it is extra effort to do this partition. With the current set of modelling elements, i.e., those introduced by the formulation rules in Fig. 1, all this can be seen particularly easy, because each block has exactly one entry point, i.e., one edge leading into it. Fortunately, standard building blocks found in process modelling would have one entry point in a structured approach. If you have, in general, also blocks with more than one entry points, it would make the discussion interesting. The above argument would not be completely infringed. Blocks still are capsules, which a semantics that can be understood locally with respect to their appearance in a strictly nested structure of blocks. The scenario itself remains neat and tidy; the difference lays in the fact, that a block with more than one entry has a particular complex semantics in a certain sense. The semantics of a block with more than one entry is manifold, e.g., the semantics of a block with two entries is threefold. Given that, in general, we also have concurrency phenomena in a business process model, the semantics of block with two entry points, i.e., its behaviour or opportunities, must be understood for the case that the block is entered via one or the other entry point and for the case that the block is entered simultaneously. But this is actually not a problem; it just means a more sophisticated semantics and more documentation. Despite a more complex semantics, a block with multiple entries still remains an anchor in the process of understanding a business process model, because it is possible, e.g., to understand the model from inside to outside following strictly the tree-like nesting, which is a canonical way to understand the diagram, i.e., a way that is always defined. It is also always possible to understand the diagram sequentially from the start node to the end node in a controlled manner. The case constructs make such sequential proceeding complex, because they open alternative paths in a tree-like manner. The advantage of a structured diagram with respect to case-constructs is that each of the alternative paths that are spawned is again a block and it is therefore possible to understand its semantics isolated from the other paths. This is not so in a non-structured diagram, in general, where might have arbitrary jumps between the alternative paths. Similarly, if analyzing a structured diagram in a sequential manner, you do not get into arbitrary loops and therefore have to deal with a minimized risk to loose track.
146
D. Draheim
The discussion of the possibility to have blocks with more entry points immediately reminds us of the discussion we have seen within the business process community on multiple versus unique entry points for business processes in a setting of hierarchical decomposition. The relationship between blocks in a flat structured language and sub-diagrams in a hierarchical approach and how the play together in a structured approach is an important strand of discussion that we will come back to in due course. For the time being, we just want to point out the relationship of the discussion we just had on blocks with multiple entries and sub-diagrams with multiple entries. A counter-argument against sub-diagrams with multiple entries would be that they are more complex. Opponents of the argument would say, that it is not a real argument, because the complexity of the semantics, i.e., its aforementioned manifoldness, must be described anyhow. With sub-diagrams that may have no more than one entry point, you would need to introduce a manifoldness of diagrams each with a single entry point. We do not discuss here how to transform a given diagram with multiple entries into a manifoldness of diagrams – all we want to remark here that it easily becomes complicated because of the necessity to appropriately handle the aforementioned possibly existing concurrency phenomena. Eventually it turns out to be a problem of transforming the diagram together with its context, i.e., transforming a set of diagrams and sub-diagrams with possibly multiple entry points into another set of diagrams and sub-diagrams with only unique entry points. Defenders of diagrams with unique entry points would state that it is better to have a manifoldness of such diagrams instead of having a diagram with multiple entries, because, the manifoldness of diagrams documents better the complexity of the semantics of the modelled scenario. For a better comparison of the discussed models against the above statements we have repainted the diagram from Fig. 4 and diagram (ii) from Fig. 5 with the blocks they are made of and their abstract syntax trees resp. quasi-abstract syntax tree in Fig. 6. The diagram of Fig. 4 appears to the left in Fig. 6 as diagram Φ and diagram (ii) from Fig. 5 appears to the right as diagram Ψ . According to that, the left abstract syntax tree φ in Fig. 6 corresponds to the diagram from Fig. 4 and the right abstract syntax tree ψ corresponds to the diagram (ii) from Fig. 5. Blocks are surrounded by dashed lines in Fig. 6. If you proceed in understanding the model P hi in Fig. 6 you first have to understand a while-loop that encompasses the ‘A’-activity – the block labelled with number ‘5’ in model P hi. After that, you are not done with that part of the model. Later, after the β-decision point you are branched back to the ‘A’activity and you have to re-understand the loop it belongs to again, however, this time in a different manner, i.e., as a repeat-until loop – the block labelled with number ‘1’ in model P hi. It is possible to argue that, in some sense, this makes the model Φ harder to read than model Ψ . To say it differently, it is possible to view model Ψ as an instruction manual on how to read the model Φ. Actually, model Ψ is a bloated version of model Φ. It contains some modelling elements of model Φ redundantly, however, it enjoys the property that each modelling element has to be understood only in the context of one block and
Frontiers of Structured Business Process Modeling
147
6
6
D
y
D n
A
n
< A
7
1
)
y
4
C
B n 2
y E
B
E
C
n
y A
ii
7
6
ii
6 ii
iv E
C
I
A
B
E B
D
y C
1
2 1
A
D n
4 3
\
2 1
5
D A
B B
D
3 2
5
Fig. 6. Block-structured versus arbitrary business process model
its encompassing blocks. We can restate these arguments a bit more formal in analyzing the abstract syntax trees φ and ψ in Fig. 6. Blocks in Fig. 6 correspond to constructs that can be generated by the formation rules in Fig. 1. The abstract syntax tree ψ is an alternate presentation of the nesting of blocks in model Ψ . A node stands for a block and for the corresponding construct according to the formation rules. The graphical model Φ cannot be derived from the formation rules in Fig. 1. Therefore it does not posess an abstract syntax tree in which each node represent a unique graphical block and a construct the same time. The tree φ shows the problem. You can match the region labelled ‘1’ in model Φ as a block against while-loop-rule (iv) and you can subsequentially match the region labelled ‘2’ against the sequence-rule (iii). But then you get stuck. You can form a further do-while loop with rule (iv) out of the β-decision point and block ‘2’ as in model Ψ but the resulting graphical model cannot be interpreted as a part of model Φ any more. This is because the edge from activity ‘B’ to the β-decision point graphically serves both as input branch to the decision point and as back branch to the decision point. This graphical problem is resolved in the abstract syntax tree φ by reusing the activity ‘B’ in the node that corresponds to node ‘5’ in tree ψ in forming a sequence according to rule (ii) with the results that the tree φ is actually no tree any longer. Similarly, the reuse of the modelling elements in forming node ‘6’ in the abstract syntax tree φ visualizes the double interpretation of this graphical region as both a do-while loop and repeat-until loop.
148
5
D. Draheim
Structure for Text-Based versus Graphical Specifications
In Sect. 4 we have said that an argument for a structured business process specification is that it is made of strictly nested blocks and that each identifiable block forms a semantic capsule. In the argumentation we have looked at the graphical presentation of the models only and now we will have a look also at the textual representations. This section needs a disclaimer. We are convinced that it is risky in the discussion of quality of models to give arguments in terms of cognitive categories like understandability, readability, cleanness, well-designedness, well-definedness. These categories tend to have a insufficient degree definedness themselves so that argumentations based on them easily suffer a lack of falsifiability. Nevertheless, in this Section, in order abbreviate, we need to speak directly about the reading ease of specifications. The judgements are our very own opinion, an opinion that expresses our perception of certain specifications. The reader may have a different opinion and this would be interesting in its own right. At least, the expression of our own opinion may encourage the reader to judge about the reading ease of certain specifications. As we said in terms of complexity, we think that the model in Fig. 4 is easier to understand than the models in Fig. 5. We think it is easier to grasp. Somehow paradoxically, we think the opposite about the respective text representation, at least at a first sight, i.e., as long as we have not internalized to much all the different graphical models in listings. This means, we think that the text representation of the models in Fig. 4, i.e., Listing 3, is definitely harder to understand than the text representation of both models in Fig. 5, i.e., Listings 4 and 5. How comes? Maybe, the following observation helps, i.e., that we also think that the graphical model in Fig. 5 is also easier to read than the models textual representation in Listing 3 and also easier to read than the two other Listings 4 and 5. Why is Listing 5 so relatively hard to understand? We think, because there is no explicitly visible connecting between the jumping-off point in line ‘04’ and the jumping target in line line ‘02’. Actually, the first thing we would recommend in order to understand Listing 5 better is to draw its visualization, i.e., the model in Fig. 5, or to concentrate and to visualize it in our mind. By the way, we think that drawing some arrows in Listing 3 as we did in Fig. 7 also help. The two arrows already help despite the fact that they make explicit only a part of the jump structure – one possible jump from line ‘01’ to line ‘03’ in case the α-condition becomes invalid must still be understood by the indentation of the text. All this is said for such a small model consisting of a total of five lines. Imagine, if you had to deal with a model consisting of several hundreds lines with arbitrary goto-statements all over the text. If it is true that the model in Fig. 4 is easier to understand than the models in Fig. 5 and at the same time Listing 3 is harder to understand than Listings 4 and 5 this may lead us to the assumption that the understandability of graphically presented models follows other rules than the understandability of textual representation. Reasons for this may be, on the
Frontiers of Structured Business Process Modeling
01 02 03 04 05
149
WHILE alpha DO A; B; IF beta THEN GOTO 02; C;
Fig. 7. Listing enriched with arrows for making jump structure explicit
one hand side, the aforementioned lack of explicit visualizations of jumps, and on the other hand side, the one-dimensional layout of textual representations. The reason for why we have given all of these arguments in this section is not in order to promote visual modelling. The reason is that we see a chance that they might explain why the structural approach has been so easily adopted in the field of programming. The field of programming was and still is dominated by text-based specifications – despite the fact that we have seen many initiatives from syntax-directed editors over computer-aided software engineering to model-driven architecture. It is fair to remark that the crucial characteristics of mere textual specification in the discussion of this Section, i.e., lack of explicit visualization of jumps, or, to say it in a more general manner, support for the understanding of jumps, is actually addressed in professional coding tools like integrated development environments with their maintenance of links, code analyzers and profiling tools. The mere text-orientation of specification has been partly overcome by today’s integrated development environments. Let us express once more that we are no promoters of visual modelling or even visual programming. In [3] we have deemphasized visual modelling. We strictly believe that visualizations add value, in particular, if it is combined with visual meta-modelling [10,11]. But we also believe that mere visual specification is no silver bullet, in particular, because it does not scale. We believe in the future of a syntax-direct abstract platform with visualization capabilities that overcomes the gap between modelling and programming from the outset as proposed by the work on AP1 [8,9] of the Software Engineering research group at the University of Auckland.
6
Structure and Decomposition
The models in Fig. 5 are unfolded versions of the model in Fig. 4. Some modelling elements of the diagram in Fig. 5 occur redundantly in each model in Fig. 4. Such unfolding violate the reuse principle. Let us concentrate on the comparison of the model in Fig. 5 with model (i) in Fig. 5. The arguments are similar for diagram (ii) in Fig. 5. The loop made of the α-decision point and the activity ‘A’ occurs twice in model (i). In the model in Fig. 5 this loop is reused by the jump from the β-decision point albeit via an auxiliary entry point. It is important to understand that reuse is not about the cost-savings of avoiding the repainting modelling elements but about increasing maintainability.
150
D. Draheim
Imagine, in the lifecycle of the business process a change to the loop consisting of the activity ‘A’ and the α-decision point becomes necessary. Such changes could be the change of the condition to another one, the change of the activity ‘A’ to another one or the refinement of the loop, e.g., the insertion of a further activity into it. Imagine that you encounter the necessity for changes by reviewing the start of the business process. In analyzing the diagram, you know that the loop structure is not only used at the beginning of the business process but also later by a possible jump from the β-decision point to it. You will now further analyze whether the necessary changes are only appropriate at the beginning of the business process or also later when the loop is reused from other parts of the business process. In the latter case you are done. This is the point where you can get into trouble with the other version of the business process specification as diagram (i) in Fig. 5. You can more easily overlook that the loop is used twofold in the diagram; this is particularly true for similar examples in larger or even distributed models. So, you should have extra documentation for the several occurrences of the loop in the process. Even in the case that the changes are relevant only at the beginning of the process you would like to review this fact and investigate whether the changes are relevant for other parts of the process. It is fair to remark, that in the case that the changes to the loop in question are only relevant to the beginning of the process, the diagram in Fig. 5 bears the risk that this leads to an invalid model if the analyst oversees its reuse from later stages in the process, whereas the model (i) in Fig. 5 does not bear that risk. But we think this kind of weird fail-safeness can hardly be sold as an advantage of model (i) in Fig. 5. Furthermore, it is also fair to remark, that the documentation of multiple occurrences of a model part can be replaced by appropriate tool-support or methodology like a pattern search feature or hierarchical decomposition as we will discuss in due course. All this amounts to say that maintainability of a model cannot be reduced to its presentation but depends on a consistent combination of presentational issues, appropriate tool support and defined maintenance policies and guidelines in the framework of a mature change management process. We now turn the reused loop consisting of the activity ‘A’ and the α-decision point in Fig. 5 into an own sub-diagram in the sense of hierarchical decomposition, give it a name – let us say ‘DoA’ – and replace the relevant regions in diagram (i) in Fig. 5 by the respective, expandable sub-diagram activity. The result is shown in Fig. 8. Now, it is possible to state that this solution combines the advantages from both kinds of models in question, i.e., it consists of structured models at all levels of the hierarchy and offers an explicit means of documentation of the places of reuse. But a caution is necessary. First, the solution does not free the analyst to actually have a look at all the places a diagram is used after he or she has made a change to the model, i.e., an elaborated change policy is still needed. In the small toy example, such checking is provoked, but in a tool you usually do not see all sub-diagrams at once, but rather step through the levels of the hierarchy and the sub-diagrams with links. Remember that the usual motivation to introduce hierarchical decomposition and tool-support for hierarchical decomposition is the
Frontiers of Structured Business Process Modeling
151
C n DoA +
E
B
y A
DoA +
B
DoA
D
y
A
Fig. 8. Example business process hierarchy
C n DoA +
E
B
y Ado +
B
Ado A
DoA +
DoA
D
y
A
Fig. 9. Example for a deeper business process hiarchy
desire to deal with the complexity of large and very large models. Second, the tool should not only support the reuse-direction but should also support the inverse use-direction, i.e., it should support the analyst with a report feature that lists all places of reuse for a given sub-diagram. Now let us turn to a comparative analysis of the complexity of the modelling solution in Fig. 8 and the model in Fig. 5. The complexity of the top-level diagram in the model hierarchy in Fig. 8 is not any more significantly higher than the one of the model in Fig. 5. However, together with the sub-diagram, the modelling solution in Fig. 8 again shows a certain complexity. It would be
152
D. Draheim
possible to neglect a reduction of complexity by the solution in Fig. 8 completely with the hint that the disappearance of the edge representing the jump from the β-decision point into the loop in Fig. 5 is bought by another complex construct in Fig. 8, wit to the dashed line from the activity ‘DoA’ to the targeted subdiagram. The jump itself can be still seen in Fig. 8, somehow, unchanged as an edge from the β-decision point to the activity ‘A’. We do not think so. The advantage of the diagram in Fig. 8 is that the semantic capsule made of the loop in question is already made explicit as a named sub diagram, which means an added documentation value. Also, have a look at Fig. 9. Here the above explanations are even more substantive. The top-level diagram is even less complex than the top-level diagram in Fig. 8, because the activity ‘A’ now has moved to an own level of the hierarchy. However, this comes at the price now, that the jump from the β-decision point to the activity ‘A’ in Fig. 5 now re-appears in Fig. 9 as the concatenation of the ‘yes’-branch in the top-level diagram, the dashed line leading from the activity ‘Ado’ to the corresponding sub-diagram at the next level and the entry edge of this sub-diagram.
7
On Business Domain-Oriented versus DocumentationOriented Modeling
In Sects. 3 through 6 we have discussed structured business process modelling for those processes that actually have a structured process specification in terms of a chosen fixed set of activities. In this Section we will learn about processes that do not have a structured process specification in that sense. In the running example of Sects. 3 through 6 the fixed set of activities was given by the activities of the initial model in Fig. 4 and again we will explain the modelling challenge addressed in this Section as a model transformation problem.
reject workpiece due to defects
quality must be improved
y
handle workpiece n
(i)
dispose deficient workpiece
amount exceeds threshold
y
quality insurance n
finish workpiece
prepare purchase order
(ii)
revision is necessary
y n
approve purchase order
y n
submit purchase order
Fig. 10. Two example business processes without structured presentation with respect to no other than their own primitives
Frontiers of Structured Business Process Modeling
y
y A
E
B
D
n C
153
n D
Fig. 11. Business process with cycle that is exited via two distinguishable paths
Consider the example business process models in Fig. 10. Each model contains a loop with two exits to paths that lead to the end node without the opportunity to come back to the originating loop before reaching the end state. It is known [1,5,6] that the behaviours of such loops cannot be expressed in a structured manner, i.e., by a D-chart as defined in Fig. 1 solely in terms of the same primitive activities as those occurring in the loop. Extra logic is needed to formulate an alternative, structured specification. Fig. 11 shows this loop-pattern abstractly and we proceed to discuss this issues with respect to this abstract model. Assume that there is a need to model the behaviour of a business process in terms of a certain fixed set of activities, i.e., the activities ‘A’ through ‘D’ in Fig. 11. For example, assume that they are taken from an accepted terminology of a concrete business domain. Other reasons could be that the activities stem from existing contract or service level agreement documents. You can also assume that they are simply the natural choice as primitives for the considered work to be done. We do not delve here into the issue of natural choice and just take for granted that it is the task to model the observed or desired behaviour in terms of these activities. For example, we could imagine an appropriate notion of cohesion of more basic activities that the primitives we are restricted to, or let’s say selfrestricted to, adhere to. Actually, as it will turn out, for the conclusiveness of our current argumentation there is no need for an explanation how a concrete fixed set of activities arises. What we need for the conclusiveness of our current argumentation is the demand on the activities, that they are only about actions and objects that are relevant in the business process. Fig. 12 shows a structured business process model that is intended to describe the same process as the specification in 11. In a certain sense it fails. The extra logic introduced in order to get the specification into a structured shape do not belong to the business process that the specification aims to describe. The model in Fig. 12 introduces some extra state, i.e., the Boolean variable δ, extra activities to set this variable so that it gets the desired steering effect and an extra δ-decision point. Furthermore, the original δ-decision point in the model of Fig. 11 has been changed to a new β∧δ-decision point. Actually, the restriction of the business process described by Fig. 11 onto those particles used in the model in Fig. 11 is bisimilar to this process. The problem is that the model in Fig. 12 is a hybrid. It is not only a business domain-oriented model any more, it now has also some merely documentation-related parts. The extra logic and state only serve
154
D. Draheim
y A
G:=true
A
EG
y B
D n
n
G:=false G
C
D
Fig. 12. Resolution of business process cycles with multiple distinguishable exits by the usage of auxiliary logic and state
the purpose to get the diagram into shape. It needs clarification of the semantics. Obviously, it is not intended to change the business process. If the auxiliary introduced state and logic would be also about the business process, this would mean, for example, that in the workshop a mechanism is introduced, for example a machine or a human actor that is henceforth responsible for tracking and monitoring a piece of information δ. So, at least what we need is to explicitly distinguish those elements in such a hybrid model. The question is whether the extra complexity of a hybrid domain- and documentation-oriented modelling approach is justified by the result of having a structured specification.
8
Conclusion
On a first impression, structured programs and flowcharts appear neat and programs and flowcharts with arbitrary jumps appear obfuscated, muddle-headed, spaghetti-like etc. But the question is not to identify a subset of diagrams and programs that look particularly fine. The question is, given a behaviour that needs description, whether it makes always sense to replace a description of this behaviour by a new structured description. What efforts are needed to search for a good alternative description? Is the resulting alternative structured description as nice as the original non-structured description? Furthermore, we need to gain more systematic insight into which metrics we want to use to judge the quality of a description of a behaviour, because categories like neatness or prettiness are not satisfactory for this purpose if we take for serious that our domain of software development should be oriented rather towards engineering [14,2] than oriented towards arts and our domain of business management should be oriented rather towards science [4], though, admittedly, both fields are currently still in the stage of pre-paradigmatic research [7]. All these issues form the topic of investigation of this article. For us, the definitely working theory of quality of business process models would be strictly pecuniary, i.e., it would enable us to define a style guide for
Frontiers of Structured Business Process Modeling
155
business process modelling that eventually saves costs in system analysis and software engineering projects. The better the cost-savings realized by the application of such style-guide the better such theory. Because our ideal is pecuniary, we deal merely with functionality. There is no cover, no aesthetics, no mystics. This means there is no form in the sense of Louis H. Sullivan [16] – just function.
References 1. B¨ ohm, C., Jacopini, G.: Flow Diagrams, Turing Machines and Languages With Only Two Formation Rules. Communications of the ACM 3(5) (1966) 2. Buxton, J.N., Randell, B.: Software Engineering – Report on a Conference Sponsored by the NATO Science Committee, Rome, October 1969. NATO Science Committee (April 1970) 3. Draheim, D., Weber, G.: Form-Oriented Analysis – A New Methodology to Model Form-Based Applications. Springer, Heidelberg (2004) 4. Gulick, L.: Management is a Science. Academy of Management Journal 1, 7–13 (1965) 5. Knuth, D.E., Floyd, R.W.: Notes on Avoiding ‘Go To’ Statements. Information Processing Letters 1(1), 23–31, 177 (1971) 6. Rao Kosaraju, S.: Analysis of Structured Programs. In: Proceedings of the 5th Annual ACM Symposium on Theory of Computing, pp. 240–252 (1973) 7. Kuhn, T.S.: The Structure of Scientific Revolutions. University of Chicago Press (December 1996) 8. Lutteroth, C.: AP1 – A Platform for Model-based Software Engineering. In: Draheim, D., Weber, G. (eds.) TEAA 2006. LNCS, vol. 4473, pp. 270–284. Springer, Heidelberg (2007) 9. Lutteroth, C.: AP1 – A Platform for Model-based Software Engineering. PhD thesis, University of Auckland (March 2008) 10. Himsl, M., Jabornig, D., Leithner, W., Draheim, D., Regner, P., Wiesinger, T., K¨ ung, J.: A Concept of an Adaptive and Iterative Meta- and Instance Modeling Process. In: Proceedings of DEXA 2007 - 18th International Conference on Database and Expert Systems Applications. Springer, Heidelberg (2007) 11. Himsl, M., Jabornig, D., Leithner, W., Draheim, D., Regner, P., Wiesinger, T., K¨ ung, J.: Intuitive Visualization-Oriented Metamodeling. In: Proceedings of DEXA 2009 - 20th International Conference on Database and Expert Systems Applications. Springer, Heidelberg (2009) 12. Milner, R.: A Calculus of Communication Systems. LNCS, vol. 92. Springer, Heidelberg (1980) 13. Milner, R.: Communication and Concurrency. Prentice-Hall, Englewood Cliffs (1989) 14. Naur, P., Randell, B. (eds.): Software Engineering – Report on a Conference Sponsored by the NATO Science Committee, Garmisch, October 1968. NATO Science Committee (January 1969) 15. Park, D.: Concurrency and Automata on Infinite Sequences. In: Deussen, P. (ed.) GI-TCS 1981. LNCS, vol. 104, pp. 167–183. Springer, Heidelberg (1981) 16. Sullivan, L.H.: The Tall Office Building Artistically Considered. Lippincott’s Magazine 57, 403–409 (1896)
Information Systems for Federated Biobanks Johann Eder1 , Claus Dabringer1 , Michaela Schicho1 , and Konrad Stark2 1 2
Alps Adria University Klagenfurt, Department of Informatics Systems {Johann.Eder,Claus.Dabringer,Michaela.Schicho}@uni-klu.ac.at University of Vienna, Department of Knowledge and Business Engineering
[email protected] Abstract. Biobanks store and manage collections of biological material (tissue, blood, cell cultures, etc.) and manage the medical and biological data associated with this material. Biobanks are invaluable resources for medical research. The diversity, heterogeneity and volatility of the domain make information systems for biobanks a challenging application domain. Information systems for biobanks are foremost integration projects of heterogenous fast evolving sources. The European project BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) has the mission to network European biobanks, to improve resources for biomedical research, an thus contribute to improve the prevention, diagnosis and treatment of diseases. We present the challenges for interconnecting European biobanks and harmonizing their data. We discuss some solutions for searching for biological resources, for managing provenance and guaranteeing anonymity of donors. Furthermore, we show how to support the exploitation of such a resource in medical studies with specialized CSCW tools. Keywords: biobanks, data quality and provenance, anonymity, heterogeneity, federation, CSCW.
1
Introduction
Biobanks are collections of biological material (tissue, blood, cell cultures, etc.) together with data describing this material and their donors and data derived from this material. Biobanks are of eminent importance for medical research for discovering the processes in living cells, the causes and effects of diseases, the interaction between genetic inheritance and life style factors, or the development of therapies and drugs. Information systems are an integral part of any biobank and efficient and effective IT support is mandatory for the viability of biobanks. For an example: A medical researcher wants to find out why a certain liver cancer generates a great number of metastasis in some patients and in others not. This knowledge would help to improve the prognosis, the therapy, the selection
The work reported here was partially supported by the European Commission 7th Framework program - project BBMRI and by the Austrian Ministry of Science and Research within the program Gen-Au - project GATIB.
A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 156–190, 2009. c Springer-Verlag Berlin Heidelberg 2009
Information Systems for Federated Biobanks
157
of therapies and drugs for a particular patient, and help to develop better drugs. For such a study the researcher needs besides biological material (cancer tissue) an enormous amount of data: clinical records of the patients donating the tissue, lab analysis, microscopic images of the diseased cells, information about the life style of patients, genotype information (e.g. genetic variations), phenotype information (e.g. gene expression profiles), etc. Gathering all these data in the course of a single study would be highly inefficient and costly. A biobank is supposed to deliver the data needed for this type of research and share the data and material among researchers. From the example above it is clear that information systems for biobanks are huge applications. The challenge is to integrate data stemming from very different autonomous sources. So biobanks are foremost integration and interoperability projects. Another important issue is the dynamics of the field: new insight leads to more differentiated diagnosis, new analysis methods allow the assessment of additional measurements, or improve the accuracy of measurements. So an information system for biobanks will be continuously evolving. And last but not least, biobanks store very detailed personal information about donors. To protect the privacy and anonymity of the donors is mandatory and misuse of the stored information has to be precluded. In recent years biobanks have been set up in various organizations, mainly hospitals and medical and pharmaceutical research centers. Since the availability of material and data is a scarce resource for medical research, the sharing of the available material within the research community increased. This leads to desire to organize the interoperation of biobanks in a better way. The European project BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) has the mission to network European biobanks to improve resources for biomedical research an thus contribute to improve the prevention, diagnosis and treatment of diseases. BBMRI is organized in the framework of European Strategy Forum on Research Infrastructures (ESFRI). In this paper we give a broad overview of the requirements for IT systems for biobanks, present the architecture of information systems supporting biobanks, discuss possible integration strategies for connecting European biobanks and discuss the challenges for this integration. Furthermore, we show how such an infrastructure can be used and present a support system for medical research using data from biobanks. The purpose of this paper is rather painting the whole picture of challenges of data mangement for federated biobanks than presenting detailed technical solutions. This paper is an extended version of [20]
2
What Are Biobanks?
A biobank, also known as a biorepository, can be seen as an interdisciplinary research platform that collects, stores, processes and distributes biological materials and the data associated with those materials. In short: biobank = biological material + data. Typically, those biological materials are human biospecimens such as tissue, blood or body fluids - and
158
J. Eder et al.
the data are the donor-related clinical information of that biological material. Human biological samples in combination with donor-related clinical data are essential resources for the identification and validation of biomarkers and the development of new therapeutic approaches (drug discovery), especially in the development of systems for biological approaches to study the disease mechanisms. Further on, they are used to explore and understand the function and medical relevance of human genes, their interaction with environmental factors and the molecular causes of diseases [16]. Besides human-driven biobanks, a biobank can also include samples from animals, cell and bacterial cultures, or even environmental samples. Biobanks became a major issue in the field of genomics and biotechnology, in recent years. According to the type of stored samples and the medical-scientific domain biobanks can differ in many forms. 2.1
The Variety of Biobanks and Medical Studies
The development of biobanks results in a very heterogeneous concept. Each biobank pursues its own strategy and specific demands on quality and annotation of the collected samples. According to [39] we distinguish between three major biobank types considering exclusively human-driven biobanks: 1. Population based biobanks. Population based cohorts are valuable for assessing the natural occurrence and progression of common diseases. They contain a huge number of biological samples from healthy or diseased donors, representative for a concrete region or ethic cohort (population isolated) or from the general population over a large period of time (longitudinal population). Examples for large population based biobanks are the Icelandic DeCode Biobank and UK Biobank. 2. Disease-oriented biobanks. Their mission is to obtain biomarkers of disease through prospective and/or retrospective collections of e.g. tumour and nontumour samples with derivatives such as DNA/RNA/proteins. The collected samples are associated to clinical data and clinical trials [39]. This groups of biobanks are typically pathology archives like the MUG Biobank Graz. A special kind of biobanks are twin registries such as GenomeEUtwin biobank, which contain approximately equal numbers of monozygotic (MZ) and dizygotic (DZ) twins. With such biobanks the parallel dissection of effects of genetic variation in a homogeneous environment (DZ twins) and of environmental effects against an identical genetic background (MZ twins) is possible. [52] These registries are also partially suited to distinguish between the genetic and non-genetic basis of diseases. In [18] Cambon-Thomsen shows, that biobanks can vary in size, access mode, status of institution or scientific sector in which the samples were collected: 1. Medical and academic research. In medical genetic studies of disease usually small case- or family-based repositories are involved. Population-based collections, which are also usually small, have been used for academic research
Information Systems for Federated Biobanks
159
for long period of time. Some large epidemiological studies have also involved the collection of a large number of samples. 2. Clinical case/control studies. The primarily use of collected samples in hospitals is for informing diagnosis, for the clinical or therapeutic follow up as well as for the discovery or validation of genetic and non-genetic risk factors. Large numbers of tissue sections have been collected by pathology departments over the years. Transplantations using cells, tissues or even organs from unrelated donors also led to the development of tissue and cell banks. 3. Biotechnology domain. Within this domain collections of reference cell lines (e.g cancer cell lines or antibody-producing cell lines) and stem cell lines of various origin are obtained. They are mainly used in biotechnology research and development. 4. Judiciary domain. Biobanks host large collections of different sources of biological material, data and DNA fingerprints, which have very restricted usage. 2.2
Collected Material and Stored Data
Biobanks are not something new in the world of medicine and biological research. The systematic collection of human samples goes back to 19th century, including formaldehyde-fixed, paraffin-embedded or frozen material [27]. Most biobanks are developed in order to support a research program in a specific type of disease or to collect samples from a particular group of donors. Due to the large resource requirements, biobanks within an institution usually conglomerate to reduce high costs. This merging typically results in the fact that biobanks have many kinds of samples and many different types (also called domains) of data. Material / Sample Types. Samples can include any kind of tissue, fluid or other material that can be obtained from an individual. Usually, biospecimens in a biobank are blood and blood components (serum), solid tissues such as small biopsies and so on. An important collection in biobanks are the so-called normal samples. These are that kind of tissue samples which are free of diagnosed disease. For instance in some cases of the medical research (e.g case/control studies) it is an important issue that there exist corresponding normal samples to several diseased diagnosed samples, which can be used as controls, in order to get the bottom of specific diseases or gene mutations. The biological samples can be collected in various ways. Samples may be taken in the course of diagnostic investigations as well as during treatment of diseases. For instance, in biopsies small human tissue specimen are obtained in order to determine the type of a cancer. Surgical resections of tumours provide larger tissue samples, which may be used to specify the type of disease treatment. Autopsies are another valuable source of human tissues, where specimen may be taken from various locations which reflect the effects of a disease in different organs of a patient. The obtained biological materials are special preserved to keep them durable over a long time.
160
J. Eder et al.
Sample Data. The stored data from a donor, which come along with the collected sample can be very extensive and various. According to [1] this data includes: – General information (e.g. race, gender, age, ...) – Lifestyle and environmental information (e.g. smoker - non smoker, living in a big city with high environmental pollution or living in rural areas) – History of present illnesses, treatments and responses (e.g prescribed drugs and the reactions of adverse) – Longitudinal information (e.g. a sequence of blood tests after tissue collection in order to test the progress behavior of diseases) – Clinical outcomes (e.g. success of the treatment: Is the donor still living?) – Data from gene expression profiles, laboratory data,... Technically, the types of data range from typical record keeping, over text and various forms of images to gene vectors and 3-D chemical structures. Ethical and Legal Issues. Donors of biological materials must be informed about purpose and intended use of their samples. Typically, the donor signs an informed consent [10] which allows the use of samples for research and obliged the biobank institution to guarantee privacy of the donor. The usage of material in studies usually requires approval by ethics boards. Since the identity of donors must be protected, the relationship between a donor and its sample must not be revealed. Technical solutions for guaranteeing privacy issues are discussed in section 6. 2.3
Samples as a Scarce Resource
Biobanks prove to be a central key resource to increase the effectivity of medical research. Besides the high costs with human biological samples, they are available in a limited amount. E.g. Ones a piece of a liver-tissue is cut off and is used for a study, this piece of tissue is expended. Therefore, it is important to avoid redundant analysis and achieve the most efficient and effective use of non-renewable biological material [17]. A common and synergetic usage of this resource will enable lots of research projects especially in case of rare diseases with very limited material available. In silico experiments [46] play an important role in the context of biobanks. The aim is to answer as many research questions as possible without access to the samples themselves. Therefore, already acquired data of samples are stored in databases and shared among interested researchers. So modern biobanks offer the possibility to decrease long-term costs of research and development as well as effective data acquisition and usage.
3
Biobanks as Integration Project
In section 2 we already mentioned that biobanks may contain various types of collections of biological materials. Depending on the type of biobank, its organizational and research environment, human tissue, blood, serum, isolated
Information Systems for Federated Biobanks
161
RNA/DNA, cell lines or others can be archived. Apart from the organizational challenges, an elaborated information system is required for capturing all relevant information of samples, managing borrow and return activities and supporting complex search enquiries. 3.1
Sample Management in Biobanks
An organizational entity, which is commissioned to establish a biobank and its operations, requires suitable storage facilities (e.g. cryo-tanks, paraffin block storage systems) as well as security measures for protecting samples from damage and for preventing unauthorized access. If a biobank is built on the basis of existing resources (material and data), a detailed evaluation is essential. The collection process, the inventory and documentation of samples has to be assessed, evaluated and optimized. The increasing number of biobanks all over the world has drawn the attention of international organizations, encouraging the standardization of processes, sample and data management of biobanks. Standardization of Processes. Managing a biobank is a dynamic process. The biological collections may grow continuously, additional collections may be integrated and samples may be used in medical studies and research projects. Standard operating procedures are required for the most relevant processes, defining competencies, roles, control mechanisms and documentation protocols. E.g. The comparison of two or more gene expression profiles computed by different institutes is only applicable if all gene expressions were determined by the same standardized process. The Organization for Economic Cooperation and Development (OECD) released the definition of Biological Resource Centers (BRC) which ”must meet the high standards of quality and expertise demanded by the international community of scientists and industry for the delivery of biological information and materials” [8]. BRCs are certified institutions providing high quality biological material and information. The model of BRCs may assist the consolidation process of biobanks defining quality management and quality assurance measures [38]. Guidelines for implementing standards for BRCs may be found in recent works such as [12,35]. Data and Data Sources. Clinical records and diagnoses are frequently available as semi-structured data or even stored as plain-text in legacy medical information system. The data available from diverse biobanks do not only include numeric and alphabetic information but also complex images such as microphotographs of pathology sections, pictures generated by medical imaging procedures as well as graphical representations of the results of analytical diagnostic procedures [11]. It is a matter of very large volumes of data. Somewhere data or findings are even archived only in printed version or stored in heterogenous formats. 3.2
Different Kinds of Heterogeneity
Since biobanks may involve many different datasources it is obvious that heterogeneity is ever-present. Biobanks may comprise interfaces to sample management
162
J. Eder et al.
systems, labor information systems, research information systems, etc. The origin of the heterogeneity lies in different data sources (clinical, laboratory systems, etc), hospitals, research institutes and also in the evolution of involved disciplines. Heterogeneity appearing in biobanks comes in various forms and thus can be divided into two different classes. The first class of heterogeneity can be found between different datasources. This kind of heterogeneity is mostly caused by the independent development of the different datasources. Here we have to deal with several different types of mismatches which all lead to heterogeneity between the systems as shown in [43,32]. Typical mismatches can be found in: – – – – – – – – –
Attribute namings (e.g. disease vs DiseaseCode) Different attribute encodings (e.g. weight in kg vs lbm) Content of attributes (e.g. homonyms, synonyms, ...) Precision of attributes (e.g. sample size: small, medium, large vs mm3 , cm3 ) Different attribute granularity Different modeling of schemata Multilingualism Quality of the data stored Semi-structured data (incompleteness, plain-text,...)
The second class of heterogeneity is the heterogeneity within one datasource. This kind of heterogeneity may not be recognized at a first glance. But as [39] show the scientific value of biobank content increases with the amount of clinical data linked to certain samples (see figure 1). The longer data will be kept in biobanks the greater its scientific value is. On the other hand keeping data in biobanks for a long time leads to heterogeneity because medical progress leads to changes in database structures and the modeled domain. Modern biobanks support time series analysis and use of material harvested over a long period of time. A particular difficulty for these uses of material and data is the correct
Fig. 1. The correlation of scientific value of biobank content and availability of certain content is shown. One can clearly see that the scientific value increases where the availability of data decreases [39].
Information Systems for Federated Biobanks
163
representation and treatment of changes. Some exemplary changes that arise in this context are: – – – – –
Changes in disease codes (e.g ICD-9 to ICD-10 [6] in the year 2000) Progress in biomolecular methods results in higher accuracy of measurements Extension of relevant knowledge (e.g. GeneOntology [4] is changed daily) Treatments and standard procedures change Quality of sample conservation increases, etc.
Furthermore, also the technical aspects within one biobank are volatile: data structures, semantics of data, parameters collected, etc. When starting a biobank project one must be aware of the above mentioned changes. Biobanks should be augmented to represent these changes. This representation then can be used to reason about the appropriateness for using a certain set of data or material together in specific studies. 3.3
Evolution
Wherever possible biobanks should provide transformations to map data between different versions. Using ontologies to annotate content of biobanks can be quite useful. By providing mapping support between different ontologies the longevity problem can be addressed. Further on, versioning and transformation approaches can help to support the evolution of biobanks. Techniques from temporal databases and temporal data warehouses can be used for the representation of volatile data together with version mappings to transform all data to a selected version [19,26,51,21,22]. This knowledge can be directly applied to biobanks as well. 3.4
Provenance and Data Quality
From the perspective of medical research, the value of biological material is tightly coupled with the amount, type and quality of associated data. Though, medical studies usually require additional data that can not be directly provided by biobanks or is not available at all. The process of collecting relevant data for biospecimens is denoted as sample annotation and is usually done in context of prospective studies based on specified patient cohorts. For instance, if the family anamnesis is to be collected for a cohort of liver cancer patients, various preprocessing and filtering steps are necessary. In some cases, different medical institutes or even hospitals have to be contacted. Patients have to be identified correctly in external information systems, and anamnesis data is extracted and collected in predefined data structures or simple spreadsheets. The collected data is combined with the data supplied by the biobank and constitutes the basis for hypotheses, analyses and experiments. Thus, additional data is created in context of studies: predispositions for diseases, gene expression profiles, survival analyses, publications etc. The collected and created data represents an added value for biospecimens, since it may be used in related or retrospective studies. Therefore, if a biobank is committed to support collaborative research
164
J. Eder et al.
activities, an appropriate research platform is required. Generally, the aspects of contextualization and provenance have to be considered by such a system. Contextualization is the ability to link and annotate an object (sample) in different contexts. A biospecimen may be used in various studies and projects and may be assigned different annotations in each of it. Further, these annotations have to be processable allowing efficient retrieval. Further, contextualization allows to organize and interrelate contexts. That is, study data is accessible only to selected groups and persons having predefined access rights. Related studies and projects may be linked to each other, whereas collaboration and data sharing is supported. The MUG biobank uses a service-oriented CSCW system as an integrated research platform. More details about the system are given in section 5.1. Data provenance is used to document the origin of data and tracks all transformation steps that are necessary to reaccess or reproduce the data. It may be defined as the background knowledge that enables a piece of data to be interpreted and used correctly within context [37]. Alternatively, data provenance may be seen from a process-oriented perspective. Data may be transformed by a sequence of processing steps which could be small operations (SQL joins, aggregations), the result of tools (analysis services) or the product of a human annotation. Thus, these transformations form a “construction plan” of a data object, which could be reasonably used in various contexts. Traceable transformation processes are useful for documentation purposes, for instance, for the materials and methods section of publications. Generally, the data quality may be improved due to the transparency of data access and transformation. Moreover, relevant processes may be marked as standard or learning processes. That is, new project participants may be introduced to research methodology or processes by using dedicated processes. Another import aspect is the repeatability of processes. If a data object is the result of an execution sequence of services, and all input parameters are known, the entire transformation process may be repeated. Further, processes may be restarted with slightly modified parameters or intermediate results are preserved to optimize re-executions [14]. Stevens et al. [46] point out an additional type of provenance: organizational provenance. Organizational provenance comprises information about who has transformed which data in which context. This kind of provenance is closely related to collaborative work and is applicable to medical cooperation projects. As data provenance has attracted more and more attention, it was integrated in well established scientific workflow systems [44]. For instance, a provenance recording was integrated for the Kepler [14], Chimera [24] and Taverna [34] workflow systems. Therefore, scientific research platform for biobanks could learn from the progress of data provenance research and incorporate suitable provenance recording mechanisms. In the context of biobanks, provenance recording can be applied to sample or medical-case data. If a data object is integrated from external sources (for instance, the follow up data of a patient), its associated provenance record may include an identification of its source as well as its access method. Additionally, the process of data generation during research activities may be documented.
Information Systems for Federated Biobanks
165
That is, if a data object is the result of a certain analysis, or is based on the processing of several other data objects, the transformation is captured in corresponding provenance records. If all data transformations are collected by recording the input and output data in relations, data dependency graphs may be built. From these graphs, data derivation graphs may be easily computed to answer provenance queries like: which input data was used to produce result X? Data provenance is closely related to the above-mentioned contextualization of objects. While contextualization enables to combine and structure objects from different sources, data provenance provides the inverse operation. It allows to trace the origins of used data objects. Thus, provenance has an important role regarding data quality assurance as it documents the process of data integration. 3.5
Architecture - Local Integration
Sample Management Systems. The necessity of elaborated sample management systems for biobanks was recognized in several biobank initiatives. Though, depending on the type of biobank and its research focus different systems have been implemented. For instance, the UK biobank adapted a laborartory information system (LIMS) supporting high throughput data capturing and automated quality control of blood and urine samples [23]. Since the UK biobank is a based on prospective collections of biological samples (more than 500,000) participants, the focus is clearly on optimized data capturing of samples and automatization techniques such as barcode reading. The UK Biorepository Information system [31] strives for supporting multicenter studies in context of lung cancer research. A manageable amount of samples is captured and linked to various types of data (life style, anamnesis data). As commercial systems lack flexibility and customization capabilities, a propriertary information system was designed and implemented. Another interesting system was presented in [15], supporting blood sample management in context of cancer research studies. The system clearly separates donor-related information (informed consents, contact information) from storage information of blood specimen and data extracted from epidemiologic questionnaires. In the context of the European Human Frozen Tissue Bank (TuBaFrost), a central tissue database for storing patient-related, tissue-related and image data was established [30]. Since biobanks typically provide material for medical research projects and studies, they are confronted with very detailed requirements from the medical domain. For instance, the following enquiry was sent to the MUG Biobank: A researcher requires about 10 samples with the following characteristics: – – – –
male patients paraffin tissue with diagnose liver cancer including follow-up data (e.g. therapy description) from the oncology
This example illustrates the diversity of search criteria that may be combined in a single enquiry. Criteria is defined on the type of sample (= paraffin tissue), the availability of material (= quantity available), on the medical case
166
J. Eder et al.
of diagnose (= liver cancer) and on the patient, as patient sex (= male) and follow-up data (= therapy description) are required. A catalogue of example enquiries should be included in the requirements analysis of the sample management system, as the system specification and design is to be strongly tailored to medical domain requirements. An other challenge exists in an appropriate representations of courses of disease. Patients suffering from cancer may be treated over several years and various tissue samples may be extracted and diagnosed. If a medical study on cancer is based on tissues of primary tumors and corresponding metastasis, it is important that the temporal dependency between the diagnosis of primary tumors and metastasis is captured correctly. Further, the causal dependency between the primary tumor and the metastasis need to be represented. Otherwise, a query would also return tissues with metastasis of secondary tumors. However, the design of a sample management system is strongly determined by the type, the quality and structure of available data. As already mentioned clinical records and diagnoses are frequently available as semi-structured data or even stored as plain-text in legacy medical information systems. SampleDB MUG. In the following we give a brief overview of the sample management system (SampleDB) of the MUG biobank. We present this system as an exemplary solution for a biobank integration project. For the sake of simplicity we only present the core-elements of the UML database schema in Figure 2. Generally, we may distinguish between three main perspectives: sample-information perspective, medical case perspective and sample-management perspective. The sample-information perspectives comprises information immediately related to the stored samples. All samples have a unique identifier, which is a histological number assigned by the instantaneous section of the pathology. Depending on the type of sample, different quality and quantity-related attributes may be stored. For instance, cryo tissue samples are conserved in vials in liquid nitrogen tanks. By contrast, paraffin blocks may be stored in large-scale robotic storage systems. Further, different attributes specifying the quality of samples exist such as the ischemic time of tissue or the degree of contamination of cell lines. That is, a special table exists for each type of sample (paraffin, cryo tissue, blood samples etc.). Samples may be used as basic material for further processing. For instance, RNA and DNA may be extracted from paraffin-embedded tissues. The different usage types of samples are modelled by the bottom classes in the schema. Operating a biobank requires an efficient sample management including inventory changes, documentation of sample usage and storing of cooperation contracts. The classes Borrow, Project and Coop Partner document which samples have left the biobank in which cooperation project and how many samples were returned. Since samples are a limited resource, they may be used up in context of a research project. For instance, paraffin-embedded tissues may be used to construct a tissue microarray, a high-throughput analysis platform for epidemiologybased studies or gene expression profiling of tumours [28]. Thus, for ensuring the sustainability of sample collections, appropriate guidelines and policies are required. In this context, samples of rare diseases are of special interest, since
Information Systems for Federated Biobanks
167
Fig. 2. Outline of the CORE schema from the SampleDB (MUG biobank)
they represent invaluable resources that may be used in multicenter studies [17]. The medical case perspective allows for assessing the relevant diagnostic information of samples. In the case of the MUG biobank, pathological diagnoses are captured and assigned to the corresponding samples. Since the MUG biobank has a clear focus on cancerous diseases, tumour-related attributes such as tumour grading and staging or the International Classification of Diseases for Oncology, ICD-O-3 classification are used [7]. Patient-related data is split in two tables: the sensitive data such as the personal data is stored in a separate table while an anonymous patient table contains a unique patient identifier. Personal data of patients are not accessible for staff members of biobanks. However, medical doctors may access sample and person-related data as part of their diagnostic or therapeutic work. 3.6
Data Integration
Data integration in the biomedical areas is an emerging topic, as the linkage of heterogeneous research, clinical and biobank information systems become more and more important. Generally, several integration architectures may be applied
168
J. Eder et al.
for incorporating heterogeneous data sources. Data warehouses extract and consolidate data from distributed sources in a global database that is optimized for fast access. On the other hand, in database federations data is left at the sources, and data access is accomplished by wrappers that map a global schema to distributed local schemas [33]. Although database federations deliver data that is up-to-date, they do not provide the same performance as data warehouses. However, they do not require redundant data storage and expensive periodic data extraction. In the context of the MUG biobank several types of information systems are accessed, as illustrated in figure 3. The different data sources are integrated in a database federation, whereas interface wrappers have been created for the relevant data. On the one hand, there are large clinical information systems which are used for routine diagnostic and therapeutical activities of medical doctors. Patient records from various medical institutes are stored in the OpenMedocs sytem, pathological data in the PACS system and laboratory data in the laboratory information system LIS. On the other hand research databases from several institutes (e.g. the Archimed system) containing data about medical studies are incorporated as well as the biological sample management system SampleDB and diverse robot systems. Further, survival data of patients is provided by the external institution Statistics Austria. Clinical and routine information systems (at the bottom of figure 3) are strictly seperated from operational information systems of the biobank. That is, sensitive patient-related data is only accessible for medical
Fig. 3. Data Integration in context of the MUG Biobank
Information Systems for Federated Biobanks
169
staff and anonymized otherwise. The MUG Biobank operates an own documentation system in order to protocol and coordinate all cooperation projects. The CSCW system at the top of the figure provides a scientific workbench for internal and external project partners, allowing to share data, documents, analysis results and services. A more detailed description of the system is given in section 5.1. A modified version of the CSCW workbench will be used as user interface for the European Bionbank initiative BBMRI, described in section 4.1. 3.7
Related Work
UK-Biobank. The aim of UK Biobank is to store health information about 500.000 people from all around the UK who are aged between 40-69. UK Biobank has evolved over several years. Many subsystems, processes and even the system architecture have been developed from experience gathered during pilot operations [5]. UK Biobank integrated many different subsystems to work together. To ensure access to a broad range of third party data sets it was essential that UK Biobank meets the needs of other relevant groups (e.g. Patient Information Advisory Group). Many external requirements had to be taken into consideration to fulfil that needs [13]. Figure 4 shows a system overview of the UK Biobank and its most important components. The recruitment system is responsible to process patient invitation data received from the National Health Service. The received data has to be cleaned and passed to the Participant Booking System. The Booking System securely transfers appointment data (name, date of birth, gender, address, ...) to the
Fig. 4. System architecture showing the most important system components of the UK Biobank [13]
170
J. Eder et al.
Assessment Data Collection System. The Assessment Center also handles the informed consent of each participant. The task of LIMS is to store identifiers for all received samples without any participant identifying data such as name, address, etc. The UK Biobank also provides interfaces to clinical and non-clinical external data repositories. The Core Data Repository containing different data repositories forms the basis for several different Data Warehouses. These Data Warehouses provide all parameters needed to generate appropriate datasets for answering validated research requests. Also disclosure control which prevents patient de-identification is performed on these Data Warehouses. The research community is able to post requests with the help of a User Portal which is positioned right on the top of the Data Warehouses. Additional Query Tools allow investigating the Data Warehouses as well as the Core Data Repository [13]. caBIG. The cancer Biomedical Informatics Grid (caBIG) has been initiated by the National Cancer Institute (NCI) as a national-scale effort in order to develop a federation of interoperable research information systems. The approach to reach federated interoperability is a grid middleware infrastructure, called caGrid. It is designed as a service-oriented architecture. Resources are exposed to the environment as grid services with well-defined interfaces. Interaction between services and clients is supported by grid communication and service invocation protocols. The caGrid infrastructure consists of data, analytical and coordination services which are required by clients and services for grid-wide functions. According to [40] coordination services include services for metadata management, advertisement and discovery, query and security. A key characteristic of the framework is its focus on metadata and model driven service development and deployment. This aspect of caGrid is particularly important for the support of syntactic and semantic interoperability across heterogeneous collections of applications. For more information see [40,3]. CRIP. The concept of CRIP (Central Research Infrastructure for molecular Pathology) enables biobanks to annotate projects with additional necessary data and to transfer them into valuable research resources. CRIP has been started in the beginning of 2006 by the departments of Pathology of Charit and the Medical University of Graz (MUG) [41]. CRIP offers a virtual simultaneous access to tissue collections of participating pathology archives. Annotated valuable data comes from different heterogeneous datasources and is stored in a central CRIP database. Academics and researchers with access rights are allowed to search for interesting material. Workflows and data transfers of CRIP projects are regulated in a special contract between CRIP partners and Fraunhofer IBMT.
4
Federation of Biobanks
Currently established national biobanks and biomolecular resources are a unique European strength, valuable collections typically suffer from fragmentation of the European biobanking-related research community. This hampers the collation of
Information Systems for Federated Biobanks
171
biological samples and data from different biobanks required to achieve sufficient statistical power. Moreover, it results in duplication of effort and jeopardises sustainability due to the lack of long-term funding. To overcome the issues stated above a federation of biobanks can be used to provide access to comprehensive data and sample sets thus achieving results with better statistical power. Further on, it is possible to investigate rare and highly diverse diseases as well as saving high costs caused by duplicate analysis of the same material or data. To benefit European health-care, medical research, and ultimately, the health of the citizens of the European Union the European Commission is funding a biobank integration project called BBMRI (Biobanking and Biomolecular Resources Infrastructure). 4.1
Biobanking and Biomolecular Resources Infrastructure
The aim of BBMRI is to build a coordinated, large scale European infrastructure of biomedically relevant, quality-assessed mostly already collected samples as well as different types of biomolecular resources (antibody and affinity binder collections, full ORF clone collections, siRNA libraries). In addition to biological materials and related data, BBMRI will facilitate access to detailed and internationally standardised data sets of sample donors (clinical data, lifestyle and environmental exposure) as well as data generated by analysis of samples using standardised analysis platforms. A large number of platforms (such as high-throughput sequencing, genotyping, gene expression profiling technologies, proteomics and metabolomics platforms, tissue microarray technology etc.) will be available through BBMRI infrastructure [2]. Benefits. The benefits of BBMRI are versatile. Talking in short-terms BBMRI leads to an increased quality of research as well as to a reduction of costs. The mid-term impacts of BBMRI can be seen in an increased efficacy of drug discovery/development. Long-term benefits of BBMRI are improved health care possibilities in the area of personalized medicine/health care [11]. Data Harmonisation and IT-infrastructure. An important part of BBMRI is responsible for designing the IT-infrastructure and database harmonisation, which includes also solutions for data and process standardization. The harmonization of data deals with the identification of the scope of needed information and data structures. Further on, it analysis how available nomenclature and coding systems can be used for storing and retrieving (heterogenous) biobank information. Several controlled terminologies and coding systems may be used for organizing the information about biobanks [11,35]. Since not all medical information is fully available in the local databases of biobanks the retrieval of data involves big challenges. That implies the necessity of flexible data sharing and collaboration between centers. 4.2
Enquiries in a Federation of Biobanks
Within an IT-infrastructure for federated biobanks authorized researchers should have the possibility to search and obtain required material and data from all
172
J. Eder et al.
participating biobanks, necessary e.g. to perform biomedical studies. Furthermore, it should be possible to even update or link data from already performed studies in the European federation. In the following we distinguish between five different kinds of use cases: 1. Identification of biobanks. Retrieves a list with contact data from participating biobanks which have desired material for a certain study. 2. Identification of cases. Retrieves the pseudonym identifiers of cases1 stored in biobanks which correspond to a given set of parameters. 3. Retrieval of data. Obtains available information (material, data, etc.) directly from a biobank for a given set of parameters. 4. Upload or linking of data. Connecting samples with data generated from this sample internally and externally. 5. Statistical queries. Performs analytical queries on a set of biobanks. Extracting, categorizing and standardizing almost semi-structured records is a strenuous task postulating medical domain knowledge and a strong quality control. Therefore, an automated retrieval of data involves big challenges. Further on, to enable the retrieving, upload and linking of data a lot of research, harmonization and integration has to be done. For the moment we assume that researchers use the contact information and pseudonym identifiers to retrieve data from a biobank. An important issue within the upload and linking of data is the question how the data has been generated. The generation of new data must follow a standardized procedure with preferably uniform tools, ontologies etc. In this context also data quality as well as data provenance play an important role. In section 3 we discussed issues within one biobank as integration project, now we are concerned with a set of heterogenous biobanks as integration project. There exist several different proposals for the handling of enquiries within the BBMRI project. Our approach for enquiries is to ascertain where desired material or data is located. Subsequently the researchers can get in contact with that biorepository by themselves. It comprises the first two use cases mentioned above. Workflow for Enquiries. In figure 5 we have modeled a possible workflow for the identification of biobanks and cases, separated into different responsibility parts. The most important participants within this workflow are the requestor (researcher,), the requestor’s BBMRI host, other BBMRI hosts and biobanks. Hosts act as global coordinators within the federation. The registration of biobanks on BBMRI hosts takes place via a hub and spoke structure. In the first step of our workflow an authenticated researcher chooses a service request from a list of available services. Since a request on material or medical data can have different conditions, a suggestion is to provide a list of possible request templates like: – Biobanks with diseased samples (cancer) – Biobanks with diseased samples (metabolic) 1
In our context a case is a set of jointly harvested samples of one donor.
Information Systems for Federated Biobanks
173
– Cases with behavioral progression of a specific kind of tumor – Cases with commonalities of two or more tumors – ... After the selection of an appropriate service request the researcher can declare service-specific filter criteria to constrain the result according to the needs. Additionally, the researcher is able to specify a level of importance for each filter
Fig. 5. Workflow for identification of biobanks and cases separated into different responsibility parts
174
J. Eder et al.
criteria. This level of importance is an interval between 1 and 5 with 1-lowest relevance and 5-highest relevance. Without any specification the importance of the filter criteria is treated as default-value 3-relevant. The level of importance has direct effects on the output of the query. It is used for three major purposes: – Specifying must-have values for the result. If the requestor defines the highest level of importance for an attribute the query only returns databases that match exactly. – Specifying nice-to-have values for the result. This feature relaxes query formulations in order to incorporate the aspect of semi-structured data. – Ranking the result to show the best matches at the topmost position. The ranking algorithm takes the resulting data of the query invocation process and sorts the output according to the predefined levels of importance. Researchers formulate their requests with the use of query by example. BBMRI then dissembles to act as one single system, performing query processing and disclosure of information from the participating biobanks transparent. According to this the formulated query of the researcher is sent to the requestor’s national host as xml-document (see xml-document below). Afterwards the national host (1) distributes the query to the other participating hosts in the federation using disclosure information, (2) queries its own meta data repository as well as the local databases from all registered biobanks, (3) applies a disclosure filter and (4) ranks the result. Each invoked BBMRI-Host in the federation performs the same procedure as the national host, but without distributing the incoming query. All distributed ranked query results are sent back to the requestor’s host and are merged on it. Depending on how the policy is specified the researcher gets a list of biobanks or individual cases of biobanks as the final result for the enquiry. Afterwards the researcher can get in contact with the desired biobanks. In case of an insufficient result set the researcher has the opportunity to constrain the result set or even to refine the query. In the following we discuss different scenarios for enquiries in a federation of biobanks. The scenarios differ in the – kind of data accessed (only accessing the host’s meta-database or additionally accessing the local databases from registered biobanks) – information contained in the result (list of biobanks and its contact data or individual anonymized cases from biobanks). We assume that the meta-database stored on each host within the federation only contains the uploaded schema information of the registered biobanks, in order to avoid enormous data redundancies and rigidity in the system. Scenario 1 - The Identification of Biobanks. Within the identification of biobanks enquiries from authorized researchers are performed only by searching the meta-databases from the federated hosts. Unfortunately this may lead to very sketchy requests. An idea is to provide a small set of attributes in the
Information Systems for Federated Biobanks
175
meta-database with the opportunity to specify a certain content, given that this content is an enumeration. A good candidate, for example, is the attribute ”diagnose” standardized as ICD-Code (ICD-9, ICD-10, ICDO-3) because it may be very useful to know which biobank(s) store information about specific diseases. Another good canditate may be ”Sex ” with the values ”Female”, ”Male”, ”Unknown”,etc.. A considerable point of view within enquires for the identification of biobanks is supporting the possibility to specify an order of magnitude for the desired material. Example enquiry 1: A researcher wants to know which biobanks store 10 samples with the following characteristics: – – – –
male patients paraffin tissue with diagnose liver cancer including follow-up data (e.g. therapy description) from the oncology
XML Output for Example Enquiry 1 after definition of Filter Criteria <service s-id="1" name="Biobanks with diseased samples (cancer)"> Scenario 2 - The Identification of Cases. Since the meta-database located on the hosts does not store any case or donor related information it is necessary to additionally query the local databases. Take note that no information
176
J. Eder et al.
of the local databases will be sent to the requestor except unique identifiers of the appropriate cases. The querying of the local databases is used for searching more detailed and therefore to get a more exact result list. A special case within the identification of cases is determined by a slight variance in the result set. Depending on a policy the result set can also contain only a list of biobanks with their contact information as discussed in scenario 1. Example enquiry 2: A researcher requires the ID of about 20 cases and their location (biobank) with the following characteristics: – – – – –
paraffin tissue with diagnose breast cancer staging T1 N2 M0 from donors of age 40-50 years including follow-up data (e.g. therapy) from the oncology
There are two types of relationships between the samples: donor-related and case-related relationships. Donor-related means that two or more samples have been taken from the same donor. Though, the samples may have been taken in various medical contexts (different diseases, surgeries, etc.). In contrast, samples are case-related when the associated diagnoses belong to the same disease. 4.3
Data Sharing and Collaboration between Different Biobanks
A desirable point of view in our discussions was to build an IT-infrastructure which provides the ability to easily adapt to different research needs. In regard to this we designed a feasible environment for the collaboration between different biobanks within BBRMI as a hybrid of peer to peer and a hub and spoke structure. In our approach a BBMRI-Host (figure 6) represents a domain hub in the IT-infrastructure and uses a meta structure to provide data sharing. Several domain hubs are connected via a peer-to-peer structure and communicate with each other via standardized and shared Communication Adapters. Each participating European biobank is connected with its specific domain hub resp. BBMRI-Host via hub and spoke-structure. Biobanks provide their obtainable attributes and contents as well as their contact data and biobank specific information via the BBMRI Upload Service of the associated BBMRI-Host. A Mediator coordinates the interoperability issues between BBMRI-Host and the associated biobanks. The information about uploaded data from each associated biobank is stored in the BBMRI ContentMeta-Structure. Permissions related to the uploaded data as well as contracts between a BBMRI-Host and a specific biobank are managed by the Disclosure Filter. A researcher can use the BBMRI Query Service for sending requests to the federated system. The BBMRI Query Service is the entry point for such requests. The BBMRI Query Service can be accessed via the local workbenches of connected biobanks as well as via the BBMRI Scientific Workbenches.
Information Systems for Federated Biobanks
4.4
177
Data Model
Our proposed data models for the BBMRI Content-Meta-Structure (in figure 6) have the ability to hold a sufficient (complete) set of needed information structures. The idea is that obtainable attributes (schema information) from local databases of biobanks can be mapped with the BBMRI Content-Meta-Structure in order to provide a federated knowledge-base for life-science-research. To avoid data overkill we designed a kind of lower-bound schema that contains attributes usually occurring in most or even all of the participating biobanks. Our approach was to accomplish a hybrid-solution of a federated system and an additional data warehouse as a kind of index to primarily reduce the query overhead. This decision led to the design of a class (named ContentInformation,
Fig. 6. Architecture of IT-infrastructure for BBMRI
178
J. Eder et al.
figure 7) which contains attributes with different meanings, similar to online analytical processing (Olap), including: – Content-attributes. Are a small set of attributes, which provide information about their content in the local database (cf. 4.2). All content-attributes must be provided by each participating biobank. The type of attributes stored in the meta-dataset with content information must be an enumeration like ICD-Code, patient sex or BMI-category. – Number of Cases (NoC). Is an order of magnitude for all available cases of a specific disease in combination with all content-attributes. – Existence-attributes accept two different kinds of characteristics: 1. Value as quantity. This kind of existence attributes tell how many occurrences of a given attribute are available in a local database for a specific instance tuple of all content-attributes. The knowledge is represented by a numeric value greater than a defined k-value for an aggregated set of cases. Conditions on this kind of existence-attributes are OR-connected because they are independent from each other. They do not give information on their inter-relationship. 2. Value as availability. With that kind of existence attributes the storage of values does not take place in an aggregated form like mentioned above, but as bitmap (0 / 1), with 0 not available and 1 available. This has the consequence that each row in the relation contains one specific case. Due to this fact AND-connected conditions on existence attributes can be answered. We compared two different approaches for the data model of the BBMRI ContentMeta-Structure, a static and a dynamic one. The Static Approach. In comparison to an Olap data-cube our class ContentInformation (see figure 7) acts as the fact-table with the content-attributes as dimensions and existence-attributes (including the Number of Cases) as measures. We call this approach static because it enforces a data model which includes a common set of attributes on that all participating biobanks have to agree. Biobanks have the opportunity to register their schema information via an attribute catalogue and provide their content information as shown in figure 7. Within the static approach the splitting of the set of obtainable attributes from the participating biobanks into content-attributes and existence-attributes
Fig. 7. Example data of class ContentInformation as static approach
Information Systems for Federated Biobanks
179
makes data analysis more performant because queries do not get too complex. Also Olap operations like roll-up and drill-down are enabled. A serious handicap is the missing flexibility to store information of biobanks that work in different areas. In this case all biobanks must use the same content information. The Dynamic Approach. The previous stated data model for the BBMRIContent-Meta-Structure works on a static defined set of attributes coupled as ContentInformation. Thus, all participating biobanks have the same ContentInformation. However a metabolic based biobank does not necessarily need resp. have a TNM-Classification and otherwise a cancer based biobank does not necessarily need resp. have metabolic specific attributes. Because of this assumption this datamodel deals with a dynamic generation of the ContentInformation (figure 9). I.e. Each biobank first declares the attributes they store in their local databases. Especially they declare which of them are content-attributes and which of them are existence-attributes. However this could affect requests on material, therefore one must be careful with the declaration. – Requests on existence of attributes. For a request on the existence of several attributes it does not matter whether the requested attributes are declared as content-attribute or existence-attribute. The only fact to get a query-hit for that request is that the searched attributes are declared by a biobank. – Requests on content of attributes. For a request on the content of several attributes, all requested attributes must be declared as content-attribute by a biobank in order to get a query-hit. E.g. A request on female patients who suffer from C50.8 (breast cancer) would not get a query-hit from BB-y (figure 8) because the attribute PatientSex is declared as existence-attribute and thus has no information about its content.
Fig. 8. Explicit declaration of attributes in the dynamic approach
180
J. Eder et al.
Fig. 9. Dynamic generation for content information of BBMRI content meta structure
With the dynamic data model it is possible to support different kinds of content information depending on the needs of the biobanks. Besides once a new attribute is introduced, this does not lead to changes in the database schema. In the following table 1 we compare the static data model with the dynamic data model. Table 1. Comparison between static and dynamic approach Approach Flexibility in mainte- Simplicity in query Anonymity issues nance static + + dynamic + +
4.5
Disclosure Filter
The disclosure filter is a software component that helps the BBMRI-Hosts to answer the following question: Who is allowed to receive what from whom under which circumstances? E.g: Since it is planned to provide information exchange across national boarders the disclosure filter has to ensure that no data (physical or electronic) leaves the country illegally. The disclosure filter takes into account laws, contracts between participants, policies of participants and even rulings (e.g: by courts, ethics boards,...). The disclosure filter on a BBMRI-Host plays three different roles: 1. Provider host and local biobank remove items from query answers that are not supposed to be seen by requestors.
Information Systems for Federated Biobanks
181
2. Technical Optimization: Query system optimizes query processing using disclosure information. 3. Requestor host removes providers which do not provide sufficient information to the requestor. This role can be switched on / off. The disclosure filter plays a central role in the workflow (figure 5) for use cases 1 and 2 as well as in the architecture (figure 6). Depending on the role of the disclosure filter the location within the workflow can change. During requirements analysis the possibility to switch the disclosure filter off on demand turned out to be an important feature. With the help of this feature it is possible to operate a more relaxed system.
5
Working with Biobanks
In this chapter, we want to point out some application areas of IT infrastructures in the context of biobanks. We mainly focus on support of medical research, since there is a strong demand for assisting, documenting and interconnecting research activities. Additionally, these activities are tightly coupled with the data management of a biobank, providing research results for samples and thereby enhancing the scientific value of samples. 5.1
Computer Supported Collaborative System (CSCW) for Medical Research
Medical research is a collaborative process in an interdisciplinary environment that may be effectively supported by a CSCW system. Such a system imposes specific requirements in order to allow flexible integration of data, analysis services and communication mechanisms. Persons with different expertise and access rights cooperate in mutually influencing contexts (e.g. clinical studies, research cooperations). Thus, appropriate virtual environments are needed to facilitate context-aware communication, deployment of biomedical tools as well as data and knowledge sharing. In cooperation with the University of Paderborn we were able to leverage a CSCW system, that covers our demands, on the flexible service-oriented architecture Wasabi, a reimplementation of Open sTeam (www.open-steam.org) which is widely used in research projects to share data and knowledge and cooperate in virtual knowledge spaces. We use Wasabi as a middleware integrating distributed data sources and biomedical services. We systematically elaborated the main requirements of a medical CSCW system and designed a conceptual model, as well as an architectural proposal satisfying our demands [42,45]. Finally we implemented a virtual workbench to support collaboration activities in medical research and routine work. This workbench had to fulfill several important requirements, in particular: – R(1) User and Role Management. The CSCW has to be able to cope with the organisational structure of the institutes and research groups of the hospital. Data protection directives have to fit in the access right model of
182
–
–
–
–
–
J. Eder et al.
the system. Though, the model has to be flexible to allow the creation of new research teams and information sharing across organisational borders. R(2) Transparency of physical Storage. Although data may be stored in distributed locations, data retrieval and data storage should be solely dependant on access rights, irrespective of the physical location. That is, the complexity of data structures is hidden from the end user. The CSCW system has to offer appropriate search, join and transformation mechanisms. R(3) Flexible Data Presentation. Since data is accessed by persons having different scientific background (biological, medical, technical expertise) in order to support a variety of research and routine activities, flexible capabilities to contextualise data are required. Collaborative groups should be able to create on-demand views and perspectives, annotate and change data in their contexts without interfering with other contexts. R(4) Flexible Integration and Composition of Services. A multitude of data processing and data analysis tools exist in the biomedical context. Some tools act as complementary parts in a chain of processing steps. For example, to detect genes correlated with a disease, gene expression profiles are created by measuring and quantifying gene activities. The resulting gene expression ratios are normalised and candidate genes are preselected. Finally, significance analysis is applied to identify relevant genes [49]. Each function may be proR R vided by a separate tool - for example by Genespring and Genesis [9,47]. In some cases tools provide equal functionality and may be chosen as alternatives. Through flexible integration of tools as services with standardised input and output interfaces a dynamic composition of tools may be accomplished. From the systems perspective services are technology neutral, loosely coupled and support location transparency [36]. The execution of services is not limited to proprietary operation systems and service callers do not know the internal structure of a service. Further, services may be physically distributed over departments and institutes, e.g. image scanning and processing is executed in an own laboratory where the gene expression slides reside. R(5) Support of cooperative Functions. In order to support collaborative work suitable mechanisms have to be supplied. One of the main aspects is the common data annotation. Thus, data is augmented and shared within a group and new content is created cooperatively. Therefore, Web 2.0 technologies like wikis and blogs procure a flexible framework for facilitating intra- and inter-group activities. R(6) Data-coupled Communication Mechanisms. Cooperative working is tightly coupled with excessive information exchange. Appropriate communication mechanisms are useful to coordinate project activities, organise meetings and enable topic-related discussions. On the one hand, a seamless integration of email exchange, instant messaging and VoIP tools facilitates communication activities. We propose to reuse the organisational data defined in R(1) within the communication tools. On the other hand, persons should be able to include data objects in their communication acts. E.g. Images of diseased tissues may be diagnosed cooperatively, whereas marking and annotating of image sections supports the decision making process.
Information Systems for Federated Biobanks
183
– R(7) Knowledge Creation and Knowledge Processing. Cooperative medical activities frequently comprise the creation of new knowledge. Data sources are linked with each other, similarities and differences are detected, and involved factors are identified. Consider a set of genes that is assumed to be strongly correlated with the genesis of a specific cancer subtype. If the hypothesis is verified the information may be reused in subsequent research. Thus, methods to formalise knowledge, share it in arbitrary contexts and deduce new knowledge are required. In the following Figure 10, the concept of virtual knowledge spaces is illustrated. The main idea is to contextualize documents, data, services, annotations in virtual knowledge spaces and make them accessible for cooperating individuals. Rooms may be linked to each other or nested in each other. Communication is tightly coupled to the shared resources, enabling discussion, quality control and collaborative annotations.
Fig. 10. Wasabi virtual room
5.2
Workflow for Gene Expression Analysis for the Breast Cancer Project
A detailed breast cancer data set was annotated at the Pathology Graz. In this context much emphasis is put on detecting deviations in the behaviour of gene groups. We support the entire analysis workflow by supplying an IT research platform allowing to select and group patients arbitrarily, preprocess and link the related gene expressions and finally perform state-of-the-art analysis algorithms. We developed an appropriate database structure with import/export methods allowing to manage arbitrary medical data sets and gene expressions. We also implemented web service interfaces to various gene expression analysis algorithms. Currently, we are able to support the following steps of the research workflow: (1) Case Selection: In the first step relevant medical cases are selected. The set of avalaible breast cancer cases with associated cryo tissue samples is selected by querying the SampleDB. Since only cases with follow-up data from the oncology
184
J. Eder et al.
Fig. 11. Workflow for the support of gene expression analysis
are included in the project, those cases are filtered. Further filter criteria are: metastasis and therapeutic documentation. Case selection is a composed activity, as two separate databases of two different institutes (pathology and oncology) are accessed, filtered and joined. After selecting the breast cancer cases, gene expression profiles may be created. Output description: Set of medical cases. Output type: File or list of appropriate medical cases identified by unique keys like patient ID. (2) Normalization of Gene Expression Profiles: A set of GPR files is defined as input source and the preferred normalisation method is applied. We use the normalisation methods offered by the bioconductor library of the R-project. The result of the normalisation is stored for further processing. Output description: Normalised gene expression matrix Output type: The result of the normalisation is a matrix where rows correspond to genes and columns to medical cases. The matrix may be stored as file or table. (3) Gene Annotation: In order to link genes with other resources, unique gene identifiers are required (e.g. Ensemble GeneID, RefSeq). Therefore, we integrated mapping data supplied by the Operon chip producer. Output description: Annotated gene expression matrix Output type: Gene expression matrix with chosen gene identifiers. The matrix may be stored as file or table.
Information Systems for Federated Biobanks
185
(4) Link Gene Ontologies: We use gene ontologies (www.geneontology.org) in order to map single genes to functional groups. Therefore, we imported the most recent gene ontologies into our database. Alternatively we also plan to integrate pathway data from the KEGG database (www.genome.jp/kegg/) to allow grouping of genes into functional groups. Output description: Mapping from gene groups (gene ontologies, KEGG functional groups) to single genes. Output type: List of mappings, where in each mapping a group is mapped to a list of single genes. (5)Link annotated Patient Data: Each biological sample corresponds to a medical case and has to be linkable to the gene expression matrix. A file containing medical parameters is imported whereas the parameters may be used to define groups of interest for the analysis. Output description: A table storing all medical parameters for all cases. Output type: A database table is created allowing to link medical parameters to cases of the annotated gene expression matrix. (6) Group Samples: A hypothesis is formulated by defining groups of medical cases that are compared in the analysis. The subsequent analysis tries to detect significant differences in gene groups between the medical groups. Output description: The medical cases are grouped according to the chosen medical parameters. Output type: A list of mappings whereas each mapping consists of a unique case identifier mapping to a group identifier. (7) Analysis: We implemented web service interfaces to the Bioconductor packages ’Global Test’ [25] and ’Global Ancova’[29]. We use the selected medical parameters for sample grouping and the GO categories for gene grouping together with the gene expression matrix as input parameters for both algorithms. After the analysis is finished the results are written into an analysis database and exported as Excel files. We also plan to integrate an additional analysis tool called Matisse from Tel Aviv University [50]. Output description: A list of significant gene groups. The number of returned gene groups may be customized. For instance, only the top 10 significant gene groups are returned. Output type: A list of significant gene groups, together with a textual description of the group and its p-value. (8) Plotting: Gene plots may be created, visualizing the influence of single genes in a significant gene group. The plots are created using bioconductor libraries which are encapsulated in a web service. Output description: Gene plots of significant gene groups. Output type: PNG image files, that may be downloaded and saved into an analysis database. We are able to show that a service-oriented CSCW system provides the functionality to build a workbench for medical research supporting the collaboration
186
J. Eder et al.
of researchers, allowing the definition of workflows and gathering all necessary data for maintaining provenance information.
6
Data Privacy and Anonymization
When releasing patient-specific data (e.g. in medical research cooperations) privacy protection has to be guaranteed for ethical and legal reasons. Even when immediately identifying attributes like name, address or day of birth are eliminated, other attributes (quasi-identifying attributes) may be used to link the released data with external data to re-identify individuals. In recent research much effort has been put on privacy preserving and anonymization methods. In this context, k-anonymity [48] was introduced allowing to protect sensitive data by generating a sufficient number of k data twins. These data twins prevent that sensitive data is linkable to individuals. K-anonymity may be accomplished by: – transforming attribute values to more general values - nominal and categorical attributes may be transformed by taxonomy trees or user-defined generalization hierarchies – mapping numerical attributes to intervals (for instance, age 45 may be transformed to age interval 40-50) – replacing a value with a less specific but semantically consistent value (e.g. replace numeric (continuous) data for blood pressure with categorical data like ’high’ blood pressure) – combining several attributes making them more coarse grain (e.g. replace height and weight with bmi) – fragmentation of the attribute vector – data blocking (i.e. by replacing certain attributes of some data items with a null value) – dropping a sample from the selection set – dropping an attribute For a given data set, several k-anonymous anonymizations may be created depending on how attributes are generalized. Transformations of attribute values are always accompanied by an information loss, which may be used as a quality criteria for an anonymization. That is, an optimal anonymization may be defined as the k-anonymous anonymization with the minimal information loss. Information, value of information and the significance of information loss is in the eye of the beholder, i.e. it depends on the requirements of the intended analysis. Only the purpose can tell which of the generalizations is more suited and gives more accurate results. Therefore, we developed a tool called Open anonymizer (see https://sourceforge.net/projects/openanonymizer) which is based on individual attribution of information loss. We implemented the anonymization algorithm as a Java web application that may be deployed on a web application server and accessed by a web browser. Open anonymizer is a highly customizable anonymization tool providing the best anonymization for a certain
Information Systems for Federated Biobanks
187
context. The anonymization process is strongly influenced by data quality requirements of users. We allow users to specify the importance of attributes as well as transformation limits for attributes. These parameters are considered in the anonymization process, which delivers a solution that is guaranteed to fulfil the user requirements and has a minimal information loss. Open anonymizer provides a wizard-based, intuitive user interface which guides the user through the anonymization process. Instead of anonymizing the entire data set of a data repository, a simple query interface allows to extract relevant subsets of data to be anonymized. For instance, in a biomedical context, diagnoses of a certain carcinoma type may be selected, anonymizsed and released without considering the rest of the diagnoses.
7
Conclusion
Biobanks are challenging application areas for advanced information technology. The foremost challenges for the information system support in a network of biobanks as envisioned in the BBMRI project are the following: – Partiality: A biobank is intended to be one node in a federation (cooperative network) of biobanks. It needs the descriptive capabilities to be useful for other nodes in the network and it needs the capability to make use of other biobanks. This needs careful design of metadata about the contents of the biobank, the acceptance and interoperability of heterogeneous partner resources. On the other hand a biobank will rely on data generated and maintained in other systems (other centres, hospital information systems, etc.). – Auditability: Managing the provenance of data will be essential for advanced biomedical studies. Documenting the origins and the quality of data and specimens, documenting the sources used for studies and the methods and tools and results of studies is essential for the reproducibility of results. – Longevity: A biobank is intended to be a long lasting research infrastructure and thus many changes will occur during its lifetime: new diagnostic codes, new therapies, new analytical methods, new legal regulations, and new IT standards. The biobank needs to be ready to incorporate such changes and to be able to make best use of already collected data in spite of such changes. – Confidentiality: A biobank stores or links to patient related data. Personal data and genomic data are considered highly sensitive in many countries. The IT-infrastructure must on the one hand provide means to protect the confidentiality of protected data and on the other enable the best possible use of data for studies respecting confidentiality constraints. We presented biobanks and discussed the requirements for biobank information systems. We have shown that many different research areas within the Databases and Information Systems field contribute to this endeavor. We were just able to show some examples: advanced information modeling, (semantic) interoperability, federated databases, approximate query answering, result ranking, computer supported cooperative work (CSCW), and security and privacy. Some well
188
J. Eder et al.
known solutions from different application areas have to be revisited given the size, heterogeneity, diversity dynamics, and complexity of data to be organized in biobanks.
References 1. Biobankcentral, http://www.biobankcentral.org 2. Biobanking and biomolecular resources research infrastructure (bbmri), http://www.bbmri.eu 3. Cabig - cancer biomedical informatics grid, https://cabig.nci.nih.gov 4. Geneontology, http://www.geneontology.org 5. Uk-biobank, http://www.ukbiobank.ac.uk 6. Who: International statistical classification of diseases and related health problems. 10th revision version for 2007 (2007) 7. Who: International classification of diseases for oncology, 3rd edn., icd-o-3 (2000) 8. Organisation for economic cooperation and development: Biological resource centres: Underpinning the future of life sciences and biotechnology (2001) 9. Genespring: Cutting-edge tools for expression analysis (2005), http://www.silicongenetics.com 10. Nih guide: Informed consent in research involving human participants (2006) 11. Bbmri: Construction of new infrastructures - preparatory phase. INFRA–2007– 2.2.1.16: European Bio-Banking and Biomolecular Resources (April 2007) 12. Organisation for economic cooperation and development. best practice guidelines for biological resource centres (2007) 13. Uk biobank: Protocol for a large-scale prospective epidemiological resource. Protocol No: UKBB-PROT-09-06 (March 2007) 14. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system, pp. 118–132 (2006) 15. Ambrosone, C.B., Nesline, M.K., Davis, W.: Establishing a cancer center data bank and biorepository for multidisciplinary research. Cancer epidemiology, biomarkers & prevention: a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology 15(9), 1575–1577 (2006) 16. Asslaber, M., Abuja, P., Stark, K., Eder, J., Gottweis, H., Trauner, M., Samonigg, H., Mischinger, H., Schippinger, W., Berghold, A., Denk, H., Zatloukal, K.: The genome austria tissue bank (gatib). Pathobiology 2007 74, 251–258 (2007) 17. Asslaber, M., Zatloukal, K.: Biobanks: transnational, european and global networks. Briefings in functional genomics & proteomics 6(3), 193–201 (2007) 18. Cambon-Thomsen, A.: The social and ethical issues of post-genomic human biobanks. Nat. Rev. Genet. 5(11), 866–873 (2004) 19. Chamoni, P., Stock, S.: Temporal structures in data warehousing. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 353–358. Springer, Heidelberg (1999) 20. Eder, J., Dabringer, C., Schicho, M., Stark, K.: Data management for federated biobanks. In: Proc. DEXA 2009 (2009) 21. Eder, J., Koncilia, C.: Changes of dimension data in temporal data warehouses. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 284–293. Springer, Heidelberg (2001)
Information Systems for Federated Biobanks
189
22. Eder, J., Koncilia, C., Morzy, T.: The comet metamodel for temporal data warehouses. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 83–99. Springer, Heidelberg (2002) 23. Elliott, P., Peakman, T.C.: The uk biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. International Journal of Epidemiology 37(2), 234–244 (2008) 24. Foster, I., V¨ ockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: Proceedings of the 14th Conference on Scientific and Statistical Database Management, pp. 37–46 (2002) 25. Goeman, J.J., van de Geer, S.A., de Kort, F., van Houwelingen, H.C.: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20(1), 93–99 (2004) 26. Goos, G., Hartmanis, J., Sripada, S., Leeuwen, J.V., Jajodia, S.: Temporal Databases: Research and Practice. Springer, New York (1998) 27. Gottweis, H., Zatloukal, K.: Biobank governance: Trends and perspectives. Pathobiology 2007 74, 206–211 (2007) 28. Hewitt, S.: Design, construction, and use of tissue microarrays. Protein Arrays: Methods and Protocols 264, 61–72 (2004) 29. Hummel, M., Meister, R., Mansmann, U.: Globalancova: exploration and assessment of gene group effects. Bioinformatics 24(1), 78–85 (2008) 30. Isabelle, M., Teodorovic, I., Morente, M., Jamin´e, D., Passioukov, A., Lejeune, S., Therasse, P., Dinjens, W., Oosterhuis, J., Lam, K., Oomen, M., Spatz, A., Ratcliffe, C., Knox, K., Mager, R., Kerr, D., Pezzella, F.: Tubafrost 5: multifunctional central database application for a european tumor bank. Eur. J. Cancer 42(18), 3103–3109 (2006) 31. Kim, S.: Development of a human biorepository information system at the university of kentucky markey cancer center. In: International Conference on BioMedical Engineering and Informatics, vol. 1, pp. 621–625 (2008) 32. Litwin, W., Mark, L., Roussopoulos, N.: Interoperability of multiple autonomous databases. ACM Comput. Surv. 22(3), 267–293 (1990) 33. Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A., Tarczy-Hornoch, P.: Methodological review: Data integration and genomic medicine. J. of Biomedical Informatics 40(1), 5–16 (2007) 34. Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for taverna workflows with lightweight annotation requirements. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg (2008) 35. Muilu, J., Peltonen, L., Litton, J.: The federated database - a basis for biobankbased post-genome studies, integrating phenome and genome data from 600 000 twin pairs in europe. European Journal of Human Genetics 15, 718–723 (2007) 36. Papazoglou, M.P.: Service-oriented computing: concepts, characteristics and directions. In: Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 2003, pp. 3–12 (2003) 37. Ram, S., Liu, J.: A semiotics framework for analyzing data provenance research. Journal of computing Science and Engineering 2(3), 221–248 (2008) 38. Rebulla, P., Lecchi, L., Giovanelli, S., Butti, B., Salvaterra, E.: Biobanking in the year 2007. Transfusion Medicine and Hemotherapy 34, 286–292 (2007) 39. Riegman, P., Morente, M., Betsou, F., de Blasio, P., Geary, P.: Biobanking for better healthcare. In: The Marble Arch International Working Group on Biobanking for Biomedical Research (2008)
190
J. Eder et al.
40. Saltz, J., Oster, S., Hastings, S., Langella, S., Kurc, T., Sanchez, W., Kher, M., Manisundaram, A., Shanbhag, K., Covitz, P.: Cagrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics 22(15), 1910–1916 (2006) 41. Schroeder, C.: Vernetzte gewebesammlungen fuer die forschung crip. Laborwelt 5, 26–27 (2007) 42. Schulte, J., Hampel, T., Stark, K., Eder, J., Schikuta, E.: Towards the next generation of service-oriented flexible collaborative systems – a basic framework applied to medical research. In: Cordeiro, J., Filipe, J. (eds.) ICEIS 2008 - Proceedings of the Tenth International Conference on Enterprise Information Systems, number 978-989-8111-36-4, Barcelona, Spain, June 2008, pp. 232–239 (2008) 43. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236 (1990) 44. Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in data-centric scientific workflows. In: ICWS 2006: Proceedings of the IEEE International Conference on Web Services, Washington, DC, USA, pp. 427–436. IEEE Computer Society, Los Alamitos (2006) 45. Stark, K., Schulte, J., Hampel, T., Schikuta, E., Zatloukal, K., Eder, J.: GATiBCSCW, medical research supported by a service-oriented collaborative system. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 148–162. Springer, Heidelberg (2008) 46. Stevens, R., Zhao, J., Goble, C.: Using provenance to manage knowledge of in silico experiments. Briefings in bioinformatics 8(3), 183–194 (2007) 47. Sturn, A., Quackenbush, J., Trajanoski, Z.: Genesis: cluster analysis of microarray data. Bioinformatics 18(1), 207–208 (2002) 48. Sweeney, L., Samarati, P.: Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression. In: Proceedings of the IEEE Symposium on Research in Security and Privacy (1998) 49. Tusher, V.G., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U S A 98(9), 5116–5121 (2001) 50. Ulitsky, I., Shamir, R.: Identification of functional modules using network topology and high-throughput data. BMC Systems Biology 1(1) (2007) 51. Yang, J.: Temporal Data Warehousing. Stanford University (2001) 52. Zatloukal, K., Yuille, M.: Information on the proposal for european research infrastructure. In: European Bio-Banking and Biomolecular Resources (2007)
Exploring Trust, Security and Privacy in Digital Business Simone Fischer-Hübner1, Steven Furnell2, and Costas Lambrinoudakis3,* 1
Department of Computer Science Karlstad University, Karlstad, Sweden
[email protected] 2 School of Computing & Mathematics, University of Plymouth, Plymouth, United Kingdom
[email protected] 3 Department of Information and Communication Systems Engineering, University of the Aegean, Samos, Greece
[email protected] Abstract. Security and privacy are widely held to be fundamental requirements for establishing trust in digital business. This paper examines the relationship between the factors, and the different strategies that may be needed in order to provide an adequate foundation for users’ trust. The discussion begins by recognising that users often lack confidence that sufficient security and privacy safeguards can be delivered from a technology perspective, and therefore require more than a simple assurance that they are protected. One contribution in this respect is the provision of a Trust Evaluation Function, which supports the user in reaching more informed decisions about the safeguards provided in different contexts. Even then, however, some users will not be satisfied with technology-based assurances, and the paper consequently considers the extent to which risk mitigation can be offered via routes, such as insurance. The discussion concludes by highlighting a series of further open issues that also require attention in order for trust to be more firmly and widely established. Keywords: Trust, Security, Privacy, Digital Business.
1 Introduction The evolution in the way information and communication systems are currently utilised and the widespread use of web-based digital services drives the transformation of modern communities into modern information societies. Nowadays, personal data are available or/and can be collected at different sites around the world. Even though the utilisation of personal information leads to several advantages, including improved customer services, increased revenues and lower business costs, it can be misused in several ways and may lead to violation of privacy. For instance, in the framework of *
Authors are listed in alphabetical order.
A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 191–210, 2009. © Springer-Verlag Berlin Heidelberg 2009
192
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
e-commerce, several organisations in order to identify the preferences of their customers and adapt their products accordingly develop new methods for collecting and processing personal data. Modern data mining techniques can then be utilised in order to further process the collected data, generating databases of the consumers’ profiles through which each person’s preferences can be uniquely identified. Therefore, such information can be utilised for invading user’s privacy and thereby compromising the 95/46 European Union directive on the protection of individuals with regard to the processing of personal and sensitive data. In order to avoid confusion, it is important to stress the difference between privacy and security; a piece of information is secure when its content is protected, whereas it is private when the identity of its owner is protected. It is true that, irrespective of the application domain (i.e. e-commerce, e-health etc), the major conservation of the users in using the Internet is due to the lack of privacy rather than cost, difficulties in using the service or undesirable marketing messages. Considering that conventional security mechanisms, like encryption, cannot ensure privacy protection (encryption for instance, can only protect the message’s confidentiality), new Privacy-Enhancing Technologies (PETs) have been developed. However, the sole use of technological countermeasures is not enough. For instance, even if a company that collects personal data stores them in an ultra-secure facility, the company may at any point in time decide to sell or otherwise disseminate the data, thus violating the privacy of the individuals involved. Therefore security and privacy are intricately related. Privacy as an expression of the human dignity is considered as a core value in democratic societies and is recognized either explicitly or implicitly as a fundamental human right by most constitutions of democratic societies. Today, in many legal systems, privacy is in fact defined as the right to informational self-determination, i.e. the right of individuals to determine for themselves when, how, to what extent and for what purposes information about them is communicated to others. For reinforcing their right to informational self-determination, users need technical tools that allow them to manage their (partial) identities and to control what personal data about them is revealed to others under which conditions. Identity Management (IDM) can be defined to subsume all functionality that supports the use of multiple identities, by the identity owners (user-side IDM) and by those parties with whom the owners interact (services-side IDM). According to Pfitzmann and Hansen, identity management means managing various partial identities (i.e. set of attributes, usually denoted by pseudonyms) of a person, i.e. administration of identity attributes including the development and choice of the partial identity and pseudonym to be (re-)used in a specific context or role (Pfitzmann and Hansen 2008). Privacy-enhancing identity management technology enforcing legal privacy principles of data minimisation, purpose binding and transparency have been developed within the EU FP6 project PRIME1 (Privacy and Identity Management for Europe) and the EU FP7 project PrimeLife2 (Privacy and Identity Management for Life). Trust has been playing an important role in PRIME and PrimeLife, because users do not only need to trust their own platforms (i.e. the user-side IDM) to manage their data accordingly but also need to trust the services sides that they process their data in a privacy-friendly and secure manner and according to the business agreements with the users. 1 2
https://www.prime-project.eu/ http://www.primelife.eu/
Exploring Trust, Security and Privacy in Digital Business
193
In considering the forms of protection that are needed, it is important to recognise that user actions will often be based upon their perceptions of risk, which may not always align very precisely with the reality of the situation. For example, they may under- or over-estimate the extent of the threats facing them, or be under- or overassured by the presence of technical safeguards. For example, some people simply need to be told that they a service is secure in order to use it with confidence. Meanwhile, others will only be reassured by seeing an abundance of explicit safeguards in use. As such, if trust is to be established, the security and privacy measures need to be provided in accordance with what users expect to see and are comfortable to use in a given context. Furthermore, how much each person values her privacy is a subjective issue. When a bank uses the credit history of a client without her consent, in order to issue a presigned credit card then it is subjective whether the client will feel upset about it and press charges for breech of the personal data protection Act or not. Providing a way to model this subjective nature of privacy would be extremely useful for organisations in the sense that they will be able to estimate the financial losses that they may experience after a potential privacy violation incident. This will allow them to reach cost-effective decision in terms of the money that they will invest for security and privacy protection reasons. This paper examines the relationship between the factors, and the different strategies that may be needed in order to provide an adequate foundation for users’ trust. It has been recognised that users often lack confidence that sufficient security and privacy safeguards can be delivered from a technology perspective, and therefore require more than a simple assurance that they are protected. In this respect, Section 2 first investigates social trust factors for establishing reliable end user trust and then presents a Trust Evaluation Function, which utilises these trust factors and supports the user in reaching more informed decisions about the trustworthiness of online services. Even then, however, some users will not be satisfied with technology-based assurances. As a consequence Section 3 considers the extent to which risk mitigation can be offered via routes, such as insurance. The discussion concludes with Section 4 that highlights a series of further open issues that also require attention in order for trust to be more firmly and widely established.
2 Trust in Online Services 2.1 Users’ Perception of Security and Privacy and Lack of Trust “Trust is important because if a person is to use a system to its full potential, be it an e-commerce site or a computer program, it is essential for her to trust the system” (Johnston et al. 2004). For establishing trust, a significant issue will be the user’s perception of security and privacy within a given context. Indeed, the way users feel about a given site or service is very likely to influence their ultimate decision about whether or not to use it. While some may be fully reassured by the presence of security technology, others may be more interested in other facts, such as the mitigation and restitution available to them in the event of breaches.
194
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Research conducted in the UK as part of the Trustguide project has investigated citizens’ trust in online services and their resultant views on the risks present in this context (Lacohee et al. 2006). Trustguide was part funded by the UK Government (through what was then the Department for Trade & Industry), and involved collaboration between British Telecom, HP Labs, and the University of Plymouth. The aim of the project was to better understand attitudes towards online security and thus enable the development of more effective ICT-based services. In order to investigate the issue, a series of ten focus groups were run with different types of UK citizen across six geographic locations. The categories of participant were: undergraduate students, postgraduate students, SMEs (three groups), farmers, ICT novices, ICT experts, and citizens (two groups). All of the groups followed the same discussion guide, and were professionally facilitated. The topics areas addressed included significant focus upon trust (from the perspective of citizens’ own use of online services, as well as their trust in those organisations that might gather their private data), as well as surrounding issues such as identity management and authentication, and the collection, storage and protection of data. One of the most significant findings was that the degree to which trust and confidence could be built into systems from the users’ perspective was likely to be limited. The research actually revealed a high degree of distrust in ICT-based services, with the focus group participants repeatedly voicing a belief that it is impossible to guarantee that electronic transactions or stored data are secure against attack. Indicative examples of the comments that emerged on this theme are presented below: “The real issue is that we know from our experience of the Internet and everything else that nobody has ever yet made anything secure. Whatever kind of encryption you’ve got, it can be broken.” “We know that no one has ever built a secure system, nothing held electronically can be secure. Banks (etc) should be more honest and say that data is never secure, and they should be open about the risks.” “Given that it’s actually impossible to make a secure system, perhaps banks and all the rest of them should stop telling us that it is secure, and rather, they should be taking measures to try and make it as secure as possible but assume that sooner or later it will be hacked, it will be broken into and with that assumption in mind, then what are the procedures?”. Also usability tests of privacy-enhancing identity management prototypes performed within the EU FP6 project PRIME have shown that there are problems to make people trust the claims about the privacy enhancing features of the systems (see FischerHübner and Pettersson 2004, Andersson et al. 2005). Although test users were first introduced into the aims and scope of privacy-enhancing identity management, the tests revealed that many of the test users did not trust the claim that the tested system would really protect their data and their privacy. Some participants voiced doubts over the whole idea of attempting to stay private on the Net: “Internet is insecure anyway” because people must get information even if it is not traceable by the identity management application, explained one test participant in a post-test interview. Another test subject stated: “It did not agree with my mental picture that I could buy a book anonymously”.
Exploring Trust, Security and Privacy in Digital Business
195
Another factor contributing to the lack of trust that was revealed by our usability tests was that test subjects generally had difficulties to mentally differentiate between user side and services side identity management. In post-test interviews the test subjects sometimes referred to functionalities from both the website and the user side identity management system as if these were one. Consequently, they also had difficulties to understand that the user side identity management console, where the user can manage her electronic identities, can be trusted by the user because it is within the user’s control, whereas the website is under the service provider’s control. Similar findings of a lack of trust in privacy enhancing technologies were also reported by others, e.g. by Günther and Spiekermann in a study on the perception of user control with privacy-enhancing identity management solutions for RFID environments, even though the test users considered the PETs in this study fairly easy to use (Günther and Spiekermann 2005). For helping users to evaluate the trustworthiness of a services side, the focus has to be on mediating factors to the users that measure the services side’s actual trustworthiness and that support trustworthy behavior of a side (Riegelsberger at al. 2005). 2.2 Social Trust Factors In this section, we investigate suitable parameters corresponding to social trust factors for measuring the actual trustworthiness of a communication partner in terms of privacy practices and of the reliability as a business partner and for establishing reliable trust. Social trust factors in the context of e-Commerce have already been researched by others. For instance, Turner (2001) showed that for ordinary users to feel secure when transacting with a website the following factors play a role: 1. the company’s reputation, 2. their experiences with the website, and 3. recommendations from independent third parties. Riegelsberger et.al. (2005) present a trust framework which is based on contextual properties (based on temporal, social and institutional embeddedness) and the services side’s intrinsic properties (ability, motivation based on internalized norms, such as privacy policies, and benevolence) that form the basis of trustworthy behavior. Temporal embeddedness can be signalled by visible investment in the business and the side, as e.g. visualised by professional website design, which can also be seen a symptom for the vendor’s intrinsic property of competence or ability to fulfill a contract. Taking the phenomena into consideration that many users have problems to differentiate between user and services side, these factors of professional design should in general be taken into account for the UI design of PrimeLife trust evaluation function even though it is not part of the vendor’s website but part of the user side identity management system. Social embeddedness, i.e. the exchange of information about a side’s performance among users, can be addressed by reputation systems. Institutional embeddedness refers to the assurance of trustworthiness by institutions, as done with trust seal programs. A model of social trust factors, which was developed by social science researchers in the PRIME project (Leenes e al. 2005), (Andersson et al. 2005), has identified 5 layers on which trust plays a role in online services: socio-cultural, institutional, service area, application, and media. Service area- related trust aspects which concern the trust put in a particular branch or sector of economic activity, as well as socio-cultural trust aspects
196
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
can however not be directly influenced by system designers. More suitable factors for establishing reliable trust can be achieved on the institutional and application layers of the model, which also refer to trust properties (contextual property based on institutional embeddedness as well as certain intrinsic properties of a web application) of the framework by Riegelsberger et al. (2005). As discussed by Leenes et al. (2005), on institutional layer, trust in a service provider can be established by monitoring and enforcing institutions, such as data protection commissioners, consumer organisations and certification bodies. Besides, on application layer, trust in an application can be enhanced if procedures are clear, transparent and reversible, so that users feel in control. This latter finding also corresponds to the results of the aforementioned Trustguide project, which also provides guidelines on how cybertrust can be enhanced and also concludes that increased transparency brings increased user confidence. Moreover, rather than receiving assurances of technological security, which many would perceive to be unrealistic, the research of the Trustguide project suggests that three other elements would contribute to helping people feel more secure in their use of ecommerce transactions: • • •
confidence that restitution can be made by a third party. Hence, measures that are in place in the event of something going wrong, should be clearly stated; assurances about what can and cannot be guaranteed; the presence of fallback procedures if something goes wrong.
2.3 A Trust Evaluation Function In this section, we will present a trust evaluation function that has been developed within the PrimeLife EU project. This function has the purpose of communicating reliable information about trustworthiness and assurance (that the stated privacy functionality is provided) of services sides. For the design of this trust evaluation function, we have followed an interdisciplinary approach by investigating social factors for establishing reliable trust, technical and organizational means, as well as HCI concepts for mediating evaluation results to the end users. Trust Parameters Used: Taking results of the studies on social trust factors presented in section 2.2 as well as available technical and organisational means into consideration, we have chosen the following parameters for evaluating the trustworthiness of communication partners that mainly refer to the institutional and application layers of the social trust factor model. Information provided by trustworthy independent monitoring and enforcing institutions, which we are utilising for our trust evaluation function, comprise: •
3
Privacy and trust seals certified by data protection commissioners or independent certifiers (e.g., the EuroPrise seal3, the TRUSTe seal4 or the ULD Gütesiegel5).
https://www.european-privacy-seal.eu/ http://www.truste.org/ 5 https://www.datenschutzzentrum.de/guetesiegel/index.htm 4
Exploring Trust, Security and Privacy in Digital Business
• •
197
Blacklists maintained by consumer organisations (such blacklists exist for example in Sweden and Denmark) Privacy and security alert lists, such as list of alerts raised by data protection commissioners or Google’s anti-phishing blacklist.
The European Consumer Centres have launched a web-based solution, Howard the owl, for checking trust marks and other signs of trustworthiness that could be used as well when evaluating a web shop6. Static seals can be complemented by dynamic (in real-time generated) seals conveying assurance information about the current security state of the services side’s system and its implemented privacy and security functions. Such dynamic seals can be generated in real-time by an “Assurance Evaluation” component that has been implemented within the PRIME framework (Pearson 2006). Dynamic seals that are generated by tamper-resistant hardware can be regarded as third-party endorsed assurances, as the tamper-resistant hardware device can be modeled as a third party that is not under full control of the services side. Such dynamic assurance seals can measure the intrinsic property of a side’s benevolence to implement privacy-enhancing functionality. Such functionality can comprise also transparency-enhancing tools that allow users to access, and to request to rectify or delete their personal data online (as implemented within the PrimeLife project), which will allow users to “undo” personal data releases and to feel in control. As discussed above, this is important prerequisite for establishing trust. For our trust evaluation function, we therefore used dynamic assurance seals informing about the PrimeLife privacy-enhancing functions that the services side’s system has implemented. Also reputation metrics based on other users' ratings can influence user trust, as discussed above. Reputation systems, such for instance the one in eBay, can however often be manipulated by reputation forging or poisoning. Besides, the calculated reputation values are often based on subjective ratings by non-experts, for whom it might for instance be difficult to judge the privacy-friendliness of communication partners. So far, we have therefore not considered reputation metrics for the PrimeLife trust evaluation function, even though we plan to address them in future research and versions of trust evaluations within the PrimeLife project. Following the process of trust and policy negotiation of the PRIME technical architectures (which on which also PrimeLife systems are based), privacy seals, which are digitally signed by the issuing institution, as well as dynamic assurance seals can be requested from a services side directly (see steps 4-5 in Figure 1), whereas information about blacklisting and alerts need to be retrieved from the third party list providers (see steps 6-7 in Figure 1). After the user requests a service (step 1), the services side replies with a request of personal data and a proposal of a privacy policy (step 2). For evaluating the side’s trustworthiness, the user can then in turn request trust and assurance data and evidences from the services side, such as privacy seals and dynamic assurance seals (steps 4-5), and information about blacklisting or alerts concerning this side from alert list or blacklist providers (steps 6-7). Information about the requested trust parameters are then evaluated at the user side and displayed via the trust evaluation user interfaces along with the privacy policy information of the services side within the “Send Personal 6
ready21.dev.visionteam.dk
198
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Data?” dialogue window (see below), with which also the user’s informed consent for releasing the requested data for the stated policy is solicited. The user can then based on the trust evaluation results and policy information decide on releasing the requested personal data items and possibly adopt the proposed policy, which is then replied to the service provider (step 8).
Fig. 1. Privacy and trust policy negotiation in PRIME and PrimeLife
Design Principles and Test Results: For the design of our trust evaluation function mock-ups, we followed the following design principles comprising general HCI principles as well as design principles, which should in particular address challenges and usability problems that we have encountered in pervious usability tests: •
Use a Multi-layered structure for displaying evaluation results, i.e. trust evaluation results should be displayed in increasing details on multiple layers in order to prevent an information overload for users not interested in the details or the evaluation. Our mockups have been structured into three layers displaying a short status view with the overall evaluations for inclusion in status bars and in the “Send Personal Data?” window (1st layer, see Figure 2) displaying also the services side’s short privacy policy and data request, a compressed view displaying the overall results within the categories privacy seals, privacy & security alert lists, support of PRIME functions and blacklisting (2nd layer), and a complete view showing the results of sub categories (3rd layer, see Figure 3).
Exploring Trust, Security and Privacy in Digital Business
199
Fig. 2. “Send Personal Data?” window displaying the overall trust evaluation result (1st Layer)
•
Use a selection of meaningful overall evaluation results. For example, in our mockups, we use a trust meter with a range of three possible overall evaluation results that provide a semantic by their names (which should be more meaningful than for instance percentages as used by some reputation metrics). The three overall results that we are using are (see trust meter in Figure 2): o
o
o
•
“Poor” symbolised with a sad-looking emoticon and red background colour (if there are negative evaluation results, i.e. the side is blacklisted or appears on alert lists); “Good” symbolised with a happy looking smiley and green background colour (if there is no negative, but some positive results, i.e. the side has a seal or supports PrimeLife functions and is not appearing on black/alert lists); “Fair” symbolised with a white background colour (for all other cases, i.e. the side has no seal, is not supporting PrimeLife functions, and is not appearing on black/alert lists).
Make clear who is evaluated - this is especially important, because as we mentioned above our previous usability tests have revealed that users have often difficulties to differentiate between user and services side (Pettersson et al. 2005). Hence, the user interface should make clear by its structure (e.g., by surrounding all information referring to a requesting services side, as illustrated in the “Send Personal Data?” window Figure 2), and by wording that the services side and not the user side is evaluated. If this is not made clear, a bad trust evaluation result for a services side might also lead to reduced trust in the user side IDM system.
200
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Fig. 3. Complete view of the Trust Evaluation Function (3rd layer) displaying the overall results
•
•
Structure the trust parameters visible on the second and third layers into the categories “Business reliability” (comprising the parameter “blacklisted”) and “privacy (comprising the parameters of security & privacy alert lists, privacy seals and PrimeLife function support). This structure should illustrate that the trust parameter used have different semantics and that scenarios with companies that are “blacklisted” for bad business practices, even though they have a privacy seal and/or support PrimeLife functions do not have to be contradictory, as they refer to different aspects of trustworthiness. Inform the user without unnecessary warnings - our previous usability tests showed that extensive warnings can be misleading and can even result in users loosing their trust in the PrimeLife system. It is a very difficult task for the systems designer to find a good way of showing an appropriate level of
Exploring Trust, Security and Privacy in Digital Business
201
alerting: for instance, if a web vendor lacks any kind of privacy seal, this in itself is not a cause for alarm, as most sites at present do not have any kind of trust sealing. We also did not choose the colour “yellow” for our trust meter for symbolizing such an evaluation result that we called “fair” (i.e. we did not use the traffic light metaphor), as yellow already symbolises a state before an alarming “red” state. First usability tests for three iterations of our PrimeLife trust evaluation function mockups were performed in the Ozlab testing environment of Karlstad University in two rounds with ten test persons each and one round with 12 tests persons. The tests clearly showed that such a function is much appreciated by end users. The presentation of overall evaluation results on top level, especially the green and red emoticons as well as the fact that the services side was evaluated were well understood. Some users had problems though to understand the “neutral” evaluation result (in case a side has no seal, is not supporting PrimeLife functions, is not blacklisted and does not appear on alert lists), which we first phrased with “ok”, and then “fair”. However, in the post-test interviews, there were no clear preferences for other names (such as “Not bad”, “No alert”). Hence, the illustration of “neutral” results is one of the most difficult issues and still needs to be investigated further (see also (Fischer-Hübner et al. 2009)).
3 Mitigating and Transferring Security and Privacy Risks The fact that many participants in the Trustguide project were nonetheless using services that they did not ultimately trust (from a technological perspective) was frequently linked to their beliefs that risk was mitigated in some other way. This was most clearly evident in relation to financial transactions involving credit cards, as illustrated by the following quotes: “If I’m buying something online with a credit card in the back of my mind, even if I don’t trust the site, is that my credit card is protected against that, they say in their blurb that if there is an unfortunate incident like that then they will pay.” “I’m not worried about giving out my card details, I’ve got mitigation insurance, if a card gets cloned, kill the account, it’s the bank’s problem.” Thus, from the users’ perspective there will be a demonstrable reduction in perceived risk in cases where the responsibility for handling any negative outcome is thought to rest with a third party. In reality, however, there will be limits to the extent to which such beliefs can realistically hold true. For example, both of the viewpoints quoted above are overlooking the potential for wider impacts, which the bank or credit card company may not be able to rectify. For example, even if the bank can prevent the victim from suffering the direct financial impact of an incident, it could still take months to clear up issues such as consequent damage to credit ratings. As an example of this, past estimates on the cost of identity theft have suggested that such incidents cost victims an average of $808 and require 175 hours of effort to put things right (Benner et al. 2000).
202
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
The Trustguide research as a whole revealed that assumptions regarding naivety of users are becoming less valid, and although many may be novices from the perspective of ICT knowledge, this does not mean that they are innocent in their use of online services. Indeed, many are well informed regarding the potential risks, and do not enter with a belief that they are secure. Their engagement is based upon a personal risk assessment (albeit conducted to varying levels of detail and effectiveness), in which they are weighing the perceived risk against the potential benefits, and engagement is more likely to occur in cases where they can clearly see the potential problems and understand how they could be rectified. There are, of course, two challenges posed by these findings. Firstly, the effects of some security breaches cannot be easily mitigated (e.g. overcoming a case of identity theft may be a non-trivial proposition, and beyond the capabilities that a single online service provider could guarantee to provide), and thus the user really needs to fall back upon a primary reliance upon the security technology to prevent certain categories of incident from occurring in the first place. Secondly, if the user’s default position is to consider that the technology cannot provide them with sufficient protection, it places them at a disadvantage in terms of placing trust in technologies that may be genuinely fit for purpose (i.e. they will not be getting as much reassurance from the presence of the technology as they should do). From the organisations’ side, the typical reaction of IT officials to the rapidly increasing number of threats and the highly sophisticated methods utilised for realising new attacks is to protect their systems through a series of technical security measures. However, in the absence of a scientifically sound methodology for evaluating the cost-effectiveness of the security measures employed, the problem is that they are unable to quantify the security level of their system and thus to determine the appropriate amount that they should invest for its protection. (Cavusoglu et al. 2004) have calculated that, on average, compromised organisations lost approximately 2.1% of their market value within two days from the day of the incident. (Moitra and Konda 2003) have demonstrated that as organisations start investing in information system security their protection increases rapidly, while it increases at a much slower rate as the investments reach a much higher level. It is therefore essential to facilitate ways for identifying how far organisations should go into investing for security, as well as for evaluating the effectiveness of the security measures that they implement. An additional issue for organizations that collect, store and process personal or / and sensitive data for its customers is the protection of their privacy. It is true that some people are really concerned about privacy protection issues, while others are not. This kind of diversity results into different estimations about the consequences that may occur in case of a privacy violation incident. It will be therefore really useful for organizations to provide them with appropriate models for estimating the expected impact level, in terms of the compensation that an individual may claim after a privacy breach. However, privacy valuation is by no means a trivial issue. A usual security incident may be valued in an objective way. For example, if due to some security incident the internet site of a company is unavailable for 1 hour, then it is quite straightforward to estimate the financial value of this incident in an objective manner by statistical estimation of the possible number of clients and their potential buys within this period. The situation is more difficult when it comes to privacy. For instance, when somebody’s telephone number is disclosed then it is rather subjective
Exploring Trust, Security and Privacy in Digital Business
203
whether one should care or not or if she will decide to press charges and ask for compensation. And given that this has happened what is the likely amount of the compensation? Again this is a very personal thing, (no matter whether the court grants the compensation or not) and somehow should be related to how much the particular client values her privacy. Given that the beliefs of users (and indeed organisations) are turning towards risk mitigation rather than a reliance upon technological safeguards, it is relevant to consider alternative options that exist and how they can be used. One such option for organizations is to insure their information systems against potential security and privacy violation incidents, aiming to balance the consequences that they will experience, in terms of financial losses, through the compensation that they will get from the insurance company. It should be emphasized that such an approach cannot and will not “replace” the technical security and privacy enhancing measures; it will act complementary. Even in that case, though, the difficulty for the insurance company is the calculation of the appropriate premium. 3.1 Insuring an Information System: Premium Calculation Recently, there is considerable interest from the Economics community in addressing the issue of insurance contracts for information systems. Indicatively, (Anderson 2001) applies economic analysis and employs the language of microeconomics (network externalities, asymmetric information, moral hazard, adverse selection, liability dumping etc) for explaining a number of phenomena that security researchers had previously found to be pervasive but perplexing. Also in (Gordon and Loeb 2002) present an economic model for determining the optimal amount to invest for protecting a given set of information. Finally in (Varian 2004) constructs a model based on economic agents decision-making on effort spent, to study systems reliability. An insurance company in order to calculate a premium that covers a car against theft or fire must, at least, have an accurate estimate of the current car’s value. If the client provides additional information, like, for instance, that a car alarm is installed, this is being evaluated by the insurance company and may result in a reduced premium. In analogy, an insurance company in order to calculate the premium for an information system will seek the following information: • •
What is the financial loss that the organisation will experience as a result of every possible security incident? How secure – well protected against potential risks - is the information system?
However, none of the above questions can be answered in a straightforward and accurate way, mainly because of the following facts: a) Every day new threats are appearing. How can someone quantify the consequences of a potential security incident if she doesn’t even know which are the major threats that the information system is facing? b) The effectiveness of a security measure cannot be presented in quantitative terms. It can only be evaluated during real attacks against the system, after it has been installed and integrated into the system’s operation. However, even in this case the evaluation cannot be accurate since there is no way to know if a specific security
204
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
measure has really prevented a security incident or not. This is in analogy to a home alarm system. If there is no record of a theft attempt, we don’t really know if this is because the home alarm has prevented it or because it simply didn’t happen irrespective of the home alarm. c) Finally, the environment of the information system has a significant impact to both the number and severity of potential threats and to the effectiveness of the security measures. For instance, the security requirements identified for an internetbased system are not the same if a wireless network was utilised instead. Also, an authentication mechanism may be extremely effective for the internet-based system but not for the wireless environment. In (Lambrinoudakis et al. 2005) a probabilistic structure in the form of a Markov model is presented for facilitating the estimation of the insurance premium as well as the valuation of the security investment. Let us assume that the information system may result into one of N different states after possible security incidents that affect a single asset Ak, where k = 1,..,M. We will denote these states by i, where i = 1,..,N. By i = 0 we will denote the state where no successful attack has been made on the information system and thus it is fully operational. We assume that at time t = 0 the information system is in the fully operational state i = 0 and as time passes it will end up in different states of non-fully operational status, that is it will end up into one of the states i = 1,..,N. We assume that the transition rates from state 0 in any other state i, as a result of a security incident compromising asset Ak, are known. Furthermore, the impact (financial Loss) of every possible security incident Li has been computed. Assuming that the transitions allowed are from the fully operational state to some other non-fully operational state and that the non-operational states are absorbing states, the use of the Markov model allows us to find the probability of the system being in different states and thus find the probability of different financial losses (Lambrinoudakis et al. 2005). Let us assume that the organisation has a utility function for its data, let us say u. Since in this simple model we assume that all consequences resulting form a security incident can be translated to financial losses, it is reasonable to assume that this utility function expresses the views of the organisation towards financial losses, i.e. it provides a way to evaluate how important a financial loss caused by a security incident is for the organisation. Then the optimal security investment can be calculated by maximizing the expected utility for the organisation: Max I E [ U(W – L(I) – I ] where:
I is the maximum amount available for security measures W is the initial wealth of the company and L is the expected loss, which of course depends on the amount I
Similarly, the optimal insurance contract should satisfy the following equation and thus can be utilised for calculating the insurance premium. U(W – π) = Ε [ U(W – L + C – π)]
Exploring Trust, Security and Privacy in Digital Business
where:
205
W is the initial wealth of the company π is the premium that the company has to pay to the insurer L is the expected loss C is the compensation that the insurer will pay in case of a security incident
This approach is useful only in cases where the transition rates and the Loss (impact values) figures are accurate (objective). 3.2 Insuring an Information System: Taking into Account Privacy Violation Incidents If we then want to study the risk that a firm is undergoing due to potential privacy incidents that it may have caused, we definitely need a subjective theory that will allow us to find out how much do individuals value their privacy. In an attempt to describe this complicated task, keeping it as free as possible from technicalities, (Yannakopoulos et al. 2008) introduces a simple model that incorporates the personalized view of how individuals perceive a possible privacy violation and if that happens how much do they value this. Our basic working framework is the random utility model (RUM) that has been extensively used in the past for modelling personalized decisions regarding financial issues, for instance: How much is someone prepared to pay for this type of car ? This model takes into account cases where one time the same individual may consider a privacy violation as annoying whereas another time she may not bother about it at all. Therefore it allows for the subjective nature of the problem. We assume that the individual j may be in two different states: State 0 refers to the state where no personal data is disclosed while State 1 refers to the state where personal data has been disclosed. The level of satisfaction of the individual j in state i=0,1 is given by the random utility function ui , j ( y j , z j ) + ε i , j where:
yj is the income (wealth) of the individual and zj is a vector related to the characteristics of the individual, e.g. age, occupation, technology aversion etc. The term ε i, j is a term that will be considered as a random variable and models the personalized features of the individual j, at state i.
State 1, the state of privacy loss, will be disturbing to individual j as long as u1, j ( y j , z j ) + ε 1, j < u0 , j ( y j , z j ) + ε 0, j and that may happen with probability P(ε 1, j − ε 0, j ) < u 0, j ( y j , z j ) − u1, j ( y j , z j ) . This is the probability that an individual will be bothered by a privacy violation and may be calculated as long as we know the distribution of the error term, and will depend on the general characteristics of the individual. Given that an individual j is bothered by a privacy violation, how much would she value this privacy violation, so how much would she like to be compensated for that? If the compensation is Cj then it would satisfy the random equation u1, j ( y j + C j , z j ) + ε 1, j = u 0 , j ( y j , z j ) + ε 0, j , the solution of which will yield a random
variable Cj which is the (random) compensation that an individual may ask for a
206
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
privacy violation. The distribution of the compensation will depend on the distribution of the error terms as well as on the functional form of the deterministic part of the utility function. We now assume that a series of claims Cj, may arrive at certain random times Nj. We need a satisfactory model for Nj and N(t), the total number of claims up to t. Assuming that the distribution of the arrival times is modelled as a Poisson distribution Pois(λ), the total claim up to time t will be given by the random sum: L(t ) =
N (t )
∑C
i
i =0
The distribution of L(t) depends on the distribution of Ci and on the distribution of the counting process N(t) -- assuming that our population is homogeneous, i.e. the Ci's are independent. Assuming independence between N(t) and the size of the arriving claims Cj, we may calculate the expected total claim and its variance: Ε[ L(t )] = Ε[ N (t )]Ε[C ]
Var ( L(t )) = Var ( N (t ))(Ε[C ]) 2 + Ε[ N (t )]Var (C )
Consider that an organization enters into an insurance contract with an insurance firm that undertakes the total claim X=L(t) that its clients may ask for, as a consequence of privacy breaches, over the time t of validity of the contract. This of course should be done at the expense of a premium paid by the IT business to the insurer, but how much should this premium be? Examples for premium calculations could be: π ( X ) = (1 + α )Ε[ X ] or π ( X ) = Ε[ X ] + f (var( X )) where a is called the safety loading factor, and usual choices for f(x) are f ( x) = ax or f ( x) = a x . Since the expectation and the variance of X=L(t) can be calculated within the context of the Random Utility Model, the premium calculation is essentially done.
4 Conclusions and Further Issues The paper has highlighted the importance of ensuring trust in digital business and identified the important contributions made by security and privacy in this context. It has also demonstrated that while technological safeguards have a part to play in providing the trust basis, they cannot be considered to be a complete solution from the user perspective. It should also be recognised that although the technological aspects presented in this paper represent key issues, they are far from the only ones that need to be considered. Indeed, several research challenges remain open, and some further areas of current activity are listed below: •
Multi-Lateral Secure Reputation Systems: As discussed above, reputation schemes based on other users' ratings can influence user trust. For this meaningful reputation metrics and ways of aggregating reputation values in a fair manner are needed, which protect against reputation poisoning and forging attacks. Moreover, reputation schemes need to be transferable and interoperable (allowing to transfer reputations between communities/groups), and as reputation information
Exploring Trust, Security and Privacy in Digital Business
207
also comprises personal data, reputation schemes need to be developed that are privacy-respecting. Reputation systems addressing the latter requirements are currently researched within the scope of the FIDIS7 and PrimeLife EU projects (see (Steinbrecher 2009)). •
Transparency Tools for Enhancing Privacy for End Users: Transparency of personal data processing is not only a basic privacy requirement that can be derived from European data protection legislation (cf. the rights of data subjects to be informed/notified about the processing of their data (Art. 10-11 EU Directive 95/46/EC) the right to access their data (Art.12 EU Directive 95/46/EC)). As discussed in section 2.2, increased transparency also brings increased user confidence and trust. Transparency tools for enhancing privacy include privacy policy evaluation tools (e.g., based on the P3P standard (P3P 2006)) or systems that keep track for the users what personal data their have released to what services side under which privacy policy and that allow users to access and correct or delete their data stored at services sides online (such as the Data Track developed in PRIME (Pettersson et al. 2006)). Further research is currently conducted within the PrimeLife EU project on tools for informing users about privacy implications of future actions based on computing the linkability of their transactions (Hansen 2008b). Besides, within the PrimeLife project transparency tools are currently developed that allow users to check whether their data have been processed in a legally compliant manner- for this secure logs have to be stored at the services sides that can only be accessed by the user or their proxies (e.g. by data protection commissioners). Besides that, further research and development on tools for informing user about passive (hidden) data collection will be needed for preserving transparency in future ubiquitous computing environments. (See (Hansen 2008a), (Hedbom 2009), (Hildebrandt 2009) for further surveys of transparency tools for enhancing privacy).
•
Usable security and privacy: Given that many of the reservations about performing online transactions relate to a lack of confidence in technology, users need to be in a position to accept and trust the safeguards that are provided to compensate. However, it is widely recognised that much of the reassurance that ought to be provided by security technologies can quickly be undermined if users cannot actually understand or use them appropriately (Furnell et al. 2006). Usability challenges can arise from a variety of directions, including over-reliance upon users’ technical knowledge and capability, presenting solutions that are over-complex or cumbersome to use, and technologies that simply get in the way of what the user is actually trying to do. These aspects consequently introduce the possibility of mistakes and mis-configuration, as well as the potential for safeguards to be turned off altogether if they are perceived to be too inconvenient. Usable solutions therefore demand attention in terms of clarity at the user interface level, as well as in terms of limiting the overheads that may be introduced in terms of system performance and interruptions to user activity.
7
www.fidis.net
208
•
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
Multilateral security addressing privacy & trust in social communities: Social communities, such as online auctions and marketplaces, are increasingly used for conducting Digital Business. Besides, companies are more and more doing social network analyses for profiling business partners or job applicants or for the purpose of conducting direct marketing with the help of social network profiles. Whereas today’s privacy enhancing technologies are usually enforcing the privacy principle of data minimisation (i.e. allowing users to release as little data as possible), novel kinds of privacy-enhancing technologies will be needed for protecting social community users, who often rather like to present themselves in their social communities (i.e. their primary goal is usually not data minimisation). Furthermore, tools for evaluating the trustworthiness of social community partners will be needed. In the context of social communities, further research on lifelong privacy need to be conducted. Due to low costs and technical advances of media storage, masses of data can easily be stored, processed and are hardly ever deleted or forgotten. Therefore the question remains how a “right to start over” can be enforced in future. Finally after one’s death, legal and technical questions of digital heritage (Who inherits my data/social community account, and how can the ownership be transferred?) still remain.
Successful attention to these issues, in conjunction with those discussed in the preceding sections, will enable significantly more flexible and friendly approaches to achieving security and privacy in digital business. Combined with the addition reassurance that users can obtain from risk mitigation and transfer options, the result will be a far more comprehensive foundation for trust in the related online services. Acknowledgements. Thanks deserve the PrimeLife Activity 4 (HCI) project partners, particulary John Sören Pettersson, Erik Wästlund, Jenny Nilsson, Maria Lindström, Christina Köffel and Peter Wolkersdorfer, who contributed to the discussion of the design of the Trust Evaluation Function. Jenny Nilsson and Maria Lindström also mainly conducted the usability tests for this function. Also we would like to express our thanks to our colleagues Athanassios Yannakopoulos, Stefanos Gritzalis and Sokratis Katsikas, who have contributed to the work on privacy insurance contract modelling. Parts of the research leading to these results has received funding from the EU 7th Framework programme (FP7/2007-2013) for the project PrimeLife. The information in this document is provided "as is", and no guarantee or warranty is given that the information is fit for any particular purpose. The PrimeLife consortium members shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable law.
References 1. Anderson, R.: Why Information Security is Hard – An Economic Perspective. In: 17th Annual Computer Security Applications Conference, New Orleans, Louisiana (2001) 2. Andersson, C., Camenisch, J., Crane, S., Fischer-Hübner, S., Leenes, R., Pearson, S., Pettersson, J.S., Sommer, D.: Trust in PRIME. In: Proceedings of the 5th IEEE Int. Symposium on Signal Processing and IT, Athens, Greece, December 18-21 (2005)
Exploring Trust, Security and Privacy in Digital Business
209
3. Benner, J., Givens, B., Mierzwinski, E.: Nowhere to Turn: Victims Speak Out on Identity Theft. CALPIRG/Privacy Rights Clearinghouse Report (May 2000) 4. Cavusoglu, H., Mishra, B., Raghunathan, S.: The effect of internet security breach announcements on shareholder wealth: Capital market reactions for breached firms and internet security developers. To appear in International Journal of Electronic Commerce (2004) 5. Fischer-Hübner, S., Pettersson, J.S., Bergmann, M., Hansen, M., Pearson, S., Casassa-Mont, M.: In: Aquisti, et al. (eds.) Digital Privacy – Theory, Technologies, and Practices. Auerbach Publications (2008) 6. Fischer-Hübner, S., Köffel, C., Wästlund, E., Wolkerstorfer, P.: PrimeLife HCI Research Report, Version V1, PrimeLife EU FP7 Project Deliverable D4.1.1 (February 26, 2009) 7. Furnell, S.M., Jusoh, A., Katsabas, D.: The challenges of understanding and using security: A survey of end-users. Computers & Security 25(1), 27–35 (2006) 8. Gordon, L., Loeb, M.: The Economics of Information Security Investment. ACM Transactions on Information and System Security 5(4), 438–457 (2002) 9. Günther, O., Spiekermann, S.: RFID and the perception of control: The consumer’s view. Communications of the ACM 48(9), 73–76 (2005) 10. Hansen, M.: Marrying transparency tools with user-controlled identity management. In: Proc. of Third International Summer School organized by IFIP WG 9.2, 9.6/11.7, 11.6 in cooperation with FIDIS Network of Excellence and HumanIT, Karlstad, Sweden, 2007. Springer, Heidelberg (2008) 11. Hansen, M.: Linkage Control – Integrating the Essence of Privacy Protection into Identity Management Systems. In: Cunningham, P., Cunningham, M. (eds.) Collaboration and the Knowledge Economy: Issues, Applications, Case Studies; Proceedings of eChallenges 2008, pp. 1585–1592. IOS Press, Amsterdam (2008) 12. Hedbom, H.: A survey on transparency tools for privacy purposes. In: Fourth FIDIS International Summer School 2008, in cooperation with IFIP WG 9.2, 9.6/11.7, 11.6. Springer, Heidelberg (2009) 13. Hildebrandt, M.: FIDIS EU Project Deliverable D 7.12: Behavioural Biometric Profiling and Transparency Enhancing Tools (March 2009), http://www.fidis.net 14. Johnston, J., Eloff, J.H.P., Labuschagne, L.: Security and human computer interfaces. Computers & Security 22(8), 675–684 (2003) 15. Köffel, C., Wästlund, E., Wolkerstorfer, P.: PRIME IPv3 Usability Test Report V1.2 (July 25, 2008) 16. Lacohee, H., Phippen, A.D., Furnell, S.M.: Risk and Restitution: Assessing how users establish online trust. Computers & Security 25(7), 486–493 (2006) 17. Lambrinoudakis, C., Gritzalis, S., Hatzopoulos, P., Yannacopoulos, A., Katsikas, S.: A formal model for pricing information systems insurance contracts. Computer Standards and Interfaces (indexed in ISI/SCI-E) 7(5), 521–532 (2005) 18. Leenes, R., Lips, M., Poels, R., Hoogwout, M.: User aspects of Privacy and Identity Management in Online Environments: towards a theoretical model of social factors. In: Fischer-Hübner, S., Andersson, C., Holleboom, T. (eds.) PRIME Framework V1 (ch. 9), June 2005, PRIME project Deliverable D14.1.a (2005) 19. Moitra, S., Konda, S.: The survivability of network systems: An empirical analysis, Carnegie Mellon Software Engineering Institute, Technical Report, CMU/SEI-200-TR-021 (2003) 20. Pearson, S.: Towards Automated Evaluation of Trust Constraints. In: Stølen, K., Winsborough, W.H., Martinelli, F., Massacci, F. (eds.) iTrust 2006. LNCS, vol. 3986, pp. 252–266. Springer, Heidelberg (2006)
210
S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis
21. Pettersson, J.S., Fischer-Hübner, S., Danielsson, N., Nilsson, J., Bergmann, M., Clauß, S., Kriegelstein, T., Krasemann, H.: Making PRIME Usable. In: SOUPS 2005 Symposium on Usable Privacy and Security, Carnegie Mellon University, Pittsburgh, July 6-8. ACM Digital Library (2005) 22. Pettersson, J.S., Fischer-Hübner, S., Bergmann, M.: Outlining Data Track: Privacyfriendly Data Maintenance for End-users. In: Proceedings of the 15TH Internation Information Systems Development Conference (ISD 2006), Budapest, 31 August -2nd September 2006. Springer Scientific Publishers, Heidelberg (2006) 23. Pfitzmann, A., Hansen, M.: Anonymity. Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management – A Consolidated Proposal for Terminology, Version v0.31 (February 15), http://dud.inf.tu-dresden.de/literatur/ Anon_Terminology_v0.31.doc#_Toc64643839 24. The Platform for Privacy Preferences 1.1 (P3P1.1) Specification, W3C Working Group Note (November 13, 2006) 25. Riegelsberger, J., Sasse, M.A., McCarthy, J.D.: The Mechanics of Trust: A Framework for Research and Design. International Journal of Human-Computer Studies 62(3), 381–422 (2005) 26. Steinbrecher, S.: Enhancing multilateral security in and by reputation systems. In: Fourth FIDIS International Summer School 2008, in cooperation with IFIP WG 9.2, 9.6/11.7, 11.6. Springer, Heidelberg (2009) 27. Turner, C.W., Zavod, M., Yurcik, W.: Factors that Affect the Perception of Security and Privacy of E-commerce Web Sites. In: Proceedings of the Fourth International Conference on Electronic Commerce Research, Dallas, TX (November 2001) 28. Varian, H.R.: Systems reliability and free riding. Working Paper (2004) 29. Yannakopoulos, A., Lambrinoudakis, C., Gritzalis, S., Xanthopoulos, S., Katsikas, S.: Modeling Privacy Insurance Contracts and Their Utilization in Risk Management for ICT Firms. In: Jajodia, S., Lopez, J. (eds.) ESORICS 2008. LNCS, vol. 5283, pp. 207–222. Springer, Heidelberg (2008)
Evolution of Query Optimization Methods Abdelkader Hameurlain and Franck Morvan Institut de Recherche en Informatique de Toulouse IRIT, Paul Sabatier University, 118, Route de Narbonne, 31062 Toulouse Cedex, France Ph.: 33 (0) 5 61 55 82 48/74 43, Fax: 33 (0) 5 61 55 62 58
[email protected],
[email protected] Abstract. Query optimization is the most critical phase in query processing. In this paper, we try to describe synthetically the evolution of query optimization methods from uniprocessor relational database systems to data Grid systems through parallel, distributed and data integration systems. We point out a set of parameters to characterize and compare query optimization methods, mainly: (i) size of the search space, (ii) type of method (static or dynamic), (iii) modification types of execution plans (re-optimization or re-scheduling), (iv) level of
modification (intra-operator and/or inter-operator), (v) type of event (estimation errors, delay, user preferences), and (vi) nature of decisionmaking (centralized or decentralized control). The major contributions of this paper are: (i) understanding the mechanisms of query optimization methods with respect to the considered environments and their constraints (e.g. parallelism, distribution, heterogeneity, large scale, dynamicity of nodes) (ii) pointing out their main characteristics which allow comparing them, and (iii) the reasons for which proposed methods become very sophisticated. Keywords: Relational Databases, Query Optimization, Parallel and Distributed Databases, Data Integration, Large Scale, Data Grid Systems.
1 Introduction At present, most of the relational database application programs are written in highlevel languages integrating a relational language. The relational languages offer generally a declarative interface (or declarative language like SQL) to access the data stored in a database. Three steps are involved for query processing: decomposition, optimization and execution. The first step decomposes a relational query (a SQL query) using logical schema into an algebraic query. During this step syntactic, semantic and authorization are done. The second step is responsible for generating an efficient execution plan for the given SQL query from the considered search space. The third step consists in implementing the efficient execution plan (or operator tree) [51]. In this paper, we focus only on query optimization methods. We consider multijoin queries without “group” and “order by” clauses. Work related to the relational query optimization goes back to the 70s, and began mainly with the publications of Wong et al. [138] and Selinger et al. [112]. These papers A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 211–242, 2009. © Springer-Verlag Berlin Heidelberg 2009
212
A. Hameurlain and F. Morvan
motivated a large part of the database scientific community to focus their efforts on this subject. The optimizer’s role is to generate, for a given SQL query, an optimal (or close to the optimal) execution plan from the considered search space. The optimization goal is to minimize response time and maximize throughput while minimizing optimization costs. The general problem of the query optimization can be expressed as follows [41]: let a query q, a space of the execution plans E, and a cost function cost (q) associated to the execution of p ∈E, find the execution plan calculating q such as the cost (q) is minimum. An optimizer can be decomposed into three elements [41]: a search space [85] corresponding to the virtual set of all possible execution plans corresponding to a given query, a search strategy generating an optimal (or close to the optimal) execution plan, and a cost model allowing to annotate operators' trees in the considered search space. Because of the importance, and the complexity of the query optimization problem [21, 75, 82, 103], the database community made a considerable effort to develop approaches, methods and techniques of query optimization for various Database Management Systems DBMS (i.e. relational, deductive, distributed, object, parallel) [7, 9, 21, 26, 47, 52, 61, 62, 79, 82, 103, 125]. The quality of query optimization methods depends strongly on the accuracy and the efficiency of cost models [1, 42, 43, 66, 99, 141]. There are two types of query optimization approaches [27]: static, and dynamic. During more than twenty years, most of the DBMSs have used the static optimization approach which consists of generating an optimal (or close to the optimal) execution plan, then executing it until the termination. All the methods, using this approach, suppose that the values of the parameters used (e.g. sizes of temporary relations, selectivity factors, availability of resources) to generate the execution plan are always valid during its execution. However, this hypothesis is often unwarranted. Indeed, the values of these parameters can become invalid during the execution due to several causes [98]: 1.
2.
Estimation errors: the estimation on the sizes of the temporary relations and the relational operator costs of an execution plan can be erroneous because of the absence, the obsolescence, and the inaccuracy of the statistics describing the data, or the errors on the hypotheses made by the cost model. For instance, the dependence or the independence between the attributes member of a selective clause (e.g. town=’Paris’ and country = ‘France’). These estimation errors are propagated in the rest of the execution plan. Moreover, [70] showed that the propagation of these errors is exponential with the number of joins. Unavailability of resources: at compile-time, the optimizer does not have any information about the system state when the query will run, in particular, about the availability of resources to allocate (e.g. available memory, CPU load).
Because of reasons quoted previously, the execution plans generated by a static optimizer can be sub-optimal. To correct this sub-optimality, some recent researches suggest improving the accuracy of parameter values used during the choice of the execution plan. A first solution consists in improving the quality of the statistics on the data by using the previous executions [1]. This solution was used by [20] to improve the estimation accuracy of the operator selectivity factors and by [117] to estimate the correlation between predicates. The second solution proposed by [80]
Evolution of Query Optimization Methods
213
concentrates on the distributed queries. The optimizer generates an optimal (or close to the optimal) execution plan, having deduced the data transfer costs and the cardinalities of temporary relations. In this solution, the query operators are executed on a tuple subset of the operands to estimate the data transfer costs and the cardinalities of temporary relations. In both solutions, the selected execution plan is executed until the termination, whatever are the changes in execution environment. As far as the dynamic optimization approach, it consists in modifying the suboptimal execution plans at run-time. The main motivations to introduce ‘dynamicity’ into query optimization [27], particularly during the resource allocation process, are based on: (i) willing to use information concerning the availability of resources, (ii) the exploitation of the relative quasi-exactness of parameter values, and (iii) the relaxation of certain too drastic and not realistic hypotheses in a dynamic context (e.g. infinite memory). In this approach, several methods were proposed in different environments: uni-processor, distributed, parallel, and large scale [3, 4, 6, 7, 8, 12, 14, 15, 17, 18, 27, 30, 37, 48, 50, 56, 59, 72, 73, 74, 76, 87, 95, 98, 101, 102, 105, 106, 107, 140]. All these methods have the capacity of detecting the sub-optimality of execution plans and modifying these execution plans to improve their performances. They allow to the query optimization process to be more robust with respect to estimation errors and to changes in execution environment. The rest of this paper is devoted to provide a state of the art concerning the evolution of query optimization methods in different environments (e.g. uni-processor, parallel, distributed, large scale). For each environment, we try to describe synthetically some methods, and to point out their main characteristics [67, 98], especially, the nature of decision-making (centralized or decentralized), the type of modification (re-optimization or re-scheduling), the level of modification (intra-operator and/or inter-operator), and the type of event (estimation errors, delay, user preferences). The major contributions of this paper are: (i) understanding the mechanisms of query optimization methods with respect to considered environments and their constraints, (ii) pointing out their main characteristics which allow comparing them, and (iii) the reasons for which proposed methods become very sophisticated. This paper is organized as follows: firstly, in section 2, we introduce two main search strategies (enumerative strategies, random strategies) for uni-processor relational query optimization. Then, in section 3 we present a synthesis of some methods in a parallel relational environment by distinguishing the two phase and one phase approaches. Section 4 provides global optimization methods of distributed queries. Section 5, describes, in data integration (mediation) systems, both types of dynamic optimization methods: centralized and decentralized. Section 6 is devoted to give an overview of query optimization in large scale environments, particularly in data grid environments. Lastly, before presenting our conclusion, in section 7, we provide a qualitative analysis of described optimization methods, and point out their main characteristics which allow comparing them.
2 Uni-processor Relational Query Optimization In the uniprocessor relational systems, the query optimization process consists of two steps: (i) logical optimization which consists in applying the classic transformation
214
A. Hameurlain and F. Morvan
rules of the algebraic trees to reduce the manipulated data volume, and (ii) physical optimization which has roles of [90]: (a) determining an appropriate join method for each join operator by taking into account the size of the relations, the physical organization of the data, and access paths, and (b) generating the order in which the joins are performed [69, 84] with respect to a cost model. In this section, we focus on physical optimization methods. We begin at first, to define, characterize, and estimate the size of the search space. Then, we present search strategies. These are based either on enumerative approaches, or random approaches. Finally, we synthesize some analyses and comparisons stemming from performance evaluations of proposed strategies. 2.1 Search Space 2.1.1 Characteristics In relational database systems [31, 120, 130], each query execution plan can be represented by a processing tree where the leaf nodes are the base relations and the internal nodes represent operations. Different tree shapes have been considered: left-deep tree, right-deep tree, and bushy tree. The Fig.1 illustrates tree structures of relational operators associated with the multi-join query R1∞ R2 ∞ R3 ∞ R4 ∞ R4. result
result
join join join R1 R2 Left-deep tree (((R1 R2)
result
join
R4
join
R1
R3
R2
R3)
R4)
join join join
R1
R3 R4 ((R1 Right-deep tree (R1 (R2 (R3 R4)))
join
R2 R3 Bushy tree R2) (R3
R4 R4))
Fig. 1. Tree shape
A search space can be restricted according to the nature of the execution plans and the applied search strategy. The nature of execution plans is determined according to two criteria: the shape of the tree structures (i.e. left-deep tree, right-deep tree and bushy tree) and the consideration of plans with Cartesian products. The queries with a large number of join predicates make the difficulty to manage associated search space which becomes too large. That is the reason why some authors [122, 123] chose to eliminate bushy trees. This reduced space is called valid space. This choice is due to the fact that this valid space represents a significant portion of the search space, which is the optimal solution. However, this assertion was never validated. Others, such as [100], think that these methods decrease the chances to obtain optimal solutions. Several examples [100] show the importance of the impact of this restrictive choice.
Evolution of Query Optimization Methods
215
2.1.2 Search Space Size The importance of the query shape1 (i.e. linear, star or clique) and of the nature of the execution plans is due to their incidence on the size of the search space. If we have N relations in a multi-join query, the question is to know how many execution plans being able to be built, taking into account the nature of the search space. The size of this space also varies according to the shape of the query. In this case, [85, 124] proposed a table illustrating the lower and superior boundary markers of the search space by taking into account the nature of this one and characteristics of the queries which are: the type of a query (i.e. repetitive, ad-hoc), the query shape, and the size of a query (i.e. simple, medium, complex). The results presented in [85, 124] point out the exponential growth of the number of execution plans according to the number of relations. This shows the difficulty to manage a solution set which is sometimes very large. Therefore, this brings the necessity of adapting the search strategy to the query characteristics. 2.2 Search Strategies In the literature, we distinguish, generally, two classes of strategies allowing to solve the problem of the join scheduling for the query optimization: -
Enumerative strategies Random strategies.
The description of the principles of search strategies leans on the generic search algorithms described in [84] and on the comparative study between the random algorithms proposed by [69, 70, 86, 122, 123]. 2.2.1 Enumerative Strategies These strategies are based on the generative approach. They use the principle of dynamic programming (e.g. optimizer of System R). For a given query, the set of all possible execution plans is enumerated. This can lead to manage a search space too large in case of complex queries. They build execution plans from sub-plans already optimized by starting with all or part of base relations of a query. In the whole of generated solutions, only the optimal execution plan is returned for the execution. However, the exponential complexity of such strategies has led many authors to propose more efficient strategies. So enumerative strategies allow to discard bad states by introducing heuristics (e.g. depth- first search with different heuristics [123]). Several strategies are described in [84]. 2.2.2 Random Strategies The enumerative strategies are inadequate in optimizing complex queries because the number of execution plans quickly becomes too large [85, 124]. To resolve this problem, random strategies are used. The transformational approach characterizes this kind of strategies. Several rules of transformation (e.g.; Swap, 3Cycle, Join commutativity/ associativity) were proposed [69, 70, 122] where the validity depends on the nature of the considered search space [86]. 1
The query shape indicates the way where the relations are joined by means of predicates, as well as the number of referenced relations.
216
A. Hameurlain and F. Morvan
The random strategies start generally with an initial execution plan which is iteratively improved by the application of a set of transformation rules. The start plan(s) can be obtained through an enumerative strategy like Augmented Heuristics. Two optimization techniques were abundantly already studied and compared: the Iterative Improvement and the Simulated Annealing [68, 69, 70, 84, 122, 123]. The performance evaluation of these strategies is very hard because of strong influence, at the same time, of random parameters and factors. The main difficulty lies in the choice of these parameters (e.g local / global minimum detection, algorithm termination criterion, initial temperature, termination criterion of inner iteration). Indeed, the quality of execution and the optimization cost depend on the quality of choice. After the tuning of the parameters, the comparison of the algorithms will allow to determine the most efficient random algorithm for the optimization problem of complex queries. However, the results obtained by [122] and by [69] differ radically because, for [122], the Iterative Improvement algorithm is better than the Simulated Annealing, while for [69], we have the opposite (even if for these last ones, their conclusion remains more moderate). The parameters were determined thanks to experiments with various alternatives, in the case of [69], or by applying the methodology of the factorial experiments [122]. An example of the use of these factorial experiments is given in [121]. 2.3 Discussion In [69, 70, 122, 123], the authors concentrated their efforts on the performance evaluation of the random algorithms for Iterative Improvement and the Simulated Annealing. However, the difference of their results underlines the difficulty of such evaluation. Indeed, for Swami and Gupta [122, 123], the Simulated Annealing algorithm is never superior to the Iterative Improvement whatever the time dedicated to the optimization is, while for Ioannidis and Cha Kong [69, 70], it is better than the Iterative Improvement algorithm after some optimization time. In [69, 70], the authors try to explain this difference. First, the considered search space is restricted to the left-deep trees in the case of Swami and Gupta [122, 123], while Ioannidis and Cha Kong [69, 70] study the search space in its totality. In [70], the authors spread their works on the study of the shape of the cost function by stressing the analysis of the linear and bushy spaces, and take into account only results waited in this restricted portion by the search space in order to keep the comparison coherent. The second difference concerns the join method. Swami and Gupta choose the hash join method, while Ioannidis and Kong [69, 70] use two join methods: nested loop and sort merge join. They choose even integrating the hash join method to show that their results do not depend on the method chosen. Another variant in the cost evaluation of the execution plan (CPU time for the first ones and I/O time for the second) has, either, no significant incidence on the difference of the results. On the other hand, they intuitively think that the number of nearest plans, the determination of the local minimum in the case of the Iterative Improvement algorithm and the definition of the transformation rules to be applied are important elements in the explanation of this difference. For example, if the number of nearest plans is not rather large, we can discard potential local minima and even indicate it as such, while they are not in reality. In that case, the results are skewed. The transformation rules applied by Swami
Evolution of Query Optimization Methods
217
and Gupta [122, 123] generates nearest execution plans with a significant difference in cost [69]. Hence, the Simulated Annealing algorithm has no more the possibility of crossing a long moment in this zone of low-cost plans and offers then insufficient improvement. However, the algorithm of the Iterative Improvement can easily reach a local minimum. The termination criterion of the Simulated Annealing defined in [122] does not give the time to the probability to decrease sufficiently. Indeed, when the time limit is reached, the probability to accept execution plans with high cost is still too high and the produced optimal plan has a still too expensive.
3 Parallel Relational Query Optimization Parallel relational query optimization methods [57] can be seen as an extension of relational query optimization methods developed for the centralized systems, by integrating the parallelism dimension. Indeed, the generation of an optimal parallel execution plan (or close to optimal), is based on either a two-phase approach [60, 63], or on a one-phase approach [24, 86, 111, 142]. A two-phase approach consists in two sequential steps: (i) generation of an optimal sequential execution plan (i.e. logical optimization followed by a physical optimization), and (ii) resource allocation to this plan. The last step consists, at first, in extracting the various sources of parallelism, then, to assign the resources to the operations of the execution plan by trying to meet the allocation constraints (i.e. data locality, and various sources of parallelism). As far as the one-phase approach, the steps (i) and (ii) are packed into one integrated component [90]. The fundamental distinction between both approaches is based on the query characteristics and the shape of the search space [57]. In the proposals concerning parallel relational query optimization few authors [55, 61, 79] proposed a synthesis dedicated to parallel relational query optimization methods. Hasan et al [61] have briefly introduced what they consider the major issues to be addressed in parallel query optimization. The issues that are tackled in [79] include, mainly, the placement of data in the memory, concurrent access to data and some algorithms for parallel query processing. These algorithms are restricted to parallel joins. As far as proposals [55], the authors describe, in a very synthetic way, data placement, static and dynamic query optimization methods, and accuracy of the cost model. Nevertheless, the authors do not show how we can compare the two optimization approaches, and how we can choose the appropriate optimization approach. Last year, Taniar et al. [126] provide the latest principles, methods and techniques of parallel query processing in their book. The rest of the section is devoted to provide an overview of some static and dynamic query optimization methods in a parallel relational environment by distinguishing the two phase and one phase approaches [57]. 3.1 Static Parallel Query Optimization Methods In this sub-section, we describe some one-phase and two-phase optimization strategies of parallel queries in a static context.
218
A. Hameurlain and F. Morvan
3.1.1 One-Phase Optimization In a one-phase approach, Schneider et al. [111] propose a parallel algorithm to process a query compound of N joins for each search space shape (i.e. left-deep tree, rightdeep tree and bushy tree, Cf. Fig. 1). The authors consider two methods of hash join: the simple hash join and the hybrid hash join. [111] reports for each search space shape, the need in memory size, the potential scheduling, and the capacity to exploit the different forms of parallelism. The study includes the case where the memory resource is unlimited and the more realistic case where the memory is limited. In the first case, the right deep tree is the most adapted to best exploit the parallelism. But, this structure is no longer the best when the memory is limited. Indeed, there are several strategies allowing to exploit the capabilities of the right deep trees when the memory is limited. The strategy, named "Static Right Deep Scheduling" [111], consists in cutting the right deep tree in several separate sub-trees in a way that the sum of the sizes of all the hash tables of a sub-tree can fit in memory. The temporary results of the execution of sub-trees T1, T2 …Tn will be stored in disks. The drawback of this strategy is that the number of sub-trees increases with the number of base relations which are not held stored in memory. Hence, this method reduces the pipeline chain and increases the response time. Two methods were proposed, one is based on segmented right-deep trees [24], and the other one is based on zigzag trees [142]. The objective of these two methods is to avoid the investigation of the bushy tree search space and then simplifying the optimization process. 3.1.2 Two-Phase Optimization In the two-phase approach, Hasan et al. [23, 60] propose several scheduling strategies of pipelined operators. To improve the response time, they develop an execution model ensuring the best trade-off between parallel execution and communication overhead. Several scheduling algorithms (i.e. processor allocation) are then proposed. They are inspired by the heuristic LPT (Largest Processing Time). These algorithms exploit pipeline and intra-operation (partitioned) parallelisms. Indeed, the authors firstly propose scheduling algorithms exploiting only the pipeline parallelism (POT Pipelined Operator Tree Scheduling), then they show how to extend these algorithms to take into account the intra-operation parallelism and the communication costs. The scheduling principle of the POT is decomposed into several steps [23]: (i) generation of operators' monotonous tree [60] from operators' tree, (ii) fragmentation of the monotonous tree which consists in cutting the monotonous tree in a set of fragments, and (iii) scheduling which consists in assigning processors to fragments. The main difficulty lies in the determination of the number of fragments and the size of each fragment by insuring the best tradeoff between parallel execution - communication overhead. As for the works of Garofalakis et al. [44, 45], they can be seen as an elegant extension of the propositions of [23, 41, 60]. Indeed, the works of [44, 45] take into account the fact that the parallel query execution requires the allocation of several resource types. They also introduce an original way to resolve this resource allocation by a simultaneous scheduling (e.g. parallelism extraction) and mapping method. First, [44] present a scheduling and mapping static strategy on a shared nothing parallel architecture, considering the allocation of several “preemptive” resources (e.g. processors). Next, the authors extend their own works in [45] for hybrid multi-processor architecture. This
Evolution of Query Optimization Methods
219
extension consists, mainly, in taking into account the "no preemptive" resource (e.g. memory) in their scheduling and mapping method. 3.2 Dynamic Parallel Query Optimization Methods The main motivations to introduce ‘dynamicity’ into query optimization [27], in particular in the resource allocation strategies, are based on: (i) the will to use,, information concerning the availability of the resources to allocate, (ii) the exploitation of the relative quasi-exactness of the metrics, and (iii) the relaxation of certain too drastic and not realistic hypotheses in a dynamic context. This sub-section describes in a synthetic way some one-phase and two phase parallel query optimization strategies. It should be pointed out that the proposed resource allocation methods become very complex and sophisticated in such a dynamic context. 3.2.1 One-Phase Optimization In this approach, the majority of work point out the importance of the determination of the join operation parallelism degree and the resource allocation method (e.g. processors and memory). Thus, it becomes interesting to synthesize some methods proposed in the literature, mainly [19, 81, 96, 107]. In their most recent work Brunie et al. [18, 19, 81] are not only interested in a multi-join process in a multi-user context, but also consider the current system state in terms of multi-resource contention. [18] studied, more generally, the relational query optimization on shared nothing architecture. The optimizer MPO (Modular Parallel query Optimizers) [81] determines dynamically the intra-operation parallelism degree of the join operators of a bushy tree. The authors suggest a dynamic heuristic to resource allocation in four steps applied in the following order: (i) Preservation of the data locality (or “data localization”), (ii) Size of the memory, (iii) I/O Reduction, and (iv) Operation serialization of a bushy tree: The proposals of Mehta et al. [96] and Rahm et al. [107] were developed independently of one-phase and two-phase approaches. Furthermore, their proposals are very representative and describe relevant and original solutions with respect to the problems identified above (i.e. determination of the parallelism degree and the resource allocation methods), we chose to include them in the one-phase approach. Mehta et al. [96] propose four algorithms (Maximum, MinDp, MaxDp, and RateMatch) to determine the join parallelism degree independently of the initial data placement. The originality of the algorithm Rate tries to make correspond the production rate of the result tuples of an operator with the consumption rate of next operator tuples. Then, the authors describe six alternative methods of processor allocation in the clones of a unique join operator. They are based on heuristics such as the random or round-robin strategies, and on a model taking into account the effect of the resource contention. As for the proposals of Rahm et al. [107], who extend the works of [95], they tackle the problem of the dynamic workload balancing of several queries compounded in a single hash join on a shared nothing architecture. The intra-operation parallelism of a join as well as the choice of the execution processors of the join are determined in a “integrated” way (.i.e. in a single step) by considering the current system state. This state is characterized by using the resources “bottlenecks”: CPU, memory, and disk.
220
A. Hameurlain and F. Morvan
3.2.2 Two-Phase Optimization XPRS adapting scheduling method In the system XPRS (eXtended Postgres one Raid and Sprite) [118], implanted on shared memory parallel architecture, Hong [63] proposes an adaptive scheduling method of fragments stemming from the best sequential execution plan represented by a bushy tree. Fragments are used as unity of parallel execution and they will also be called tasks in this sub-section. The adaptive scheduling algorithm is based on the following three elements: (i) classification of the “IO-bound” and “CPU-bound” tasks, (ii) computing method of the IO-CPU balance point of two tasks, and (iii) mechanism of dynamic adaptation of the parallelism degree of a task. The proposed strategy by [63] consists in finding task scheduling which maximizes the use of the resources (i.e. processors and disks), and thus minimizes the response time. For that purpose, [63] defines two types of tasks: the IO-bound tasks (limited by Input / Output) and the CPU-bound tasks (limited by the number of processors). To maximize the resource utilization (e.g. when one of both tasks ends, a part of resources remains unused), [63] proposes a dynamic adaptation method of the parallelism degree of a task according to the implemented distribution methods (i.e. roundrobin, interval). This method is used in the adaptive scheduling so that the system always works on the IO-CPU balance point. Dynamic re-optimization methods of sub-optimal execution plans In Kabra et al. [77], where the idea is close Brunie and al. [18], the authors propose a dynamic re-optimization algorithm which detects and corrects sub-optimality of the execution plan produced by the optimizer at compile time. This algorithm is implanted in the system Paradise [33] which is based on the static optimizer OPT++ [78]. The authors show that sub-optimality of an execution plan can result: (i) in a poor join scheduling, (ii) in the inappropriate choice of the join algorithms, or (iii) in a poor resources allocation (CPU and memory). These three problems would be caused by erroneous or obsolete cost estimations, or another lack of information necessary for the static optimization, concerning to the system state. The basic idea of this algorithm is founded on the collection of the statistics in some key-points during the query execution. The collected statistics correspond to the real values (observed during the execution), where the estimation is subject to error at compile time (e.g. size of a temporary relation). These statistics are used to improve the resource allocation or by changing the execution plan of the remainder of the query (i.e. the part of the query, which is not executed yet). As for the re-optimization process, it will be engaged only in case of estimation errors really bringing sub-optimality besides of the execution plan. Indeed, on the basis of these new improved estimations, if they are different in a significant way from those supplied by the static optimizer a new execution plan of the remainder of the query is generated in the case where it brings a minimum benefit.
4 Distributed Query Optimization The main motivation of the distributed databases is to present data which are distributed on networks of type LAN (Local Area Network) or of type WAN (Wide Area
Evolution of Query Optimization Methods
221
Network) in an integrated way to a user. One of the objectives is to make data distribution transparent to the user. In this environment, the main steps of the evaluation process of a distributed query are data localization and optimization. The optimization process [82, 103] takes into account network particularities. Indeed, contrary to the interconnection network of a multi-processor, networks have a lower bandwidth and a more important latency. For example, with a satellite connection the latency exceeds the half-second. These particularities are significant in cost of a distributed execution plan that authors [10, 103] are focused. They suppose that the communication cost is widely superior to those of the I/O and the CPU. So, many works focus on the communication cost to the detriment of CPU and I/O costs. At present, with the improvement of network performance, the cost functions used by the optimization process take into account the processing (i.e. CPU and I/O) and communication time together. The optimization process of a distributed query is composed of two steps [103]: a global optimization step and a local optimization step. The global optimization consists of: (i) determining the best execution site for each local sub-query considering data replication, (ii) finding the best inter-site operator scheduling, and (iii) placing these last ones. As for local optimization, it optimizes the local sub-queries on each site which are involved to the query evaluation. The inter-site operator scheduling and their placement are very important in a distributed environment because they allow to reduce the data volumes exchanged on the network and consequently to reduce the communication costs. Hence, the estimation accuracy of the temporary relation sizes that must be transferred from a site to another one is important. In the rest of this section, we present global optimization methods of distributed queries. They differ by the objective function used by the optimization process and by the type of approach: static or dynamic. 4.1 Static Distributed Query Optimization In distributed environments, various research works concerning the static query optimization are focused mainly on the optimization of inter-site communication costs. The idea is to minimize the data volume transferred between sites. In this perspective, there are two methods to process inter-site joins [103]: (i) the direct join by moving one relation or both relations, and (ii) the join based on semi-join. This alternative consists in replacing a join, whatever the class of algorithm implanting this join is, by the combination of a projection, and a semi-join ended by a join [25]. The cost of the projection can be minimized by encoding the result [133]. The benefit of a join based on semi-join with respect to a direct joint is proportional in the join operator selectivity [134]. According to the relation profiles (e.g. relation size), the optimizer will choose the approach which minimizes the data volume transferred between sites. For example, the SDD-1 system [10] often uses the join based on semi-join. However, System R* [113] avoids to use it. Indeed, the use of a join based on semi-join can increase the query processing time. Mackert and Lohman [91] showed the importance of the local processing cost in the performance of a distributed query. Furthermore, its consideration by the optimizer significantly increases the size of the search space. Indeed, in a query, there are several possibilities of join based on semi-join for a given relation. The number of join based on semi-join is an exponential function
222
A. Hameurlain and F. Morvan
which depends of the number of temporary relations resulting from local sub-queries [103]. This explains why many optimizers do not use this alternative. The quality of a distributed execution plan which is generated by the global optimization process depends on the accuracy of the used estimations. However, it is difficult to estimate the parameters (e.g. relation profile, resource availability) used by the optimizer. Generally, the used cost models made the assumption of processor and network uniformity. These cost models assume that all processors and network connections have the same speed and bandwidth, like in a parallel environment. Furthermore, they do not take into account the workload of processors nor that of the network. Based on these observations, several works [80, 119] try to improve the accuracy of these parameters. In this objective the Mariposa distributed DBMS [119] leans on an economic model in which querying servers buy data from data server. Each query Q, which is decomposed into several sub-queries Q1, Q2,…, QN, is administered by a broker. A broker obtains bids for a sub-query Qi from various sites. After choosing the better bid, the broker notifies the winning site. The advantage of this method is that it leans on the local cost models of every DBMS which can participate in the query evaluation. So, it considers the processor heterogeneity and takes into account their workload. [80] propose that the optimizer generates an optimal (or close to the optimal) execution plan, having deduced the data transfer costs and the cardinalities of temporary relations. In this solution, the operators of a query are executed on a tuple subset of the operands to estimate the data transfer costs and the cardinalities of temporary relations. After deduced the cost of these parameters, an optimal execution plan is generated and executed until the termination, whatever the changes in execution environment are. 4.2 Dynamic Distributed Query Optimization A solution to correct the sub-optimality of an execution plan consists in changing the operation scheduling at run-time. In the multi-database MIND system, Ozcan et al. [102] proposed strategies for dynamic re-scheduling of inter-site operators (e.g., join, union) to react to the inaccuracies of estimations. The inter-site operators can be executed as soon as two sub-queries which are executed on different sites produced their results. These strategies use the partial results available at run-time to define the scheduling of the executions between the inter-site operators. The query processing is done in two steps [37]: 1. 2.
Compilation. During this step, a global query is decomposed into local subqueries. The sub-queries are sent to different sites to be executed in parallel. Dynamic scheduling. This step defines a dynamic scheduling between the operations consuming the results of sub-queries sent on sites. When a sub-query produces its result, a threshold is associated to the result. This threshold is used to determine if the result must be consumed immediately to execute a join with another result already available, or if the consumption of this result will be delayed while waiting for another result, which is unavailable in this moment. The threshold associated with a result is calculated according to the costs and selectivity factors of all joins connected to this result.
Evolution of Query Optimization Methods
223
This scheduling strategy reduces the uncertainty of estimations since it is based on the execution times of local sub-queries. Moreover, it avoids the needs to know the cost models of the various databases.
5 Query Optimization in Data Integration Systems Data integration systems extend [22, 53, 88, 127, 128, 136] the distributed database approach to multiple, autonomous, and heterogeneous data sources by providing uniform access (same query interface in read only to all sources). We use the term data source to refer any data collection which his owner wishes to share with other users. The main differences of a distributed database approach are the number of data sources and the heterogeneity of the data sources. The distributed database approach addresses about tens of distributed databases while data integration system approach can scale up to hundreds of data sources [104]. In addition to the material heterogeneity (i.e. CPU, I/O, network) due to the environment, the data sources are heterogeneous by their data structure (e.g. relational or object). Moreover, the software infrastructures allowing the access to data sources have different capabilities for processing queries. For example, a phone book service which requires the name of a person to return a phone number is a data source where the access is restricted. In this context, we need new operators in order to access to data sources and to, for instance, join two relations. Consider an execution plan that needs a relational join between Employee (empId, name) and Phone (name, phoneNumber) tables on their name attribute. In a standard join both of the following fragments: Join (Employee, Phone) and Join (Phone, Employee) are valid since join is a commutative operator. However, with restricted sources, the second fragment Join (Phone, Employee) on name attribute is not valid, since Phone requires the value of the name attribute in order to return the value of the phoneNumber. In consequence, we need a new join operator which is asymmetric in nature, also known as dependent join Djoin [46]. The asymmetry of this operator causes the search space to be restricted and raises the issue of capturing valid (feasible) execution plans [92, 93, 139]. In an environment with hundreds of data sources connected on Internet it is even more difficult to estimate, at compile time, the availability of the resources like network, CPU or memory. Hence, many authors propose dynamic optimization strategies to correct the sub-optimality of execution plans at runtime. Initially, proposed methods are centralized [3, 4, 7, 14, 15, 32, 74, 109]. A dynamic optimization method is said to be centralised if there is a unique process, generally the optimiser, which is charged to supervise, control and modify the execution plans. This process can be based on other modules ensuring the production of necessary information for the modifications and the control of an execution plan. On other hand, in this environment, two phenomena that occur frequently are significant: initial delays before data start arriving and bursty arrivals data thereafter [72]. In order to react to these unpredictable data arrival rate, several authors propose to decentralize the control inside the operator [72, 131, 132]. The idea is to produce most quickly as possible a part of the result with the already arrived tuples during the waiting of operand tuples.
224
A. Hameurlain and F. Morvan
In the rest of the section, we present the specific operators to the data integration, at first, then we describe both types of dynamic optimization methods: centralized and decentralized. 5.1 Operators for Restricted Source Access Consider the execution plan presented previously that needs a relational join between Employee (empId, name) and Phone (name, phoneNumber) tables. The tables can be modeled with the concept of ‘binding patterns’ as introduced in [108]. Binding patterns can be attached to the relational table to describe its access restrictions due to the reasons of confidentiality or performance issues. A binding pattern for a table R(X1, X2, . . . , Xn) is a partial mapping from {X1, X2, . . . , Xn} to the alphabet {b, f} [93]. For those attributes mapped to ‘b’, the values should be supplied in order to get information from R while the attributes mapping to ‘f’ do not require any input in order to return tuples from R. If all the attributes of R are mapped to ‘f’ then it is possible to get all the tuples of R without any restriction (e.g. with a relational scan operator). The binding patterns of the tables of our example are as follows: Employee (empIdf, namef), and Phone (nameb, phoneNumberf). It means that the Employee table is ready to return the values of the empId, and the name while the Phone table can give the phoneNumber only if the value of the name attribute is known. Regular set of relational operators are insufficient in order to answer queries in the presence of restricted sources. Although we can model the restricted sources with formalization of ‘binding patterns’, due to the access restrictions of the sources, we cannot use the query processing operators, like relational scan and relational join. In the example, in order to get the phoneNumber we have to give the values of the name attribute. So we need a new scan operator which is able to deal with the restricted sources. We quote this operator DAccess as D indicates its dependency on the values of the input attribute(s). While the relational scan operator always returns the same result set, this new operator DAccess returns different sets depending on its input set. Formal semantics of DAccess is as follows: Consider a table R(Xb, Yf) and χ be a set of values for X. Then, DAccess(R(Xb, Yf))χ =σ X∈χ(R(X, Y)) [93]. We noticed that to make the join between Employee (empIdf, namef), and Phone (nameb, phoneNumberf) we need a new join operator known as dependent join [46], represented by the symbol . The representation of the dependent join is T←Scan(R1(Uf, Vf)) V=X DAccess(R2(Xb, Yf)). The hash dependent join consists in building a hash table from R1 and at the same time the distinct values of the attribute(s) V are retrieved and stored them into a table P. P is given to the DAccess operator to compute R2’ = σ X∈P (R2(X, Y)). Then the hash table is probed with R2’ to compute the result. 5.2 Centralized Dynamic Optimization Methods in Data Integration Systems In this sub-section, we present some dynamic optimization methods and techniques where the type of decision-making is centralized. We classify these methods according
Evolution of Query Optimization Methods
225
to the modification level of execution plans. This modification can be taken either on the intra-operator level, or on the inter-operator level. 5.2.1 Modification of Execution Plans on the Intra-operator Level The sub-optimality of execution plans can be modified during the execution of an operator (intra-operator). With this objective, two approaches were proposed: the first one is based on the routing of tuples named Eddy [7], and the second one is based on the dynamic partitioning of data [74]. Avnur and Hellerstein [7] proposed a mechanism named Eddy for query processing which updates continuously the execution schedule of operators in order to adapt to the changes in execution environment. Eddy can be considered as a router of tuples positioned between a number of data sources and a set of operators. Each operator must have one or two input queues to receive the tuples sent by Eddy and an output queue to return the result tuples to Eddy. The tuples received by an Eddy are redirected towards the operators in different orders. Thus, the scheduling of operators is encapsulated by the dynamic routing of tuples. The key point in Eddy is the routing of tuples. Thus, the policy of the tuple routing must be efficient and intelligent in order to minimize the query response time. For that purpose, several authors [32, 109] suggest to extend Eddy's mechanism to improve the quality of the routing. Dynamic data partitioning was proposed by Ives et al. [74]. It corrects the suboptimality of execution plans relying on dynamic data partitioning. In this method, a set of execution plans is associated to each query which will be executed either in parallel or in sequence on separate data partitions. The execution plan of a query is constantly supervised at runtime, and it can be replaced by a new plan in the case where the current plan is considered to be sub-optimal. The tuples which are processed by each used plan represent a data partitioning. When an execution plan is replaced, a new data partitioning is produced. Each used execution plan produces a part of the total result from the associated data partitioning during the query execution. The union of the tuples produced by the various used execution plans provides only part of the total result. Thus, to calculate the final result of the query, it must also calculate the results of all the combinations of various data partitioning. This method is similar to that of Eddy [7]. But contrary to Eddy which uses a local decision routing, this method is based on more total information to generate the new plans. The main difference is that the decision to suspend or replace an execution plan by another one is made by the optimizer. 5.2.2 Modification of Execution Plans on the Inter-operator Level A solution to correct the sub-optimality of execution plans consists in changing the operation scheduling at runtime. The works of Amsaleg et al. [3] take into account the delays in data arrival rates. They have identified three types of delays: (i) Initial delay: that occurs before the arrival of the first tuple, (ii) bursty arrival: the data arrive in bursts but the arrival of these data is suddenly stopped and followed by a long period of no arrival, and (iii) slow delivery: the data arrive regularly but slower than normal. To deal with these delays, two methods were proposed by Amsaleg et al. [4] and by Bouganim et al. [14, 15].
226
A. Hameurlain and F. Morvan
The technique of query scrambling [3, 4] was proposed to process the blockings caused by the delays in data arrival rates. It tries to mask these delays by the executions of other portions of the execution plan until the termination of these delays. The technique of query scrambling processes the initial delay and the bursty arrival in two phases [3]: 1. Re-scheduling: as soon as a delay is detected, this phase is invoked. It begins with the separation of the relational operators of an execution plan in two disjoined sets: (i) the set of blocked operators that contains all the ancestors of unavailable operands, and (ii) the set of executable operators that contains the remainder of the operators that do not belong to the set of blocked operators. Then, a maximum executable sub-tree is extracted from the set of the executable operators. This maximum sub-tree is executed and its intermediate result is materialized. 2. Synthesis: this phase is invoked if the set of the executable operators is empty and the set of the blocked operators is not empty. Contrary to the re-scheduling phase, the synthesis phase can significantly change the execution plan by adding new operators and/or by removing existing operators. The synthesis phase starts, at first, by the construction of a graph of the joins which are ready to be executed. Then, a join is processed and the result is materialized. The synthesis phase is finished if all delays are finished, or if the graph is reduced to only one node or several nodes without join predicates. The technique of query scrambling supposes that an execution plan is executed without taking into account the delays in data arrival rates during plan execution. For that, Bouganim et al. [14, 15] proposed a strategy where the memory is available and data arrival rates are constantly supervised. This information is used to produce a new scheduling between the various fragments of the execution plan or to re-optimize the remainder of the query. The paper of Ives et al. [72] described a dynamic optimization method which is able to deal with the majority of the changes in execution environment (delays, errors and unavailable memory). This method interweaves the phases of optimization and execution and it uses specific dynamic operators. In this method, the optimizer transforms a query into an annotated execution plan [77] and generates the associated rules with Event-Condition-Action type. These rules determine the behavior of the execution plan according to the changes at runtime. They check certain conditions (e.g. comparison of the sizes of the current temporary relations with those estimated during compilation) when events occur (e.g. delay, memory unavailable) they start actions (e.g. memory re-allocation, re-scheduling or re-optimization). 5.3 Decentralized Dynamic Optimization Methods in Data Integration Systems The decentralized dynamic optimization methods correct the sub-optimality of execution plans by decentralizing the control. The conventional hash join [16] algorithm requires the reception of all tuples of the first operand for building the hash table before beginning the probe step. Thus, the time to produce the first tuple can be long if: (i) the size of the operands is large, or (ii) when the data arrival rate is irregular. Contrary to the conventional hash join, the double hash join (DHJ) introduced by Ives et al. [72] built a hash table for each operand. When a tuple arrives, it is inserted firstly in the associated hash table. Then, it is used to probe the other hash table. If the probe step allows
Evolution of Query Optimization Methods
227
to produce result tuples, then these tuples are immediately delivered. DHJ was proposed in TUKWILA project [72] to deal with the problems of conventional hash join in the context of data integration: (i) the production time of the first tuple is minimized, (ii) the optimizer does not need to know the sizes of the operands in order to choose the operand used in the building of the hash table, and (iii) it masks the slow arrival rate of tuples from an operand by processing the tuples of the other operand. However, DHJ requires to maintain the two hash tables in memory. This can limit the use of DHJ with operands having large sizes or with queries constituted of several joins. To solve this problem, parts of the hash tables residing in the memory are moved towards a secondary storage space. When the memory becomes saturated, a partition of one of the two tables is chosen to be moved towards the secondary storage space. The DHJ allows reducing the necessary time for the production of the first tuple of result. Moreover, it makes it possible to continue the production of the result tuples in spite of the unavailability of any one of the two operands. However, it can lead to bad performances if the tuple productions of the two operands are blocked. For that, the Xjoin operator is proposed by Urhan et Franklin [131]. When Xjoin detect the unavailability of the tuples of each operand, the tuples of a portion resident in the secondary storage space are joined with the tuples of the same partition of second operand residing in memory. To accelerate the production of result tuples, it is interesting to define scheduling mechanisms between the various phases of the Xjoin operator. For that purpose, Urhan and Franklin [132] proposed a scheduling technique using the notion of Stream. Stream is the execution unit which consumes and produces tuples. The execution schedule of Stream is determined at runtime and is changed according to the variations of the system behaviour (productions of tuples, terminated streams).
6 Query Optimization in Large Scale Environments 6.1 Query Optimization in Large Scale Data Integration Systems Large scale environment means [58]: (i) high numbers of data sources (e.g. databases, xml files), users, and computing resources (i.e. CPU, memory, network and I/O bandwidth) which are heterogeneous and autonomous, (ii) the network bandwidth presents, in average, a low bandwidth and strong latency, and (iii) huge volumes of data. In a large scale distributed environment, performances of previous optimization methods decrease because: (i) the number of messages relatively important on a network with low bandwidth and strong latency, and (ii) the bottleneck that forms the optimizer. It becomes thus convenient to make the query execution autonomous and self-adaptable. In this perspective, two close approaches have been investigated: the broker approach [28], and the mobile agent approach [6, 76, 101, 110]. The second approach consists in using a programming model based on mobile agents [40], knowing that at present the mobile agent platforms supply only migration mechanisms, but they do not offer proactive migration decision policy. The rest of this sub-section is devoted to describe execution models associated to brokers and mobile agent approaches [6, 28, 66, 76, 98, 101, 110].
228
A. Hameurlain and F. Morvan
Broker Approach In a large scale mediation system context, Collet and Vu [28] proposed an execution model based on brokers. The broker, which is the basic unit of the query execution, supervises the execution of a sub-query. It detects the estimation inaccuracies and adapts itself according to these inaccuracies. Moreover, it communicates with the other brokers to take into account the updates of the execution environment. The principal components of a broker are: (i) context including the annotations and constraints necessary for the execution of a sub-query, (ii) operator of the sub-query, (iii) buffer allowing to synchronize the data exchange between the brokers, and (iv) rules which define behavior of the broker according to changes of the execution environment. Mobile Agent Approaches A mobile agent [40] is an autonomous software entity which can move (code, data, and execution state) from a site to another in order to carrying out a task. In the traditional operating system, the decision of migration activity is controlled by another process. However, in a mobile agent, the decision of the migration activity is made by the agent itself. The operators of double hash join and Xjoin improve the local processing cost by adapting the use of resources CPU, I/O and memory with the changes of the execution environment (e.g. estimation errors, delays in data arrivals rates) and does not take in account the network resource. In objective to take into account the network resource, the work proposed by Arcangeli et al. [6], Hussein et al. [66] and Ozakar et al. [101] based on mobile agents extend the algorithms of direct join, semi-join based join and dependent join (in presence of binding patterns). This extension allows them to change their execution sites proactively. Each mobile agent executing a join chooses itself its execution site by adapting to the execution environment (e.g. CPU load, bandwidth) and the estimation accuracies on temporary relation sizes. Hence, the control which makes the decision of the execution site change is carried out in a decentralized and autonomous way. Furthermore, for dynamic query optimization, Morvan et al. [97] proposed three cooperation methods between the mobile join agents. These methods allow to a mobile agent to make its decision to migrate or not according to the decisions of the other agents communicating with it. These methods minimize the number of messages exchanged between agents. As far as work of Jones and Brown [76], they propose, for large scale distributed queries, an execution model based on mobile agents which react to the estimations inaccuracies. The mobile agents are charged to execute the local sub-queries of an execution plan. These agents compare the partial results (e.g. size, execution costs) with the estimations used during compilation in order to detect sub-optimality. By taking into account the possibility of migration of mobile agents, two strategies were proposed: 1. Decentralized execution without migration: the agents, executing sub-queries, communicate between them, by broadcasting their partial execution states, in order to produce an execution plan for the remainder of the query. 2. Decentralized Execution with migration: this strategy extends the previous strategy while allowing the agents to migrate from one site to another before beginning their executions. The decision of migration can be made in a distributed, individual or centralized way.
Evolution of Query Optimization Methods
229
Another method based on mobile agents has been proposed by [110] in order to execute queries in a web context. In this context, the query result can correspond to a new query on another server which processes it. For this, two mechanisms were proposed which are also known as being parts of LDAP (Lightweith Directory Access Protocol) [64]: (i) referral which consists into return to the user, the new query and server address to process it, and (ii) chaining which consists in cooperating with the server executing the new query to produce the result. In this approach [110], the mobile agents are used to exploit these two mechanisms in the query processing. Each query is processed by using a mobile agent which can choose the best adapted mechanism (referral and chaining). 6.2 Query Optimization in Data Grid Systems Since more than ten years, the grid systems are very active research topics. The main objective of grid computing [39] is to provide a powerful and platform which supplies resources (i.e. computational resources, services, metadata and data sources). The grid computing is very important for scale distributed systems and applications that require effective management distributed and heterogeneous resources [58]. Large scale and dynamicity of nodes (unstable system) characterize the grid systems. Dynamicity of nodes (system instability) means that a node can join, leave or fail at any time. Today, the grid computing, intended initially for the intensive computing, open towards the management of voluminous, heterogeneous, and distributed data on a large-scale environment. Grid data management [104] raises new problems and presents real challenges such as resource discovery and selection, query processing and optimization, autonomic management, security, and benchmarking. To tackle these fundamental problems [104], several methods have been proposed [5, 30, 48, 49, 65, 94, 129]. A very good and complete overview addressing the most above fundamental problems is described in [104]. The authors discuss a set of open problems and new issues related to Grid data management using, mainly, Peer-to-Peer P2P techniques [104]. More focused on a specific and very hot problem such as resource discovery, [129] propose a complete review of the most promising Grid systems that include P2P resource discovery methods by considering the three main classes of P2P systems: unstructured, structured, and hybrid (super-peer). The advantages and weaknesses of a part of proposed methods are described in [104, 129]. The rest of this sub-section tries to provide an overview of query processing and optimization in data grid systems. Several approaches have been proposed for distributed query processing (DQP) in data grid environments [2, 5, 48, 49, 50, 65, 115, 135]. Smith et al. [115] tackle the role of DQP within the Grid and determine the impact of using Grid for each step of DQP (e.g. resource selection). The properties of grid systems such as flexibility and power make grid systems suitable platforms for DQP [115]. In recent years, convergence between grid technologies and web services leads researchers to develop standardized grid interfaces. Open Grid Services Architecture OGSA [38] is one of the most well known standards used in grids. Many applications are developed by using OGSA standards [2, 5, 135]. OGSA-DQP [2] is a high level data integration tool for service-based Grids. It is built on a Grid middleware named OGSA-DAI [5] which provides a middleware that assists its users by accessing and
230
A. Hameurlain and F. Morvan
integrating data from separate sources via the Grid. [135] describes the concepts that provide virtual data sources on the Grid and that implement a Grid data mediation service which is integrated into OGSA-DAI. By analyzing the approaches of DQP on the Grid, the research community focused on the current adaptive query processing approaches [7, 47, 62, 67, 74] and proposed extensions in grid environments [29, 48, 50]. These studies achieve query optimization, by providing efficient resource utilization, without considering parallelization. Although, they use different techniques, most of the studies profit existing monitoring systems to determine progress of the queries. In [48], Gounaris et al. highlighted the importance and challenges of DQP in Grids. They mentioned the necessity of grids by emphasizing increasing demand for computation in the distributed databases. They also explained the challenges in developing adaptive query processing systems by expressing the weaknesses of existing studies and key points for the solutions. After giving the challenges, Gounaris et al. [50] proposed an adaptive query processing algorithm. They introduced an algorithm which provides both a resource discovery/allocation mechanism and a dynamic query processing service. In [114], Slimani et al. developed a cost model by modeling the network characteristics and heterogeneity. By using this cost model, they also introduced a query optimization method on top of Beowulf clusters [34]. They considered both logical and physical costs and deployed the distributed query according to the cheapest cost model. In [29], Cybula et al. introduced a different technique for query optimization which is based on caching of query results. They developed a query optimizer which stores results of queries inside the middleware and used the cache registry to identify queries that need not be reevaluated. As far as parallelism dimension integration, many authors have re-studied DQP in order to be efficiently adopted by considering the properties (e.g. heterogeneity) of grids. Several methods are proposed in this direction [13, 30, 49, 89, 106, 116] which define different algorithms for parallel query processing in grid environments. The proposed methods consider different forms of parallelism (e.g. pipelined parallelism), whereas all of them consider also resource discovery and load balancing. In [13], Bose et al. examined the problem of efficient resource allocation for query sub-plans. They developed their algorithm by exploiting the bushy query trees. They incrementally distributed the sub-queries until a stopping condition is satisfied. In [30, 106] the authors introduced an adaptive parallel query processing middleware for the Grid. They developed a distributed query optimization strategy which is then integrated with a grid node scheduling algorithm by considering runtime statistics of the grid nodes. Gounaris et al. [49] proposed an algorithm which optimizes parallel query processing in grids by iteratively increasing the number of nodes which execute the parallelizable sub-plans. In [89], Liu et al. presented a query optimization algorithm which grades the nodes according to their capacities. They determined serial and parallel parts of the queries and proposed an execution sequence in highest ranked nodes. Soe et al. [116] proposed a parallel query optimization algorithm. In their study, they considered resource allocation, intra-query parallelism and inter-query parallelism by analyzing bushy query trees.
Evolution of Query Optimization Methods
231
7 Discussion According to the discussion led in the section 2.4, and the results in [122, 123, 68, 69, 70, 84], it is difficult to conclude about the superiority of a search strategy (e.g. scheduling of the join operators) with regard to the one another . However, each of them proposes a solution to improve the performances of these algorithms. Ioannidis and Kong [69, 70] chose to propose a new algorithm, called Two Phase Optimization [69], which consists in applying, at first, the Iterative Improvement algorithm, and then, the Simulated Annealing algorithm. As for Swami [123], he chose to experiment a set of heuristics with the aim of improving the performances of the Iterative Improvement and the Simulated Annealing algorithms [123]. The works of Ioannidis and Kong was able to show that the choice of a join method has no direct influence on the performances of the search strategies. In a parallel environment, Lanzelotte and al. [86] showed that the search strategy in breath first is not applicable in a bushy search space for queries with 9 relations or more. The use of a random algorithm is then indispensable. The authors thus developed a random algorithm called Toured Simulated Annealing in a context of parallel processing [86]. The search strategies find the optimal solution more or less quickly according to their capacity to face the various problems. They must be adaptable to queries of diverse sizes (simple, medium, complex) and in various types of use (i.e. ad-hoc or repetitive) [54, 83]. A solution to this problem is the parameterization and the extensibility of query optimizers [71, 83] possessing several search strategies, each being adapted for a type of queries. The major contributions in this domain arise, mainly, from the Rodin project [83, 84, 85, 86] as well as on the Ioannidis and Kong’s results [69]. Indeed, one of the main aspects studied by Lanzelotte in [83] concerns the extensibility of the search strategy for the optimizer, demonstrated by the implementation of four different strategies: System R, Augmented Heuristic, Iterative Improvement and Simulated Annealing. Lanzelotte is especially interested in the query optimization in new systems such as oriented object and deductive DBMS, and proposes an extensible optimizer OPUS (OPtimizer for Up-to-date database Systems) [83] for these non conventional DBMS. Recently, Bizarro et al. [11] proposed “Progressive Parametric Query Optimization” which presents a novel framework to improve the performance of processing parameterized queries. As far as parallel database systems, a synthesis dedicated to parallel relational query optimization methods and approaches [57] has been provided in section 3. In a static context [57], the most advanced works are certainly those of Garofalakis and Ioannidis [44, 45]. They extend elegantly the propositions of [23, 41, 60] where the algorithms of parallel query are based on a uni-dimensional cost model. Furthermore, [45] tackle the scheduling problem (i.e. parallelism extraction) and the resource allocation in a context, which can be multi-query by considering a multidimensional model of used resources (i.e. preemptive, and non-preemptive). The proposals of [45] seem to be the richest in terms of categories of considered resources (i.e. multiresource allocation), exploited parallelisms, and various allocation constraints. In a dynamic context, the efforts were mainly centered on the handling of the following problems: (i) the determination and the dynamic adaptation of the intra-operation parallelism degree, (ii) the methods of resource allocation, and (iii) the dynamic query re-optimization. We identified a set of relevant parameters, mainly: search space,
232
A. Hameurlain and F. Morvan
strategy generation of a parallel execution plan, optimization cost for parallel execution, and cost model. These parameters allow: (i) to compare the two optimization approaches (i.e. one-phase, two-phase), and (ii) to help in the choice of an optimal exploitation of parallel optimization approaches according to the query characteristics and the shape of search space. In a distributed database environment, static query optimization methods are focused mainly on the optimization of inter-site communication costs, by reducing the data volume transferred between sites. Dynamic query optimization methods are based on dynamic scheduling (or re-scheduling) of inter-site operators to correct the sub-optimality due to the inaccuracies of estimations and variations of available resources. The introduction of a new operator, semi-join based join [10, 25], provides certainly more flexibility to optimizers. However, it increases considerably the size of search space. Heterogeneity and autonomy of data sources characterize data integration systems. Sources might be restricted due to the limitation of their query interfaces or certain attributes must be hidden due to privacy reasons. To handle the limited query capabilities of data sources, new mechanisms have been introduced [46, 93], such as, Dependant Join Operator which is asymmetric in nature. The asymmetry of this operator causes the search space to be restricted and raises the issue of capturing valid (feasible) execution plans [92, 93, 139]. As for the optimization methods, the community quickly noticed that the centralized optimization methods [4, 7, 14, 15, 72, 73, 74, 77 ] could not be scaled up for the reasons which are previously pointed out. So, dynamic optimization methods were decentralized by leaning, mainly, on the brokers or on the mobile agents which allow decentralizing the control and scaling up. However, it is important to observe that the decentralized dynamic methods described in sub-section 5.3 build both two hash tables (one for each operand relation). So, they do not apply to restricted data sources. Indeed, a restricted data source returns a result, only if all attributes which are mapped to ' b ' are given. In grid environments, which are characterized by large scale and dynamicity of nodes (system instability), distributed query optimization methods are focused on two aspects: (i) proposed execution models react to state of resources by using monitoring services [36, 49, 137] and (ii) considering different forms and types of parallelism (inter-query parallelism, intra-query parallelism). Moreover, heterogeneity, autonomy, large scale and dynamicty of nodes raise new problems and present real challenges to design and develop acceptable cost models [1, 35, 42, 43, 99, 114, 141]. Indeed, for instance, the statistics describing the data stemming from sources and the formulae associated with the operations processed by these sources cannot be often published [35]. In a large scale environment, whatever the approach of the used cost model is (i.e. history approach [1], calibration approach [43, 141], generic approach [99]) the statistics stored in the catalog are subject to obsolescence [66], which generates large variations between parameters estimated at compile time and parameters computed at runtime. In consequence, it is not realistic to replicate a cost model on all sites. This cost model should be distributed and partially replicated [66, 58]. In an execution model based on mobile agents, a part of cost model should be embedded in mobile agents. This, ensures the autonomy of mobile joins and avoids distant interactions with the site on which was emitted the query [66].
Evolution of Query Optimization Methods
233
Finally, from this state of the art, we can point out the following main characteristics of query optimization methods [98]:
− − − − −
− −
Environment: query optimization methods have designed and implemented in different environments as uni-processor, parallel, distributed, and large scale. Type of method: a query optimization method can be static or dynamic. Search Space. this space can be restricted according to the nature of the considered execution plans, the limited capabilities of data sources, and the applied search strategy. Nature of decision-making: can be centralized or decentralized. The decentralized dynamic optimization methods correct the sub-optimality of execution plans by decentralizing the control. Type of modification: can be, mainly, re-optimization or re-scheduling. When the sub-optimality of an execution plan is detected, correction could be made by reoptimization process or by a re-scheduling process. Re-optimization process: consists in producing a new execution plan for the remainder of the query [77]. The physical implementation, the scheduling and the tree structure of operators which are not yet been executed can be updated. As far as re-scheduling process, the tree structure of the remainder of the execution plan remains unchanged. But, scheduling between the operators can be modified. Level of modification: can occur at intra-operator level or inter-operator level. The sub-optimal execution plan can be corrected during the execution of an operator and/or at sub-query level. Type of event: a dynamic query optimization method can react to following events: (i) estimation errors, (ii) available memory, (iii) delays in data arrival rates, and (iv) user preferences.
These parameters allow comparing proposed optimization methods, and pointing out their advantages and weaknesses. A comparison study of dynamic optimization methods is described in detail in [98]. Furthermore, in a large scale environment, the benefits of mobile agents depending on estimation errors of temporary relation sizes, network bandwidth, and processor frequency, seem to be very promising due to their autonomy and proactive behavior.
8 Conclusion Researches related to relational query optimization goes back to the 70s, and began with the publication of two papers [112, 138]. These papers and relevant applications requirements motivated a large part of the database community to focus their efforts and energies on this topic. Because of the importance and the complexity of the query optimization problem, the database community has proposed approaches, methods and techniques in different environments (uni-processor, parallel, distributed, large scale). In this paper, we wanted to provide a survey related to evolution of query optimization methods from centralized relational database systems to data grid systems through parallel and distributed database systems and data integration (mediation)
234
A. Hameurlain and F. Morvan
systems. For each environment, we described some query optimization methods, and pointed out their main characteristics which allow comparing them.
Acknowledgement We would like to warmly thank Professor Roland Wagner for his kind invitation to write this paper.
Permissions 57.
58.
98.
Hameurlain, A., Morvan, F.: Parallel query optimization methods and approaches: a survey. Journal of Computers Systems Science & Engineering 19(5), 95–114 (2004) Hameurlain, A., Morvan, F., El Samad, M.: Large Scale Data management in Grid Systems: a Survey. In: IEEE Intl. Conf. on Information and Communication Technologies: from Theory to Applications, pp. 1–6. IEEE CS, Los Alamitos (2008) Morvan, F., Hameurlain, A.: Dynamic Query Optimization: Towards Decentralized Methods. Intl. Jour. of Intelligent Information and Database Systems (to appear, 2009)
Section 1 contains materials from [98] with kind permissions from Inderscience. Section 3 contains materials from [57] with kind permissions from CRL Publishing. Section 5 contains materials from [98] with kind permissions from Inderscience. Section 6 and 7 contain materials from [58, 98] with kind permissions from IEEE and Inderscience.
References 1. Adali, S., Candan, K.S., Papakonstantinou, Y., Subrahmanian, V.S.: Query Caching and Optimization in Distributed Mediator Systems. In: Proc. of ACM SIGMOD Intl. Conf. on Management of Data, pp. 137–148. ACM Press, New York (1996) 2. Alpdemir, M.N., Mukherjee, A., Gounaris, A., Paton, N.W., Fernandes, A.A.A., Sakellariou, R., Watson, P., Li, P.: Using OGSA-DQP to support scientific applications for the grid. In: Herrero, P., S. Pérez, M., Robles, V. (eds.) SAG 2004. LNCS, vol. 3458, pp. 13– 24. Springer, Heidelberg (2005) 3. Amsaleg, L., Franklin, M.J., Tomasic, A., Urhan, T.: Scrambling query plans to cope with unexpected delays. In: Proc. of the Fourth Intl. Conf. on Parallel and Distributed Information Systems, pp. 208–219. IEEE CS, Los Alamitos (1996) 4. Amsaleg, L., Franklin, M., Tomasic, A.: Dynamic query operator scheduling for widearea remote access. Distributed and Parallel Databases 6(3), 217–246 (1998) 5. Antonioletti, M., et al.: The design and implementation of Grid database services in OGSA-DAI. In: Concurrency and Computation: Practice & Experience, vol. 17, pp. 357– 376. Wiley InterScience, Hoboken (2005)
Evolution of Query Optimization Methods
235
6. Arcangeli, J.-P., Hameurlain, A., Migeon, F., Morvan, F.: Mobile Agent Based SelfAdaptive Join for Wide-Area Distributed Query Processing. Jour. of Database Management 15(4), 25–44 (2004) 7. Avnur, R., Hellerstein, J.-M.: Eddies: Continuously Adaptive Query Processing. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, vol. 29, pp. 261–272. ACM Press, New York (2000) 8. Babu, S., Bizarro, P., De Witt, D.J.: Proactive re-optimization. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 107–118. ACM Press, New York (2005) 9. Bancilhon, F., Ramakrishnan, R.: An Amateur’s Introduction to Recursive Query Processing Strategies. In: Proc. of the 1986 ACM SIGMOD Conf. on Management of Data, vol. 15, pp. 16–52. ACM Press, New York (1986) 10. Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L., Rothnie Jr.: Query Processing in a System for Distributed Databases (SDD-1). ACM Trans. Database Systems 6(4), 602– 625 (1981) 11. Bizarro, P., Bruno, N., De Witt, D.J.: Progressive Parametric Query Optimization. IEEE Transactions on Knowledge and Data Engineering 21(4), 582–594 (2009) 12. Bonneau, S., Hameurlain, A.: Hybrid Simultaneous Scheduling and Mapping in SQL Multi-query Parallelization. In: Bench-Capon, T.J.M., Soda, G., Tjoa, A.M. (eds.) DEXA 1999. LNCS, vol. 1677, pp. 88–99. Springer, Heidelberg (1999) 13. Bose, S.K., Krishnamoorthy, S., Ranade, N.: Allocating Resources to Parallel Query Plans in Data Grids. In: Proc. of the 6th Intl. Conf. on Grid and Cooperative Computing, pp. 210–220. IEEE CS, Los Alamitos (2007) 14. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: A dynamic query processing architecture for data integration systems. Journal of IEEE Data Engineering Bulletin 23(2), 42–48 (2000) 15. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: Dynamic query scheduling in data integration systems. In: Proc. of the 16th Intl. Conf. on Data Engineering, pp. 425–434. IEEE CS, Los Alamitos (2000) 16. Bratbergsengen, K.: Hashing Methods and Relational Algebra Operations. In: Proc. of 10th Intl. Conf. on VLDB, pp. 323–333. Morgan Kaufmann, San Francisco (1984) 17. Brunie, L., Kosch, H.: Control Strategies for Complex Relational Query Processing in Shared Nothing Systems. SIGMOD Record 25(3), 34–39 (1996) 18. Brunie, L., Kosch, H.: Intégration d’heuristiques d’ordonnancement dans l’optimisation parallèle de requêtes relationnelles. Revue Calculateurs Parallèles, numéro spécial: Bases de données Parallèles et Distribuées 9(3), 327–346 (1997); Ed. Hermès 19. Brunie, L., Kosch, H., Wohner, W.: From the modeling of parallel relational query processing to query optimization and simulation. Parallel Processing Letters 8, 2–24 (1998) 20. Bruno, N., Chaudhuri, S.: Efficient Creation of Statistics over Query Expressions. In: Proc. of the 19th Intl. Conf. on Data Engineering, Bangalore, India, pp. 201–212. IEEE CS, Los Alamitos (2003) 21. Chaudhuri, S.: An Overview of Query Optimization in Relational Systems. In: Symposium in Principles of Database Systems PODS 1998, pp. 34–43. ACM Press, New York (1998) 22. Chawathe, S.S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J.D., Widom, J.: The TSIMMIS Project: Integration of Heterogeneous Information Sources. In: Proc. of the 10th Meeting of the Information Processing Society of Japan, pp. 7–18 (1994)
236
A. Hameurlain and F. Morvan
23. Chekuri, C., Hassan, W.: Scheduling Problem in Parallel Query Optimization. In: Symposium in Principles of Database Systems PODS 1995, pp. 255–265. ACM Press, New York (1995) 24. Chen, M.S., Lo, M., Yu, P.S., Young, H.S.: Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins. In: Proc. of the 18th VLDB Conf., pp. 15–26. Morgan Kaufmann, San Francisco (1992) 25. Chiu, D.M., Ho, Y.C.: A Methodology for Interpreting Tree Queries Into Optimal SemiJoin Expressions. In: Proc. of the 1980 ACM SIGMOD, pp. 169–178. ACM Press, New York (1980) 26. Christophides, V., Cluet, S., Moerkotte, G.: Evaluating Queries with Generalized Path Expression. In: Proc. of the 1996 ACM SIGMOD, vol. 25, pp. 413–422. ACM Press, New York (1996) 27. Cole, R.L., Graefe, G.: Optimization of dynamic query evaluation plans. In: Proc. of the 1994 ACM SIGMOD, vol. 24, pp. 150–160. ACM Press, New York (1994) 28. Collet, C., Vu, T.-T.: QBF: A Query Broker Framework for Adaptable Query Evaluation. In: Christiansen, H., Hacid, M.-S., Andreasen, T., Larsen, H.L. (eds.) FQAS 2004. LNCS, vol. 3055, pp. 362–375. Springer, Heidelberg (2004) 29. Cybula, P., Kozankiewicz, H., Stencel, K., Subieta, K.: Optimization of Distributed Queries in Grid Via Caching. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM-WS 2005. LNCS, vol. 3762, pp. 387–396. Springer, Heidelberg (2005) 30. Da Silva, V.F.V., Dutra, M.L., Porto, F., Schulze, B., Barbosa, A.C., de Oliveira, J.C.: An adaptive parallel query processing middleware for the Grid. In: Concurrence and Computation: Pratique and Experience, vol. 18, pp. 621–634. Wiley InterScience, Hoboken (2006) 31. Date, C.J.: An Introduction to Database Systems, 6th edn. Addison-Wesley, Reading (1995) 32. Deshpande, A., Hellerstein, J.-M.: Lifting the Burden of History from Adaptive Query Processing. In: Proc. of the 13th Intl. Conf. on VLDB, pp. 948–959. Morgan Kaufmann, San Francisco (2004) 33. De Witt, D.J., Kabra, N., Luo, J., Patel, J.M., Yu, J.B.: Client-Server Paradise. In: Proc. of the 20th VLDB Conf., pp. 558–569. Morgan Kaufmann, San Francisco (1994) 34. Dinquel, J.: Network Architectures for Cluster Computing. Technical Report 572, CECS, California State University (2000) 35. Du, W., Krishnamurthy, R., Shan, M.-C.: Query Optimization in a Heterogeneous DBMS. In: Proc. of the 18th Intl. Conf. on VLDB, pp. 277–291. Morgan Kaufmann, San Francisco (1992) 36. El Samad, M., Gossa, J., Morvan, F., Hameurlain, A., Pierson, J.-M., Brunie, L.: A monitoring service for large-scale dynamic query optimisation in a grid environment. Intl. Jour. of Web and Grid Services 4(2), 222–246 (2008) 37. Evrendilek, C., Dogac, A., Nural, S., Ozcan, F.: Multidatabase Query Optimization. Journal of Distributed and Parallel Databases 5(1), 77–113 (1997) 38. Foster, I.: The Grid: A New Infrastructure for 21st Century Science. Physics Today 55(2), 42–56 (2002) 39. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2004) 40. Fuggetta, A., Picco, G.-P., Vigna, G.: Understanding Code Mobility. IEEE Transactions on Software Engineering 24(5), 342–361 (1998)
Evolution of Query Optimization Methods
237
41. Ganguly, S., Hasan, W., Krishnamurthy, R.: Query Optimization for Parallel Execution. In: Proc. of the 1992 ACM SIGMOD int’l. Conf. on Management of Data, vol. 21, pp. 9– 18. ACM Press, San Diego (1992) 42. Ganguly, S., Goel, A., Silberschatz, A.: Efficient and Accurate Cost Models for Parallel Query Optimization. In: Symposium in Principles of Database Systems PODS 1996, pp. 172–182. ACM Press, New York (1996) 43. Gardarin, G., Sha, F., Tang, Z.-H.: Calibrating the Query Optimizer Cost Model of IRODB, an Object-Oriented Federated Database System. In: Proc. of 22nd Intl. Conf. on VLDB, pp. 378–389. Morgan Kaufmann, San Francisco (1996) 44. Garofalakis, M.N., Ioannidis, Y.E.: Multi-dimensional Resource Scheduling for Parallel Queries. In: Proc. of the 1996 ACM SIGMOD intl. Conf. on Management of Data, vol. 25, pp. 365–376. ACM Press, New York (1996) 45. Garofalakis, M.N., Ioannidis, Y.E.: Parallel Query Scheduling and Optimization with Time- and Space - Shared Resources. In: Proc. of the 23rd VLDB Conf., pp. 296–305. Morgan Kaufmann, San Francisco (1997) 46. Goldman, R., Widom, J.: WSQ/DSQ: A practical approach for combined querying of databases and the web. In: Proc. of ACM SIGMOD Conf., pp. 285–296. ACM Press, New York (2000) 47. Gounaris, A., Paton, N.W., Fernandes, A.A.A., Sakellariou, R.: Adaptive Query Processing: A Survey. In: Eaglestone, B., North, S.C., Poulovassilis, A. (eds.) BNCOD 2002. LNCS, vol. 2405, pp. 11–25. Springer, Heidelberg (2002) 48. Gounaris, A., Paton, N.W., Sakellariou, R., Fernandes, A.A.A.: Adaptive Query Processing and the Grid: Opportunities and Challenges. In: Proc. of the 15th Intl. Dexa Workhop, pp. 506–510. IEEE CS, Los Alamitos (2004) 49. Gounaris, A., Sakellariou, R., Paton, N.W., Fernandes, A.A.A.: Resource Scheduling for Parallel Query Processing on Computational Grids. In: Proc. of the 5th IEEE/ACM Intl. Workshop on Grid Computing, pp. 396–401 (2004) 50. Gounaris, A., Smith, J., Paton, N.W., Sakellariou, R., Fernandes, A.A.A., Watson, P.: Adapting to Changing Resource Performance in Grid Query. In: Pierson, J.-M. (ed.) VLDB DMG 2005. LNCS, vol. 3836, pp. 30–44. Springer, Heidelberg (2006) 51. Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing Survey 25(2), 73–170 (1993) 52. Graefe, G.: Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994) 53. Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing Queries Across Diverse Data Sources. In: Proc. of 23rd Intl. Conf. on VLDB, pp. 276–285. Morgan Kaufmann, San Francisco (1997) 54. Hameurlain, A., Bazex, P., Morvan, F.: Traitement parallèle dans les bases de données relationnelles: concepts, méthodes et applications. Cépaduès Editions (1996) 55. Hameurlain, A., Morvan, F.: An Overview of Parallel Query Optimization in Relational Systems. In: 11th Intl Worshop on Database and Expert Systems Applications, pp. 629– 634. IEEE CS, Los Alamitos (2000) 56. Hameurlain, A., Morvan, F.: CPU and incremental memory allocation in dynamic parallelization of SQL queries. Journal of Parallel Computing 28(4), 525–556 (2002) 57. Hameurlain, A., Morvan, F.: Parallel query optimization methods and approaches: a survey. Journal of Computers Systems Science & Engineering 19(5), 95–114 (2004) 58. Hameurlain, A., Morvan, F., El Samad, M.: Large Scale Data management in Grid Systems: a Survey. In: IEEE Intl. Conf. on Information and Communication Technologies: from Theory to Applications, pp. 1–6. IEEE CS, Los Alamitos (2008)
238
A. Hameurlain and F. Morvan
59. Han, W.-S., Ng, J., Markl, V., Kache, H., Kandil, M.: Progressive optimization in a shared-nothing parallel database. In: Proc.of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 809–820 (2007) 60. Hasan, W., Motwani, R.: Optimization Algorithms for Exploiting the Parallelism - Communication Tradeoff in Pipelined Parallelism. In: Proc. of the 20th int’l. Conf. on VLDB, pp. 36–47. Morgan Kaufmann, San Francisco (1994) 61. Hasan, W., Florescu, D., Valduriez, P.: Open Issues in Parallel Query Optimization. SIGMOD Record 25(3), 28–33 (1996) 62. Hellerstein, J.M., Franklin, M.J.: Adaptive Query Processing: Technology in Evolution. Bulletin of Technical Committee on Data Engineering 23(2), 7–18 (2000) 63. Hong, W.: Exploiting Inter-Operation Parallelism in XPRS. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 19–28. ACM Press, New York (1992) 64. Howes, T., Smith, M.C., Good, G.S., Howes, T.A., Smith, M.: Understanding and Deploying LDAP Directory Services. MacMillan, Basingstoke (1999) 65. Hu, N., Wang, Y., Zhao, L.: Dynamic Optimization of Sub query Processing in Grid Database, Natural Computation. In: Proc of the 3rd Intl Conf. on Natural Computation, vol. 5, pp. 8–13. IEEE CS, Los Alamitos (2007) 66. Hussein, M., Morvan, F., Hameurlain, A.: Embedded Cost Model in Mobile Agents for Large Scale Query Optimization. In: Proc. of the 4th Intl. Symposium on Parallel and Distributed Computing, pp. 199–206. IEEE CS, Los Alamitos (2005) 67. Hussein, M., Morvan, F., Hameurlain, A.: Dynamic Query Optimization: from Centralized to Decentralized. In: 19th Intl. Conf. on Parallel and Distributed Computing Systems, ISCA, pp. 273–279 (2006) 68. Ioannidis, Y.E., Wong, E.: Query Optimization by Simulated Annealing. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 9–22. ACM Press, New York (1987) 69. Ioannidis, Y.E., Kang, Y.C.: Randomized Algorithms for Optimizing Large Join Queries. In: Proc of the 1990 ACM SIGMOD Conf. on the Manag. of Data, vol. 19, pp. 312–321 (1990) 70. Ioannidis, Y.E., Christodoulakis, S.: On the Propagation of Errors in the Size of Join Results. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 268–277. ACM Press, New York (1991) 71. Ioannidis, Y.E., Ng, R.T., Shim, K., Sellis, T.K.: Parametric Query Optimization. In: 18th Intl. Conf. on VLDB, pp. 103–114. Morgan Kaufmann, San Francisco (1992) 72. Ives, Z.-G., Florescu, D., Friedman, M., Levy, A.Y., Weld, D.S.: An adaptive query execution system for data integration. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 299–310. ACM Press, New York (1999) 73. Ives, Z.-G., Levy, A.Y., Weld, D.S., Florescu, D., Friedman, M.: Adaptive query processing for internet applications. Journal of IEEE Data Engineering Bulletin 23(2), 19–26 (2000) 74. Ives, Z.-G., Halevy, A.-Y., Weld, D.-S.: Adapting to Source Properties in Processing Data Integration Queries. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 395–406. ACM Press, New York (2004) 75. Jarke, M., Koch, J.: Query Optimization in Database Systems. ACM Comput. Surv. 16(2), 111–152 (1984) 76. Jones, R., Brown, J.: Distributed query processing via mobile agents (1997), http://www.cs.umd.edu/~rjones/paper.html
Evolution of Query Optimization Methods
239
77. Kabra, N., Dewitt, D.J.: Efficient Mid - Query Re-Optimization of Sub-Optimal Query Execution Plans. In: Proc. of the ACM SIGMOD intl. Conf. on Management of Data, vol. 27, pp. 106–117. ACM Press, New York (1998) 78. Kabra, N., De Witt, D.J.: OPT++: An Object-Oriented Implementation for Extensible Database Query Optimization. VLDB Journal 8, 55–78 (1999) 79. Khan, M.F., Paul, R., Ahmed, I., Ghafoor, A.: Intensive Data Management in Parallel Systems: A Survey. Distributed and Parallel Databases 7, 383–414 (1999) 80. Khan, L., Mcleod, D., Shahabi, C.: An Adaptive Probe-Based Technique to Optimize Join Queries in Distributed Internet Databases. Journal of Database Management 12(4), 3–14 (2001) 81. Kosch, H.: Managing the operator ordering problem in parallel databases. Future Generation Computer Systems 16(6), 665–676 (2000) 82. Kossmann, D.: The State of the Art in Distributed Query Processing. ACM Computing Surveys 32(4), 422–469 (2000) 83. Lanzelotte, R.S.G.: OPUS: an extensible Optimizer for Up-to-date database Systems. PhD Thesis, Computer Science, PUC-RIO, available at INRIA, Rocquencourt, n° TU-127 (1990) 84. Lanzelotte, R.S.G., Valduriez, P.: Extending the Search Strategy in a Query Optimizer. In: Proc. of the Int’l Conf. on VLDB, pp. 363–373. Morgan Kaufmann, San Francisco (1991) 85. Lanzelotte, R.S.G., Zaït, M., Gelder, A.V.: Measuring the effectiveness of optimization. Search Strategies. In: BDA 1992, Trégastel, pp. 162–181 (1992) 86. Lanzelotte, R.S.G., Valduriez, P., Zaït, M.: On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces. In: Proc. of the Intl Conf. on VLDB, pp. 493– 504. Morgan Kaufmann, San Francisco (1993) 87. Lazaridis, I., Mehrotra, S.: Optimization of multi-version expensive predicates. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 797–808. ACM Press, New York (2007) 88. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information Sources Using Source Descriptions. In: Proc. of the Intl. Conf. on VLDB, pp. 251–262. Morgan Kaufmann, San Francisco (1996) 89. Liu, S., Karimi, H.A.: Grid query optimizer to improve query processing in grids. Future Generation Computer Systems 24(5), 342–353 (2008) 90. Lu, H., Ooi, B.C., Tan, K.-L.: Query Processing in Parallel Relational Database Systems. IEEE CS Press, Los Alamitos (1994) 91. Mackert, L.F., Lohman, G.M.: R* Optimizer Validation and Performance Evaluation for Distributed Queries. In: Proc. of the 12th Intl. Conf. on VLDB, pp. 149–159 (1986) 92. Manolescu, I.: Techniques d’optimisation pour l’interrogation des sources de données hétérogènes et distribuées, Ph-D Thesis, Université de Versailles Saint-Quentin-enYvlenies, France (2001) 93. Manolescu, I., Bouganim, L., Fabret, F., Simon, E.: Efficient querying of distributed resources in mediator systems. In: Meersman, R., Tari, Z., et al. (eds.) CoopIS 2002, DOA 2002, and ODBASE 2002. LNCS, vol. 2519, pp. 468–485. Springer, Heidelberg (2002) 94. Marzolla, M., Mordacchini, M., Orlando, S.: Peer-to-Peer for Discovering resources in a Dynamic Grid. Jour. of Parallel Computing 33(4-5), 339–358 (2007) 95. Mehta, M., Dewitt, D.J.: Managing Intra-Operator Parallelism in Parallel Database Systems. In: Proc. of the 21th Intl. Conf. on VLDB, pp. 382–394 (1995) 96. Mehta, M., Dewitt, D.J.: Data Placement in Shared-Nothing Parallel Database Systems. The VLDB Journal 6, 53–72 (1997)
240
A. Hameurlain and F. Morvan
97. Morvan, F., Hussein, M., Hameurlain, A.: Mobile Agent Cooperation Methods for Large Scale Distributed Dynamic Query Optimization. In: Proc. of the 14th Intl. Workshop on Database and Expert Systems Applications, pp. 542–547. IEEE CS, Los Alamitos (2003) 98. Morvan, F., Hameurlain, A.: Dynamic Query Optimization: Towards Decentralized Methods. Intl. Jour. of Intelligent Information and Database Systems (to appear, 2009) 99. Naacke, H., Gardarin, G., Tomasic, A.: Leveraging Mediator Cost Models with Heterogeneous Data Sources. In: Proc. of the 14th Intl. Conf. on Data Engineering, pp. 351–360. IEEE CS, Los Alamitos (1998) 100. Ono, K., Lohman, G.M.: Measuring the Complexity of Join Enumeration in Query Optimization. In: Proc. of the Int’l Conf. on VLDB, pp. 314–325. Morgan Kaufmann, San Francisco (1990) 101. Ozakar, B., Morvan, F., Hameurlain, A.: Mobile Join Operators for Restricted Sources. Mobile Information Systems: An International Journal 1(3), 167–184 (2005) 102. Ozcan, F., Nural, S., Koksal, P., Evrendilek, C., Dogac, A.: Dynamic query optimization in multidatabases. Data Engineering Bulletin CS 20(3), 38–45 (1997) 103. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 2nd edn. PrenticeHall, Englewood Cliffs (1999) 104. Pacitti, E., Valduriez, P., Mattoso, M.: Grid Data Management: Open Problems and News Issues. Intl. Journal of Grid Computing 5(3), 273–281 (2007) 105. Paton, N.W., Chávez, J.B., Chen, M., Raman, V., Swart, G., Narang, I., Yellin, D.M., Fernandes, A.A.A.: Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options. VLDB Journal 18(1), 119–140 (2009) 106. Porto, F., da Silva, V.F.V., Dutra, M.L., Schulze, B.: An Adaptive Distributed Query Processing Grid Service. In: Pierson, J.-M. (ed.) VLDB DMG 2005. LNCS, vol. 3836, pp. 45–57. Springer, Heidelberg (2006) 107. Rahm, E., Marek, R.: Dynamic Multi-Resource Load Balancing in Parallel Database Systems. In: Proc. of the 21st VLDB Conf., pp. 395–406 (1995) 108. Rajaraman, A., Sagiv, Y., Ullman, J.D.: Answering queries using templates with binding patterns. In: The Proc. of ACM PODS, pp. 105–112. ACM Press, New York (1995) 109. Raman, V., Deshpande, A., Hellerstein, J.-M.: Using State Modules for Adaptive Query Processing. In: Proc. of the 19th Intl. Conf. on Data Engineering, pp. 353–362. IEEE CS, Los Alamitos (2003) 110. Sahuguet, A., Pierce, B., Tannen, V.: Distributed Query Optimization: Can Mobile Agents Help? (2000), http://www.seas.upenn.edu/~gkarvoun/dragon/publications/ sahuguet/ 111. Schneider, D., Dewitt, D.J.: Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines. In: Proc. of the 16th VLDB Conf., pp. 469–480. Morgan Kaufmann, San Francisco (1990) 112. Selinger, P.G., Astrashan, M., Chamberlin, D., Lorie, R., Price, T.: Access Path Selection in a Relational Database Management System. In: Proc. of the 1979 ACM SIGMOD Conf. on Management of Data, pp. 23–34. ACM Press, New York (1979) 113. Selinger, P.G., Adiba, M.E.: Access Path Selection in Distributed Database Management Systems. In: Proc. Intl. Conf. on Data Bases, pp. 204–215 (1980) 114. Slimani, Y., Najjar, F., Mami, N.: An Adaptable Cost Model for Distributed Query Optimization on the Grid. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM-WS 2004. LNCS, vol. 3292, pp. 79–87. Springer, Heidelberg (2004)
Evolution of Query Optimization Methods
241
115. Smith, J., Gounaris, A., Watson, P., Paton, N.W., Fernandes, A.A.A., Sakellariou, R.: Distributed Query Processing on the Grid. In: Parashar, M. (ed.) GRID 2002. LNCS, vol. 2536, pp. 279–290. Springer, Heidelberg (2002) 116. Soe, K.M., New, A.A., Aung, T.N., Naing, T.T., Thein, N.L.: Efficient Scheduling of Resources for Parallel Query Processing on Grid-based Architecture. In: Proc. of the 6th Asia-Pacific Symposium, pp. 276–281. IEEE CS, Los Alamitos (2005) 117. Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: LEO - DB2’s LEarning Optimizer. In: Proc.of 27th Intl. Conf. on Very Large Data Bases, pp. 19–28. Morgan Kaufmann, San Francisco (2001) 118. Stonebraker, M., Katz, R.H., Paterson, D.A., Ousterhout, J.K.: The Design of XPRS. In: Proc. of the 4th VLDB Conf., pp. 318–330. Morgan Kaufmann, San Francisco (1988) 119. Stonebraker, M., Aoki, P.M., Litwin, W., Pfeffer, A., Sah, A., Sidell, J., Staelin, C., Yu, A.: Mariposa: A Wide-Area Distributed Database System. VLDB Jour. 5(1), 48–63 (1996) 120. Stonebraker, M., Hellerstein, J.M.: Readings in Database Systems, 3rd edn. Morgan Kaufmann, San Francisco (1998) 121. Swami, A.: Optimization of large join queries. Technical report, Software Techonology Laboratory, H-P Laboratories, Report STL-87-15 (1987) 122. Swami, A.N., Gupta, A.: Optimization of Large Join Queries. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 8–17. ACM Press, New York (1988) 123. Swami, A.N.: Optimization of Large Join Queries: Combining Heuristic and Combinatorial Techniques. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pp. 367–376 (1989) 124. Tan, K.L., Lu, H.: A Note on the Strategy Space of Multiway Join Query Optimization Problem in Parallel Systems. SIGMOD Record 20(4), 81–82 (1991) 125. Taniar, D., Leung, C.H.C.: Query execution scheduling in parallel object-oriented databases. Information & Software Technology 41(3), 163–178 (1999) 126. Taniar, D., Leung, C.H.C., Rahayu, J.W., Goel, S.: High Performance Parallel Database Processing and Grid Databases. John Wiley & Sons, Chichester (2008) 127. Tomasic, A., Raschid, L., Valduriez, P.: Scaling Heterogeneous Databases and the Design of Disco. In: Proc. of the 16th Intl. Conf. on Distributed Computing Systems, pp. 449–457. IEEE CS, Los Alamitos (1996) 128. Tomasic, A., Raschid, L., Valduriez, P.: Scaling Access to Heterogeneous Data Sources with DISCO. IEEE Trans. Knowl. Data Eng. 10(5), 808–823 (1998) 129. Trunfio, P., et al.: Peer-to-Peer resource discovery in Grids: Models and systems. Future Generation Computer Systems 23(7), 864–878 (2007) 130. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press (1988) 131. Urhan, T., Franklin, M.: XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin 23(2), 27–33 (2000) 132. Urhan, T., Franklin, M.: Dynamic pipeline scheduling for improving interactive query performance. In: Proc.of 27th Intl. Conf. on VLDB, pp. 501–510. Morgan Kaufmann, San Francisco (2001) 133. Valduriez, P.: Semi-Join Algorithms for Distributed Database Machines. In: Proc. of the 2nd Intl. Symposium on Distributed Data Bases, pp. 22–37. North-Holland Publishing Company, Amsterdam (1982) 134. Valduriez, P., Gardarin, G.: Join and Semijoin Algorithms for a Multiprocessor Database Machine. ACM Trans. Database Syst. 9(1), 133–216 (1984)
242
A. Hameurlain and F. Morvan
135. Wohrer, A., Brezany, P., Tjoa, A.M.: Novel mediator architectures for Grid information systems. Future Generation Computer Systems, 107–114 (2005) 136. Wiederhold, G.: Mediators in the Architecture of Future Information Systems. Journal of IEEE Computer 25(3), 38–49 (1992) 137. Wolski, R., Spring, N.T., Hayes, J.: The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. Journal of Future Generation Computing Systems 15(5-6), 757–768 (1999) 138. Wong, E., Youssefi, K.: Decomposition: A Strategy for Query Processing. ACM Transactions on Database Systems 1, 223–241 (1976) 139. Yerneni, R., Li, C., Ullman, J.D., Garcia-Molina, H.: Optimizing Large Join Queries in Mediation Systems. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 348–364. Springer, Heidelberg (1998) 140. Zhou, Y., Ooi, B.C., Tan, K.-L., Tok, W.H.: An adaptable distributed query processing architecture. Data & Knowledge Engineering 53(3), 283–309 (2005) 141. Zhu, Q., Motheramgari, S., Sun, Y.: Cost Estimation for Queries Experiencing Multiple Contention States in Dynamic Multidatabase Environments. Journal of Knowledge and Information Systems Publishers 5(1), 26–49 (2003) 142. Ziane, M., Zait, M., Borlat-Salamet, P.: Parallel Query Processing in DBS3. In: Proc of the 2nd Intl. Conf. on Parallel and Distributed Information Systems, pp. 93–102. IEEE CS, Los Alamitos (1993)
Holonic Rationale and Bio-inspiration on Design of Complex Emergent and Evolvable Systems Paulo Leitao Polytechnic Institute of Bragança, Campus Sta Apolonia, Apartado 1134, 5301-857 Bragança, Portugal
[email protected] Abstract. Traditional centralized and rigid control structures are becoming inflexible to face the requirements of reconfigurability, responsiveness and robustness, imposed by customer demands in the current global economy. The Holonic Manufacturing Systems (HMS) paradigm, which was pointed out as a suitable solution to face these requirements, translates the concepts inherited from social organizations and biology to the manufacturing world. It offers an alternative way of designing adaptive systems where the traditional centralized control is replaced by decentralization over distributed and autonomous entities organized in hierarchical structures formed by intermediate stable forms. In spite of its enormous potential, methods regarding the self-adaptation and selforganization of complex systems are still missing. This paper discusses how the insights from biology in connection with new fields of computer science can be useful to enhance the holonic design aiming to achieve more self-adaptive and evolvable systems. Special attention is devoted to the discussion of emergent behavior and self-organization concepts, and the way they can be combined with the holonic rationale. Keywords: Holonic Manufacturing Systems, Bio-inspiration, Emergent Behavior, Self-organization.
1 Introduction Nowadays, manufacturing companies to be competitive in the current global economy, facing the customized customer demands, must stand on cost, quality and responsiveness [1]. Several studies, e.g. the one elaborated by the Manufuture High Level Group of experts, promoted by the European Commission [2], reinforces this idea by pointing out the reconfigurable manufacturing as the highest priority for future research in manufacturing. Re-configurability, that is the ability of a system to dynamically change its configuration, usually to respond to dynamic changes in its environment, provides the way to achieve a rapid and adaptive response to change, which is a key enabler of competitiveness. Traditional manufacturing control approaches typically fall into large monolithic and centralized systems, exhibiting low capacity of adaptation to the dynamic changes of their environment and thus not supporting efficiently the demanded requirements. The quest for re-configurability requires a new class of intelligent and distributed A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 243–266, 2009. © Springer-Verlag Berlin Heidelberg 2009
244
P. Leitao
manufacturing control systems that operate in a totally distinct way when compared with the traditional ones, being the centralized and rigid control structures replaced by the decentralization of control functions over distributed entities. Multi-agent systems [3], derived from distributed artificial intelligence and based on the ideas proposed by Marvin Minsky in his seminal work “The society of Mind” [4], suggest the definition of distributed control based on autonomous agents that account for the realization of efficient, flexible, reconfigurable and robust overall plant control, without any need for centralized control. Other emergent manufacturing paradigms have been proposed during the last years, namely Bionic Manufacturing Systems (BMS) [5] and Holonic Manufacturing Systems (HMS) [6], which uses biological and social organization theories as sources of inspiration. These paradigms are unified in proposing distributed, autonomous and adaptive manufacturing systems, suggesting the idea that hierarchy is needed in order to guarantee the inter-entities conflict resolution and to maintain the overall system coherence and objectivity resulting from the individual and autonomous attitude of the entities [7]. Among others, the Product-ResourceOrder-Staff Architecture (PROSA) [6], the ADAptive holonic COntrol aRchitecture for distributed manufacturing systems (ADACOR) [8], the Holonic ComponentBased Architecture (HCBA) [9] and the P+2000 [10]), are successful examples of the application of the HMS principles. See for example [11] and [12] for a more deeply analysis of the state-of-the-art on agent-based manufacturing control. The application of multi-agent systems and holonic paradigms, by themselves, does not completely solve the current manufacturing problems, being necessary to combine them with mechanisms to support the dynamic structure re-configuration, thus dealing more effectively with unpredicted behavior and minimizing its effects. In other words, questions like how the global production optimization is achieved in decentralized systems, how temporary hierarchies are dynamically formed, evolved and removed, how individual components self-organize and evolve to support the evolution and emergence, and how to adapt their emergent behavior using learning algorithms, are yet far from being answered. In fact, in spite of the interesting potential introduced by these paradigms, methods and tools to facilitate their design and maintenance, in particular regarding to their self-adaptation and self-organization properties, are required. Biology and nature seem suitable sources of inspiration to answer the above questions, achieving better solutions for self-adaptive and evolvable complex systems. A recent article published in the National Geographic magazine reinforces this idea, stating that “the study of swarm intelligence is providing insights that can help humans to manage complex systems” based on the idea that “a single ant or bee isn´t smart but their colonies are” [13]. Biological theories are being successfully used to develop complex adaptive applications, namely in economics, logistics and military applications. As examples, Air Liquide uses an ant-based strategy to manage the truck routes for delivering industrial and medical gases, the Sky Harbor International Airport in Phoenix uses an ant-based model to improve the airlines scheduling [13], and swarm intelligence principles were used to forecast energy demands in Turkey. In this context, this paper discusses the benefits that the bio-inspired theories can bring to the manufacturing world, analyzing how the insights from biology, such as emergence and self-organization, in connection with new fields of computer science,
Holonic Rationale and Bio-inspiration
245
such as artificial life and evolutionary computing, can be applied to power the design of holonic systems aiming to achieve more adaptive and re-configurable systems. The ADACOR holonic approach is used to illustrate the application of some biological inspired concepts to design an adaptive production control system that evolves dynamically between a more hierarchical and a more heterarchical control architectures, based in the self-organization concept. The rest of the paper is organized as follows: Section 2 overviews the holonic rationale applied to complex manufacturing systems. Section 3 illustrates the use of biological theories to build complex systems exhibiting emergent behavior, namely the swarm intelligence principles, and Section 4 discusses how self-organization principles can be applied to achieve self-adaptive, re-configurable and evolvable complex systems. Section 5 summarizes the differences and similarities between emergence and self-organization concepts, pointing out the benefits of combining them with holonic rationale. Section 6 presents the ADACOR example of applying biological inspired concepts to achieve an adaptive production control system. Finally, Section 7 rounds up the paper with the conclusions.
2 Backing to the Roots of Holonic Rationale The manufacturing domain is and will be in the future one of the main wealth generators of the world economy [14], assuming a crucial importance the analysis, the design and the implementation of new and innovative manufacturing systems, in order to maintain those wealth levels and to establish a solid base for economic growth. Manufacturing environment is usually characterized by being: • Non-linear, since it is based on processes that are regulated by non-linear equations where effects are not proportional to causes. • Complex, since the parameters that regulate those processes are interdependent: when changing one, the others are changed, resulting in instability and unpredictable systems. • Chaotic, since some processes work as amplifiers, i.e. small causes may provoke large effects. For example, the occurrence of a small disturbance in a machine may affect the system’s productivity. Additionally, as some disturbance effects can remain in the system after its resolution, its occurrence may have severe impact in the performance of manufacturing systems. • Uncertainty, i.e. the decisions are made based on incomplete and inaccurate information and their execution is subject to uncertainty, e.g. the occurrence of deviations. As result, manufacturing systems are usually unpredictable and very difficult to control. In parallel, manufacturing is constantly subject to pressures from market that demands for customized products at shorter delivery time, which requires the ability to respond and adapt promptly and efficiently to changes. This observation is illustrated by the current tendency to reduce the batch sizes as a consequence of the mass customization era.
246
P. Leitao
The described observations show that manufacturing systems are complex adaptive systems working in a very dynamic and demanding environment with a limited or bounded rationale [15]. Having in mind the particularities of the manufacturing domain, suitable paradigms are required to address the described challenge. HMS is a paradigm, developed under the international Intelligent Manufacturing Systems (IMS) collaborative research programme, which translates the concepts developed by Arthur Koestler into a set of appropriate concepts for manufacturing domain. Koestler introduced the word holon to describe a basic unit of organization in living organisms and social organizations [16], based on Herbert Simon theories and on his observations. Simon observed that complex systems are hierarchical systems formed by intermediate stable forms (see the parable of the watchmakers1), which do not exist as auto-sufficient and non-interactive elements but, on the contrary, they are simultaneously a part and a whole. In fact, complex systems will evolve from simple systems much more rapidly if there are stable intermediate forms than if there are not, i.e. if they are hierarchically organized. Simon’s observation is even more applicable as more complex is the product or the environment. Koestler concluded that, although it is easy to identify sub-wholes or parts, wholes and parts in an absolute sense do not exist anywhere. The word holon, proposed by Koestler, is the representation of this hybrid nature, being a combination of the Greek word holos, which means whole, and the suffix on, which means particle. holon = holos (whole) + on (particle) Koestler also identified important properties of a holon: • Autonomy, where the stability of the holons, i.e. holons as stable forms, result from their ability to act autonomously in case of unpredictable circumstances. • Self-reliance, meaning that the holons are intermediate forms providing a context for the proper functionality of the larger whole. • Cooperation, which is the ability to have holons cooperating, transforming these holons into effective components of bigger wholes. The holonic theory reconciles both the holistic and reductionist approaches, describing the simultaneous application of them. Reductionism approach states that a complex system is nothing but the sum of its parts, and that an account of it can be reduced to accounts of individual constituents. Holism on the other hand is the idea that all properties of a given system cannot be determined or explained by its parts alone, but instead, the system as a whole determines in an important way how the parts behave. 1
The parable tells the story of two excellent watchmakers Tempus and Hora. While Hora is getting richer and richer, Tempus is getting poorer and poorer. A team of analysts makes a visit to both shops and noticed the following. Both watches consists of 1000 parts, but Tempus designed his watch such that, when he had to put down a partly assembled watch, it immediately fell into pieces and had to be reassembled from the basic elements. Hora had designed his watches so that he could put together subassemblies of about ten components each. Ten of these subassemblies could be put together to make a larger sub-assembly. Finally, ten of the larger subassemblies constituted the whole watch. Each subassembly could be put down without falling apart [17].
Holonic Rationale and Bio-inspiration
247
In a manufacturing environment, a holon can represent a physical or logical activity, such as a robot, a machine, an order, a flexible manufacturing system, or even an operator. The holon has a partial view of the world, containing information about itself and the environment. It may comprise an information processing part and a physical processing part, the last one only if the holon represents a physical device, such as an industrial robot [18], as illustrated in Fig. 1.
Fig. 1. Constitution of a Holon
Koestler defines a holarchy as a hierarchically organized structure of selfregulating holons that function first as autonomous wholes in supra-ordination to their parts, secondly as dependent parts in sub-ordination to controls on higher levels, and thirdly in coordination with their local environment [16]. The HMS is a holarchy that integrates the entire range of manufacturing activities, combining the best features of hierarchical and heterarchical organization, i.e. it preserves the stability and predictability of hierarchy while providing the dynamic flexibility and robustness of heterarchy. In HMS, the holons behaviors and activities are determined through the cooperation with other holons, in opposition of being determined by a centralized mechanism. Considering the Janus2 effect, i.e. that holons are simultaneously self-contained wholes to their subordinated parts and dependent parts when seen from the higher levels, it is possible to recursively decompose a holon into several others holons, allowing the reduction of the problem complexity. As an example, illustrated in Fig. 2, a holon belonging to a certain holarch may represent a manufacturing cell, being simultaneously the whole, encapsulating holons representing the cell resources, and the part, when considering the shop floor system. The implementation of the HMS concepts, mainly focusing the high-level of abstraction, can be done using the agent technology, which is appropriate to implement the modularity, decentralization, re-use and complex structures characteristics [19]. 2
Janus was a Roman god with two faces: one looking forward and the other looking back. In this context, one side is looking “down” and acting as an autonomous system giving directions to “lower” components, and the other side is looking “up” and serving as a part of a “higher” holon.
248
P. Leitao PDQXIDFWXULQJFHOO
KRORQ
KRORQ
UHVRXUFH
Fig. 2. Holarchy and the Janus Effect of a Holon
In spite of its promising perspectives, at the moment, the industrial adoption of these approaches has fallen short of expectations, and the implemented functionalities are normally restricting [20]. This weak adoption is due, amongst others, to questions related to the required investment, the acceptance of distributed thinking, technology maturity, engineering development tools and others more related with technical issues, such as interoperability and scalability [11; 20]. However, an important reason contributes significantly for this weak adoption: the traditional design of holonic systems misses the application of self-adaptation and self-organization properties that causes systems to become increasingly more reconfigurable, adaptive, organized and efficient. The challenge faced to overcome this problem is to go back to the roots of holonics provided by Koestler and look for other sources of inspiration that enhance holonic rationale to design more adaptive and evolvable systems, which can be easily deployed into real environments. The vision sustained in this work is that, as illustrated in Fig. 3, besides the infrastructure’s technologies to support ubiquitous, modular and distributed features inherent to these applications, the engineering of emergent and evolvable complex systems will consider the combination of existing distributed collaborative paradigms, such as holonic rationale, with bio-inspiration theories, such as emergent behavior and selforganization, in connection with emergent fields of computer science, such as Artificial Life. Artificial Life [21] is a discipline that studies the natural life in artificial environments, e.g. through simulations using computer models, in order to understand such complex systems. Note that Artificial Life is not similar or is not included in Artificial Intelligence field: the last one is mostly related to the perception, cognition and generation of actions, and the former one focuses on evolution, reproduction, morphogenesis and metabolism processes [22]. The following sections discuss some concepts and mechanisms found in biology and nature, and illustrate how they can be combined with holonic rationale to build complex systems behaving in a simple way as it occurs in nature.
Holonic Rationale and Bio-inspiration
,QIUDVWUXFWXUH 7HFKQRORJLHV :LUHOHVVVHQVRU QHWZRUNV5)L'
249
&ROODERUDWLYH FRQWUROSDUDGLJPV +060$66R$
%LRORJLFDOLQVSLUHG 7HFKQLTXHV 6HOIRUJDQL]DWLRQ(PHUJHQW %HKDYLRU6ZDUP LQWHOOLJHQFH
Fig. 3. Engineering of Complex Distributed Adaptive Systems
3 Emergent Behavior in Complex Systems Biology and nature offer a plenty of powerful mechanisms, refined by millions of years of evolution, to handle emergent and evolvable environments. In nature, almost everything is distributed, being complex systems built upon entities that exhibit simple behaviors and have reduced cognitive abilities, where a small number of rules can generate systems of surprising complexity [23]. As example, an ant or a bee present very simple behavior but their colonies exhibit a smart and complex behavior. The emergence concept reflects this phenomenon and defines the way complex systems arise out from a multiplicity of interactions among entities exhibiting simple behavior. The emergent behavior considers a two-level structure, which have close interdependencies: • Macro level, considering the system as a whole, and being the global patterns of organization resulted from the lower-level interactions. • Micro level, considering the system from the point of view of the local components and their interactions. The emergent behavior occurs without the guidance of a central entity and only when the resulted behavior of the whole is greater and much more complex than the sum of the behaviors of its parts [23]. In the manufacturing domain a typical example of an emergent behavior is a robot able to perform pick and place operations as result of the aggregation of a robot that is able to make movements on the space and grippers that are able to be opened/closed. Broadly, in the characterization of the emergent structures, patterns and properties, three aspects should be considered:
250
P. Leitao
• More than the sum of effects, which mean that the emergent properties are not just the predictable result of summing the properties of the individual parts. • Supervenience, which means that emergent properties are novel, additional or unexpected, which will no longer exist if the micro level is removed (i.e. the emergent properties are irreducible to properties of the micro level). • Causality, which means that the macro level properties should have causal effects on the micro level ones, known as downward causation (i.e. emergent properties are not epiphenomenal). During the operation of a system exhibiting emergent behavior, a large number of non-linear interactions occur among the individual entities, leading to a whole behavior that is complex and difficult to predict, due to the large number of possible non-deterministic ways the system can behave. In spite of being unpredictable, when handling with emergent behavior it is desirable to ensure that the expected properties will actually emerge, and the not expected and not desired properties will not emerge. Swarm intelligence, a concept found in colonies of insects, exhibits this emergent behavior, being defined as the emergent collective intelligence of groups of simple and single entities [24]. In fact, swarm intelligence is typically made up of a community of simple entities, following very simple rules, interacting locally with each another and with their environment. This bottom-up approach offers an alternative way of designing intelligent systems, in which the traditional centralized pre-programmed control is replaced by a distributed functioning where the interactions between such individuals lead to the emergence of "intelligent" global behavior, unknown to them [24]. Examples of swarm intelligence include ant colonies, bird flocking, fish shoaling and bacterial growth [13]. A more widespread example of the application of the swarm intelligence principles is Wikipedia: a huge number of people contribute for the encyclopedia with their individual knowledge; no single person knows everything but collectively it is possible to know far more than it was expected to know. In these environments, swarm intelligence can be achieved more from the coordination of activities and less from the use of decision-making mechanisms. A typical example is the movement of group of birds, where individuals coordinate their movements according to the movement of the others. For this purpose, simple mechanisms are used to coordinate the individual behavior aiming to achieve the global one: the resulted structure is essentially a highly nonlinear configuration (i.e. many to many interactions), where feedback processes (both positive and negative) interacts. The positive and negative feedbacks assume crucial importance in these systems: in case of positive feedback, the system responds to the perturbation in the same direction as the perturbation, and in case of negative feedback, the system responds to the perturbation in the opposite direction. Translating these ideas to the manufacturing world, manufacturing systems can be seen as a community of autonomous and cooperative entities, the holons in the HMS paradigm, each one regulated by a small number of simple rules and representing a manufacturing component, such as a robot, a conveyor, a pallet or an order. The degree of complexity of the behavior of each entity is strongly dependent of the embodied intelligence and learning skills. Very complex and adaptive systems can emerge from the interaction between the individual entities, as illustrated in Fig. 4.
Holonic Rationale and Bio-inspiration
251
p1 t1 p1
p2
p4 p8
p4 t1
t2
t3
p3
p5
t4 p9
p2
p5
t2
t5
p3
p6
t4 p7 t3
t6
p1 t1 t5
p2
p4
t2
t3
p3
p5
p2
t4
t2
t3
p3
p5
p1 t1
p6
p4
t4
p1
t1
p2
t2
Fig. 4. Emergence in Complex Systems
The achieved emergent behavior results from the capability of individual entities to change dynamically and autonomously their properties, coordinated towards a unique goal to evolve. In fact, even if all individuals perform their tasks, the sum of their activities could be the chaos (disorder) if they are not coordinated according to a common goal [25]. Also, the emergent behavior won’t be smart if the members belonging to the micro level imitate one another or wait for someone to tell what to do. Since no one is in charge (i.e. no central control), each member should do its own part being its role important for the whole. In manufacturing, the coordination of these systems, for example for the task allocation, is usually related to the regulation of expectations of entities presenting conflict of interests: some entities (usually products or orders) have operations to be executed and others (usually resources) have skills to execute them. Several algorithms can be used for this purpose, namely those based on the Contract Net Protocol (CNP) [26], those based on the markets laws [27] and those based on the attraction fields concept [28]. Systems exhibiting the emergent behavior, as observed in nature, operate in a very flexible and robust way [24]: • Flexible, since it allows the adaptation to changing environments by adding, removing or modifying the entities on the fly, i.e. without the need to stop, reprogram and re-initialize the other components. • Robust, since the society of entities has the ability to work even if some individuals may fail to perform their tasks. The achievement of emergent systems guarantees the fulfillment of flexible and robustness requirements. A step head in designing these complex adaptive systems is related to how the system can evolve to adapt quickly and efficiently to the environment volatility, addressing the responsiveness property.
252
P. Leitao
4 Evolution and Self-organization in Complex Systems Evolution is the process of change, namely development, formation or growth, over generations, leading to a more advanced or complex form. The evolution phenomenon is observed in several domains such as biology, mathematics and economics. In biological systems there are two different approaches for the adaptation to the dynamic evolution of the environment [29]: evolutionary systems and self-organization. The next sections discuss the concepts of evolution and self-organization to answer the question of how complex systems can be adaptive and evolvable. 4.1 Evolutionary Theory The evolutionary approach derives from the theory of evolution introduced by Charles Darwin 150 years ago in his book “The origin of species” [30]. According to Darwin, nature is not immutable, but on contrary, is in a state of permanent transformation, a continuous movement in which the species would change from generation to generation, evolving to suit their environment. The mechanism of evolution proposed by Darwin, the natural selection, is based on the following points: • Since the populations tend to produce more descendents that those will survive, individuals of a given population will struggle for survival (for food, space and other environmental factors). • Individuals which have more favorable characteristics (i.e. more suitable for conditions in which they are) live longer and reproduce themselves more and, as such, their characteristics are passed to the next generation. On contrary, individuals which do not have advantageous features will be progressively eliminated. Only the most fitness will survive. • The differentiated reproduction allows, through a slow accumulation of characteristics, the emergence of new species, with specific features being retained or eliminated depending on the goal or intention. Basically, Darwin saw the evolution as a result of selection by the environment acting on a population of organisms competing for resources. He stated that the species that will survive to evolution and changes in the environment are not the strongest or the most intelligent, but those that are more responsive to change. In this evolution process, the selection is natural in the sense that is purely spontaneous without a predefined plan. The punctuated equilibrium introduced by Stephen Jay Gould and Niles Eldredge [31] constitutes an advance in the Darwin evolution theory. In opposite to the Darwinian theory of evolution, where the evolution is a slow, continuous process without sudden jumps, the evolution in punctuated equilibrium tends to be characterized by long periods where nothing changed, "punctuated" by episodes of very fast development of new forms. Lately, Darwinian natural selection was combined with Mendelian inheritance (i.e. set of principles relating to the transmission of hereditary characteristics from parent to their children) to form the modern evolutionary synthesis, connecting the units of evolution (genes) with the mechanisms of evolution (natural selection).
Holonic Rationale and Bio-inspiration
253
Translating these theories to the manufacturing world, the companies better prepared to survive in the current competitive markets are those that better respond to emergent and volatile environments, by adapting dynamically their behavior [8]. The complex manufacturing systems should evolve continuously or punctually, driven by stimulus that force its re-organization and adaptation to environmental conditions. The distributed entities are subject to the application of evolutionary techniques, belonging to evolutionary computing, by selecting gradually a better system. Evolutionary computing is a class of computational techniques that uses the Darwinian principles of biological evolution and natural selection to solve complex problems, namely combinatorial optimization problems. The evolutionary algorithms (e.g. genetic algorithms, evolutionary strategies and genetic programming) and the swarm intelligence (and concretely the Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO) algorithms designed using the swarm principles) are evolutionary techniques. In a similar way, neural networks algorithm is a typical example of computer programming techniques that uses the insights of the central nervous system of animals. Here the focus is more on learning than on natural selection. 4.2 Self-organization In biological systems, self-organization plays an important role to achieve system’s adaptation to the dynamic evolution of the environment. Self-organization is basically a process of evolution where the effect of the environment is minimal, i.e. where the development of novel, complex structures takes place primarily through the system itself (the term “self” suggests that), being normally triggered by internal variation processes, which are usually called "fluctuations" or "forces". Self-organization is not a new concept, being applied in different domains such as economics, sociology, computing and robotics. Several distinct definitions, but not necessarily contradictory, are found in the literature (e.g. see [32] and the references therein). However, to better understand the self-organization phenomenon it is necessary to identify some important types of self-organization observed in the nature [33]: stimergy, decrease of entropy and autopoiesis.
$FWLRQ GHSRVLW UHLQIRUFHPHQW
$FWLRQ GHSRVLW UHLQIRUFHPHQW
3HUFHSWLRQ
3HUFHSWLRQ
IORZILHOGJUDGLHQW
SKHURPRQH
IORZILHOGJUDGLHQW
Fig. 5. Ant-based Interaction (adapted from [34])
254
P. Leitao
Stigmergy, a phenomenon found in the social insects’ behavior, is a form of selforganization, involving an indirect coordination between entities, where the trace left in the environment stimulates the execution of a subsequent action, by the same or different entity. As an example, ants exchange information by depositing pheromones (i.e. chemical substances that spread odor) on their way back to the nest when they have found food, as illustrated in Fig. 5. The flow field gradient is characterized by the reduction of the intensity of the odor and increase of the entropy (i.e. reduction of information). This form of self-organization produces complex, intelligent and efficient collaboration, without the need for planning, control or even communication between the entities. In the field of thermodynamics, systems that are constantly exchanging energy (i.e. receiving, transforming and dissipating) exhibit a self-organizing behavior far from thermodynamic equilibrium, decreasing their entropy (i.e. disorder) when an external pressure is applied, e.g. temperature, and thus reaching a new stable state [35]. In fact, the 2nd law of thermodynamics introduces the concept of entropy (from the Greek entrope, meaning change), stating that everything in the universe seemed to tend to go from states of order towards states of chaos (in other words, the quality of the energy is degradated irreversibility). This explains why ice pieces tend towards the room temperature without some outside energy source (e.g. a refrigerator). In the nature of living systems, autopoiesis, which literally means self-creation, represents the capability of a system to self-maintain through the self-generation of the system’s entities (i.e. transformation and destruction), for example through the cells reproduction [36]. Autopoietic systems are closed systems characterized by a changing structure but an invariant organization. The self-organization appears as result of the interaction of the system’s entities that self-(re)generate system’s entities to survive in their given environment. Analyzing the above types of self-organization found in nature, in spite of the similarity of concepts observed, some differences can be identified. As an example, comparing the stimergy behavior with the decrease of entropy, it is possible to observe an interesting difference in the trigger mechanism: in stimergy, the initiative occurs within the system (the pheromones deposited by the system members) and in the decrease of entropy, the initiative occurs as response to external pressures (e.g. an increase/decrease of the temperature). Having in mind the previously described types of self-organization, a possible definition, used in this work, is to consider self-organization as the ability of an entity/system to adapt dynamically its behavior to external conditions by re-organizing its structure through the modification of the relationships among entities without external intervention and driven by local and global forces. The integration of self-organization capabilities may require using distributed architectures [37], as those provided by the holonic design, which do not follow a rigid and predictable organization. In fact, autonomous systems, as our brain, have to constantly optimize their behavior, involving the combination of non-linear and dynamic processes. The self-organizing behavior of each entity is based on an adaptive mechanism that dynamically interprets and responds to perturbations [37]. These characteristics imply the management and control of behavioral complexity as well. However, it is important to notice that the system re-organization may also occur in structures with centralized control.
Holonic Rationale and Bio-inspiration
255
Evolution and self-organization can be confusing concepts. Some authors, such as Stuart Kauffman, state that natural selection must be complemented by selforganization in order to explain evolution [38]. In the Darwin’s theory of evolution, the evolution is the result of the selection by the environment, acting on a population of organisms competing for resources, i.e. encompassing external forces, while in self-organization, the evolution is purely due to the internal configuration without any external forces or pressures. Translating the self-organization concept found in nature to the manufacturing domain, the network of entities that represents the control system is established by the entities themselves. Ideally, re-configuration should be done on the fly, maintaining unchanged the behavior of the entire system which should continue to run smoothly after the change, as illustrated in Fig. 6.
Fig. 6. Evolution in Manufacturing Complex Adaptive Systems
In manufacturing domain, the need for re-configuration and evolution can appear in several situations. Particularly, self-organization can contribute to design adaptive manufacturing systems in the main following areas [28]: • Shop floor layout, where the manufacturing resources present in the shop floor are movable, i.e. the producer and transporter resources move physically in order to minimize the transportation distances. • Adaptive control, where the goal is to find out an adaptive and dynamic production control strategy based in the dynamic and on-line schedule, adapted in case of occurrence of unexpected disturbances. • Product demand, where the manufacturing system re-organizes itself in order to adapt to the changes in the product demand, increasing or reducing the number of manufacturing resources, or modifying their capabilities. Others manufacturing related domains can also be referred, such as supply chain optimization, virtual organizations management and logistics management, which requires the frequent re-organization of partners aiming to achieve optimization and responsiveness to unexpected situations. Driving forces guide the re-organization process according to the environment conditions and to the control properties of the distributed entities. Several selforganization mechanisms can be found in nature [39]: foraging, nest building, molding
256
P. Leitao
and aggregation, morphogenesis, web weaving, brood sorting, flocking and quorum. Each one of these mechanisms presents different driven forces to support the evolution. For example, in foraging the driving force is the deposit of pheromones, and in flocking and schooling the self-organization is driven by collision avoidance, speed matching and flock centering. The embodied intelligence, a concept used in the Artificial Life field, may play another important role in the design of these systems. Embodied intelligence suggests that intelligence requires a body to interact with [40], with the intelligent behavior emerging from the interaction of brain, body and environment. The key issue is to define powerful intelligence mechanisms, not only including static intelligence mechanisms but also learning capabilities, that enable the system to behave better in the future as the result of its experience and knowledge. The learning capability is not just a human (which comprises around of 50 billions of neurons) or higher animal prerogative, but it also occurs in worms (comprising a simple 302-cell nervous system) and even in single-celled bacteria. The degree of efficiency of the selforganization capability, and consequently the improvement of the entity’s performance to dynamically evolve in case of emergency, is strongly dependent on how the learning mechanisms are implemented. Learning capabilities play an important role in the evolution process, e.g. the identification of re-configuration opportunities and the way the system evolves. 4.3 Equilibrium and Stability in the Evolution Process In dynamic and complex systems, in which emergence and evolution play key roles, besides to consider mechanisms to identify the reconfiguration and evolution opportunities, it is important to discuss the equilibrium, stability, predictability during the evolution process. The evolution and equilibrium are apparently contradictory concepts: evolution is related to how things change over the time and equilibrium is related to how things attain a steady and balanced state of being. Systems exhibiting evolution features imply non equilibrium dynamics, and in certain situations, specific driving forces will force the system far away from equilibrium. However, equilibrium is different from stability. The concept of stability is concerned to the condition in which a slight disturbance or modification in the system does not produce a significant disrupting effect on that system. This is especially important in chaotic and nonlinear systems, where a small perturbation may cause a large effect (see the butterfly effect3), and consequently resulting in the system becoming unstable. In evolution processes, like those resulting from the emergent behavior, positive and negative feedbacks are crucial to fuel them: the first ones to increase the number of configurations and the second ones to stabilize these configurations. The interaction between them may create intrincate and unpredictable patterns (chaos), which can develop very quickly until a stable configuration (known as attractor), according to a goal or objective. On the other hand, during the evolution process some instability and unpredictability can appear as the result of not properly synchronized evolution processes. 3
Introduced by Edward Lorentz to illustrate the notion of sensitive dependence on initial conditions in chaos theory: a butterfly flapping its wings in one part of the world (e.g. in Chicago) can contribute to the evolution of a tornado in another part of the world (e.g. in Tokyo).
Holonic Rationale and Bio-inspiration
257
The phenomenon of emergence is associated to the tendency of systems to create order from chaos (concept known as extropy in opposite to entropy). In such dissipative systems, the system self-organize into an ordered state since this actually increases the rate of entropy production (as greater is the energy that flows in such systems as greater is the order generated). In fact, a self-organizing system which decreases its entropy must necessarily, in analogy to the 2nd law of thermodynamics, dissipate such entropy to its surroundings. Note that the order can also be regarded as the quantity of information available. Regulation mechanisms are crucial to support the emergence from chaos to order during the evolution process, achieving stability and avoiding the increase of entropy and consequently the chaotic or instable states. Although being chaotic and unpredictable, evolution moves preferentially in the direction of increasing a fitness objective, which depends of the system context and strategy: e.g. the objective can be to reduce the thermo-dynamical energy or to increase the system’s productivity. The evolution process, i.e. the achieved organization, must be evaluated, according to a specific criterion, if the achieved organization solution is better than the previous one [41]. In the complex systems described in the paper, each individual has partial knowledge, i.e. none of them has a global view of the system, introducing uncertainty in the system. Being the entropy a measure of uncertainty of information or ignorance, these systems are normally associated to disorder (chaos). Using mechanisms that combine efficiently the individual knowledge hosted by distributed entities, it is possible to reduce the entropy of the system, becoming the system complex, organized and ordered. Trust-based systems and reputation systems, that take their inspiration from human behavior, can be suitable to handle information uncertainty and be associated to emergence and re-organization processes.
5 Combining Emergence and Self-organization Concepts Emergence and evolution, especially the self-organization, are two different concepts usually incorrectly referred as synonyms in the literature. In spite of their similarities, the difference between self-organization and emergence should be stated: they both lead to systems that evolve over time and can not be directly controlled from the exterior, but while emergent systems consist of a set of individuals that collaborate to exhibit a higher behavior, the self-organized systems exhibit a goal-directed behavior. Additionally, they both exhibit robustness properties but in a different manner [42]: • In the emergence concept it is related to the flexibility of local components that contribute to the emergent properties (i.e. the failure of one component will not result in the complete failure of the emergent property). • In the self-organization concept it is related to the capability of dynamically adapt to change. The emergent behavior and the evolution capability can be exhibited independently or combined. This emergence/self-organization relationship can be expressed in the bidimensional approach to complex evolvable systems illustrated in Fig. 7.
P. Leitao
HYROXWLRQ
258
Fig. 7. Bi-Dimensional Approach to Complex Evolvable Systems
The traditional central and rigid control systems are characterized by not exhibiting self-organization and emergent behavior, and consequently they are not able to re-organize to adapt to environmental changes. These systems are not sufficient to respond to the current demands for flexibility, responsiveness and re-configurability. The self-organization appears without having emergence, the so-called evolvable systems of Fig. 7, essentially when the system works under central or strictly hierarchical control. In fact, new structure patterns can be identified and adopted under the central control to respond to changes. On the other hand, it is also possible to build systems exhibiting the emergent phenomenon without having self-organization, the so-called emergent systems in Fig. 7. In this case, the emergent behavior appears as result from the interaction between distributed entities but the whole system is unable to self-organize to face changes. The most interesting systems are those exhibiting simultaneously self-organization and emergence behavior. Here, the system works under decentralized control emerged from the interactions among individual entities, which are autonomous, active and responsiveness to change, leading to the dynamic system self-organization. The application of self-organization associated to emergent behavior allows achieving [32]: • Dynamic self-configuration, i.e. the adaptation to changing conditions by changing their own configuration permitting the addition/removal of resources on the fly and without service disruption. • Self-optimization, i.e. tuning itself in a pro-active way to respond to environmental stimuli. • Self-healing, i.e. the capacity to diagnose deviations from normal conditions and to take proactive actions to normalize them and avoid service disruptions. The holonic and multi-agent applications, according to the bi-dimensional approach of Fig. 7, normally address the first dimension (i.e. the emergent behavior) but rarely consider the second one (i.e. the evolution and self-organization). For example, self-organization in multi-agent systems normally occurs according to the swarm intelligence principles using very simple agents and interactions rules.
Holonic Rationale and Bio-inspiration
259
In these emergent and evolvable environments, as self-organizing holonic systems are, a pertinent question is related to how emergent behavior and self-organization can be combined. According to Wolf and Holvoet, two different perspectives can be considered [42]: • Self-organization as the cause, being the emergent behavior as result from the self-organization of the interactions among components; in this case selforganization is situated at the micro-level of the emergent process (i.e. selforganization leads to emergence). • Self-organization as the effect, being achieved as a consequence of the emergent behavior (i.e. is an emergent property); in this case self-organizing behavior occurs at the macro-level (i.e. emergence leads to self-organization). However, an additional possibility is to have the self-organization disaggregated from the emergent behavior. In this case, self-organization is inserted on the top of emergent behavior, like a cherry on top of the cake, appearing not only from the self-organization of the interactions among components but also from the self-organization exhibited by the behaviors of local components and the mechanisms that drive these local selforganization capabilities. In some situations, the emergence of complex adaptive systems requires the reproduction of their members, aiming to evolve over the time to reduce its fitness, recalling the theory of evolution and introducing the autocatalytic sets theory. An autocatalytic set is a group of elements that catalyses the creation of its own elements, being in biology referred as the generation of offspring. The autocatalytic set is a system characterized by positive feedback, i.e. the presence of its members increases the rate at which new set elements are created, which in turn increases this rate even further. An undesirable consequence of positive feedback is that the system becomes locked in the solution that it selects first and a huge effort is required to switch at a later instant. Here some similarities can be found between the Darwin’s theory and Simon’s observations: life forms are not created from scratch, but instead they create small sets of structures that are catalysis to themselves and sufficiently stable to survive until the next energy input (i.e. the auto-catalytic sets). The autocatalysis process requires the existence of the autonomy property in the members, and its improvement requires the existence of cooperation. Recalling the holonic rationale it is possible to verify that the notion of holon already consider the autonomy and cooperation as important properties of complex adaptive systems. In holonic rationale, due to the concepts of holons and holarchies, self-organization occurs as an emergent process where order appears from disorder due to simple relations that statistically evolve through complex relations progressively organizing themselves [33]. More powerful self-organizing holonic systems can be devised using more intelligent agents and more complex interaction patterns between local components. The current challenge faced to the research community is to research the combination of self-organization mechanisms with emergent behavior that enhance the holonic rationale aiming to achieve emergent and evolvable complex systems.
260
P. Leitao
6 The ADACOR Example In manufacturing domain some few examples can be referred as tentative of introduction of biological inspired insights. Namely, Valckenaers et al. combined stimergy concepts with the PROSA architecture to achieve emergent forecasting in manufacturing coordination and control [43], Parunak and Brueckner use stimergic learning to achieve self-organization in mobile ad-hoc networks [34], and Ulieru et al. use emergence concepts to cover both vertical and horizontal integration in distributed organizations to enable the dynamic creation, refinement and optimization [44]. This section describes the use of concepts inherited from biology, namely swarm intelligence and self-organization, in the ADACOR architecture, to achieve an adaptive production control approach that addresses the system re-configurability and evolution, especially when operating in emergent environments. ADACOR architecture is built upon a community of autonomous and cooperative holons representing manufacturing entities, e.g. robots, pallets and orders. In analogy with insect colonies, where an individual usually does not perform all tasks, but rather specializes in a set of tasks [24] (a concept known in biology as division of labour), ADACOR architecture identifies four manufacturing holon classes, each one possessing proper roles, objectives and behaviors [8]: product (PH), task (TH), operational (OH) and supervisor (SH). The product holons represent the products available in the factory catalogue, the task holons represent the production orders launched to the shop floor to execute the requested products and the operational holons represent the physical resources available at shop floor. Supervisor holons provide co-ordination and optimization services to the group of holons under their supervision, and thus introducing hierarchy in a decentralized system. The modularity provided by ADACOR is similar to that exhibited by the Lego™ concept: grouping elementary and inter-connectable entities in a particular way allow building bigger and more complex systems. Emergent behavior emerges from the interactions between ADACOR holons exhibiting intelligent behavior. The systems’ re-configurability or evolution is achieved by the dynamic re-aggregation of the elementary components or systems. Being the ADACOR holons pluggable (i.e. without the need to re-initialize and re-program the system when a holon is added to the system), it offers enormous flexibility and re-configurability to support emergent behavior on the fly. 6.1 Driving Forces for Self-organization The system self-organization is only achieved if the distributed entities have stimulus to drive their local self-organization capabilities. In ADACOR, the local driving forces to achieve self-organization are the autonomy factor and the learning capability, which are inherent characteristics to each ADACOR holon. The autonomy factor, α, is a parameter that fixes the level of autonomy of each ADACOR holon [8], and evolves dynamically in order to adapt the holon behavior to the changes in the environment where it is placed. The autonomy factor is regulated by a function, α = f (α, τ, ρ), where:
Holonic Rationale and Bio-inspiration
261
• τ is the reestablishment time, which is the estimated time to recover from the disturbance. • ρ is the pheromone parameter, which is an indication of the level of impact of the disturbance. The evolution into a new organization, triggered by the rules described above, is governed by a decision mechanism where learning mechanisms play a crucial role to detect evolution opportunities and ways to evolve. The powerfulness of the selforganization mechanism is closely related on how the learning mechanisms are implemented and on new knowledge influences the decision parameters. These two local driving forces (i.e. autonomy and learning) allow the dynamic selfadaptation of the holon, contributing for the re-configuration of the system as a whole. However, the global self-organization of the system is only achieved if global forces drive the local self-organization capabilities. The global driving force used in ADACOR to support the system’s self-organization is a pheromone-like spreading mechanism, recalling the stimergy concept. The holons cooperating with this type of mechanism propagate the need for re-organization by spreading the information to the other holons, like ants deposit pheromones in the environment. The quantity of pheromone deposited in the neighbor supervisor holon is proportional to the forecasted impact of the disturbance. The holons associated to each supervisor holon sense the information dissipated by the other holons (like ants sense the pheromone odors), and accordingly, they trigger a self-adaptation of their behavior (e.g. increasing its autonomy) and propagate the pheromone to other neighbor holons. The intensity of the pheromone odor becomes smaller as far as it is from the epicenter of the evolution trigger (similar to distance in the original pheromone techniques), according to a defined flow field gradient. The use of pheromone-like techniques for the propagation of information is suitable for the dynamic and continuous adaptation of the system to disturbances, supporting the global self-organization and reducing the communication overhead [8]. A simple implementation of the decision function associated to the autonomy factor can use a fuzzy rule-based engine that considers a simple discrete binary variable for the autonomy factor, comprising the states {Low, High}, and a discrete variable for the pheromone parameter, comprising the states {Very Low, Low, Medium, High, Very High}. In this case, the evolution of the autonomy factor is determined by the following set of simplified rules [32]: IF (ρ >= HIGH AND α == LOW) THEN α:= HIGH AND evolveIntoNewStructure IF (ρ >= HIGH AND α == HIGH AND τ == ELAPSED) THEN α:= HIGH AND τ:= value IF (ρ