Lecture Notes in Business Information Processing Series Editors Wil van der Aalst Eindhoven Technical University, The Netherlands John Mylopoulos University of Trento, Italy Norman M. Sadeh Carnegie Mellon University, Pittsburgh, PA, USA Michael J. Shaw University of Illinois, Urbana-Champaign, IL, USA Clemens Szyperski Microsoft Research, Redmond, WA, USA
8
Joaquim Filipe José Cordeiro (Eds.)
Web Information Systems and Technologies Third International Conference, WEBIST 2007 Barcelona, Spain, March 3-6, 2007 Revised Selected Papers
13
Volume Editors Joaquim Filipe José Cordeiro Polytechnic Institute of Setúbal – INSTICC 2910-761 Setúbal, Portugal E-mail: {j.filipe,jcordeiro}@est.ips.pt
Library of Congress Control Number: 2008930563 ACM Computing Classification (1998): H.3.5, H.4, J.1, K.4.4 ISSN ISBN-10 ISBN-13
1865-1348 3-540-68257-0 Springer Berlin Heidelberg New York 978-3-540-68257-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12272464 06/3180 543210
Preface
This book contains the best papers from the International Conference on Web Information Systems and Technologies (WEBIST 2007), organized by the Institute for Systems and Technologies of Information, Control and Communication (INSTICC), endorsed by IW3C2, and held in Barcelona, Spain. The purpose of WEBIST is to bring together researchers, engineers, and practitioners interested in the technological advances and business applications of web-based information systems. It has four main topic areas, covering different aspects of web information systems, namely, internet technology; web interfaces and applications; society, e-business and e-government; and e-learning. WEBIST 2007 received 367 submissions from more than 50 countries across all continents. After a double-blind review process, with the help of more than 200 experts from the international program committee, and also after presentation at the conference, 23 papers were finally selected. Their extended and revised versions are published in this book. This strict selection made the conference appealing to a global audience of engineers, scientists, business practitioners, and policy experts. The papers accepted and presented at the conference demonstrated a number of new and innovative solutions for e-business and web information systems in general, showing that the technical problems in this field are challenging and worth further R&D effort. The program of this conference also included three outstanding keynote lectures presented by internationally renowned distinguished researchers. Their keynote speeches reinforced the overall quality of the event. The organization of these conferences required the dedicated effort of many people. Firstly, we must thank the authors, whose research and development efforts are recorded here. Secondly, we thank the members of the program committee and the additional reviewers for their diligence and expert reviewing. Thirdly, we congratulate the INSTICC organizing team for their excellent work. Last but not least, we thank the invited speakers for their invaluable contribution and for taking the time to synthesize and prepare their talks. We hope that you will find this collection of papers interesting and a helpful source of inspiration and that you will look forward to the book containing the best papers of next year’s conference. March 2008
Joaquim Filipe José Cordeiro
Organization
Conference Co-chairs Joaquim Filipe, Polytechnic Institute of Setúbal/INSTICC, Portugal Julià Minguillón, Open University of Catalonia, Spain
Program Co-chairs José Cordeiro, Polytechnic Institute of Setúbal/INSTICC, Portugal Slimane Hammoudi, E.S.E.O., France
Organizing Committee Paulo Brito, INSTICC, Portugal Marina Carvalho, INSTICC, Portugal Helder Coelhas, INSTICC, Portugal Bruno Encarnação, INSTICC, Portugal Vítor Pedrosa, INSTICC, Portugal Mónica Saramago, INSTICC, Portugal
Program Committee Jacky Akoka, France Abdullah Alghamdi, Saudi Arabia A.Y. Al-Zoubi, Jordan Rachid Anane, UK Margherita Antona, Greece Azita Bahrami, USA Matteo Baldoni, Italy Ken Barker, Canada Cristina Baroglio, Italy Sean Bechhofer, UK Andreas Becks, Germany David Bell, UK Orlando Belo, Portugal Bharat Bhargava, USA Paolo Bouquet, Italy Christos Bouras, Greece Amy Bruckman, USA Jack Brzezinski, USA Maria Claudia Buzzi, Italy
Elena Calude, New Zealand Diego Calvanese, Italy John Carroll, USA Nunzio Casalino, Italy Maiga Chang, Taiwan Weiqin Chen, Norway Adrian David Cheok, Singapore Evangelos Christou, Greece Jae Chung, USA Christophe Claramunt, France Isabelle Comyn-Wattiau, France Michel Crampes, France Alexandra Cristea, UK Daniel Cunliffe, UK John Philip Cuthell, UK Alfredo Cuzzocrea, Italy Georgios Dafoulas, UK Vasilios Darlagiannis, Switzerland Alessandro D'Atri, Italy
VIII
Organization
Valeria de Antonellis, Italy Sergio de Cesare, UK Steven Demurjian, USA Darina Dicheva, USA Y. Ding, Austria Asuman Dogac, Turkey Josep Domingo-Ferrer, Spain Chyi-Ren Dow, Taiwan Schahram Dustdar, Austria Ali El Kateeb, USA A. El Saddik, Canada Atilla Elci, Turkey Luis Ferreira Pires, The Netherlands Filomena Ferrucci, Italy Stefan Fischer, Germany Akira Fukuda, Japan Giovanni Fulantelli, Italy Erich Gams, Austria Daniel Garcia, Spain Franca Garzotto, Italy Dragan Gasevic, Canada José A. Gil Salinas, Spain Karl Goeschka, Austria Nicolás González-Deleito, Belgium Anna Grabowska, Poland Begoña Gros, Spain Vic Grout, UK Francesco Guerra, Italy Aaron Gulliver, Canada Pål Halvorsen, Norway Ioannis Hatzilygeroudis, Greece Stylianos Hatzipanagos, UK Dominik Heckmann, Germany Nicola Henze, Germany Dominic Heutelbeck, Germany David Hicks, Denmark Pascal Hitzler, Germany Kathleen Hornsby, USA Wen Shyong Hsieh, Taiwan Brian Hudson, Sweden Tanko Ishaya, UK Alexander Ivannikov, Russia Kai Jakobs, Germany Anne James, UK Narayana Jayaram, UK Ivan Jelinek, Czech Republic Jingwen Jin, USA
Qun Jin, Japan Carlos Juiz, Spain Michail Kalogiannakis, France Jussi Kangasharju, Germany Vassilis Kapsalis, Greece George Karabatis, USA Frank Kargl, Germany Roland Kaschek, New Zealand Sokratis Katsikas, Greece David Kennedy, China Ralf Klamma, Germany Mamadou Tadiou Kone, Canada Tsvi Kuflik, Israel Axel Küpper, Germany Daniel Lemire, Canada Tayeb Lemlouma, France Weigang Li, Brazil Leszek Lilien, USA Claudia Linnhoff-Popien, Germany Hui Liu, USA Pascal Lorenz, France Heiko Ludwig, USA Vicente Luque-Centeno, Spain Anna Maddalena, Italy George Magoulas, UK Johannes Mayer, Germany Carmel McNaught, China Ingo Melzer, Germany Elisabeth Metais, France Alessandro Micarelli, Italy Debajyoti Mukhopadhyay, India Kia Ng, UK David Nichols, New Zealand Andreas Ninck, Switzerland Dusica Novakovic, UK Vladimir Oleshchuk, Norway Andrea Omicini, Italy Kok-Leong Ong, Australia Jose Onieva, Spain Vasile Palade, UK Jun Pang, Germany Kalpdrum Passi, Canada Viviana Patti, Italy Chang-Shyh Peng, USA Guenther Pernul, Germany Rajesh Pillania, India Carsten Pils, Ireland
Organization
Volkmar Pipek, Germany Pierluigi Plebani, Italy Bhanu Prasad, USA Joshua Pun, Hong Kong Mimi Recker, USA Frank Reichert, Norway Werner Retschitzegger, Austria Norman Revell, UK Yacine Rezgui, UK Thomas Risse, Germany Marco Roccetti, Italy Dumitru Roman, Austria Danguole Rutkauskiene, Lithuania Wasim Sadiq, Australia Maytham Safar, Kuwait Eduardo Sanchez, Spain Abdolhossein Sarrafzadeh, New Zealand Anthony Savidis, Greece Vittorio Scarano, Italy Alexander Schatten, Austria Alexander Schill, Germany Heiko Schuldt, Switzerland Magy Seif El-Nasr, USA Jochen Seitz, Germany Tony Shan, USA Jamshid Shanbehzadeh, Iran Quan Z. Sheng, Australia Charles A. Shoniregun, UK Keng Siau, USA Miguel-Angel Sicilia, Spain Marianna Sigala, Greece Eva Söderström, Sweden
Miguel Soriano, Spain J. Michael Spector, USA Martin Sperka, Slovakia Frank Stowell, UK Carlo Strapparava, Italy Eleni Stroulia, Canada Hussein Suleman, South Africa Aixin Sun, Singapore Junichi Suzuki, USA Ramayah T., Malaysia Taro Tezuka, Japan Dirk Thissen, Germany Ivan Tomek, Canada Bettina Törpel, Denmark Arun-Kumar Tripathi, Germany Andre Trudel, Canada Thrasyvoulos Tsiatsos, Greece Klaus Turowski, Germany Michail Vaitis, Greece Christelle Vangenot, Switzerland Athanasios Vasilakos, Greece Jari Veijalainen, Finland Juan D. Velasquez, Chile Maria Esther Vidal, Venezuela Maurizio Vincini, Italy Viacheslav Wolfengagen, Russia Martin Wolpers, Belgium Huahui Wu, USA Lu Yan, UK Sung-Ming Yen, Taiwan Janette Young, UK Weihua Zhuang, Canada
Auxiliary Reviewers Isaac Agudo, Spain Gunes Aluc, Turkey Simona Barresi, UK Solomon Berhe, USA Simone Braun, Germany Tobias Buerger, Austria Patrícia Dockhorn Costa, The Netherlands Volker Derballa, Germany Wolfgang Dobmeier, Germany Thanh Tran Duc, Germany Stefan Dürbeck, Germany
Seth Freeman, USA Vittorio Fuccella, Italy Carmine Gravino, Italy Fu-Hau Hsu, Taiwan Larissa Ismailova, Russia Yildiray Kabak, Turkey Jan Kolter, Germany Jacek Kopecky, Austria Sergey Kosikov, Russia Damien Magoni, France Zoubir Mammeri, France
IX
X
Organization
Jaime Pavlich Mariscal, USA Sergio di Martino, Italy Jing Mei, China Michele Melchiori, Italy Alexandre Miège, Canada Alper Okcan, Turkey Wladimir Palant, Norway Andreas Petlund, Norway Vassilis Poulopoulos, Greece Sylvie Ranwez, France
Cyril Ray, France Christos Riziotis, Greece Claudio Schifanella, Italy Christian Schläger, Germany Paolo Spagnoletti, Italy Agusti Solanas, Spain Heiko Stoermer, Italy Fulya Tuncer, Turkey Knut-Helge Vik, Norway Christian Werner, Germany
Invited Speakers Azzelarabe Taleb-Bendiab, Liverpool John Moores University, UK Schahram Dustdar, Information Systems Institute, Vienna University of Technology, Austria Ernesto Damiani, University of Milan, Italy
Table of Contents
Invited Papers Programming Support and Governance for Process-Oriented Software Autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Taleb-Bendiab, P. Miseldine, M. Randles, and Thar Baker Representing and Validating Digital Business Processes . . . . . . . . . . . . . . . Ernesto Damiani, Paolo Ceravolo, Cristiano Fugazza, and Karl Reed
3
19
Part I: Internet Technology Semi-automated Content Zoning of Spam Emails . . . . . . . . . . . . . . . . . . . . . Claudine Brucks, Michael Hilker, Christoph Schommer, Cynthia Wagner, and Ralph Weires
35
Security and Business Risks from Early Design of Web-Based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rattikorn Hewett
45
Designing Decentralized Service Compositions: Challenges and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ustun Yildiz and Claude Godart
60
Grid Infrastructure Architecture: A Modular Approach from CoreGRID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Augusto Ciuffoletti, Antonio Congiusta, Gracjan Jankowski, Michal Jankowski, Ondrej Krajiˇcek, and Norbert Meyer The Privacy Advocate: Assertion of Privacy by Personalised Contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Maaser, Steffen Ortmann, and Peter Langend¨ orfer
72
85
Efficient Queries on XML Data through Partitioning . . . . . . . . . . . . . . . . . Olli Luoma
98
Contract Based Behavior Model for Services Coordination . . . . . . . . . . . . . Alberto Portilla, Genoveva Vargas-Solar, Christine Collet, Jos´e-Luis Zechinelli-Martini, and Luciano Garc´ıa-Ba˜ nuelos
109
Transparent Admission Control and Scheduling of e-Commerce Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dmytro Dyachuk and Ralph Deters
124
XII
Table of Contents
Part II: Web Interfaces and Applications Life Cases: A Kernel Element for Web Information Systems Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Klaus-Dieter Schewe and Bernhard Thalheim
139
Finding Virtual Neighbors in 3D Blog with 3D Viewpoint Similarity . . . . Rieko Kadobayashi
157
Model Driven Formal Development of Digital Libraries . . . . . . . . . . . . . . . Esther Guerra, Juan de Lara, and Alessio Malizia
169
Predicting the Influence of Emerging Information and Communication Technologies on Home Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michele Cornacchia, Vittorio Baroncini, and Stefano Livi
184
RDF Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saket Kaushik and Duminda Wijesekera
201
Logging and Analyzing User’s Interactions in Web Portals . . . . . . . . . . . . Gennaro Costagliola, Filomena Ferrucci, Vittorio Fuccella, and Luigi Zurolo
213
A Semantics-Based Automatic Web Content Adaptation Framework for Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chichang Jou
230
Part III: Society, e-Business and e-Government Summarizing Online Customer Reviews Automatically Based on Topical Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiaming Zhan, Han Tong Loh, and Ying Liu
245
The Structure of Web-Based Information Systems Satisfaction: An Application of Confirmatory Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . Christy M.K. Cheung and Matthew K.O. Lee
257
Part IV: e-Learning Introducing Communities to e-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gavin McArdle, Teresa Monahan, and Michela Bertolotto Evaluating an Online Module on Copyright Law and Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carmel McNaught, Paul Lam, Shirley Leung, and Kin-Fai Cheng Using Games-Based Learning to Teach Software Engineering . . . . . . . . . . Thomas M. Connolly, Mark Stansfield, and Tom Hainey
277
292 304
Table of Contents
XIII
Agility in Serious Games Development with Distributed Teams: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Oliveira and Heiko Duin
314
Improving the Flexibility of Learning Environments: Developing Applications for Wired and Wireless Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . David M. Kennedy and Doug Vogel
327
Evaluating an e-Learning Experience Based on the Sakai Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F´elix Buend´ıa Garc´ıa and Antonio Herv´ as Jorge
338
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
347
Invited Papers
Programming Support and Governance for Process-Oriented Software Autonomy A. Taleb-Bendiab, P. Miseldine, M. Randles, and Thar Baker School of Computing and Mathematical Sciences Liverpool John Moores University, Byrom Street, Liverpool, L3 3AF, UK
[email protected],
[email protected] [email protected],
[email protected] Abstract. Business Process models seek to orchestrate business functions through the development of automated task completion, which is becoming increasingly used for Service-Oriented Architectures. This had led to many advances in the methods and tools available for software and language support in process modelling and enactment. Recent development in Business Process Execution languages, such as WS-BPEL 2.0 has widened the scope of process modelling to encompass cross-enterprise and inter-enterprise processes with a wide spread of often heterogonous business processes together with a range of associated modules for enactment, governance and assurance, to name but a few, to address non-functional requirements. Hence, the task of provisioning and managing such systems far outstrips the capabilities of human operatives, with most adaptations to operational circumstances requiring the system to be taken offline reprogrammed, recompiled and redeployed. This work focuses on the application of recent developments in language support for software autonomy whilst guaranteeing autonomic software behaviour. The issues to be addressed are stated with a supporting framework and language, Neptune. This is illustrated through a representative example with a case study evaluation reported upon. Keywords: Self-governance, programming language, process models.
1 Introduction Business Process Management is presently a well-established method for designing, maintaining and evolving large enterprise Business Process Models. Business Process Modelling or Management (BPM) originally consisted of workflow type graphical notations based on flow charts. In recent years, however, the scope of BPM, in general has widened to support an enterprise integration technology complementing Service-Oriented Architecture (SOA), Enterprise Application Integration (EAI), and Enterprise Service Bus (ESB). The current understanding of a process is one that orchestrates complex system interactions, and is itself a service capable of communicating and interacting with other processes according to industry standard protocols. Many languages and software tools have been proposed to develop process-oriented J. Filipe and J. Cordeiro (Eds.): WEBIST 2007, LNBIP 8, pp. 3–18, 2008. © Springer-Verlag Berlin Heidelberg 2008
4
A. Taleb-Bendiab et al.
software. Currently there is much interest in web based service orchestration to describe business processes through SOA and the emerging Service Component Architecture model, for example. The primary language for Business process support in this field is the Business Process Execution Language (BPEL), most usually represented by BPEL4WS and its successor WS-BPEL 2.0 [1]; though, other alternatives exist to support process–oriented software for implementing business processes on top of web services. In view of the complexity engendered by the heterogeneity in processes and their functional and non-functional requirements to support enterprise agility and business ecosystems, this work focuses on the support for runtime self-management and its predictable governance [2]. The aim is to contribute to new models of autonomic process-oriented Service Component Architecture (SCA). This approach is seen as vital to support current service intensive systems providing the core infrastructure for commerce and government [3], with its need for constant availability, to be always on and its continuous evolution based on current knowledge and process. Such requirements are evident in a healthcare decision process model which is oftenchanging to reflect new understandings, and the software used to aid the decision processes workflow is required to remain up to date to produce best-practice care [4]. Delays in updating software mean patients are diagnosed using out-of-date medical knowledge and guidelines. Coupled with distribution, where different parts of the application may be executed and stored on different machines, the complexity of deployment and its operating environment drives the need for computational techniques to aid in the analysis and actuation of changes in requirement whilst the software remains operational. Research into autonomic software has yielded many potential solutions to adaptation with respect to complexity [5], often from the viewpoint of providing self-aware behaviour to predict and react to changes in requirements for governance [6]. As identified in [7] however, a systematic approach to the adaptation of software at runtime to new or emergent requirements is still an outstanding issue for research. In addition, providing adaptation of a business process based on assured, bounded governance requires knowledge of not only what should be adapted, but also its impact on other processes and the safety of producing such adaptation within its defined boundaries. In effect, knowledge of the construction and the meanings behind the formation of a business process system is vital for bounded autonomy through adaptation to be accomplished [8]. Issues arise when this knowledge for autonomy is to be inferred at runtime, and adaptation is required to be instigated. Process implementations are most often deployed in a compiled denotation unsuited for reasoning upon to perform coordinated adaptation. Similarly, the compilation process and language used removes much of the semantic direction of the code; code is commonly written as instructions to achieve a goal, without implicitly specifying why those specific instructions were chosen for the task. In modern, distributed systems, it is often the case that source code is not available for adaptation, with the use of software as a service model hiding the implementation from consumers. As a result, much research focuses on behaviour driven adaptation and tuning of an application, rather than focusing on the construction and adaptation of its deployed code base.
Programming Support and Governance for Process-Oriented Software Autonomy
5
It is the contention of this paper that the functionality and knowledge required for autonomic adaptation of process models can be achieved via semantic analysis of the core components and code base of the software itself. This analysis can in turn inform traditional techniques of critique used in Autonomic Software, such as emergence identification and signal-grounding [9]. The paper highlights the limitations in current research into computational runtime adaptation and moves to identify properties of a new language Neptune that attempts to provide the requisite features to meet the goal of runtime computational adaptation of a system in general and process models in particular. Several proof of concept scenarios are given to further illustrate the approach advocated by Neptune and to justify a semantics-based approach. The paper concludes with discussion of limitations introduced by the approach, with work relating to possible future solutions given.
2 Business Process Runtime Requirements for Self-governance The inherent complexity of Business Process Models enactment, where process interactions and cross enterprise applications are involved, contributes towards the increasing interest in emerging autonomic SOA provision. Where autonomic software Table 1. Language Support Issues
Issues 1. Deployment Platform Reliance
2. Development Platform Reliance
3. Expressiveness
4. Application Representation 5. Component Effects 6. Data Distribution
Details There must be no reliance on an extension to, or specialism of, a specific language, operating system, or other system component to function. In effect, it must be deploymentplatform agnostic. The method of code adaptation for autonomic management must be accessible such that it has the ability to be dispatched by, and respond via, standard methods of invocation and communication. In effect, it must be developmentplatform agnostic. The semantic labelling of the processes ultimate purpose must be expressive enough to represent the semantics of base language components – which themselves may be implemented in any programming language that is supported in issue 2 – such that the effect, requirement, and mechanism of actuation of the component can be ascertained at runtime for use by the autonomic manager. A service application must be expressed in terms of the processes it is to represent using an activity-based classification, using the same semantic forms defined in Issue 3. The intended purpose of a component as provided in Issue 3 must be defined by its side effects to the state of the process it is used within. There must be support for runtime distribution of all information expressed within the process model via computation methods and communication standards.
6
A. Taleb-Bendiab et al.
capabilities attempt to predict and react to changes in software requirements for selfpreservation. Using the naturally derived autonomous behaviour exhibited by living organisms, research into autonomic computing is often linked with runtime adaptation and analysis to provide solutions to imbue self-aware behaviour to systems, machines and processes [10]. A wide variety of solutions exist, often targeting different aspects of A wide variety of solutions exist, often targeting different aspects of behaviour in terms of implementation and deployment. As stated in the introduction, Autonomic Computing has traditionally viewed systems from a behavioural perspective via monitoring. There are techniques however, that use code base analysis. Toskana [11] and similar approaches use AOP as a basis for adaptable change. Further holistic paradigms for autonomic software have been made using agent theory [12] treating a complete system as a series of agents, each with their own behaviour models to collectively perform set tasks and behaviour. Such an approach can also be applied to view business processes as sets of agent tasks and roles. Other problems arise from the lack of availability of source code. Often in systems such as service-oriented architectures, only consumption of behaviour is available with the actual implementation and its source code hidden. In these cases, adaptive middleware components that marshal and re-route calls are popular solutions [13]; these, however, often work at an abstraction layer above the code base, via policy documentation, forming of metamodel of behaviour reliant on observation. In line with the highlighted concerns, associated with providing autonomic management of implemented business process applications, there are a number of issues that need to be addressed; these have been collated in Table 1. A more detailed description may be found in the associated technical report [14].
3 Implementation of Autonomic Management As a conceptual basis, the qualities specified previously have been formed to provide the foundation of a scripting language, Neptune, to solve many of the outstanding issues highlighted in current research into runtime analysis and adaptation. Neptune can enable full runtime adaptation for autonomic self-management by addressing the issues previously identified. Issues 1 and 2 both indicate that the solution should sit above and be separate from any single language used to produce language components. In effect, the components are hidden from Neptune by way of a semantic interface: Neptune need only be able to dispatch calls defined within the semantic interface, rather than needing to directly host and manage the component itself. In this way, a bridge between Neptune and a base language is required both to facilitate communication between the base language and Neptune so that the language can instruct Neptune and interpret results from it, but also so Neptune can dispatch a call that is required to be handled by the base language. It is also clear that Neptune requires a form of expression to be defined such that it is capable of representing the ontological model desired to fulfil the remaining qualities. In keeping with established, classical works [15], Neptune will present information using a context-free language NeptuneScript.
Programming Support and Governance for Process-Oriented Software Autonomy
7
Figure 1 shows a schematic view of the method in which Neptune communicates with base language components. The Base Language Connector’s are themselves written within a base language rather than with NeptuneScript, and encapsulate the structure and communication methods needed to interact fully with Neptune and to be able to interpret and write NeptuneScript. Neptune itself provides numerous interfaces using standardised protocols such as COM, SOAP, and RPC forms such as Java RPC or .NET Remoting to enable a base language connector to dispatch a call to Neptune, and receive responses from it, as governed by Issue 2.
Fig. 1. Neptune/Base Language Interface
These calls are marshalled either directly to the higher Neptune Framework for processing, or situated as Neptune Base Language Objects (NBLOs), a semantic form that describes the intentions and conditions needed by a base language component. In addition, the NBLOs contain the requisite information to determine how to interoperate with the appropriate base language (BL) connector to marshal a call to the base language component. This information can be used by the Neptune Framework, to invoke the behaviour described in the NBLO. A semantic linker module within the framework encapsulates the functionality required to match and relate the semantics defined within process models, to those contained within the NBLOs to provide solutions for activities or tasks given in the process. Once a match is found, the calls to the base language contained within the NBLO are actuated. In effect, the axiomatic and operational semantics of a task are related to the denotational semantics given as an actuation within an NBLO’s definition. As discussed in Issue 3, the process of semantic manipulation and inference are well understood in literature and similar techniques are used to inform the linking module. This process is illustrated in Figure 2.
8
A. Taleb-Bendiab et al.
Fig. 2. The Neptune Framework
3.1 A Simple Example In order to illustrate the design basis of Neptune, and show the form NeptuneScripts take, the following example is used: A simple task is portrayed as a process that formats a given string as a title, using different styles that are implemented in distinct languages; a method bold in C# and a method italic in Java as shown below (Figures 3 and 4).
Fig. 3. Code for format.cs
Fig. 4. Code for format.java
Both are described using operational and axiomatic semantics within an NBLO definition to show that the purpose is to provide a feature to a type defined in Neptune, NString to transcend the types used in the base languages. In effect, the NBLO is defined by its effect on the state of any application it is part of – in keeping with
Programming Support and Governance for Process-Oriented Software Autonomy
9
Issue 5 – such that the type NString represents the state of a string through the process being executed. This is shown for the C# bold method in Figure 5. Tasks that require particular features of the type, requirements that are axiomatically defined, can thus be related to the relevant NBLO that can operationally ensure conformance to the requirement.
Fig. 5. NBLO Definitions
Evaluations, if possible, of the operational semantics defined can be included to determine if the goal of the identified process has been successfully actuated. Details of the base language component are provided in the form of an actuation and actual call needed by the base language connector, to provide the denotational semantics that the semantic linker module should produce. Each base language connector defines the meaning of a set of base types provided by Neptune which are related to equivalent types in the base language, such that NString represents a string as understood by Neptune, which is related to System.String in C# and the String type in Java via their respective base language connectors, to maintain interoperability. With this behavioural basis, a task can be written that requires a string to have two features, bold and italic. These are named a and b respectively, so that the requirements can be referenced within the body of the task for adaptation purposes. Neptune can rewrite the task to provide an action element that can satisfy the requirements specified in the task by inferring suitable denotational semantics by linking them to the purposes defined in the NBLOs such that the requirements hold. In this regard, Neptune relates the semantic description given in a task to the denotational semantics provided in the NBLOs given as actual source code invocations, without human intervention. This provides the maximum level of semantic-based assurance to the solution of the process task: each operation to take place via the base language components can be traced and asserted to individual requirements. As the requirements expressed in the task are met, Neptune can then marshal the appropriate calls to the base language connectors to instigate the actual functionality. To highlight the adaptive qualities of Neptune, the task can be rewritten at runtime to only require the code shown in Figure 6:
10
A. Taleb-Bendiab et al.
Fig. 6. Adapted Task Definition
As the actions produced before no longer satisfy the single requirement, such that nblo.Java.Italic provides a feature not required by the task, this can be treated as a trigger for adaptation. This is important as the requirements defined in the task can thus act as a governance mechanism ensuring the behaviour required is actually actuated. In this way, a developer can specify actions, which are asserted against the requirements and refined accordingly. The process of semantic linking can repeat given this trigger, to refine the action element of the task. Furthermore, a requirement can be specified to only be applied if the NBLO evaluates appropriately using the logic outlined in the evaluate element of the NBLO definition. As such, the task can be written a shown in Figure 7.
Fig. 7. Adapted Task Definition with New Inferred Action
In this way, the feature Bold is defined to be evaluated to true if a string starts with “” and ends with “”. If the string does not contain this feature – it equates to false – then the requirement is actuated. This control structure is repeated within the action element of the task to optimise the process; without the control structure, the action element would require adaptation upon each execution of the task with a given string. To show how a Neptune process can be adapted to reflect new behaviour, a new assembly can be added to the runtime environment of the application, and described via a new NBLO that is registered with the controller at runtime with a similar underline class in C#. The task FormatStringAsTitle can be rewritten at runtime due to its exposition, to include the requirement of the functionality provided for by the new behaviour which is semantically linked and adapted by the respective modules. Thus, the semantic definition of NString.Underline implies that of NString.Bold to be present, allowing the task to be adapted as shown in Figure 8.
Programming Support and Governance for Process-Oriented Software Autonomy
11
Fig. 8. Inferred Bolded Behaviour
4 A Case Study Neptune has been used to develop a number of autonomic process models to test and evaluate its design, performance and further analyse its approach. In particular, Neptune has been used to implement an autonomic service-oriented decision-support process, which enables via situation and role-awareness, the reconfiguration of medical decision agents to adapt to the individual preferences of expert clinicians and hospitals local clinical governance rules. Below is a survey of Neptune’s suitability to produce self-adaptive autonomic business process systems bounded within organisational guidelines via examples of architectural and behaviour adaptation required to be implemented at runtime. 4.1 Architectural Autonomy In partnership with several NHS dental trusts, the triage process of dental access services were assessed and implemented in Neptune, using an activity-based BPM. The model defined individual tasks that represented data input, such that the process required knowledge of a patient. The data could feasibly be read from a database, if it was available, or from the operator directly asking the patient. As such, two NBLOs were defined to describe these methods of data input for the patient’s data, given their NHS ID as shown in Figure 9. As such, both NBLOs have the same purpose, however define different base language methods of obtaining the data. In addition, the getPatientFromOperator purpose specifies that it should only be enacted if the data was not available in the database. A task was setup to obtain this data, however, how the data was to be found was left to Neptune to deliberate upon. In this way, the actual behaviour within the system of locating a patient was left to autonomous control, it is not specified in any source code other than via intentions and semantics of its desired effect. During the lifetime of the application, a new Electronic Patient Register was produced. The existing mechanism for retrieving data within the process now had to not use the local database, but rather the new EPR via a web service interface, written in Java. Accordingly, the existing process model still held, however any reference to nblo.getPatientFromDatabase was required to be adapted to point towards a new
12
A. Taleb-Bendiab et al.
Fig. 9. NBLO Definition for Patient Lookup
NBLO getPatientFromEPR, which semantically defined the new mechanism for retrieving data. For brevity, its source is not fully included, however due to its web service basis, it set a requirement that a SOAP endpoint was made available to it before it could execute, giving a different semantic to that of the database. The refactor types are pre-defined blocks of utility, executed by Neptune via its sources. In this way, when called, the NBLO will rewrite any reference to the existing NBLO in the tasks and processes to the new method required. The original NBLO is then removed from Neptune, giving a redirection of behaviour at runtime. These mechanisms are the same as employed by IDE’s such as Eclipse and Visual Studio, and as such follow classical research into adaptation. In Neptune however, such adaptation is machine-led and performed at runtime to the existing source base without a recompilation, as the code is simply re-interpreted rather than being at a pre-compile stage. However, the interpreter handles refactoring tasks such that when called, it modifies the source code directly to instigate the change. Thus, rather than applying the policy in a meta-form, either via additional policy or monitoring, the changes are applied at the source code level, leading to cleaner and more manageable code base. To conclude, the actions are re-interpreted after the refactor commands were interpreted. As the getPatientFromEPR NBLO required a SOAP service endpoint to be discovered, the actions, after re- interpretation, contain the new requirement set forth by the NBLO, to yield line 7 in Figure 10. Thus, any point in which the process required a feature from the database was automatically adapted at runtime to provide the same feature from a different approach that itself required modifications to the source based on its own requirements, to maintain integrity. As all inferences are semantic based, changes to the requirements and properties of the new system could easily be instigated, and via the checks made by Neptune, full adaptation within the
Programming Support and Governance for Process-Oriented Software Autonomy
13
task could be made. In addition Neptune can be used to describe components implemented in different languages; these could be combined to produce a single solution to a set process. This was achieved without requiring the execution sequence of components to be specified. It was, instead, inferred autonomically from the boundaries described semantically via BPMs.
Fig. 10. Task with New Re-factored Actions
4.2 Behavioural Autonomy Using Neptune, the process that occurred after pain levels were ascertained could be modified directly, producing a unified decision process model for each clinician. Axioms set forth in the base decision model were set, allowing behaviour, such as instructing a patient to go to hospital, to be ratified against institutional procedures. In this way, as long as these were met, the requirements of the model could be instigated. Thus, the task, afterPain could be rewritten at runtime to invoke the method p.ToHospital (say). Other tasks and NBLOs could be written to extend the definition of p.ToHospital such that this can only occur if there is space in the hospital, or waiting times are lower than a specified value. As each aspect of information is defined independently, the cause and effect of such policy does not need to be determined: Neptune relates the semantics to an end-goal, rather than requiring human intervention to achieve runtime autonomic refactoring of behaviour. Where a behaviour is deemed unsuitable, a boundary is formed, such that if no hospitals were available to send a patient, the preference could not be safely realised, and as such, control defaults back to the original model which is rigorously defined. In this way concerns can be autonomously introduced to the system with assured behaviour through maintaining the purposes and requirements of the system.
5 Evaluation Given the increasing requirements for future distributed computing systems to be endowed with runtime adaptation capabilities, there are many applications for the Neptune framework and language, in that, autonomic middleware will incorporate a logical reasoning mechanism to deduce responses, throughout runtime, without halting the system. Presently the rules for autonomic response are often portrayed to the
14
A. Taleb-Bendiab et al.
system as a static rule set consisting of norms or modalities, such as permissible or obligatory. In many distributed settings, as the system evolves and moves away from its starting configuration, these rules will quickly become outdated. It is at this point that the logical mechanisms within Neptune become paramount to dynamically evolve the rule set and code base in line with the systems evolution. There have, therefore been many case study evaluations for this described approach in autonomic middleware applications. The following is a further evaluation of Neptune performance against the Sun Microsystem’s Pet Store and Microsoft’s PetShop benchmark for enterprise applications, which is outlined below. Though, the full description of this evaluation is outside the scope of this paper and can be found in [16]. 5.1 Adaptation Scenario Pet Store [17] is an architectural blueprint from Sun Microsystems highlighting many of the advanced capabilities available with Java as well as detailing the design guidelines for the language. The application produced by the Pet Store blueprint builds an imaginary web-based solution for browsing and purchasing animals within a storefront interface. There are facilitates within the application architecture for administrators of the system to add, remove, and modify the animals available for sale, with their data stored on a database. Thus the Pet Store is publicly available, best-practice system design architecture for e-commerce and enterprise systems in general that Neptune may be evaluated against. Furthermore shortly after the release of Pet Store, Microsoft implemented its own interpretation of the system, PetShop [18], such that a comparison could be made against the effectiveness of a Java implemented enterprise system, and a Microsoft .NET implementation of the same behaviour. The design of the Pet Store model is one that advocates the use of 3-tier architecture. In this approach, a logic layer, provided by language support, sits between a data layer, consisting of database functions and a presentation layer displaying dynamic web pages to the user. To assess the benefits of Neptune it is the logic layer that will be primarily of interest here to evaluate the impact on the system of the adaptation of process. In the PetShop model, the use of thin classes, that represent directly the data types used within the software, act as isolators of dependency between layers. The logic layer consists of classes that each encapsulate an entity used within the shop model; a customer class represents all the features and data model required from a customer, whereas the order class represents the processes that pertain to order transactions within the system. Methods are used to encapsulate processes, such that Product.getProductFromID encapsulates the process needed to obtain a product from the database given an ID. As such, to adapt a process, its associated class and method contained within the logic layer needs to be determined. 5.2 Introducing Neptune to the PetShop/Pet Store Other implementations of the original Pet Store model exist; this model is particularly of interest, as Neptune is required to be introduced to an existing model. In this way, Neptune can be considered as a concern, and thus crosscut against the existing classes representing the processes requiring adaptation. The logic layer in PetShop performs
Programming Support and Governance for Process-Oriented Software Autonomy
15
several key processes: The process of ordering a product, where customer product functions are executed to produce an order, is an example of such a process. It can be seen that adapting the behaviour of the order process, such that a modified process is generated in reference to a new behaviour using a Process-Oriented Architecture (POA), may highlight many of the design qualities given in Neptune, and contrast this against current best-practice advocated by the original designs. In the comparison between PetShop and Pet Store, several criteria were considered to compare the designs. In effect, the benefits of the PetShop were determined by the number of lines of code produced with performance and scalability judged by response time per user, and CPU resources required by both approaches. The scenario outlines that a process should be chosen from the PetShop example to be modelled within Neptune, and analysis made using the criteria set by Microsoft and Sun upon the relative impact of introducing Neptune to an existing system. The model should then be adapted to introduce a new behaviour, with a comparison made between the method required in Neptune to achieve this adaptation, and the method used in the traditional system design, with benefits and limitations discussed accordingly. The process of performing an order within the PetShop is split between the logic layer and the presentation layer. Data is retrieved to inform the process via discrete web pages, such that OrderShipping.aspx, for example, contains the trigger code that moves the process towards another task, when the user has completed the page and clicked a relevant button. The execution of the process then passes to the logic layer to interpret the data, and store this within the database. Once stored, the logic moves the execution to produce another page for the user. As such, the classes within the logic layer can be said to change the state of the process by signifying that a task is complete, and that data is added to the data layer structures, as these are the side effects of its actuation. Similarly, pages within the presentation layer can be said to introduce data to the state of process, and signify that the task of producing data is complete. In this way, exposing the logic layer and presentation layer classes with NBLOs to Neptune will allow the operational semantics of the classes to be imparted. 5.3 Results Using the criteria set forth in the original release of PetShop, the following results were obtained for the execution of a complete order process. These results will, therefore, show the impact of using Neptune in terms of complexity and performance. For response times, Microsoft ACT was used to provide dummy information to the pages so that the process could complete via machine instruction. Averages of 10,000 process executions were made with 100 concurrent visitors. Whilst the original PetShop did not include any need for reflection, it can be considered vital, due to the relationship between runtime analysis and adaptation, that for a system to be available for adaptation, reflection is required. As such, the same test was performed using a modified PetShop implementation that called its methods via reflection. In this way, a benchmark for any form of adaptation could be ascertained.
16
A. Taleb-Bendiab et al.
Fig. 11. Performance Relationship
The results, shown in Figure 11 indicate that the overhead of using Neptune is significant upon the first execution of the process. This is due to the semantic linker having to process the available NBLO descriptions and refactor the description to perform the task. Upon further execution however, as the tasks have already been provided the NBLOs that satisfy their requirements, execution time increases dramatically, introducing a slight overhead in terms of computational performance over the adaptation benchmark.
Fig. 12. Neptune Scalability Testing
Programming Support and Governance for Process-Oriented Software Autonomy
17
In terms of CPU usage the processes of Neptune produced more in the way of computational complexity than the traditionally compiled application. This is expected due to the interpretation that occurs at runtime via use of Neptune. The result of using reflection produced a slight increase in CPU consumption. However, the compiled Neptune system – produced after the first semantic linking – performed relatively worse. It should be noted that the implementation of the Neptune framework is such that it was designed as a prototype, rather than being optimised for performance. As such, it could be expected theoretically, that any overhead can be reduced further with subsequent revisions of the implementation of the framework. The scalability of Neptune in terms of performance (blue line) is measured against a baseline execution rate (green line) in Figure 12. This shows that at present the initial overhead of Neptune performance is quite high. This, however, may be offset by benefits of continuous availability later in the systems evolution.
6 Conclusions This work has proposed applying autonomic runtime support for Business Process Models to provide methods to address self-governance in service-oriented architectures. Specifically this has involved adaptation of the actual code base of the system by employing a scripting language and framework, Neptune, to reason over the perceived goals of the deployed software. This means that interacting or complex business processes specified in any manner can benefit from runtime adaptation. Further work is required around the issues of type safety and scalability in the semantic linking process; as the volume of NBLOs and semantic concerns increases there will be a corresponding rise in the complexity associated with maintaining the optimal behaviour required. In this work the method is shown to have practical and theoretical value in providing autonomous management in SOAs. Further work is required to make this approach more scalable. The present position is that a high initial computational overhead is encountered. Optimisation of the approach will aid scalability; offsetting the initial costs against future saving in system downtime due to the added robustness gained from the runtime adaptation facilities.
References 1. Arkin, A., Askary, S., Bloch, B., Curbera, F., Goland, Y., Kartha, N., Liu, C.K., Thatte, S., Yendluri, P., Yiu, A. (eds.): Web Services Business Process Execution Language Version 2.0. WS-BPEL TC OASIS (December 2005), http://docs.oasis-open.org/wsbpel/ 2.0/wsbpel-v2.0.pdf 2. Luciano, B., Di Nitto, E., Ghezzi, C.: Toward Open-World Software: Issue and Challenges. Computer 39(10), 36–43 (2006) 3. Adeshara, P., Juric, R., Kuljis, J., Paul, R.: A Survey of Acceptance of E-government Services in the UK. In: Proceedings of the 26th International Conference on Information Technology Interfaces, pp. 415–420 (2004)
18
A. Taleb-Bendiab et al.
4. Rousseau, N., McColl, E., Newton, J., Grimshaw, J., Eccles, M.: Practice based, longitudinal, qualitative interview study of computerised evidence based guidelines in primary care. British Medical Journal 326(7384), 1–8 (2003) 5. Dharini, B., Morrison, R., Kirby, G., Mickan, K., Warboys, B., Robertson, I., Snowdon, B., Greenwood, R.M., Seet, W.: A software architecture approach for structuring autonomic systems. In: DEAS 2005: Proceedings of the 2005 workshop on Design and evolution of autonomic application software, pp. 1–7. ACM Press, New York (2005) 6. Kephart Jeffrey, O., Chess, D.M.: The vision of autonomic computing. Computer 36(1), 41–50 (2003) 7. Mazeiar, S., Tahvildari, L.: Autonomic computing: emerging trends and open problems. In: DEAS 2005: Proceedings of the 2005 workshop on Design and evolution of autonomic application software, pp. 1–7. ACM Press, New York (2005) 8. Sharon, A., Estrin, D.: On supporting autonomy and interdependence in distributed systems. In: Proceedings of the 3rd workshop on ACM SIGOPS European workshop, pp. 1–4. ACM Press, New York (1988) 9. Martin, R., Taleb-Bendiab, A., Miseldine, P.: Addressing the signal grounding problem for autonomic systems. In: The Proceedings of the International Conference on Autonomic and Autonomous Systems, 2006 (ICAS 2006), July 2006, pp. 21–27 (2006) 10. Manish, P., Hariri, S.: Autonomic computing: An overview. Springer, Heidelberg (2005) 11. Michael, E., Freisleben, B.: Supporting autonomic computing functionality via dynamic operating system kernel aspects. In: AOSD 2005: Proceedings of the 4th international conference on Aspect-oriented software development, pp. 51–62. ACM Press, New York (2005) 12. Cristiano, C.: Guarantees for autonomy in cognitive agent architecture. In: ECAI 1994: Proceedings of the workshop on agent theories, architectures, and languages on Intelligent agents, pp. 56–70. Springer, New York (1995) 13. Yves, C.: Self-adaptive middleware: Supporting business process priorities and service level agreements. Advanced Engineering Informatics 19(3), 199–211 (2005) 14. Philip, M., Taleb-Bendiab, A.: Neptune: Supporting Semantics-Based Runtime Software Refactoring to Achieve Assured System Autonomy, Technical Report, Liverpool John Moores University http://neptune.cms.livjm.ac.uk/dasel 15. Chomsky, N.: Three models for the description of language. Information Theory, IEEE Transactions 2(3), 113–124 (1956) 16. Miseldine, P.: Language Support for Process-Oriented Programming of Autonomic Software Systems. PhD. Thesis, Liverpool John Moores University (2007) 17. Sun Microsystems, Java Pet Store demo. Java Blueprint (Accessed November 2007), https://blueprints.dev.java.net/petstore/ 18. Microsoft, using .NET to Implement Sun Microsystems’ Java Pet Store J2EE Blueprint Application (Accessed November 2007), http://msdn2.microsoft.com/en-us/ library/ms954626.aspx
Representing and Validating Digital Business Processes Ernesto Damiani1 , Paolo Ceravolo1 , Cristiano Fugazza1 , and Karl Reed2 1
Department of Information Technology, University of Milan via Bramante, 65 - 26013, Crema (CR), Italy {damiani,ceravolo,fugazza}@dti.unimi.it 2 Department of Computer Science, LaTrobe University Bundoora, Melbourne, Australia
[email protected] Abstract. Business Process Modeling is increasingly important for the digitalization of both IT and non-IT business processes, as well as for their deployment on service-oriented architectures. A number of methodologies, languages, and software tools have been proposed to support digital business process design; nonetheless, a lot remains to be done for assessing a business process model validity with respect to an existing organizational structure or external constraints like the ones imposed by security compliance regulations. In particular, web-based business coalitions and other inter-organizational transactions pose a number of research problems. The Model Driven Architecture (MDA) provides a framework for representing processes at different levels of abstraction. In this paper, a MDA-driven notion of business process model is introduced, composed of a static domain model including the domain entities and actors, plus a platformindependent workflow model providing a specification of process activities. The paper focuses on semantics-aware representation techniques, introducing logicsbased static domain models and their relationship with Description Logics and current Semantic Web metadata formats. Then, the paper discusses some issues emerging from the literature on business processes representation and presents some research directions on the evaluation of the compatibility of business and IT processes with existing organizational environments and practices. The problem of implicit knowledge and of its capture in a manner which allows it to be included in business process design is also discussed, presenting some open research issues.
1 Introduction Today, the term “extended enterprise” is used designate any collection of organizations sharing a common set of goal. In this broad sense, an enterprise can be a whole corporation, a government organization, or a network of geographically distributed entities. Extended enterprise applications support digitalization of traditional business processes, adding new processes enabled by e-business technology (e.g. large scale Customer Relationship Management). Often, they span company boundaries, supporting a web of relationships between a company and its employees, managers, partners, customers, suppliers, and markets. Digital modeling improves the enterprise ability to dynamically J. Filipe and J. Cordeiro (Eds.): WEBIST 2007, LNBIP 8, pp. 19–32, 2008. c Springer-Verlag Berlin Heidelberg 2008
20
E. Damiani et al.
set up new business processes or to adjust existing ones to changing market conditions, increasing their ability to compete. Companies are therefore developing a growing interest in concepts, technologies and systems that help them to flexibly align their businesses and engineering processes to meet changing needs and to optimize their interactions with customers and business partners. In such a scenario, Business Process Modeling (BPM) techniques are becoming increasingly important. Less than a decade ago, BPM was known as “workflow design” and was aimed at describing human-based processes within a corporate department. Today, BPM is used to design orchestration of complex system interactions, including communicating with the processes of other companies according to well-defined technical contracts. Also, it is used to check compatibility and consistence of the business processes of collaborating business entities. A number of methodologies, languages and software tools have been proposed to support digital BPM; nonetheless, a lot remains to be done for assessing a business process model validity with respect to an existing organizational structure or w.r.t. external constraints like the ones imposed by security compliance regulations. In particular, web-based business coalitions and other inter-organizational transactions pose a number of research problems. Recently, the BPM and Web services communities have proposed many languages and standardization approaches that aim at describing business processes, especially from the perspective of Web services orchestration. Well-known examples include BPEL4WS, BPML, XLANG and WSFL. Most languages focus on the representation of the process message exchanges (the so-called process choreography) and on the control and data flow among the involved Web services (the process orchestration). The OMG’s Model Driven Architecture (MDA) provides a framework for representing processes at different levels of abstraction. In this paper we rely on a MDA-driven notion of business process model, composed of a static domain model including the domain entities and actors, plus a platform-independent workflow model providing a specification of process activities. We shall focus on semantics-aware business process representation techniques, introducing logics-based static domain models and their relationship with Description Logics and current Semantic Web metadata formats. Then, the problem of implicit knowledge and of its capture in a manner which allows it to be included in business process design is also discussed, presenting some open research issues. The paper is structured as follows: in Section 2 we introduce rule based business modeling, while in Section 3 we discuss some semantic models for rules. In Section 4, the relation of these semantic models with the ones proposed for Semantic Web metadata is discussed; Section 5 discussed how these two approaches can be successfully hybridized to meet modeling requirements. Finally, Section 6 shows a worked-out hybrid model example, and Section 8 draws the conclusion and outlines our future work.
2 Rule-Based Business Modeling Business modeling languages are aimed at providing companies with the capability of describing their internal business procedures using a formal syntax. Rule-based formalization, being close to controlled natural language statements, will will ensure that businesses involved in processes fully understand descriptions of themselves and other
Representing and Validating Digital Business Processes
21
participants in their business; also, it will enable organizations to flexibly adjust to new internal and B2B business circumstances. More specifically, rule-based business modeling is aimed at producing machine-understandable business statements in a standard rext-based notation. Business modeling activity usually starts from a shared, domainwide business vocabulary and derives an object-oriented model from it. The Object Management Group’s (OMG) Model Driven Architecture (MDA) decomposes the modeling acvtivity’s outcome into two main work products: – Business Rules (BR): declarative statements describing business domains: Controlled english syntax, easy enough to be understood by humans – Business Process Models (BPM): stepwise specifications of individual business processes Machine-understandable and detailed enough to be translated into computersupported workflows Business Rules determine what states and state transitions are possible or permitted for a given business domain, and may be of alethic or deontic modality. Alethic rules are used to model necessities (e.g. implied by physical laws) which cannot be violated, even in principle. For example, an alethic rule may state that an employee must be born on at most one date. Deontic rules are used to model obligations (e.g., resulting from company policy) which ought to be obeyed, but may be violated in real world scenarios. For example, a deontic rule may state that it is forbidden that any person smokes inside any company building. It is important to remark that widespread domain modeling languages such as the Universal Modeling Language (UML) typically express alethic statements only. When drawing a UML class diagram, for instance, the modeler is stating that domain objects belonging to each UML class MUST have the attribute list reported in the class definition, implicitly taking an alethic approach to domain modeling. In business practice, however, many statements are deontic, and it is often important (e.g., for computing metrics) to know if and how often they are violated. The recognition of the importance of this distinction was a major driver of OMG’s SBVR (Semantics of Business Vocabulary and Rules) proposal, aimed at specifying a business semantics definition layer on top of its software-oriented layers. We will now focus on BR syntax, postponing a discussion on their semantics to Section 3. To fix our ideas, in this Section we shall mention a few simple examples based on the (fictitious) global car rental company EU-Rent used in the Semantics of Business Vocabulary and Business Rules (SBVR) draft standard1. The basic types of business rules are listed below, together with some illustrative examples: – Business Rule (deontic: obligation or prohibition) – Example: It is required that the drop-off date of a rental precedes the expiration date on the driverlicense of the customer reserving the rental – Definition (alethic: something true by definition) – Example: The sales tax rate for a rental is the sales tax rate at the pick-up branch of the rental on the drop-off date of the rental 1
The EU-Rent case study was developed by Model Systems, Ltd., along with several other organizations, and has been used in many papers. The body of descriptions and examples may be freely used, providing its source is clearly acknowledged.
22
E. Damiani et al.
– Authorization (permission) – Example: A customer reserving a rental may provide what credit card guarantees the rental if the name on the credit card is the name of the customer or the credit card is for a corporate account authorized for use by the customer. In business rules, a structured vocabulary is used whose symbols belong to three lexical types, namely: – Terms – Examples: account, assigned car, branch, credit card, driver license, drop-off branch, drop-off date, pick-up branch – Names – Examples: EU-Rent, MilanBranch – Verbs (Forms of Expression) – Examples: credit card guarantees rental, customer has driver license Of course, in business rules the relation between terms and concepts is not one-to-one; a single concept can be represented by many terms. For instance, the concept “car” can be represented by alternative synonyms like “automobile” or by translations like “coche” or “voiture”. The BR language must also include a way to express facts as well as rules, and also metadata (annotations) about documents, facts, and rules. However, the OMG SBVR specification does not specify neither how to parse (controlled) natural language, nor how to phrase business rules. This is a source of flexibility, as the same fact can be represented in many equivalent sentential forms. For example, the assertion “the driver’s license expires on date XX/XX/XX” can be alternatively expressed as “The expiration date xx/xx/xxxxis on driver’s license” or as “The driver’s license has expiration date xx/xx/xxxx ”.
3 Logical Models for BR Languages The semantics of BRs has been traditionally described by providing a mapping from the business rules syntax to some well-known logic formalism. Three (potentially conflicting) basic requirements for the logical models underlying BRs have been identified: 1. High expressive power A basic overall requirement for the underlying logics is a high expressive power in specifying rules, in order to provide most of the description capabilities of controlled English. Business modelers accustomed to using English syntax would not accept any too severe limitation to the modeling language’s expressive power. 2. Tractability The need for performing rule checking and for executability, however, implies a strong desire for computational tractability. In other words, the underlying logics expressive power has to be carefully balanced against tractability. 3. Non-falsifiability BR logical semantics should rely on monotonic reasoning, i.e. on inference whose conclusions cannot be contradicted by simply adding new knowledge.
Representing and Validating Digital Business Processes
23
These three requirements have been often emphasized by business analysts and researchers as guidelines toward finding a correct logical interpretation of business models, but are NOT satisfied by current SBVR logical modeling. In order to understand this paradoxical situation, let us briefly discuss the state or research in this area. Much research work has been done to provide a logics-based model for BR including modalities. Indeed, supporting modalities does not mean that it is mandatory to map the BTs to a modal logic. For instance, work by the BR OMG team and specifically by Terry Halpin (including his open source package NORMA, which supports deontic and alethic rules, see sourceforge.net/projects/orm). The approach behind NORMA, the Object-Role Modeling (ORM) starts from controlled natural language to express available information in terms of simple or elementary facts. The information is then converted into an abstract model in terms of natural concepts, like objects and roles. NORMA addresses logical formalization for SBVR by mapping BR’s deontic modalities Required, Prohibited into modal operators obligatory (O), permitted (P) (used when no modality is specified in the rule) and forbidden (F). Deontic modal operators have the following rules w.r.t. negation: ¬Op ≡ P ¬p
(1)
¬P p ≡ O¬p ≡ F p
(2)
Other modal operators used for mapping BR alethic rules are necessary (true in all possible states of the business domain, ), possible (i.e. true in some state of the business domain, and , impossible (). Alethic operators’ negation rules are as follows: ¬♦p ≡ ¬p
(3)
¬♦p ≡ ♦¬p
(4)
Halpin’s NORMA approach represents BRs as rules where the only modal operator is the main rule operator, thus avoiding the need for a modal logics model. Some allowed SBVR formulations that violate this restriction may be transformed into an equivalent NORMA expression by applying modal negation rules and the Barcan formulae and their converses, i.e. ∀pF p ≡ ∀pF p (5) ∃p♦F p ≡ ♦∃pF p
(6)
For instance, the BR For each RentedCar, it is necessary that RentedCar was rented in at most one Date is transformed into It is necessary that each RentedCar was hired in at most one Date2 However, BR rules emerging from business modeling cannot be always transformed into rules where the only modal operator is the main operator. To support such cases, in principle there is no alternative but to adopt a semantics based on a modal logic; but 2
Another transformation that could be used in this context is the one based on Barcan formulae’s deontic variations, i.e. ∀pOF p ≡ O∀F p. We shall not discuss here in detail the application of these transformations to normalizing modal logic formulas; the interested reader can is referred to [12].
24
E. Damiani et al.
the choice of the “right” modal logic is by no means a trivial exercise, due to tractability and expressive power problems [12]. Modal logics reasoning engines do exist; for instance, MOLOG has been developed by the Applied Logic Group at IRIT from 1985 on, initially supported by the ESPRIT project ALPES. MOLOG is a general inference machine for building a large class of meta-interpreters. It handles Prolog clauses (with conjunctions and implications) qualified by modal operators, The language used by MOLOG can be multi-modal (i.e., contain several modal operators at the same time. Classical resolution is extended with modal resolution rules defining the operations that can be performed on the modal operators. according to the chosen modal logics. However, instead of choosing a modal logics and applying MOLOG-style modal reasoning, most current approaches to BR semantics are based on Horn Logic, the well-studied sublanguage of First-Order Logic (FOL) which is the basis of Logic Programming and counts on thousands of robust implementations. Of course, not every software rule engine is (or should be) able to process the entire Horn Logics; full Horn Logic rules are well-known to be Turing complete, hence undecidable. Again, here we shall not attempt to describe the BR-to-Horn mapping in full. Below we give some intuitive correspondences : – Each, Some: Quantifiers. – and, or, not: Logical Connectives. – The, An : Quantification and variable binding. For example. let’s consider a EU-Rent rule expressed in natural language as follows: “For class C rentals, if pick-up branch does not coincide with the drop-off branch, then rental duration cannot be less than two days”. This rule can be parsed to obtain a semiformal (parsed) rule syntax as follows: For each rental X, if X.class = C and X.pickup < > X.dropoff, then days(X.duration) > 1. The corresponding FOL formula can be written as follows: ∀x ∈ Rental, IsEqual(Class(x), C) ∪ IsEqual(P ickU p(x), DropOf f (x)) → IsGreater(Duration(x), 1) In many practical applications, it is possible to carry out deductions over a knowledge base composed of rules like Eq. 7 by standard Prolog-style forward- and backwardchaining rule reasoning. With respect to the requirements we identified at the beginning of the section, however, we remark that the OMG proposed format for encoding BR as generic Horn rules is undecidable in its full expressivity, so we can expect only partial implementations of SBRL reasoners. Also, this reasoning is clearly not monotonic (see the next Section 4.).
4 BR vs. Semantic Web-Style Knowledge Representation Semantic Web (SW) knowledge representation formats aim to integrate heterogeneous information sources by enabling decidable reasoning on metadata. Since individual Web-based systems are not expected to store an exhaustive “mirror” of external data
Representing and Validating Digital Business Processes
25
sources, Semantic Web languages (unlike FOL-based business rules) rely on monotonic reasoning. Intuitively, deductions that are made on data structures that are not exhaustively known to the system must not be contradicted by mere extension of the knowledge base (KB)3 . For applications, the most important effect of taking the SW modeling approach (as opposed to the SBVR one) is that monotonic reasoning yields a stronger notion of negation (¬): If individual i is not known to be member of class C (i.e., the assertion C(i) cannot be found in the KB), then it cannot be assumed to be member of class ¬C. This requirement, generally referred to as open-world assumption (OWA), has often been indicated as the correct interpretation of digital business models (see Requirement 3 in Section 3) because of the frequent under-specification of local business models and the possibly partial knowledge of remote data sources. In Sec. 5, concept Competitor will be defined as an example of open-world entity that may be included in a business model to adjust its dynamic behavior by considering external data structures. Open-world, SW-oriented knowledge representation languages like OWL [17] are based on a different subset of FOL than business rules. Namely, they rely on Description Logics (DL) [1]. However, mature implementations of DL reasoners exist and are constantly developing new features. OWL reasoners are capable of processing data structures that cannot be expressed in rule-based systems, such as recursive definitions and individuals not explicitly defined in the KB). As an example, consider the following SBVR definition and the corresponding DL axiom4: a product is produced by a manufacturer and is distributed in a country Product ≡ ∃producedBy.Manufacturer
∃distributedIn.Country If provided with the assertion Product(PROD-01), a DL reasoner will derive the following assertions with regard to individual PROD-01: ∃producedBy.Manufacturer(PROD-01)
(7)
∃distributedIn.Country(PROD-01)
(8)
Conversely, if individuals ITALY and COMP-01 are known to be related with PROD-01 as in the following assertions: distributedIn(PROD-01,ITALY) producedBy(PROD-01,COMP-01) 3
4
It is interesting to remark that, besides being different from BR semantics, monotonicity is also not the way mass market data storage applications, such as relational and object-oriented databases, interpret information when queried, e.g. by means of a SQL-like interface. Here we concatenate terms to obtain valid tokens, such as distributedIn and distinguish concept names with a leading uppercase letter.
26
E. Damiani et al.
a DL reasoner will derive, as a consequence of the definition above, the following assertion: Product(PROD-01)
(9)
The FOL counterpart of these derivations requires the biunivocal relationship to be split in two rules, because Horn clauses require the separation of the conditions triggering a rule (the antecedent or body) from the corresponding derivation (the consequent or head). Moreover, variables are assumed to be universally quantified; therefore it is not possible to achieve the exact semantics of the DL axiom above. Product(x) ← producedBy(x, y) ∧ Manufacturer(y) ∧ distributedIn(x, z) ∧ Country(z) Product(x) → producedBy(x, y) ∧ Manufacturer(y) ∧ distributedIn(x, z) ∧ Country(z)
Here only the first rule can be implemented, essentially because in Horn engines all variables in the consequent must also be present in the antecedent5. Consequently, only the implication in (9) can be derived using rule languages. While DL is technically a smaller FOL fragment than Horn logics, the above discussion shows that traditional production rules used by Horn rule engines are able to render only part of the semantics of DL-based formalisms. More importantly, rule systems share a weaker notion of negation (∼) according to which lack of knowledge is considered false (negative) knowledge, thus leading to a non-monotonic reasoning generally referred to as closed-world assumption (CWA): If individual i is not known to be member of class C (i.e., the assertion C(i) cannot be found in the KB), then it is assumed to be member of class ∼ C 6 . When modeling business domains, it should be possible to declare concepts and roles as closed (i.e., characterized by complete knowledge). Although in principle it would be possible to apply rule reasoning in order to evaluate a DL-based knowledge base in a closed-world fashion (e.g., by asserting negative facts on closed concepts), a more straightforward approach is the one of the so-called epistemic operators K and A. The former allows for evaluating DL concepts according to facts that are explicitly known to hold (i.e., have not been inferred by reasoning); as an example, the following concept definition identifies individuals whose filler of role producedBy is an individual that is known to be a Manufacturer instance: K-QUERY ≡ KproducedBy.KManufacturer
5
6
Unlike what happens in DL reasoners, which are capable of handling “hypotetical” individuals as fillers of roles producedBy and distributedIn. Note that we distinguish between classical negation as implemented by DL reasoners (¬) and the weaker notion implemented by rule systems (∼).
Representing and Validating Digital Business Processes
27
The epistemic operator K has been implemented (albeit only in queries, not as concept constructor) in the Pellet reasoner [14]. In turn, the auto-epistemic operator A is related with the notion negation-as-failure. Intuitively, AC corresponds to ¬ ∼ C. The following concept definition identifies individuals having as filler of property producedBy an individual that is assumed to be (i.e., is not explicitly asserted as not being) a Manufacturer instance. A-QUERY ≡ KproducedBy.AManufacturer
However, to the best of our knowledge, the autoepistemic operator A has not been implemented so far in any DL reasoner. While DL reasoners extensions look promising for their applications to business modeling, Horn rule reasoning remains useful to manage application-dependent data structures that need not be shared with the structural (DL-based) component of the KB. Also, when translating business rules into the corresponding logic representation, we may need to model constructs not expressible with DL alone. In fact, let aside the restriction to monadic and binary relations, it very easy to derive from SBVR rules role concatenations that are not comprised in the tree-shaped models that can be expressed with DL. As an example, the following SBVR rule =configures a simple cyclic pattern that nonetheless requires a Horn rule for its definition: a direct product is a product that is produced by a manufacturer that is a reseller of the product DirectProduct ← Product(x) ∧ producedBy(x, y)
∧Manufacturer(y) ∧ resellerOf(y, x)
5 Combining DL and Horn Rules As we have seen in Section 4, Horn rule engines do not entirely render the semantics of DL-based formalisms. On the other hand, when writing business rules we may need to model constructs which are not expressible with DL alone7 . This suggests the idea of representing complex business systems using separate structural (i.e., DL-based) and relational (i.e., rule-based) components . However, it is difficult to obtain all the expected logic entailments by mere composition of structural and relational reasoning on the corresponding portions of the KB. To clarify this, consider a barebones structural component constituted by concept definitions Manufacturer and Distributor, together with the following assertion, expressing that individual K-MART is known to be either a Manufacturer or a Distributor (although the system cannot tell which one): (Manufacturer Distributor)(K-MART) 7
Actually, the OWL 1.1 proposal extends the relational expressivity of the language to encompass role composition and additional features. We do not address the OWL 1.1 proposal in this paper as no mature software tools based on it are likely to be available in the short term.
28
E. Damiani et al.
Now we define concept Reseller in the relational component of the KB, as follows: a reseller is a manufacturer or a distributor Reseller(x) ← Manufacturer(x) Reseller(x) ← Distributor(x)
It is clear that is not possible to derive the assertion Reseller(K-MART) by applying the rules to the structural component of the KB or the other way around. The above example shows that even simple implications are not necessarily preserved when dividing the KB into separate components. Furthermore, exploiting full Horn clauses expressive power easily leads to undecidability. In order to achieve sound and complete reasoning (see Requirement 2 of Section 3) on hybrid KBs, it is essential to monitor the flow of assertions between the two components by constraining the possible clauses that populate the relational component of the KB. This can be done by requiring variables referred to by clauses to be associated with at least one non-DL atom in the rule body. The approach proposed in CARIN [11] requires variables from role predicates to satisfy this constrain. Another formalism, DL+log [16], generalizes existing results (namely, AL-log and DL-safe rules [4]) and proposes a weaker safety condition by applying this requirement only to predicates in the rule consequent. Finally, MKNF [5] is a recent proposal integrating all the above formalisms and relies on CARIN’s stronger notion of safety.
6 An Example of Hybrid Reasoning Let us now consider a hybrid knowledge base (HKB) composed of a structural component S, that is a theory in a subset of FOL (e.g. DL), and a rule component R (e.g. Datalog¬∨). In this example we follow the approach proposed in[16], where the rule component constrained in its interaction with the structural component according to a safeness condition. A = AS ∪ AR , with AS ∩ AR = 0 p ∈ AS is a structural predicate r ∈ AR is a rule component A HKB H is a pair (S, R), where the following safeness condition holds. Each rule R has the following form: p1 (X1 ) ∨ ... ∨ pn (Xn ) ← r1 (Y1 ), ...rn (Yn ), sk (Zk ), ∼ u1 (W1 ), ... ∼ un (Wn )
Such that: – – – –
each pi is a predicate of A; each ri , ui is a predicate of AR ; each si is a predicate of AS ; each variable occurring in R must occur in one of the ri ’s.
Representing and Validating Digital Business Processes
29
6.1 Hybrid Reasoning Let H be a KB where S defines the following predicates: ∀x P RODUCT (x) → ∃y.P RODUCED BY (x, y) ∧ MAN UF ACT URER(y) [A1] ∀x P RODUCT (x) → ∃y.DIST RIBUT ED IN (x, y) ∧ COUN T RY (y) [A2] ∀x COMP ET IT OR(x) → ∃y.SELL IN (x, y) ∧ T ARGET COUN T RY (y) [A3] ∀x T ARGET COUN T RY (x) → COUN T RY (x) [A4] ∀x MAN UF ACT URER(x) → RESELLER(x) [A5] ∀x DIST RIBUT OR(x) → RESELLER(x) [A6] ∀x IRRELEV AN TC OUN T RY (x) → COUN T RY (x) [A7] ∀x IRRELEV AN TC OUN T RY (x) → ¬T ARGET COUN T RY (x)) [A8] T ARGET COUN T RY (IT ALY ) [F1] T ARGET COUN T RY (F RAN CE) [F2] T ARGET COUN T RY (SP AIN ) [F3] COUN T RY (HUN GARY ) [F4] MAN UF ACT URER(COMP 01) [F5] DIST RIBUT OR(COMP 02) [F6] DIST RIBUT OR(COMP 03) [F7] SELL IN (COMP 01, IT ALY ) [F8] SELL IN (COMP 01, F RAN CE) [F9] SELL IN (COMP 03, HUN GARY ) [F10] P RODUCED BY (P ROD01, COMP 01) [F11] P RODUCED BY (P ROD03, COMP 03) [F12]
Also, in H, R defines the following predicates: DirectP roduct(x) ← resellerOf (y, x), P RODUCT (x), MAN UF ACT URER(y), P RODUCED BY (x, y) [R1] DirectCompetitor(x) ← resellerOf (y, x), DirectP roduct(x), COMP ET IT OR(y) [R2] resellerOf (COMP 01, P ROD01) [F13] resellerOf (COMP 03, P ROD03) [F14]
It can be easily verified that H satisfies the following ground assertions: – COM P ET IT OR(COM P 01). Based on axiom A3, selling in a target country it is sufficient to be a Competitor. – DirectP roduct(P ROD01) Based on axiom A1 together with facts F5 and F11, PROD01 is a Product; so rule R1 in conjunction with F13 satisfy this assertion. – DirectP roduct(P ROD03) Based on A1 and F11,; PROD03 is a Product so rule R1 in conjunction with F13 satisfy this assertion. Note that it is not relevant that COMP03 is not a Manufacturer because no assertion contradicts this possibility and according to the open world semantics a fact cannot be negated only on the basis of the lack of an assertion stating it. – DirectCompetitor(COM P 01) Since A3, COMP01 is a Competitor, then R2 in conjunction with F13 and R1 satisfy this assertion.
30
E. Damiani et al.
On the contrary, the following assertion cannot be derived in H: – ¬COM P ET IT OR(COM P 03). Due to the open world semantics a company selling in irrelevan t countries (i.e., in countries that are out of our companies’ target ) cannot be inferred not to be a Competitor, because the partial knowledge assumption do not justify us to consider asserted facts as the only possible ones. Note that this semantics prevents deriving false information without justification of asserted facts, and can be important in a strategical context.
7 Related Work Both the business and the software engineering communities have produced much work on Business Process Management. Providing a complete overview of the business management approach to process representation is clearly outside the scope of this paper. The reader interested in a business vision of BPM, seen as a tool for innovative management of virtual organizations, is referred to [8]. Software engineering research has been trying since long to add business process models to the palette of the system designer. For instance, the UML Profile for Automated Business Processes allows BPEL4WS processes to be modeled using ordinary UML tools. The BPEL4WS representation can be used to automatically generate web services artifacts (BPEL, WSDL, XSD) from a UML model meeting the profile, as described for instance in [7]. More recent work by Christian Huemer and Birgit Hofreiter [9] represents the semantics of UN/CEFACTs Modeling Methodology (UMM) business processes using BPEL, showing that BPEL is appropriate to capture business collaborations. An important aspect of inter-organizational business processes is governing different evolution paces due to distributed administration. Work by Stefanie Rinderle, Andreas Wombacher and Manfred Reichert [15] addresses one important challenge which has not been adequately addressed by other approaches, i.e. independent evolution of business process choreographies. If modifications are applied to processes in an uncontrolled manner, inconsistencies or errors might ensue. In particular, modifications of part of a shared process may affect the implementation of the other part of the process by other partners. The authors sketch a framework that allows process engineers to detect how changes to process fragments affect the whole process, relying on change semantics. Other works from the software engineering research community, focus on the alignment between business modeling and software development or execution. As discussed in [10] the general idea is that requirements models are built from organizational goals. Some of these works, such as for instance [13] or [6], are strongly related to UML and do not clearly specify some organizational aspects such as technology that implements business processes or the relationships among the different organizational views. Our own Knowledge Management group at the Software Engineering and Software Architectures Research (SESAR) lab, at the University of Milan’s Department of Information technology (http://ra.crema.unimi.it), has been carrying out work in business process representation and monitoring in the framework of the TEKNE and SecureSCM projects, as in [3] or [2].
Representing and Validating Digital Business Processes
31
8 Conclusions Over the past 30 years, many techniques have grown up around the development, management, and execution of business rules. In parallel, DL-based formalisms have been developed as a part of the Semantic Web proposal. Both technologies seem to have their own strengths and weaknesses. In this paper we we discussed the state of the art of business modeling formalisms, outlining the potential role of hybrid representations in business domain descriptions. We provided preliminary but hopefully useful answers to the following questions: – Are there ways to combine each technology to make them more beneficial than they are being used alone? – Are there enhancements that can be done to further leverage the rules technologies? These questions are being addressed by a collective interdisciplinary effort of the computer science and business management communities. At SESAR lab, we are developing answers to these questions as a part of our current research work. Acknowledgements. This work was partly funded by the Italian Ministry of Research under FIRB contract RBNE05FKZ2_004 (project TEKNE), and by the European Commission under project SecureSCM. The authors wish to thank Claudio Pizzi for his valuable comments on modal logic reasoning. Also, many thanks are due to Schahram Dustdar and Azzelarabe Taleb-Bendiab, Ernesto Damiani’s fellow keynote speakers at WEBIST, and to the conference chair Joaquim Filipe for creating a unique atmosphere for exchanging and discussing research ideas.
References 1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003) 2. Ceravolo, P., Damiani, E., Viviani, M.: Bottom-up extraction and trust-based refinement of ontology metadata. IEEE Transactions on Knowledge and Data Engineering 19(2), 149–163 (2007) 3. Ceravolo, P., Fugazza, C., Leida, M.: Modeling semantics of business rules. In: Proceedings of the Inaugural IEEE International Conference On Digital Ecosystems and Technologies (IEEE-DEST) (February 2007) 4. Donini, F.M., Lenzerini, M., Nardi, D., Schaerf, A.: AL-log: Integrating Datalog and Description Logics. J. of Intelligent Information Systems 10(3), 227–252 (1998) 5. Donini, F.M., Nardi, D., Rosati, R.: Description logics of minimal knowledge and negation as failure. ACM Trans. Comput. Logic 3(2), 177–225 (2002) 6. Eriksson, H.-E., Penker, M.: Business Modeling With UML: Business Patterns at Work. John Wiley & Sons, Inc., New York (1998) 7. Gardner, T.: Uml modelling of automated business processes with a mapping to bpel4ws. In: Proceedings of the First European Workshop on Object Oriented Systems (2003) 8. Fingar, P., Smith, H.: Business process management. The third wave
32
E. Damiani et al.
9. Hofreiter, B., Huemer, C.: Transforming UMM business collaboration models to BPEL. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM-WS 2004. LNCS, vol. 3292, pp. 507–519. Springer, Heidelberg (2004) 10. Diaz, J.S., De la Vara Gonzalez, J.L.: Business process-driven requirements engineering: a goal-based approach. In: Proceedings of the 8th Workshop on Business Process Modeling, Development, and Support (2007) 11. Levy, A.Y., Rousset, M.-C.: CARIN: A representation language combining horn rules and description logics. In: European Conference on Artificial Intelligence, pp. 323–327 (1996) 12. Linehan, M.: Semantics in model-driven business design. In: Proceedings of 2nd International Semantic Web Policy Workshop (2006) 13. Marshall, C.: Enterprise modeling with UML: designing successful software through business analysis. Addison-Wesley Longman Ltd., Essex (2000) 14. Parsia, B., Sivrin, E., Grove, M., Alford, R.: Pellet OWL Reasoner (2003) 15. Rinderle, S., Wombacher, A., Reichert, M.: On the controlled evolution of process choreographies. In: ICDE 2006: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006), Washington, DC, USA, p. 124. IEEE Computer Society, Los Alamitos (2006) 16. Rosati, R.: DL+log: Tight Integration of Description Logics and Disjunctive Datalog. In: KR 2006, pp. 68–78 (2006) 17. W3C. OWL Web Ontology Language Overview, http://www.w3.org/TR/owl-features/
Part I
Internet Technology
Semi-automated Content Zoning of Spam Emails Claudine Brucks, Michael Hilker, Christoph Schommer, Cynthia Wagner, and Ralph Weires University of Luxembourg MINE Research Group, ILIAS Laboratory Dept. of Computer Science and Communication 6, Rue Richard Coudenhove-Kalergi, 1359 Luxembourg, Luxembourg Abstract. With the use of electronic mail as a communicative medium, the occurrence of spam emails has increased and kept stable; and although spam can be detected by spam filtering, an element of risk of a continuously network overload remains. To add insult to injury, spammers augment novel, clever and astute techniques, which currently leads to a neck-and-neck race between protection and injury. Constantly novel approaches are necessitated to enhance spam filters through a more efficient and effective prevention. In this respect, we argue for a zoning of emails to classify a text in regions, where each region bears a meaning. Here, we use this technique for spams to identify these zones and to evaluate if an email is a spam. We introduce content zoning and prove our approach by a prototypical implementation including first results. Keywords: Content Zoning, Spam Email Detection, Text Analysis and Statistics, Human-Computer Interaction.
1 Introduction Originally, the word spam had firstly been referred to a meat product and had been an abbreviation of Shoulder of Pork and hAM (SPAM), but later on, to Spiced Pork And Meat. By a sketch of Monty Python’s Comedy team, the meaning of the word spam has changed as consequence of spammed by a group of Vikings repeating a song containing . . . spam, lovely spam. . . Due to this excessive use, the word spam has become a term in the Internet, meaning any kind of unsolicited messages or junk emails or bulk emails. The distribution of spam emails is understood as spamming, and the sender is named a spammer. While speaking about spam emails, a multitude of different spam types exist. Unsolicited commercial mail are usually dubious emails that offer products to special price rates, like the advertisement of pharmaceuticals. Chain letters comprise texts with rumors/news instruct the recipient to forward. Malicious forms are worms and viruses, which cause computer damages, and phishing emails that may cause personal damages - mostly of financial type or by requesting sensitive data. Both normal and spam emails consist of several parts, for example the email header and the email content (see Figure 1): the header includes the technical information, protocols, or spam flags, the subject line, and the sender/recipient information. The email content includes the technical information (source code), a readable text (content), including images, and the signature. In the recent years, the importance of electronic messaging is growing and the question of how to control spam emails more and more arises. J. Filipe and J. Cordeiro (Eds.): WEBIST 2007, LNBIP 8, pp. 35–44, 2008. c Springer-Verlag Berlin Heidelberg 2008
36
C. Brucks et al.
Fig. 1. Source code of an electronic mail: Zone 1 represents the HEADER of the email including SUBJECT (marked in green). Zone 3 is the CONTENT, Zone 4 the SIGNATURE.
Many solutions already exist and improvements to avoid this kind of spam messaging are mostly of similar nature. In [6], honeypots accounts are created and positioned to receive spam. Spam filter engines often refer to a rule-based knowledge structure [1,7,10]. Intrusion Detection Systems (IDS) have been enabled [5] that simulate the natural body system control while generating different kinds of cells and cell communication as well. In this respect, each delivered email package, which is sent from one network computer to another, is controlled: if a certain (spam) pattern is present, then the packet is deleted from the network to minimize the risk of intrusion. Beside of that, spam filters exist that directly detect spams, for example by white lists and black lists that contain accepted and denied email addresses. Or, patterns are fixed that is foreseen as to be used by spammers. Another alternative is to analyse the header and the content of emails to classify in spam and non-spam. In fact, natural language filter systems remove emails that are written in a foreign language as well as user-defined rules. Indeed, the spammers continuously introduce novel techniques in order to fail spam filters and to hide spam emails. For example, spam emails can contain text hidden between complex HTML tags or replace characters, e.g. zeros by O’s - a challenge for common-used spam filters. Another example is the recent trend in spam emails of using scrambled words, which means that text - correctly understood by human beings and identified by machines
Semi-automated Content Zoning of Spam Emails
37
with access to natural language dictionaries (thesauri) - cannot be processed seriously by normal spam filters: Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. . . 1 In this article, we follow the idea of content zoning as it is shown in Figure 1 to improve the quality of spam filtering. With content zoning, a text becomes composed of different regions (zones) where each region receives a meaning. For example, a zone SUBJECT may describe the subject of an email, OFFER to the advertised product itself. Therefore, we introduce our content zoning solution and demonstrate it with COZO, an interactive content zoning system [3], on two disjunctive domains: finance and pharmacy. We conclude the paper with an outlook.
2 Content Zoning In [4,9], Argumentative Zoning - that refers to annotated zones to describe and to explain scientific documents - is presented. Its aim is to provide an intelligent library towards the generation of summaries on scientific articles and the detection of intellectual ownership of documents, e.g. for a better document management or for plagiarism detection. 2.1 Annotated Regions in Texts Our approach bases on similar principles to analyse spam emails, as means to separate spam from non-spam mails. A part of this work can also be seen in [3], which deals with the manual analysis of up-to-date spam emails to check if they contain significant patterns in the subject line and the content section. We focus on discovering content structures to describe common patterns in spam emails. With this, we intend to gain information that can be used to perform automatic recognition of zones for further work. Automatically derived information from spam emails can then support a classification of spam emails while separating them from non-spam emails. The aim of the content zoning technique is to provide a useful function or add-on to normal spam filters for the detection of patterns in the content of spam emails and because of the questions if spam emails have characteristic content structures. The separation into zones is done according to the given structure of the documents (e.g. different paragraphs) and especially because of the semantic meaning of the document parts. In the end, the zones are supposed to represent the various semantic units of the document. Besides the plain division of a document in zones, additional analysis can also be performed to gain further information about the separate zones. Rather simple to extract information could be statistics about the size and layout of zones but a more sophisticated analysis of their text content is also possible. The latter can lead to an extraction of the semantic content and purpose of a zone. Such kind of information can be used for various purposes, 1
http://www.mrc-cbu.cam.ac.uk/∼mattd/Cmabrigde/ - November 2007.
38
C. Brucks et al.
such as comparing documents to each other (regarding their analaysed zone structure and content). In the following, we refer to such information about the zones as zone variables. 2.2 Identification of Spam Emails We focus on a test set of spam emails that has been applied to the process of zoning; the test set mainly refers to the different categories of pharmaceutical (i.e. drug offerings) and financial email texts (more precisely, stock spam emails). In the following, we give a more detailed description of how the zones for these two types of spam emails are defined and which zone variables are used. A zone can be defined as a region in a document that is allocated to a specific information in this document. For example, when having a text document with a price offer, this price offer can be annotated by a tag; a logical name for that zone could be price offer. Of course, when having a huge document, zones can reoccur multiple times. Figure 1 has illustrated an example of a pharmaceutical spam email; each zone is represented by a color and an annotated text (HEADER, SUBJECT, CONTENT, SIGNATURE). Financial emails typically introduce themselves as texts of large content, where specific categories like the price, the offerings or the expectation of profits have their place including a very detailed and accurate description. Pharmaceutical texts, on the other hand, refer to texts of shorter size; its content includes in-note-form-like texts with direct offerings and less descriptive text. Sometimes, pharmaceutical spam emails are composed of an image or the text is written completely in white color. Generally, while analyzing financial and pharmaceutical spam emails and characterizing their content, respectively, it has become obvious that pharmaceutical spam emails have a lower amount of zones than financial spam emails [3]. To get granular on electronic spam mails of pharmacy and finance, we have observed that the most redundant text rubrics have been considered as to be the default zones. Since irrelevant text is often included in spam emails, a content may remain unsupported by zoning. The annotation tags have logical names, so that no supplementary information is needed to understand what the zone is about: – INFORMATION: The introductive zone that informs the reader about the intention of the email sending. As an example, the text motivates any kind of lack or absence. – OFFER: This zone represents the offer itself, for example a product or a service. It is accompanied with specific advantages and auspicious expectations. – PRICE: In the PRICE zone, the effective price is put. In many cases, it is a sub-zone of the zone OFFER. – LINK: This zone triggers as connective link to external web-sources. When clicking, the user becomes automatically connected. However, a definite and overall definition of the zones is hard to give, because content zoning alludes more to a informational positions in the text; it does not analyze content via text mining functionalities. We therefore introduce zone variables to identify parts of the text. A zone variable is defined as a parameter that describes specific information of a zone. By describing the zones with zone variables, more structured information can be
Semi-automated Content Zoning of Spam Emails
39
extracted of the zone content. The zone variables are one of the most important factors in content zoning, because they contribute to the statistical evaluation and finally to the detection of similarities in spam emails. In this work, two kinds of variables have been defined. Zone independent variables are defined as parameters which can be applied to every zone. Examples for zone independent variables are for example the position of the zone in the text, the length of the zone, expressed in number of characters, the number of words contained in the zone, or the most occurring word (under consideration of stopword elimination). Zone dependent variables on the other hand are defined as parameters, which are specific for a certain type of zone and hence can only be applied to zones of that type. Mostly, this kind of variables represents semantic parameters, for example top-level domain for links (e.g. .com, .net, .lu), telephone numbers for the address zones, value for price zones, currency for price zones (e, £, $) and so on.
3 The COZO Explorer The COZO Explorer (see Figure 2) integrates a spam thesaurus content-wide and a graphical user interface towards zoning in texts. Furthermore, it supports an automatic image sorting. 3.1 A Spam Thesaurus The Spam-Thesaurus refers to the content of electronic mails and follows the idea that transformation of different writing styles to a consistent form of spam keywords may
Fig. 2. The COZO Explorer consists of two panels: the reader panel (left) and the options panel (right) with the predefined 18 zones. A zone can be marked and colorized, the result is shown in the reader panel.
40
C. Brucks et al.
improve the prematurely detection of spams and an advancement of spam-filtering. In COZO, the thesaurus is a list that is continuously expanded by reading spam email streams. It consists of the most occurring words of all extracted subject lines plus their standardized meaning. Currently, the thesaurus has a keyword-list of more than one thousand entries, excluding its flections, which vary of lot: for example, the word antibiotics has no flection as the word Viagra compounds has more than 120. 3.2 The Graphical User Interface Most common used email types (HTML, EML, etc.) are supported, irrelevant information, like email header or HTML tags, can automatically be removed on demand; only the content remains. The user can zone the content. The results can be exported and stored in XML. In Figure 2, the zones are selected for a given text. To facilitate the zoning, 18 zones have been predefined. However, if there is no convenient zone for a certain email region, the user has the possibility to create his specific zones or to leave that region unzoned. Each zone is visualised by a different colour; for each zone variables, statistical information like the position of the zone in the email, the density of the zone, and the amount of characters in the zone, are computed. The output - definition of zones, calculated variables, and a picture of the zoned email with colours - is also stored in files for further analysis, e.g. picture matching. At this point of time, the zoning must be done manually.
4 Selected Results In the following, we will present some results from a set of spam emails captured from a period of nine months. The data set contains emails from various domains (e.g. credit offers, movie downloads, porn emails), but only spam emails from pharmacy and finance have been tested. 4.1 Zoning Pharmaceutical Spam Emails The mechanism of zoning is firstly applied to spam emails pre-classified as pharmaceutical emails. The number of the test spam set is 100, we have principally taken the four general zones (see Section 2.2) but a bit more adapted to the mentioned domain. For pharmacy, the OFFER zone represents the part where the pharmaceutical product is promoted, LINK the position where the linked homepage is placed; the zone INFORMATION refers to pharmaceutical product descriptions, PRICE to the indication of the price: interestingly, this zone occurs rather seldom. While zoning the spam emails, diverse observations can be done, for example that 64% of the spam emails share a sequential pattern of SUBJECT - OFFER - LINK INFORMATION whereas 16% a sequential pattern SUBJECT - OFFER - LINK. 8% has not a pattern at all. For the tests, no supplementary zones have been generated, the initial 18 zones mostly cover each email content. The zones of the emails from the data set have been colorized as in Figure 3. After having computed the zoning, the statistics are calculated and stored in a XML-file with tags like
Semi-automated Content Zoning of Spam Emails
41
Fig. 3. Content Zoning of an spam email from finance: the zones are colorized and marked and annotated by the rectangles. The pattern ZON E3 → ZON E4 is continuously present. 399 5 80 ...
We have obtained different results, for example that pharmaceutical spam share quite often a similar zoning. Consequently, the order and the size of the zones are similar as well as the calculated variables of each zone. 4.2 Zoning Financial Spam Emails For financial spam emails, the number of the test spam set has been set to 100 as well. Here, we have established principally a list of the following 8 zones: – – – –
ZONE 1 (NAME): Complete name of the company whose shares are offered. ZONE 2 (DESCRIPTION): Description of the company’s offering. ZONE 3 (CURRENT): Current price of the stock. ZONE 4 (PROMISED): Promised price, i.e. the speculative price that the share will reach in a few days time. – ZONE 5 (PROMOTION): Promotion, i.e. arguments to motivate the reader to buy the shares. – ZONE 6 (FACTS): Facts about the company, i.e. a short historical review of the company and its products. – ZONE 7 (NEWS): News about the company and its activities.
42
C. Brucks et al.
– ZONE 8 (FORWARD): Forward looking statement, i.e. a clear statement that no responsibility is taken over. These different zones have been set up manually, and at the beginning 20% of the financial emails have been analyzed to discover any kind of sequential zoning pattern. As financial, especially stock market spam emails, are assumed to be similar in their structure, nine different zones could be determined quite easily and then matched to the rest of the stock mails. Surprisingly, diverse patterns of zone sequences have been discovered, for example that a ZON E3 is always followed by ZON E4 (see Figure 3)2 .
Fig. 4. Coincidental arrangement of zoned spam emails before the image sort has taken place; financial and pharmaceutical spam emails are disordered
4.3 Automated Image Sorting As another alternative for spam email classification, we have downloaded and deployed the freeware software product ImageSorter3 to inspect collections of zoned spam emails and to sort them according to their colorized zones. As we have observed, both financial and pharmaceutical zoned spam emails differ in the number, occurrence and size. We take advantage of this and comply with the idea of accomplishing classification as computing the similarity between images: and, the similarity between zoned spam emails is geared to the color gradient of images. The figures 4 and 5 show some zoned financial and zoned pharmaceutical spam emails before and after an image sort has been applied. Zoned pharmaceutical spam emails are placed to the top; they are separated from zoned financial spam emails that are placed to the bottom. 2
3
The presented rectangles are just for a better understanding; COZO colorizes the zones but does not place a rectangle around it. ImageSorter is a freeware software that is currently developed at the Center of HumanComputer Interaction at the FHTW Berlin.
Semi-automated Content Zoning of Spam Emails
43
Fig. 5. Sorted zoned spam emails where zoned financial spam emails are put to the bottom, pharmaceutical to the top
5 Conclusions There are many advanced possibilities to classify spam emails using content zoning, for example to more automate the content zoning process including the discovery of annotated labels or to define a stronger similarity metrics according to the calculated zone statistics (zone ordering, zone sizes, etc.). The expansion of the idea of image sorting in order to realize the comparison between electronic mails is another claim: since the actual comparison is of course not limited to basic (exact) picture matching but must include more robust image sorting techniques as well. With this, we are able to perform a comparison of a new email to our existing spam email types (financial and pharmaceutical here) to decide whether or not the email is spam. Acknowledgments. This work is funded by the University of Luxembourg within the project TRIAS. We thank all the colleagues and friends of the MINE research group within the Intelligent and Adaptive Systems Laboratory (ILIAS).
References 1. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An evaluation of naive bayesian anti-spam filtering (2000) 2. Bayerl, S., Bollinger, T., Schommer, C.: Applying Models with Scoring. In: Proceedings on 3rd International Conference on Data Mining Methods and Databases for Engineering, Finance and Other Fields, Bologna, Italy, September 25-27 (2002)
44
C. Brucks et al.
3. Brucks, C., Wagner, C.: Spam analysis for network protection. University of Luxembourg (2006) 4. Feltrim, V., Teufel, S., Gracas Nunes, G., Alusio, S.: Argumentative Zoning applied to Critiquing Novices Scientific Abstracts. In: Computing Attitude and Affect in Text: Theory and Applications, pp. 233–245. Springer, Dordrecht (2005) 5. Hilker, M., Schommer, C.: SANA security analysis in internet traffic through artificial immune systems. In: Proceedings of the Trustworthy Software Workshop Saarbruecken, Germany (2006) 6. Lambert, A.: Analysis of Spam. Master Thesis, University of Dublin (2003) 7. Rigoutsos, I., Chung-Kwei, T.H.: A pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam). In: Proc. of the Conference on Email and Anti-Spam (CEAS) (2004) 8. Sun, Q., Schommer, C., Lang, A.: Integration of Manual and Automatic Text Classification A Text Categorization for Emails and Spams. In: Biundo, S., Fr¨uhwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 156–167. Springer, Heidelberg (2004) 9. Teufel, S.: Argumentative zoning: Information extraction from scientific text, Phd Thesis, University of Edinburgh, England (1999) 10. Wittel, G., Wu, S.: On attacking statistical spam filters. In: Proc. of the Conference on Email and Anti-Spam (CEAS) (2004)
Security and Business Risks from Early Design of Web-Based Systems Rattikorn Hewett Department of Computer Science, Texas Tech University Abilene TX 79601, U.S.A
[email protected] Abstract. This paper presents a systematic approach for the automated assessment of security and business risks of web-based systems at the early design stage. The approach combines risk concepts in reliability engineering with heuristics using characteristics of software and hardware deployment design to estimate security and business risks of the system to be developed. It provides a mechanism that can help locate high-risk software components. We discuss limitations of the approach and give an illustration in an industrial engineering and businessto-business domain using a case study of a web-based material requirements planning system for a manufacturing enterprise. Keywords: Software risk assessment, software security, business risks, software architecture, internet-based system.
1 Introduction Web application systems play important roles in providing functions and business services for many organizations. The scope and complexity of web applications have grown significantly from small-scale information dissemination to large-scale sophisticated systems that drive services, collaboration and business on a global scale. Unlike conventional software systems, web-based systems grow and change rapidly. They also tend to be living systems having long life cycles with no specific releases and thus, require continuous development and maintenance [7]. Designing web applications involves not only technical but also social and political factors of the system to be built. Poorly developed web applications that continue to expand could cause many serious consequences. Because they are ubiquitous and deal with a huge and diverse population of users, they must be tolerant to errors, adverse interactions and malicious attacks. Even if they perform perfectly, they can still put the business of the enterprise deploying them at risk if they do not support sound business logic. Effective web-based system design involves both technical and managerial decisions that require a balanced integration of economic flexibility, sound engineering judgment and organizational support. A thorough understanding of risk can help provide the connection between technical concepts, user values and values created by the design. The ability to quickly estimate security and business risks early in the system development life cycle can help software developers and managers make risk-acceptable, J. Filipe and J. Cordeiro (Eds.): WEBIST 2007, LNBIP 8, pp. 45–59, 2008. c Springer-Verlag Berlin Heidelberg 2008
46
R. Hewett
cost-effective and informed decisions. In this paper, the terms security risk and business risk refer to the possibilities of adverse consequences, due to software behavior, to resource ownerships, and to business objectives of the system in which the software is used, respectively. Various techniques and guidelines to security risk analysis are available [10,16] and activities to increase security assurance of software are integrated throughout the software development life cycle (e.g., security requirements, code analysis, security and penetration testing). However, most developers perform security risk assessments of software systems at a system level after the systems have been implemented. For business challenges, efforts on incorporating business processes into enterprise applications are mostly in a requirement stage in order to create a design [4,6,13,15]. However, analyses of business implications of the design products are seldom addressed structurally. Software vulnerabilities can occur in codes (e.g., buffer overflow, SQL injection) and designs (e.g., internal controls that attackers can compromise). They can even exist in components that are not involved with security procedures. As many software-intensive infrastructure systems increasingly utilize COTS (Commercial Off-The-Shelf) components for rapid application development, ability to analyze security risks at a software component level becomes essential. As web applications are implemented on standard platforms, they inherit similar security weaknesses that are published and available publicly [1]. These known vulnerability data-bases are used widely as parts of basic principles for guidelines, standards, and vulnerability scanning tools such as Nessus [2]. Currently, like other information systems, most web-based systems rely on standard checklists, firewall rules and scanners for system security risk testing. One drawback of these tools is their lack of ability to assess vulnerabilities specific to custom web applications [17]. Most traditional risk analysis techniques are difficult to apply directly to modern software design, as they do not necessarily address potential vulnerabilities and threats at a component or environment level [18]. Particularly, web-based information systems are designed on multi-tier architectures spanning over multiple boundaries of trust. A vulnerability of a component depends on both its platform (e.g., J2EE on Tomcat/Apache/Linux, C# on a Windows .NET server) and environment (e.g., exposed network such as LAN versus secure DMZ). To adapt risk methodology to web application design, security risk analysis should consider risk factors from different design levels including software components, hardware platforms and network environments. Risk analysis provides assessment of risk to serve as a basis for decision-making to prevent, mitigate, control and manage risks of loss. This paper presents a systematic approach for automated security and business risk assessment of a web application system from its component-based designs. By combining risk concepts in reliability engineering with heuristics using characteristics of software design and hardware platforms, the security and business risks of a proposed system can be estimated. In contrast with rigorous risk analysis during the operational phase, our approach provides a simple early estimate of risks that can help locate design components that are at risk. Thus, mitigating strategies (e.g., allocating resources for rigorous testing, or redesigning the high-risk component) can follow.
Security and Business Risks from Early Design of Web-Based Systems
47
Section 2 describes related work. Section 3 defines relevant security terminologies and risk concepts, and describes the proposed approach for security risk assessment. Section 4 illustrates our approach to a web application in industrial engineering. Section 5 describes how the proposed approach can be adapted for assessment of business risks. Conclusion and future work are discussed in Section 6.
2 Related Work Much research in web information systems is concerned with techniques and methodologies for engineering web-based systems effectively by adapting principles from software engineering to web application design and development [3,12,7]. Risk methodologies have been employed to assess safety and performance associated with software systems in various application domains including industrial engineering, business and space science [14,5,19]. Risk methodology including FMEA (Failure Mode and Effect Analysis) has been applied to gain understanding about usability of web information systems [20]. Unlike our approach, none of these work addresses risks in security contexts. Identifying likelihoods in risk methodology in these web information systems are mostly subjective rather than analytical and objective. Our work is most similar to a methodology and framework proposed by Yacoub et al. [19]. However, instead of assessing the reliability of software components, we assess the security and business risks associated with software. Thus, their risk models contain different attributes than ours. Reliability risks are estimated from the complexity of software components whereas security and business risks are based on characteristics of software design as well as hardware platforms to be deployed.
3 Security Risk Analysis In this section, we define basic risk analysis concepts and relevant security elements in Section 3.1. Section 3.2 describes the proposed methodology. 3.1 Preliminaries Risk is defined as a multiplication of the probability of occurrence of an undesired event and its consequences [9]. Risk analysis is a process that aims to assess, manage and communicate risk information to assist decision-making. Risk assessment involves identification of adverse situations (or hazards or threats) and quantifying their risks by determining all likelihoods of their causes and the severity of their impacts. The risk measures obtained can assist in cost benefit study for resource planning to prevent, eliminate or mitigate these undesired situations. These are activities of risk management. Many techniques for risk assessment exist (e.g., FMEA, HAZOP (Hazard and Operability), FTA (Fault Tree Analysis) [8]), but they have been mainly applied to hardware rather than software artifacts. Risks associated with software (or software risks) are risks that account for undesired consequences of software behavior, in the system in which the software is used.
48
R. Hewett
Security and safety are two related perspectives of value judgments of acceptable risks. Security is concerned with the protection of ownerships of assets and privacy, while safety is more concerned with the protection of human lives and environments, factors that cannot be evaluated easily in monetary terms. Examples also include some assets in web applications e.g., trust and a company’s reputation. Sources of threats include malicious (e.g., deliberate attack, code injection) as well as non-malicious actions (e.g., natural disaster, human error, software system failure) that may lead to a security breach. Security attacks, whether intentionally targeted or not, can originate from internal and external threats depending on the degree of exposure of assets and how vulnerable the system is to attack. A system’s vulnerability refers to its weakness (i.e., specific attribute/environment) that makes a system more prone to attack or makes an attack more likely to succeed. Exploited vulnerabilities can compromise security. In finance, risks are estimated in terms of expected loss per year, ALE (Annualized Loss Expectancy), which is a product of the annual rate of occurrence and single loss expectancy. The former reflects probability, and the latter for consequences over a period of time. In security risk analysis, risks are estimated from adapted ALE formulae. Basic probability factors are related to exposure degrees, vulnerabilities and threats, whereas basic consequence factors are measured by asset values. Thus, for a given threat/adverse action, its security risk can be roughly estimated by: Risk = threat × exposure × vulnerability × severity where threat actually measures the threat likelihood. The first three contributes to probability measures, whereas the last measures its consequence. In this paper we omit threat, which can be easily integrated into our analysis at a later stage of the design. While the ALE concept works well in general financial and security risks, it may not be applicable for cases like terrorist attacks, which are rare or may never occur. 3.2 Methodology Quantitative risk assessment methodologies, such as the ALE-based approach, provide useful mecha-nisms to facilitate risk ranking for decision-making in risk management. However, they are often criticised for producing results that are too general (statisticlike) to be useful for specific decision-making, and producing results that are not credible enough as they rely heavily on subjective value assignments. The authors’ proposed methodology given below has two objectives: (1) to estimate security risks of a web-based software system from its early design configuration, and (2) to estimate probability measures more objectively. The latter is achieved by using heuristics based on relevant design attributes (more details later). Because web applications are typically designed with multiple architectural layers and trust boundaries, our risk methodology must incorporate relevant attributes from multiple design levels (e.g., component, network and hardware platforms). Figure 1 shows basic steps in the methodology. The methodology requires two inputs: a software design and a hardware deployment design. We assume that the software design is expressed in terms of an extended activity
Security and Business Risks from Early Design of Web-Based Systems
49
Fig. 1. Proposed risk methodology
or workflow diagram, where the software components of the web applications and use case actors are represented as well as activities in use cases [3]. The software design includes all use cases, each of which may have one or more scenarios. The hardware deployment design provides information about hardware platforms including host or server, their attributes (e.g., services, operating environments), and network attributes (e.g. IP addresses for network configuration, and LAN, WAN or DMZ for connection). Security scanning tools use this information to identify vulnerabilities of each host in the system. The hardware deployment plan also indicates software components designated to use, or run on, corresponding hosts. Examples of a workflow diagram and a hardware deployment diagram are shown in Figures 3 and 4, respectively. As shown in Figure 1, the first two steps use software design characteristics as heuristic measures of probability factors of risk. Step 1) computes interaction rates among software components for each scenario based on the design workflow. For simplicity, we assume that each outgoing interaction from the same component occurs equally likely (when insights about interaction likelihoods are available, appropriate weights can be applied). Step 2) constructs a graph whose nodes represent software components. Transition rates among components are computed by taking scenario likelihoods into account for combining interaction rates from each scenario. In security risk contexts, scenario likelihoods are heuristic measures of probabilities of potential threats (e.g., when scenarios involve adverse actions), whereas interaction rates are heuristics that partially measure degrees of component exposure. Step 3) uses characteristics of hardware deployment design as a heuristic measure of an additional probability factor of security risk. We first determine vulnerability sources of each host. This can be done by a table lookup procedure on publicly available vulnerability databases for vulnerability sources based on relevant host and network attributes (e.g., IP addresses, services). Examples of vulnerability sources include Apache Chunked-Code software on web servers, operating environments Windows XP SP2 and
50
R. Hewett
JRE 1.4.2 for running JAVA programs on application servers, and Oracle and TNS Listener software for database servers. Such vulnerabilities and their sources can be easily identified by scanning tools when hardware platforms to be deployed are already in operation. Next step is to measure vulnerabilities of each software component. Each component may require one or more hosts. For example, a component that runs on one application server may also need to query data on another database server. From the deployment design, we can determine all hosts required by each component. Let Hi be a set of hosts required by software component i and vh be a number of vulnerabilities of host h. For component i, a number of vulnerabilities, vi and its normalization to a vulnerability ratio, Vi are estimated as: vh and Vi = vi / vi . vi = h∈Hi
i
Here vulnerability ratios of software components are heuristic measures of likelihood factors of security risks. The more vulnerable software component is, the more likely that the system security can be compromised. Step 4) combines all probability factors obtained in Steps 2) and 3) to estimate the likelihood of each component being attacked (or security breach). Here we interpret this likelihood as a chance of a component under investigation to be the first being attacked in the system. Thus, all components and links along the path to reach the component are secured (unexposed to attack and unexploited vulnerability). The likelihood of a node being secured is estimated by subtracting its exposure degree, estimated by its weaknesses (i.e., vulnerability ratio), from one.
Fig. 2. Likelihoods of security breach in components
We elaborate this step by a small example of a dependency graph shown in Figure 2, where nodes S, A, B, C and D represent components of a software design. A doublelined arrow indicates that S is an entry point to software system execution. Each component and transition is annotated with a corresponding vulnerability ratio (in bold) and a transition rate, respectively. As shown in Figure 2, the likelihood of S to breach security is determined by the vulnerability ratio of S itself. The chance of A being attacked (via S → A) is estimated by its chance to be reached without attacks (i.e., S being secured and S transits to A) and its exposure factor (weaknesses). The former is estimated by the likelihoods of S being secured and S transiting to A, whereas the latter is estimated by A’s vulnerability ratio. Thus, the chance of A being attacked is estimated
Security and Business Risks from Early Design of Web-Based Systems
51
by (1 − 0.3) × 0.2 × 0.1 = 0.014. Similarly, the likelihood estimate of security breach in B is 0.224. The chance of C being attacked is estimated by the sum of the chances of C being attacked via S → A → C (i.e., (1 − 0.3) × 0.2 × (1 − 0.1) × 1 × 0.1 = 0.0126) and via S → B → C (i.e., (1 − 0.3) × 0.8 × (1 − 0.4) × 0.1 × 0.1 = 0.00336). This gives 0.01596. In general, the estimated probability along an access path is analogous to propagation of probabilistic effects in Bayesian network by using Bayes Rule [11]. The transition rate plays a similar role to a conditional probability of a destination being attacked given a source being reached. For simplicity, we do not explicitly represent access privilege. We assume that any activity in the same component can be equally accessed. However, not every component deployed by the same host can be accessed equally. We also assume that network connections between hosts do not have vulnerabilities. Such vulnerabilities can be easily incorporated in our proposed approach. Note that since all probability factors (transition rate, vulnerability ratio) are no greater than one, the longer the path to reach a component is, the harder it is to attack (i.e., its chance of being attacked decreases). For a path that contains cycles, we consider the maximum likelihood that can be obtained. Thus, the heuristic function we use for estimate likelihoods conforms to intuition in security contexts. Step 5), shown in Figure 1, estimates consequences of each adverse action in each software component using approaches similar to HAZOP or FMEA. In this paper we focus on probability estimation. Thus, we apply a typical severity analysis technique, categorize and quantify component consequences according to security standards [9]. By using the maximum severity obtained in each component, we can compute an estimate of security risks of each component of the system being design. For a risk estimate of an overall system, we estimate the likelihood of a system by multiplying the likelihood of a system to safely terminate by a vulnerability ratio of overall system components, and the system severity. We estimate the system severity by taking a majority of severity categories of components in the overall system. Section 4 illustrates our methodology in a case study of web application design.
4 A Case Study This section presents an illustration in an industrial engineering and business-to-business (B2B) domain with a case study, adapted from [12], of a web-based material requirements planning system for a manufacturing enterprise. 4.1 Web Application Design A web-based material requirements planning (MRP) system takes customer’s product order through the B2B web front and performs material requirements planning. The goal of the system is to respond quickly and effectively to changing requirements to minimize delay in production and product delivery. The system automatically updates orders and predicts the daily planned orders, which may be overwritten by a manager. The daily planned orders are sent to a job shop that simulates real-time scheduling and production. At the same time, the system determines materials required for the daily
52
R. Hewett
planned orders and sends orders of raw materials to appropriate suppliers. Job shop simulator reports on finished parts/products to the MRP system, where the manager may adjust some initial parameters (e.g., lead time, lot size) for scheduling production plan to meet customer order requirements. A supplier can view material order status and acknowledge the receipt of the order. The web application design consists of seven software components: MRP, MPS (Master Production Schedule), BOM (Bill of Material), JOB (Job shop simulator), CI (Customer Interface), MI (Manager Interface), and SI (Supplier Interface). MRP validates logins, records product orders and determines customer’s billing. MPS schedules appropriate daily production plan from a given product demand (e.g., from customer order) and other input parameters. The details of MPS are beyond the scope of this paper. BOM determines parts required for each product to be produced and sends order for raw materials to suppliers. JOB is a job shop component that simulates real-time scheduling and production in a manufacturing plant. CI, MI and SI facilitate web interfaces for different users. Customers and suppliers are required to login via a web front, while a manager can do remote login to MRP.
Fig. 3. Workflow diagram of a customer use case
To focus our illustration on risk methodology, we simplify a B2B process by assuming that all payment between a customer and a manufacturer, or a manufacturer and a supplier are off-line transactions. We represent the design in an extended workflow diagram as described in Section 3.2. To make it easy to understand our analysis approach, we present the workflow in three use cases, each of which has one scenario. The three use cases are customer orders products, manager adjusts demands/requirements, and supplier acknowledges material order. We assume that the likelihood of occurrence of each scenario is 0.4, 0.2, and 0.4, respectively. Figure 3 shows a workflow diagram of a process to handle customer orders for products. We assume customers have established business relationships with the manufacturing enterprise and therefore MRP is restricted for use with these customers only. Thus, customers and suppliers are required to login via a web front. The workflow starts at
Security and Business Risks from Early Design of Web-Based Systems
53
a dark circle and ends at a white circle. It represents both sequential (by directed link) and concurrent (indicated by double bar) activities among different components. For example, after determining parts required for products production, BOM sends results to job shop to simulate production of product orders and concurrently it places order of raw materials to appropriate suppliers. The workflow diagram is extended to include software components responsible for functions/executions of these activities. The components are shown in oval shapes on the right of the figure. The arrows between components indicate the interactions among them.
Fig. 4. Hardware architecture and deployment plan
Second part of the design is a hardware deployment plan. Figure 4 shows the design of hardware platforms required. Like many web applications, the MRP system has a three-tier architecture. The presentation layer requires a web server, W . The business logic layer contains three application servers: A1 − A3. Finally, a data layer employs four database servers: D1 − D4, each facilitates data retrieval from different databases as summarized at the bottom of the figure. The servers/hosts are connected via LAN with a firewall between the web server, W and the LAN access to the three application servers. The deployment plan annotates each server with software components that rely on its services as appeared in a square box on top of each server. For example, as shown in Figure 4, a manager can login remotely to MRP via MI, both of which run on application server A1. MRP and MI allow him to review production status and adjust product demands. Since MRP is also responsible for login validation, it needs to retrieve password data. Thus, unlike other application servers, A1 needs to be connected with D1. Next section describes the analysis for assessing security risks associated with each of these software components from a given design. 4.2 Estimating Security Risks The first step for the proposed risk methodology is to calculate interaction rates from each scenario of the web application design. As shown in Figure 3, MRP has one interaction to MPS but has a total of four interactions from it (two from MRP to a terminal state, one to CI and one to MPS). Thus, an interaction rate from MRP to MPS is 1/4 as shown in table M1 of Figure 5. M1 represents interaction rates among components obtained from a “customer orders product” scenario, where T is a terminal state in the
54
R. Hewett
Fig. 5. Tables of interaction rates in each scenario
Table 1. Vulnerability Measures
workflow. Similarly, we can obtaintables M2 and M3 representing interaction rates in the manager and supplier scenario, respectively. Step 2) constructs a dependency graph of software components. For each pair of components, we compute a transition rate using the corresponding interaction rates obtained from each scenario (M1 , M2 and M3 ) and a given likelihood of each scenario, as described in Figure 1. In Step 3), determine vulnerabilities of hardware platforms from a given deployment plan shown in Figure 4 by applying a scanning tool (if the required host and network configurations of the system are already in operation), or (otherwise) a table lookup procedure to public vulnerability databases. Table 1(a) shows vulnerability sources and their counts on our application deployment plan. As described earlier, the table lookup for vulnerability sources are based on software and its environments required on each host to provide their services. Next, to find a vulnerability ratio, Vi of component i, we determine host services required by the component. For example, the customer interface, CI requires the web server W for login and ordering products, and also requires query retrievals of product information from the database D2 for browsing. Similar results can be obtained for other components as summarized in Table 1(b). The second to last column shows the number of vulnerabilities associated with each component. This is estimated by summing a total number of vulnerabilities in each of the hosts required in the component. The last column shows vulnerability ratios with a total ratio of one for overall system vulnerability. Figure 6 shows the resulting graph obtained from Step 2) with annotated transition rates and vulnerability ratios (in bold), heuristic estimates of vulnerability of the design components. Here we assume that at the starting point S (a solid circle), the system has no vulnerability, whereas the terminating point T (a white circle) has a vulnerability ratio of the whole system (i.e., one). Step 4 determines likelihoods of attacks to (or security breach in) each component using the approach as described in the example of
Security and Business Risks from Early Design of Web-Based Systems
55
Fig. 6. A dependency graph of software components
Table 2. Severity analysis of the web application design
Figure 2. For example, the chance of CI being attacked (via S → CI) is 1×0.4×0.09 = 0.036. Note that the path to CI from MRP (via S → CI → M RP → CI) is ignored since it produces a lower likelihood and thus, not the worst case as desired. The chance of MRP being attacked is computed as a sum of likelihoods of MRP being reached from CI, MI, SI, and JOB, i.e., 0.06 + 0.02 + 0.04 + 0.01 = 0.14. A summary of component security breach likelihoods is given in the first row of Table 4. In Step 5), we estimate consequences of adverse actions in each component by categorizing consequences into five categories: CT (catastrophic: 0.95), CR (critical: 0.75), MG (marginal: 0.5), and MN (minor: 0.25) [9]. The estimated severity for each component is shown in Table 2. By selecting the maximum severity obtained in each component to be the component severity, and applying standard quantification to it, we can obtain risk of each component as shown in Table 3. Here MRP is the most high-risk component with 34.31%
56
R. Hewett Table 3. Risk analysis results
risk, whereas MI is the lowest with 2.64% risk. The risk of overall system is computed by multiplying the likelihood of reaching the terminating point, T with its vulnerability and a severity of the overall system. Here we use, CR, which is a severity category of a majority of all components. The result in Table 3 shows that the security risk of overall system is about 50%, which can be used as a baseline to compare with alternatives. These risks give relative measures for assisting decision-makings on security options from different designs.
5 Business Risk Analysis Although both safety and security can impact the business success of the organization deploying the software, business risks have different additional characteristics that deserve to be considered separately. Despite many technology advances, business application designers are facing with organizational challenges in customizing the application to fit in with existing business processes and rules in order to ensure effective coordination of work across the enterprise. This section illustrates how the proposed risk estimation framework in Section 3 can be applied to analyzing the business implications from the design. The risk methodology for business risks can be simplified from the methodology shown in Figure 1 by omitting Step 3 to computer system vulnerabilities and eliminating vulnerability factor in Step 4. However, more detailed severity analysis is necessary. In particular, we are interested in analyzing consequences of a business failure. The analysis can help identify counter measures for design improvement. This is accomplished by distinguishing different types data flow terminating points of a given component-based software design. For example, consider the web application design described in Section 4 in our case study. Figure 7 illustrates three possible types of exits: • Quit before enter (q1 ): the customer either did not have login information at hand or had no time, or did not like the early login compulsion (especially for first time visitors). For business perspectives, one counter measure to prevent losing potential customers is by removing the early login requirement. • Quit after product order (q2 ): the customer did not have sufficient credit payment, or changed decisions (e.g., due to products or cost after adding shipment charge). One strategy to counter measure failure at this stage could be to give customers more delivery and shipment options including discounts on large orders. Alternatively, the inventory could be running out of items. In such a case, a counter measure can be marketing research and increasing inventory or supply of high demand.
Security and Business Risks from Early Design of Web-Based Systems
57
Fig. 7. Determining business implications from software data flow design
• Quit after payment or after material order (q3 ): this is a normal exit that meets the business objectives, i.e., to obtain transactions from customer and to allow a manager to order the supply materials required for the products. Consequently, the separation of different types of failures yields a slightly different dependency graph of software component from our previous security analysis. A new dependency graph is shown on the right of Figure 7. The likelihoods of each software component can be estimated in a straightforward manner. For severity analysis, we analyze consequences of failures. Table 4 gives an example of failure consequences of business contexts in our example where q1 is not considered as a failure state. Severity analysis is organization dependent and in practice, quantifying severity can be assessed from experiences (e.g., using expected loss per year). If such data is not available, a common practice is to categorize severity into four standard categories: catastrophic (CT), critical (CR), marginal (MG), and minor (MN), each of which is respectively quantified by 0.95, 0.75, 0.50 and 0.25. For example, as shown in Table 4, quitting after payment in q2 results in loss of one order transaction, which is critical but not as severe as loss of opportunity for numerous new customers as quitting before enter in q1 . The rest of the proposed steps can be applied similarly to obtain business risks fornot meeting business objectives as designed. These are reflected in quitting in q1 and q2 .
Table 4. Severity of different stages of business failures
58
R. Hewett
6 Concluding Remarks This paper attempts to provide a general framework for security risk analysis of web application software design. We present a systematic methodology for automated preliminary risk assessment. Our focus here is not to estimate precise risk measures but to provide a quick and early rough estimate of security risks by means of heuristics based on characteristics of software design and hardware platforms to help locate high-risk components for further rigorous testing. Our risk model is far from complete partly due to inherent limitations in the design phase. Although, we illustrate our approach to web applications, it is general to apply to other software-intensive systems and to extend the framework and the methodology to include additional security elements in the model. The approach has some limitations. Besides obvious requirements on knowledge about system usage scenarios and component-based design, the approach is limited by insufficient data available in the early phases of software life cycle to provide precise estimation of the system security. However, the risk analysis methodology suggested in this paper can be used to estimate security risks at an early stage, and in a systematic way. Finally, the approach does not consider failure (to satisfy security criteria) dependencies between components. Thus, risks of a component connected to attacked components are determined in the same way as those that are not attacked. We plan to extend this framework to address this issue by providing mechanisms that take user roles, access privileges and vulnerability exploits into account. Additional future work includes refinement of explicit severity analysis and incorporation of attack scenarios in the risk models. Acknowledgements. This research is partially supported by the Sheldon Foundation. Thanks to Phongphun Kidsanayothin and Meinhard Peters for their help in the previous version of this paper whose parts were presented at the 3rd Conference of Web Information Systems and Technology.
References 1. Bugtraq (October 2006), www.securityfocus.com/archive/1 2. Nessus Vulnerability Scanner (October 2006), www.nessus.org 3. Barna, P., Frasincar, F., Houben, G.-J.: A workflow-driven design of web information systems. In: ICWE 2006: Proceedings of the 6th international conference on Web engineering, pp. 321–328. ACM, New York (2006) 4. Bleistein, S.J., Cox, K., Verner, J.: Requirements engineering for e-business systems: Integrating jackson problem diagrams with goal modeling and bpm. In: 11th Asia Pacific Software Engineering Conference, Busan, Korea (2004) 5. Cortellessa, V., Appukkutty, K., Guedem, A.R., Elnaggar, R.: Model-based performance risk analysis. IEEE Trans. Softw. Eng. 31(1), 3–20 (2005); Senior Member-Katerina GosevaPopstojanova and Student Member-Ahmed Hassan and Student Member-Walid Abdelmoez and Member-Hany H. Ammar 6. Csertan, G., Pataricza, A., Harang, P., Doban, O., Biros, G., Dancsecz, A., Friedler, F.: BPM based robust E-Business application development (2002) 7. Ginige, A., Murugesan, S.: Web engineering: An introduction. Multimedia 8, 14–18 (2001)
Security and Business Risks from Early Design of Web-Based Systems
59
8. Haimes, Y.Y.: Risk Modeling, Assessment, and Management. Wiley-IEEE (2004) 9. ISO. Risk Management - Vocabulary - Guidelines for Use in Standards. ISO Copyright Office, Geneva (2002) 10. Landoll, D.J.: The Security Risk Assessment Handbook: A Complete Guide for Performing. CRC Press, Boca Raton (2006) 11. Pearl, J.: Graphical models for probabilistic and causal reasoning. In: Handbook of Defeasible Reasoning and Uncertainty Management Systems, vol. 1, pp. 367–389 (1998) 12. Qiang, L., Khong, T.C., San, W.Y., Jianguo, W., Choy, C.: A web-based material requirements planning integrated application. In: EDOC 2001: Proceedings of the 5th IEEE International Conference on Enterprise Distributed Object Computing, Washington, DC, USA, p. 14. IEEE Computer Society Press, Los Alamitos (2001) 13. Russell, N., van der Aalst, W.M.P., ter Hofstede, A.H.M., Wohed, P.: On the suitability of uml 2.0 activity diagrams for business process modelling, pp. 95–104. Australian Computer Society, Inc., Hobart (2006) 14. Shahrokhi, M., Bernard, A.: Risk assessment/prevention in industrial design processes. In: 2004 IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2592– 2598 (2004) 15. Singh, I., Stearns, B., Johnson, M.: Designing enterprise applications with the J2EE platform, p. 417. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (2002) 16. Stoneburner, G., Goguen, A., Feringa, A.: Risk management guide for information technology systems. NIST Special Publication, pp. 800–830 (2002) 17. van der Walt, C.: Assessing Internet Security Risk, Part 4: Custom Web Applications, securityfocus.com (October 2002) 18. Verdon, D., McGraw, G.: Risk analysis in software design. Security & Privacy Magazine 2, 79–84 (2004) 19. Yacoub, S.M., Cukic, B., Ammar, H.H.: Scenario-based reliability analysis of componentbased software. In: ISSRE 1999: Proceedings of the 10th International Symposium on Software Reliability Engineering, Washington, DC, USA, p. 22. IEEE Computer Society Press, Los Alamitos (1999) 20. Zhang, Y., Zhu, H., Greenwood, S., Huo, Q.: Quality modelling for web-based information systems. In: FTDCS 2001: Proceedings of the 8th IEEE Workshop on Future Trends of Distributed Computing Systems, Washington, DC, USA, p. 41. IEEE Computer Society, Los Alamitos (2001)
Designing Decentralized Service Compositions: Challenges and Solutions Ustun Yildiz and Claude Godart INRIA-LORIA BP 239 Campus Scientifique F-54506 Vandœuvre-l`es-Nancy, France {yildiz,godart}@loria.fr
Abstract. This paper reports a new approach to the decentralized execution of service compositions. The motivation of our work is to provide a systematic solution for the advanced configuration of P2P service interactions when they are involved in a composition. In order to do so, we consider a service composition as a centralized workflow specification and derive corresponding cooperating process in a such way that the dependencies of the centralized specification are implemented as P2P interactions between underlying services that execute the derived processes. More precisely, we present the most important issues of deriving operation and propose corresponding solutions that run counter to naive intuition.
1 Introduction Services oriented architectures are emerging as a new standardized way to design and implement business processes within and across enterprise boundaries. Basic XML-based service technologies (e.g. SOAP [1], WSDL [2]) provide simple but powerful means to model, discover and access software applications over the Web as analogous to wellknown concepts of traditional RPC-like and component-oriented computing. The term composition is used to describe the design and implementation issues of the combination of distinct services into a coherent whole in order to fulfill sophisticated tasks that cannot be fulfilled by a single service. For developing business processes based on service compositions, there are two essential initiatives around the banner of Orchestration and Choreography. Orchestration approach considers executable processes (e.g. WS-BPEL [3] specifications) that specify the dependencies of composed services through a single execution unit as analogous to centralized workflow management [4]. In contrast to orchestration, choreography approach (championed by WS-CDL [5]) is more decentralized in nature. It considers a global declarative composition that captures the interactions in which composed services can involve each other. Global choreographies are used to verify the consistency of composed services that work in a collaborative manner. As in application software modelling, different composition approaches are good for different things. Orchestration based compositions, for example, are quite useful for control and monitoring purposes as they govern compositions from a centralized site, but are less useful for data intensive compositions for scalability limitations. Choreography based compositions are useful in elucidating P2P interactions and parallelism, but are more J. Filipe and J. Cordeiro (Eds.): WEBIST 2007, LNBIP 8, pp. 60–71, 2008. c Springer-Verlag Berlin Heidelberg 2008
Designing Decentralized Service Compositions: Challenges and Solutions
61
difficult to establish and control. The focus of this paper is on the design and implementation of decentralized service compositions that bridge the gap between these two approaches. Intuitively speaking, establish peer-to-peer (P2P) interconnections between composed services is a major requirement in the management of business processes [6] [7]. However, the efficient configuration of service interactions is a difficult problem. With respect to two essential approaches, this difficulty has multiple reasons. Following the orchestration approach, composed services are coordinated from a single execution unit and they are isolated from each other. This prevents their P2P interactions. Following the choreography approach, services can establish P2P interactions. However, their interactions may not be adequate to those required by process designers or other contextual criteria. Decentralized workflow management has been always appealing in the conception and the implementation of sophisticated configurations between process participants [8] [9] [10]. The principal reason that prevents their wide spread use is their architectural assumptions that make them unreasonable in the web context. Following the approach of decentralized workflow management, we developed a new approach to the decentralized execution of service compositions. This approach is based on deriving cooperating processes from a workflow-like centralized specification of a composition. Derived processes are deployed on composed services to implement the dependencies of the centralizes specification as P2P interactions between underlying services. In this paper, we discuss the most important issues of deriving operation. First, we present the process specifications that the deriving operation is intuitive. Next, we consider sophisticated specifications that combine control, data and conversational aspects, most of which, there is no intuitive deriving operation. Furthermore, we present our approach for deriving processes from the latter. The remainder of this paper is organized as follows. Section 2 reviews the approach and the related work. Section 2 reviews the related work and presents the background of our approach. Section 3 introduces presents the formal concepts that we employed. Next, section 4 is the core core of contribution while the last section concludes.
2 Background Figure 1 illustrates the overview of our approach. The left-hand side describes a composition as a centralized process specification. In this kind of implementations, service are isolated from each other. A centralized workflow engine executes the composition specification. Consequently, the messages that correspond to the fulfillment of control and data dependencies of composition flow back and forth to the centralized workflow engine. The left-hand side of the same figure illustrates the execution of the same composition with the derived fragments. In contrast to existing solutions, this approach is reasonable and inexpensive in many ways. Intuitively, a composition specification that includes service interactions can be modeled by an organization and can be easily received and executed by another one even over a different execution engine. This is a secondary outcome of standardization efforts as they provide a common understanding for service compositions. However, the mobility of standard process models or process instances between different organizations and tools has not been handled by a decentralization point of view.
62
U. Yildiz and C. Godart
Service
Service
Service
Service
d6 a
d1
Centralized Workflow Engine
a
a
3
6
a
a
0
1
a
8
d0
9
a
a
10
11
a 2
d2
a 12
a 5
a
a
d4
17
Workflow Engine a
a
13
14
a
a
15
16
a
a
a
18
19
20
7
a
d5
4
d3
Service
Service Service
Service
Service Service
(a)
(b)
Fig. 1. Comparison of centralized and decentralized service compositions
The idea of decentralized execution setting is not new. Issues related to decentralized execution of workflows has been considered as very important in the relevant research literature a decade ago [11]. If we consider previous works that are not relevant to service oriented paradigms and technologies, much of them are conceived to prevent the overloading of a central coordinator component. Their basic proposition is to migrate process instances from an overloaded coordinator to a more available according to some metrics [12] or execute sub-partitions of a centralized specification on distinct coordinators [13]. Although efficient, these approaches do not consider P2P interactions of workflow elements. The implementation of decentralization is relatively new for service oriented applications. A simple but crucial observation is that some of the recent approaches deal with decentralization by using centralized services or additional software layers what can be a strong hypothesis in the Web context rather than reasoning decentralization with decentralized processes [9]. Some recent works provide several different process partitioning techniques similar to ours. For example, [14] focuses on the partitioning of variable assignment activities of a BPEL processes. But, they do not discuss how they deal with the elimination of dead paths (Dead Path Elimination [15]). The works presented in [16] [17] are similar, but limited to the decentralization of control flow. Our work is also relevant to traditional program partitioning for distributed computing. However there are important differences. First, the aim of a decentralized execution is not to parallelize the execution as the business logic of centralized composition must be preserved. From this point of view, decentralized service composition is much more relevant to P2P computing rather than program parallelization. The motivation of our approach highlights the problems of arriving at a definition of an accurate operation that derives cooperating processes to be executed by composed services (i.e. how a control or data dependency of the centralized specification can be reestablished and preserved in a decentralized composition?).
3 Formal Considerations This section describes the formal underpinnings of our approach that requires rigorous semantics on modeled processes. It is important to clarify what we do not intent to
Designing Decentralized Service Compositions: Challenges and Solutions
63
propose a new process language or extends an existing one. The presented concepts are applicable to a variety of process modeling languages and execution engines. A process can be represented as a message passing program modeled by a directed acyclic graph [15]. We assume that the centralized composition is modeled as a workflow. At this point, we assume that workflow models are structured processes [18]. A structured process considers control connectors that abstract the control flow of a process as properly nested split/join pairs that characterize control connectors (OR-split, OR-join, AND-split, AND-join). Definition 1 (Process). A process P is a tuple (A, C, Ec , Ed , D) where A is the set of activities, C is the set of control connectors, Ec ⊆ A × C × A is the set of control edges, Ed ⊆ (A × A) × D is the set of data edges and D is the set of data elements. A process P has a unique start activity, denoted with as , that has no predecessors, and has a unique final activity, denoted with af , with no successors. A control edge from an activity ai to another activity aj means that aj can not be executed until ai has reached a certain execution state (termination by default). A data edge exists between two activities if a data is defined by an activity and referenced by another without interleaving the definition of the same data. For example, a data received from a service si with the execution of an activity ai can be used as the input value of an activity aj that consists of an interaction with a service sj . In this case, there is a data edge between ai and aj . It should be noted that our process modeling does not consider control edges between control connectors. We associate source, target: Ec ∪ Ed → A functions return the target and source activities of control and data edges. With respect to control flow we can also define a partial order, denoted by kl <e>ez is presented in Fig. 1. The nodes are identified using their preorder and postorder numbers which encode a lot of structural information1. The tree traversals in XPath are based on 12 axes which are presented in Table 3. In simple terms, an XPath query can be thought of as a series of location steps of the form /axis::nodetest[predicate] which start from a context node - initially the root of the tree - and select a set of related nodes specified by the axis. A node test can be used to restrict the name or the type of the selected nodes. An additional predicate can be used to filter the resulting node set further. The location step n/child::c[child::*], for example, selects all children of the context node n which are named “c” and have one or more child nodes. As a shorthand for axes child, descendant-or-self, and attribute, one can use /, //, and /@, respectively. 1
In preorder traversal, a node is visited before its subtrees are recursively traversed from left to right and in postorder traversal, a node is visited after its subtrees have been recursively traversed from left to right.
Efficient Queries on XML Data through Partitioning
101
1, 10 b
2, 2 c
3, 1 d=”y”
4, 6 c
5, 3 d=”y”
8, 9 c
6, 5 e
9, 8 e
7, 4 ”kl”
10, 7 ”ez”
Fig. 1. An XML tree Table 1. XPath axes and their semantics Axis Semantics of n/Axis parent Parent of n. child Children of n, no attribute nodes. ancestor Transitive closure of parent. descendant Transitive closure of child, no attribute nodes. ancestor-or-self Like ancestor, plus n. descendant-or-self Like descendant, plus n, no attribute nodes. preceding Nodes preceding n, no ancestors or attribute nodes. following Nodes following n, no descendants or attribute nodes. preceding-sibling Preceding siblings of n, no attribute nodes. following-sibling Following siblings of n, no attribute nodes. attribute Attribute nodes of n. self Node n.
Using XPath, it is also possible to set conditions for the string values of the nodes. The XPath query //record=‘‘John Scofield Groove Elation Blue Note", for instance, selects all element nodes with label “record” for which the value of all text node descendants concatenated in document order matches “John Scofield Groove Elation Blue Note”. Notice also that the result of an XPath query is the concatenation of the parts of the document corresponding to the result nodes. In our example case, query //c//*, for example, would result in no less than <e>kl<e>ez<e>kl <e>ez.
4 Our Partitioning Method As mentioned earlier, we rely on pre-/postorder encoding [17], i.e., we assign both preorder and postorder numbers for the nodes. This encoding provides us with enough information to evaluate the four major axes [4] as follows: Proposition 1. Let pre(n) and post(n) denote the preorder and postorder numbers of node n, respectively. For any two nodes n and m, n ∈ m/ancestor::* iff pre(n)
post(m), n ∈ m/descendant::* iff pre(n) > pre(m) and post(n) < post(m), n ∈ m/preceding::* iff pre(n) < pre(m) and post(n) < post(m), and n ∈ m/following::* iff pre(n) > pre(m) and post(n) > post(m). This observation is actually slightly inaccurate since the attribute nodes are not in the result of, for instance, the descendant axis. However, if all nodes are treated equally, the proposition holds. The observation is exemplified on the left side of Fig. 2 which shows the partitioning created by the major axes using node (6,5) as the context node. The ancestors of node (6,5) can be found in the upper-left partition, the descendants in the lower-right partition, the predecessors in the lower-left partition, and the followers in the upper-right partition. Based on the previous observation, the following proposition is obvious: Proposition 2. For any two nodes n and m and for any p > 0, pre(n)/p ≤ pre(m) /p and post(n)/p ≥ post(m)/p if n ∈ m/ancestor::*, pre(n)/p ≥ pre(m)/p and post(n)/p ≤ post(m)/p if n ∈ m/descendant::*, pre(n) /p ≤ pre(m)/p and post(n)/p ≤ post(m)/p if n ∈ m/preceding::*, and pre(n)/p ≥ pre(m)/p and post(n)/p ≥ post(m)/p if n ∈ m /following::*. Proposition 2 provides us with simple means for partitioning the nodes into disjoint subsets. The partitioning is exemplified in Fig. 2 (right) which shows the nodes of our example tree partitioned using value 4 for p. It should now be obvious that when searching, for instance, the ancestors of node (6,5), it is sufficient to check the nodes in the shaded partitions since other partitions cannot contain any ancestors. In what follows, values pre(n)/p and pre(n)/p where n is a node residing in partition P are simply referred to as the preorder and postorder number of P and denoted by pre(P ) and post(P ), respectively. Many minor axes can also benefit from partitioning. The ancestor-or-self and descendant-or-self axes, for example, behave similarly to ancestor and descendant axes, and thus they can be accelerated using the same partition information. The results of preceding-sibling and following-sibling, on the
Efficient Queries on XML Data through Partitioning
103
other hand, are subsets of preceding and following and the results of parent and child subsets of ancestor and descendant so they can also be accelerated using the partitioning. However, other minor axes than ancestor-or-self and descendant-or-self are generally very easy to evaluate, and thus using the partition information to evaluate them is often just unnecessary overhead especially if we have an index which can be used to locate the nodes with a given parent node efficiently [18]. One should also notice that in practice, the nodes are not distributed evenly and there are several empty partitions. Thus, the number of non-empty partitions is usually much smaller than p2 .
5 Implementation Options 5.1 Relational Implementation To test our idea, we designed a relational database according to the principles described in the previous section. As the basis of our implementation, we chose the XPath accelerator [4]. In order to store the partition information, we employed relation Part, and thus we ended up with the following relational schema; the underlined attributes serve as the primary keys: Node(Pre, Post, Par, Part, Type, Name, Value) Part(Part, Pre, Post)
As in the original proposal, the database attributes Pre, Post, Par, Type, Name, and Value of the Node relation correspond to the pre- and postorder numbers, the preorder number of the parent, the type, the name, and the string value of the node, respectively. The database attributes Type, Name, and Value are needed to support node tests and string value tests; the axes can be evaluated using database attributes Pre, Post, and Par. The partition information is contained in the Part table in which the database attributes Pre and Post correspond to the preorder and postorder number of the partition, respectively. The database attribute Part in the Node relation is a foreign key referencing to the primary key of the Part relation. For the sake of brevity, we do not describe the XPath-to-SQL query translation in detail; the details can be found in [4] and [6]. For our purposes, it is sufficient to say that in order to perform structural joins, these systems issue SQL queries which involve nonequijoins, i.e., joins using < or > as the join condition, which can easily lead to scalability problems [12,18,19]. In XPath accelerator, the XPath query n0 /descendant::*, for instance, is transformed into the following SQL query: SELECT DISTINCT n1 .* FROM Node n0 , Node n1 WHERE n1 .Pre>n0 .Pre AND n1 .Post=a1 .Pre AND b1 .Postn0 .Pre AND n1 .Postn0 .Pre AND n1 .Post