Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6896
Stefanie Rinderle-Ma Farouk Toumani Karsten Wolf (Eds.)
Business Process Management 9th International Conference, BPM 2011 Clermont-Ferrand, France August 30 - September 2, 2011 Proceedings
13
Volume Editors Stefanie Rinderle-Ma University of Vienna Faculty of Computer Science Workflow Systems and Technology Vienna, Austria E-mail:
[email protected] Farouk Toumani Blaise Pascal University Laboratoire LIMOS - CNRS Clermont-Ferrand, France E-mail:
[email protected] Karsten Wolf University of Rostock Institute of Computer Science Rostock, Germany E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-23058-5 e-ISBN 978-3-642-23059-2 DOI 10.1007/978-3-642-23059-2 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011933596 CR Subject Classification (1998): H.4, H.5, D.2, F.3, J.1 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The BPM Conference series provides the most distinguished research forum for researchers and practitioners in all aspects of business process management (BPM) including theory, frameworks, methods, techniques, architectures, systems, and empirical findings. The series has a record of attracting innovative research of the highest quality, from a mix of disciplines including computer science, management information science, services computing, services science, and technology management. BPM 2011 was the ninth conference of the series. It took place from August 30 to September 2, 2011, on the campus des c´ezeaux in Clermont-Ferrand, France, and was organized by the LIMOS Laboratory of the Blaise Pascal University and CNRS. BPM 2011 attracted 157 research paper submissions, out of which 22 papers were selected for this volume based on a thorough review process (each paper was reviewed by three to four Program Committee members and subject to subsequent discussion among the reviewers). Altogether the research track was extremely competetive with an acceptance rate of less than 14%. Further, this volume contains a paper and two abstracts documenting the invited keynote talks as well as 5 industrial papers (2 invited industrial papers and 3 papers selected out of 14 submissions to the industrial track). In conjunction with the main conference, 12 international workshops were held on August 29, 2011. The topics of the workshops covered a broad range of BPM-related subjects and stimulated the exchange and discussion of new and innovative ideas. The workshop proceedings are published as a separate volume of Springer’s Lecture Notes in Business Information Processing series. Beyond that, the conference also included a doctoral consortium, an industry day, tutorials, panels, and demonstrations. We want thank everyone who made BPM 2011 a success by generously and voluntarily sharing their knowledge, skills and time: the Conference Committee for providing an excellent environment for the conference, and all other colleagues holding offices. In particular, we want to express our gratitude to the senior, regular and industrial Program Committee members as well as the additional reviewers for devoting their expertise and time to ensure the high quality of the conference scientific program through an extensive review and discussion process. Finally, we are grateful to all the authors who showed their appreciation and support of the conference by submitting their valuable work to it. September 2011
Stefanie Rinderle-Ma Farouk Toumani Karsten Wolf
Organization
BPM 2011 was organized by the LIMOS Laboratory, CNRS, University of Blaise Pascal, Clermont-Ferrand, France.
Steering Committee Boualem Benatallah Fabio Casati Peter Dadam Wil van der Aalst J¨ org Desel Schahram Dustdar Arthur ter Hofstede Barbara Pernici Matthias Weske
University of New South Wales, Sydney, Australia University of Trento, Italy University of Ulm, Germany Eindhoven University of Technology, The Netherlands Catholic University Eichst¨att-Ingolstadt, Germany Vienna University of Technology, Austria Queensland University of Technology, Brisbane, Australia Politecnico di Milano, Italy Hasso Plattner Institut, University of Potsdam, Germany
Executive Committee General Chairs Mohand-Said Hacid Farouk Toumani
Claude Bernard University, France Blaise Pascal University, France
Organizing Chair Michel Schneider
Blaise Pascal University, France
Program Chairs Stefanie Rinderle-Ma Farouk Toumani Karsten Wolf
University of Vienna, Austria Blaise Pascal University, France University of Rostock, Germany
Industrial Track Chairs Francisco Curbera Alexander Dreiling Hamid Reza Motahari Nezhad
IBM Research, Hawthorne, USA SAP Research, Brisbane, Australia HP Labs, Palo Alto, USA
VIII
Organization
Workshop Chairs Kamel Barkaoui Florian Daniel Schahram Dustdar
C´edric, Paris, France University of Trento, Italy Vienna University of Technology, Austria
Tutorials/Panels Chairs Fabio Casati Marlon Dumas Stefan Tai Demonstration Chairs Heiko Ludwig Hajo A. Reijers
University of Trento, Italy University of Tartu, Estonia Karlsruhe Institute of Technology, Germany
IBM Research, San Jose, USA Eindhoven University of Technology, The Netherlands
PHD Symposium Chairs Anis Charfi Claude Godart Michael Rosemann
SAP Research, Darmstadt, Germany LORIA - University of Lorraine, France Queensland University of Technology, Brisbane, Australia
Publicity Chairs Laurent d’Orazio Rajiv Ranjan Liang Zhang
Blaise Pascal University, France University of New South Wales, Australia Fudan University, China
Senior Program Committee Boualem Benatallah Peter Dadam Wil Van Der Aalst Schahram Dustdar Claude Godart Arthur ter Hofstede Stefan Jablonski Frank Leymann Jan Mendling Hajo A. Reijers
University of New South Wales, Australia University of Ulm, Germany Eindhoven University of Technology, The Netherlands Vienna University of Technology, Austria LORIA - University of Lorraine, Nancy, France Queensland University of Technology, Brisbane, Australia University of Bayreuth, Germany University of Stuttgart, Germany Humboldt-Universit¨ at zu Berlin, Germany Eindhoven University of Technology, The Netherlands
Organization
Michael Rosemann Manfred Reichert Mathias Weske
IX
Queensland University of Technology, Australia University of Ulm, Germany University of Potsdam, Germany
Program Committee Pedro Antunes Djamal Benslimane Shawn Bowers Christoph Bussler Fabio Casati Francisco Curbera Valeria De Antonellis Giuseppe De Giacomo J¨ org Desel Alin Deutsch Remco Dijkman Alexander Dreiling Johann Eder Gregor Engels Dirk Fahland Hans-Georg Fill Avigdor Gal Dimitrios Georgakopoulos Daniela Grigori Mohand-Said Hacid Kees Van Hee Richard Hull Marta Indulska Leonid Kalinichenko Gerti Kappel Dimka Karastoyanova Ekkart Kindler Agnes Koschmider John Krogstie Akhil Kumar Jana K¨ ohler Ana Liu Niels Lohmann
University of Lisbon, Portugal Lyon 1 University, France Gonzaga University, Spokane, USA Saba Software, Inc., USA University of Trento, Italy IBM Research, Hawthorne, USA University of Brescia, Italy University of Rome Sapienza, Italy Catholic University Eichst¨att-Ingolstadt, Germany University of California San Diego, USA University of Technology, Eindhoven, The Netherlands SAP Research, Brisbane, Australia University of Klagenfurt, Austria University of Paderborn, Germany Humboldt-Universit¨ at zu Berlin, Germany University of Vienna, Austria Technion, Haifa, Israel CSIRO, Canberra, Australia University of Versailles, France Universit´e Claude Bernard Lyon 1, France Technische Universiteit Eindhoven, The Netherlands IBM T.J. Watson Research Center, USA The University of Queensland, Australia Russian Academy of Science, Russia Vienna University of Technology, Austria University of Stuttgart, Germany Technical University of Denmark, Denmark Karlsruher Institute of Technology, Germany University of Science and Technology, Norway Penn State University, USA Hochschule Luzern, Switzerland National ICT, Australia Universit¨ at Rostock, Germany
X
Organization
Heiko Ludwig Hamid Motahari Bela Mutschler Prabir Nandi Andreas Oberweis Herv´e Panetto Oscar Pastor Lopez Cesare Pautasso Frank Puhlmann Jan Recker Stefanie Rinderle-Ma Domenico Sacca’ Shazia Sadiq Raghu Santanam Erich Schikuta Christian Stahl Mark Strembeck Stefan Tai Farouk Toumani Boudewijn Van Dongen Hagen Voelzer Harry Wang Barbara Weber Petia Wohed Eric Wohlstadter Karsten Wolf Andreas Wombacher Leonid Zhao Christian Zirpins
IBM Research, USA HP Labs, Palo Alto, USA University of Applied Sciences Ravensburg-Weingarten, Germany IBM Research, USA Karlsruhe Institute of Technology, Germany Nancy-Universit´e, CNRS, France University of Valencia, Spain University of Lugano, Switzerland inubit AG, Germany QUT, Brisbane, Australia University of Vienna, Austria University of Calabria, Italy The University of Queensland, Australia Arizona State University, USA University of Vienna, Austria Technische Universiteit Eindhoven, The Netherlands Vienna University of Economics and BA, Austria Karlsruhe Institute of Technology, Germany Blaise Pascal University, France Eindhoven University of Technology, The Netherlands IBM Research, Switzerland University of Delaware, USA University of Innsbruck, Austria Stockholm University, Sweden University of British Columbia, USA Rostock University, Germany University of Twente, The Netherlands University of Karlsruhe, Germany
Industrial Track Program Committee Soeren Balko Alistair Barros Christoph Bussler Anis Charfi Franois Charoy Jude Fernandez Sven Graupner Richard Goodwin
SAP Research, Brisbane, Australia SAP Research, Brisbane, Australia Xtime, USA SAP Research, Darmstadt, Germany LORIA -Henri Poincar University, Nancy, France Infosys, Bangalore, India HP Labs, Palo Alto, USA IBM Research, Hawthorne, USA
Organization
Hans-Arno Jacobsen Christian Janiesch Dimka Karastoyanna Rania Khalaf Jana Koehler
XI
University of Toronto, Canada SAP Research, Brisbane, Australia University of Stuttgart, Germany IBM Research, Hawthorne, USA Lucerne University of Applied Sciences and Art, Switzerland University of Toronto, Canada Fujitsu America, USA IBM Research, Hawthorne, USA SAP Research, Germany
Vinod Muthusamy Keith Swenson Roman Vaculin Birgit Zimmermann
Additional Reviewers Arsinte, Delia Aubry, Alexis Baird, Aaron Bals, Jan-Christopher Barukh, Moshe Chai Baumgartner, Peter Baumgrass, Anne Bergenthum, Robin Bianchini, Devis Bogoni, Leandro Paulo Buccafurri, Francesco Cai, Rainbow Chang, Wei Chiasera, Annamaria Clemens, Stephan De La Vara, Jose Luis De Masellis, Riccardo Di Ciccio, Claudio Doganata, Yurdaer Elabd, Emad Elliger, Felix Favre, Cedric Fazal-Baqaie, Masud Felli, Paolo Fischer, Robin Folino, Francesco Fortino, Gianfranco Gao, Zhiling Gater, Ahmed Gerth, Christian Gierds, Christian Governatori, Guido
Guabtni, Adnene Guzzo, Antonella Hantry, Francois Hartmann, Uwe Hipp, Markus Hoisl, Bernhard Hu, Daning Huang, Zan Kabicher, Sonja Kenig, Batya Klai, Kais Kohlborn, Thomas Koncilia, Christian Kriglstein, Simone Kuester, Jochen La Rosa, Marcello Lakshmanan, Geetika Lee, Gun-Woong Leitner, Maria Lemos, Fernando Lezoche, Mario Liegl, Philipp Lincoln, Maya Linehan, Mark Malets, Polina Mangler, Juergen Marrella, Andrea Mauser, Sebastian Mayrhofer, Dieter Mecella, Massimo Melchiori, Michele Menzel, Michael
Mesmoudi, Amin Michelberger, Bernd Musaraj, Kreshnik M¨ uller, Richard Nagel, Benjamin Nigam, Anil Nourine, Lhouari Nowak, Alexander Oro, Ermelinda Patrizi, Fabio Pichler, Christian Pontieri, Luigi Rodriguez, Carlos Roy Chowdhury, Soudip Roychowdhury, Soudip Sagi, Tomer Sanchez, Juan Sarnikar, Surendra Schefer, Sigrid Schleicher, Daniel Schreiber, Hendrik Schumm, David Schuster, Nelly Sebahi, Samir Soi, Stefano Soltenborn, Christian Sonntag, Mirko Spaulding, Trent Stahl, Christian Strommer, Michael S¨ urmeli, Jan Taher, Yehia
XII
Organization
Thion, Romuald Torres, Victoria Tranquillini, Stefano Ul Haq, Irfan Vaisenberger, Avi
Wanek, Helmut Wang, Lei Widl, Magdalena Wimmer, Manuel Wittern, Erik
Yahia, Esma Zapletal, Marco Zhang, He
Table of Contents
Keynotes Some Thoughts on Behavioral Programming . . . . . . . . . . . . . . . . . . . . . . . . . David Harel The Changing Nature of Work: From Structured to Unstructured, from Controlled to Social . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandy Kemsley Automatic Verification of Data-Centric Business Processes . . . . . . . . . . . . Elio Damaggio, Alin Deutsch, Richard Hull, and Victor Vianu
1
2 3
Industrial Track A Blueprint for Event-Driven Business Activity Management . . . . . . . . . . Christian Janiesch, Martin Matzner, and Oliver M¨ uller
17
On Cross-Enterprise Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lav R. Varshney and Daniel V. Oppenheim
29
Source Code Partitioning Using Process Mining . . . . . . . . . . . . . . . . . . . . . . Koki Kato, Tsuyoshi Kanai, and Sanya Uehara
38
Next Best Step and Expert Recommendation for Collaborative Processes in IT Service Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Reza Motahari-Nezhad and Claudio Bartolini Process Variation Analysis Using Empirical Methods: A Case Study . . . . Heiko Ludwig, Yolanda Rankin, Robert Enyedi, and Laura C. Anderson
50 62
Research Track Stimulating Skill Evolution in Market-Based Crowdsourcing . . . . . . . . . . . Benjamin Satzger, Harald Psaier, Daniel Schall, and Schahram Dustdar Better Algorithms for Analyzing and Enacting Declarative Workflow Languages Using LTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Westergaard Compliance by Design for Artifact-Centric Business Processes . . . . . . . . . Niels Lohmann
66
83 99
XIV
Table of Contents
Refining Process Models through the Analysis of Informal Work Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simon Brander, Knut Hinkelmann, Bo Hu, Andreas Martin, Uwe V. Riss, Barbara Th¨ onssen, and Hans Friedrich Witschel Monitoring Business Constraints with Linear Temporal Logic: An Approach Based on Colored Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabrizio Maria Maggi, Marco Montali, Michael Westergaard, and Wil M.P. van der Aalst
116
132
Automated Error Correction of Business Process Models . . . . . . . . . . . . . . Mauro Gambini, Marcello La Rosa, Sara Migliorini, and Arthur H.M. ter Hofstede
148
Behavioral Similarity – A Proper Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Kunze, Matthias Weidlich, and Mathias Weske
166
Event-Based Monitoring of Process Execution Violations . . . . . . . . . . . . . . Matthias Weidlich, Holger Ziekow, Jan Mendling, Oliver G¨ unther, Mathias Weske, and Nirmit Desai
182
Towards Efficient Business Process Clustering and Retrieval: Combining Language Modeling and Structure Matching . . . . . . . . . . . . . . . . . . . . . . . . . Mu Qiao, Rama Akkiraju, and Aubrey J. Rembert Self-learning Predictor Aggregation for the Evolution of People-Driven Ad-Hoc Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Dorn, C´esar A. Mar´ın, Nikolay Mehandjiev, and Schahram Dustdar
199
215
Serving Information Needs in Business Process Consulting . . . . . . . . . . . . Monika Gupta, Debdoot Mukherjee, Senthil Mani, Vibha Singhal Sinha, and Saurabh Sinha
231
Clone Detection in Repositories of Business Process Models . . . . . . . . . . . Reina Uba, Marlon Dumas, Luciano Garc´ıa-Ba˜ nuelos, and Marcello La Rosa
248
Business Artifact-Centric Modeling for Real-Time Performance Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rong Liu, Roman Vacul´ın, Zhe Shan, Anil Nigam, and Frederick Wu A Query Language for Analyzing Business Processes Execution . . . . . . . . Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, Hamid Reza Motahari-Nezhad, and Sherif Sakr Discovering Characteristics of Stochastic Collections of Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kees van Hee, Marcello La Rosa, Zheng Liu, and Natalia Sidorova
265 281
298
Table of Contents
Wiki-Based Maturing of Process Descriptions . . . . . . . . . . . . . . . . . . . . . . . . Frank Dengler and Denny Vrandeˇci´c A-Posteriori Detection of Sensor Infrastructure Errors in Correlated Sensor Data and Business Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Wombacher Conformance Checking of Interacting Processes with Overlapping Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Fahland, Massimiliano de Leoni, Boudewijn F. van Dongen, and Wil M.P. van der Aalst Simplifying Mined Process Models: An Approach Based on Unfoldings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Fahland and Wil M.P. van der Aalst Foundations of Relational Artifacts Verification . . . . . . . . . . . . . . . . . . . . . . Babak Bagheri Hariri, Diego Calvanese, Giuseppe De Giacomo, Riccardo De Masellis, and Paolo Felli
XV
313
329
345
362 379
On the Equivalence of Incremental and Fixpoint Semantics for Business Artifacts with Guard-Stage-Milestone Lifecycles . . . . . . . . . . . . . . . . . . . . . Elio Damaggio, Richard Hull, and Roman Vacul´ın
396
Compensation of Adapted Service Orchestration Logic in BPEL’n’Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mirko Sonntag and Dimka Karastoyanova
413
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
429
Some Thoughts on Behavioral Programming David Harel Weizmann Institute of Science, Israel
[email protected] Abstract. The talk starts from a dream/vision paper I published in 2008, whose title, Can Programming be Liberated, Period?, is a play on that of John Backus’ famous Turing Award Lecture (and paper). I will propose that or rather ask whether programming can be made a lot closer to the way we humans think about dynamics, and the way we somehow manage to get others (e.g., our children, our employees, etc.) to do what we have in mind. Technically, the question is whether we can liberate programming from its three main straightjackets: (1) having to directly produce a precise artifact in some language; (2) having actually to produce two separate artifacts (the program and the requirements) and having then to pit one against the other; (3) having to program each piece/part/object of the system separately. The talk will then get a little more technical, providing some evidence of feasibility of the dream, via LSCs and the play-in/play-out approach to scenario-based programming, and its more recent Java variant. The entire body of work around these ideas can be framed as a paradigm, which we call behavioral programming.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, p. 1, 2011. c Springer-Verlag Berlin Heidelberg 2011
The Changing Nature of Work: From Structured to Unstructured, from Controlled to Social Sandy Kemsley Kemsley Design Ltd., Toronto , Canada
[email protected] Abstract. Traditional BPM systems perform well with structured processes and controlled interactions between participants. However, most business functions involve both structured processes with ad hoc activities, and can range from topdown delegation to social collaboration. A key trend in business is the increased focus on knowledge worker productivity as the routine work becomes more automated, which has led to fragmentation of the BPM market: in addition to traditional BPM systems, a new breed of adaptive case management (ACM) systems manage unstructured processes, collaboration is taking hold both in process design and runtime environments, and user interfaces are borrowing from social media to put a new face on BPM. But do the differences between structured and unstructured process management, or between authoritarian and collaborative interactions, warrant the use of different types of systems? Or do they just represent different parts of a spectrum of process functionality? How can they best be combined to manage a broad variety of business process types? What types of hybrid process systems will emerge to match these business needs?
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, p. 2, 2011. c Springer-Verlag Berlin Heidelberg 2011
Automatic Verification of Data-Centric Business Processes Elio Damaggio1, Alin Deutsch1 , Richard Hull2, , and Victor Vianu1, 1
University of California, San Diego {elio,deutsch,vianu}@cs.ucsd.edu 2 IBM T.J. Watson Research Center
[email protected] 1 Introduction Recent years have witnessed the evolution of business process specification frameworks from the traditional process-centric approach towards data-awareness. Process-centric formalisms focus on control flow while under-specifying the underlying data and its manipulations by the process tasks, often abstracting them away completely. In contrast, data-aware formalisms treat data as first-class citizens. A notable exponent of this class is the business artifact model pioneered in [46, 34], deployed by IBM in professional services offerings with associated tooling, and further studied in a line of follow-up works [4, 5, 26, 27, 6, 38, 33, 35, 42]. Business artifacts (or simply “artifacts”) model key business-relevant entities, which are updated by a set of services that implement business process tasks. A collection of artifacts and services is called an artifact system. This modeling approach has been successfully deployed in practice, yielding proven savings when performing business process transformations [5, 4, 15]. In recent work [21, 16], we addressed the verification problem for artifact systems, which consists in statically checking whether all runs of an artifact system satisfy desirable properties expressed in an extension of linear-time temporal logic (LTL). We consider a simplified model of artifact systems in which each artifact consists of a record of variables and the services may consult (though not update) an underlying database. Services can also set the artifact variables to new values from an infinite domain, thus modeling external inputs and tasks specified only partially, using pre- and post-conditions. Post-conditions permit non-determinism in the outcome of a service, to capture for instance tasks in which a human makes a final determination about the value of an artifact variable, subject to certain constraints. This setting results in a challenging infinite-state verification problem, due to the infinite data domain. Rather than relying on general-purpose software verification tools suffering from well-known limitations, the work of [21] and [16] addresses this problem by identifying relevant classes
This work was supported by the National Science Foundation under award III-0916515, and by IBM Research through an Open Collaborative Research grant in conjunction with Project ArtiFact. This author was supported in part by the National Science Foundation under award IIS0812578. This author was supported in part by the European Research Council under grant Webdam, agreement 226513.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 3–16, 2011. c Springer-Verlag Berlin Heidelberg 2011
4
E. Damaggio et al.
of artifact systems and properties for which fully automatic verification is possible. This extended abstract informally summarizes these results.
Context and Related Work Data-aware business process models. The specific notion of business artifact was first introduced in [46] and [34] (called there “adaptive documents”), and was further studied, from both practical and theoretical perspectives, in [4, 5, 26, 27, 6, 38, 33, 35, 42, 25, 17, 29, 3]. (Some of these publications use the term “business entity” in place of “business artifact”). Some key roots of the artifact model are present in “adaptive business objects” [43], “business entities”, “document-driven” workflow [50] and “document” engineering [28]. The Vortex framework [31, 24, 30] also allows the specification of database manipulations and provides declarative specifications for when services are applicable to a given artifact. The artifact model considered here is inspired in part by the field of semantic web services. In particular, the OWL-S proposal [40, 39] describes the semantics of services in terms of input parameters, output parameters, pre- and post-conditions. In the artifact model considered here the services are applied in a sequential fashion. The GuardStage-Milestone (GSM) approach [25, 17, 29] to artifact lifecycles permits services with pre- and post-conditions, parallelism, and hierarchy. Results in [17] show that the verification results described in the current paper can be generalized to apply to GSM. On a more practical level there is a close correspondence between business artifacts and cases as in Case Management systems [18, 52]; GSM captures most aspects of the conceptual models underlying those systems. Static analysis of data-aware business processes. Work on formal analysis of artifactbased business processes in restricted contexts has been reported in [26, 27, 6]. Properties investigated include reachability [26, 27], general temporal constraints [27], and the existence of complete execution or dead end [6]. For the variants considered in each paper, verification is generally undecidable; decidability results were obtained only under rather severe restrictions, e.g., restricting all pre-conditions to be ”true” [26], restricting to bounded domains [27, 6], or restricting the pre- and post-conditions to refer only to artifacts (and not their variable values) [27]. None of the above papers permit an underlying database, integrity constraints, or arithmetic. [14] adopts an artifact model variation with arithmetic operations but no database. It proposes a criterion for comparing the expressiveness of specifications using the notion of dominance, based on the input/output pairs of business processes. Decidability is shown only by restricting runs to bounded length. [51] addresses the problem of the existence of a run that satisfies a temporal property, for a restricted case with no database, no arithmetic, and only propositional LTL properties. [3] considers another variety of the artifact model, with database but no arithmetic and no data dependencies, and limited modeling of the input from the environment. The work focuses on static verification of properties in a very powerful language (first order μ-calculus) which subsumes the temporal logic we consider (first order LTL), in particular allowing branching time. This expressivity comes at the cost of restricting verification decidability to the case when
Automatic Verification of Data-Centric Business Processes
5
the initial database contents are given. In contrast, we verify correctness properties for all possible initializations of the database. Static analysis for semantic web services is considered in [44], but in a context restricted to finite domains. More recently, [1] has studied automatic verification in the context of business processes based on Active XML documents. The works [22, 49, 2] are ancestors of [21] from the context of verification of electronic commerce applications. Their models could conceptually (if not naturally) be encoded as artifact systems, but they correspond only to particular cases of the model in [21]. Also, they limit artifact values to essentially come from the active domain of the database, thus ruling out external inputs, partially-specified services, and arithmetic. Infinite-state systems. Artifact systems are a particular case of infinite-state systems. Research on automatic verification of infinite-state systems has recently focused on extending classical model checking techniques (e.g., see [13] for a survey). However, in much of this work the emphasis is on studying recursive control rather than data, which is either ignored or finitely abstracted. More recent work has been focusing specifically on data as a source of infinity. This includes augmenting recursive procedures with integer parameters [9], rewriting systems with data [10, 8], Petri nets with data associated to tokens [36], automata and logics over infinite alphabets [12, 11, 45, 19, 32, 7, 8], and temporal logics manipulating data [19, 20]. However, the restricted use of data and the particular properties verified have limited applicability to the business artifacts setting.
2 Artifact Systems We describe a minimalistic variant of the artifact systems model, adequate for illustrating our approach to verification. The presentation is informal, relying mainly on a running example (the formal development is provided in [21, 16]). The example, modeling an e-commerce process, features several characteristics that drive the motivation for our work. 1. The system routinely queries an underlying database, for instance to look up the price of a product and the shipping weight restrictions. 2. The validity checks and updates carried out by the services involve arithmetic operations. For instance, to be valid, an order must satisfy such conditions as: (a) the product weight must be within the selected shipment method’s limit, and (b) if the buyer uses a coupon, the sum of product price and shipping cost must exceed the coupon’s minimum purchase limit. 3. Finally, the correctness of the business process relies on database integrity constraints. For instance, the system must check that a selected triple of product, shipment type and coupon are globally compatible. This check is implemented by several local tests, each running at a distinct instant of the interaction, as user selections become available. Each local test accesses distinct tables in the database, yet they globally refer to the same product, due to the keys and foreign keys satisfied by these tables.
6
E. Damaggio et al.
The example models an e-commerce business process in which the customer chooses a product and a shipment method and applies various kinds of coupons to the order. There are two kinds of coupons: discount coupons subtract their value from the total (e.g. a $50 coupon) and free-shipment coupons subtract the shipping costs from the total. The order is filled in a sequential manner (first pick the product, then the shipment, then claim a coupon), as is customary on e-commerce web-sites. After the order is filled, the system awaits for the customer to submit a payment. If the payment matches the amount owed, the system proceeds to shipping the product. As mentioned earlier, an artifact is an evolving record of values. The values are referred to by variables (sometimes called attributes). In general, an artifact system consists of several artifacts, evolving under the action of services, specified by pre- and post-conditions. In the example, we use a single artifact with the following variables status,prod id,ship type,coupon,amount owed,amount paid,amount refunded. The status variable tracks the status of the order and can take the following values: “edit product”, “edit ship”, “edit coupon”, “processing”, “received payment”, “shipping”, “shipped”, “canceling”, “canceled”. Artifact variables ship type and coupon record the customer’s selection, received as an external input. amount paid is also an external input (from the customer, possibly indirectly via a credit card service). Variable amount owed is set by the system using arithmetic operations that sum up product price and shipment cost, subtracting the coupon value. Variable amount refunded is set by the system in case a refund is activated. The database includes the following tables, where underlined attributes denote keys. Recall that a key is an attribute that uniquely identifies each tuple in a relation. PRODUCTS(id, price, availability, weight), COUPONS(code, type, value, min value, free shiptype), SHIPPING(type, cost, max weight), OFFERS(prod id, discounted price, active). The database also satisfies the following foreign keys: COUPONS[free shiptype] ⊆ SHIPPING[type] and OFFERS[prod id] ⊆ PRODUCTS[id]. Thus, the first dependency says that each free shiptype value in the COUPONS relation is also a type value in the SHIPPING relation. The second inclusion dependency states that every prod id value in the OFFERS is the actual id of a product in the PRODUCTS relation. The starting configuration of every artifact system is constrained by an initialization condition, which here states that status initialized to “edit prod”, and all other variables to “undefined”. By convention, we model undefined variables using the reserved constant λ.
Automatic Verification of Data-Centric Business Processes
7
The services. Recall that artifacts evolve under the action of services. Each service is specified by a pre-condition π and a postcondition ψ, both existential first-order (∃FO) sentences. The pre-condition refers to the current values of the artifact variables and the database. The post-condition ψ refers simultaneously to the current and next artifact values, as well as the database. In addition, both π and ψ may use arithmetic constraints on the variables, limited to linear inequalities over the rationals. The following services model a few of the business process tasks of the example. Throughout the example, we use primed artifact variables x to refer to the next value of variable x. choose product: The customer chooses a product. π : status = “edit prod” ψ : ∃p, a, w(PRODUCTS(prod id , p, a, w) ∧ a > 0) ∧ status = ”edit shiptype” choose shiptype: The customer chooses a shipping option. π : status = “edit ship” ψ : ∃c, l, p, a, w(SHIPPING(ship type , c, l) ∧ PRODUCTS(prod id, p, a, w) ∧ l > w)∧ status = “edit coupon” ∧ prod id = prod id apply coupon: The customer optionally inputs a coupon number. π : status = “edit coupon” ψ : (coupon = λ ∧ ∃p, a, w, c, l(PRODUCTS(prod id, p, a, w)∧ SHIPPING(ship type, c, l) ∧ amount owed = p + c) ∧ status = “processing” ∧prod id = prod id ∧ ship type = ship type)∨ (∃t, v, m, s, p, a, w, c, l(COUPONS(coupon , t, v, m, s)∧ PRODUCTS(prod id, p, a, w) ∧ SHIPPING(ship type, c, l) ∧ p + c ≥ m∧ (t = “free shipping” → (s = ship type ∧ amount owed = p))∧ (t = “discount” → amount owed = p + c − v)) ∧status = “processing” ∧ prod id = prod id ∧ ship type = ship type) Notice that the pre-conditions of the services check the value of the status variable. For instance, according to choose product, the customer can only input her product choice while the order is in “edit prod” status. Also notice that the post-conditions constrain the next values of the artifact variables (denoted by a prime). For instance, according to choose product, once a product has been picked, the next value of the status variable is “edit shiptype”, which will at a subsequent step enable the choose shiptype service (by satisfying its pre-condition). Similarly, once the shipment type is chosen (as modeled by service choose shiptype), the new status is “edit coupon”, which enables the apply coupon service. The interplay of pre- and post-conditions achieves a sequential filling of the order, starting from the choice of product and ending with the claim of a coupon. A post-condition may refer to both the current and next values of the artifact variables. For instance, in service choose shiptype, the fact that only the shipment type is picked while the product remains unchanged, is modeled by preserving the product id: the next and current values of the corresponding artifact variable are set equal. Pre- and post-conditions may query the database. For instance, in service choose product, the post-condition ensures that the product id chosen by the customer is that of an available product (by checking that it appears in a PRODUCTS tuple, whose availability attribute is positive).
8
E. Damaggio et al.
Finally, notice the arithmetic computation in the post-conditions. For instance, in service apply coupon, the sum of the product price p and shipment cost c (looked up in the database) is adjusted with the coupon value (notice the distinct treatment of the two coupon types) and stored in the amount owed artifact variable. Observe that the first post-condition disjunct models the case when the customer inputs no coupon number (the next value coupon is set to undefined), in which case a different owed amount is computed, namely the sum of price and shipping cost. Semantics. The semantics of an artifact system A consists of its runs. Given a database D, a run of A is an infinite sequence {ρi }≥0 of artifact records such that ρ0 and D satisfy the initial condition of the system, and for each i ≥ 0 there is a service S of the system such that ρi and D satisfy the pre-condition of S and ρi , ρi+1 and D satisfy its post-condition. For uniformity, blocking prefixes of runs are extended to infinite runs by repeating forever their last record. The business process in the example exhibits a flexibility that, while desirable in practice for a postive customer experience, yields intricate runs, all of which need to be considered in verification. For instance, at any time before submitting a valid payment, the customer may edit the order (select a different product, shipping method, or change/add a coupon) an unbounded number of times. Likewise, the customer may cancel an order for a refund even after submitting a valid payment.
3 Temporal Properties of Artifact Systems The properties we are interested in verifying are expressed in a first-order extension of linear temporal logic called LTL-FO. This is a powerful language, fit to capture a wide variety of business policies implmented by a business process. For instance, in our running example it allows us to express such desiderata as: If a correct payment is submitted then at some time in the future either the product is shipped or the customer is refunded the correct amount. A free shipment coupon is accepted only if the available quantity of the product is greater than zero, the weight of the product is in the limit allowed by the shipment method, and the sum of price and shipping cost exceeds the coupon’s minimum purchase value. In order to specify temporal properties we use an extension of LTL (linear-time temporal logic). Recall that LTL is propositional logic augmented with temporal operators such as G (always), F (eventually), X (next) and U (until) (e.g., see [47]). For example, Gp says that p holds at all times in the run, Fp says that p will eventually hold, and G(p → Fq) says that whenever p holds, q must hold sometime in the future. The extension of LTL that we use, called1 LTL-FO, is obtained from LTL by replacing propositions with quantifier-free FO statements about particular artifact records in the run. The statements use the artifact variables and may use additional global variables, 1
The variant of LTL-FO used here differs from previous ones in that the FO formulas interpreting propositions are quantifier-free. By slight abuse we use here the same name.
Automatic Verification of Data-Centric Business Processes
9
shared by different statements and allowing to refer to values in different records. The global variables are universally quantified over the entire property. For example, suppose we wish to specify the property that if a correct payment is submitted then at some time in the future either the product is shipped or the customer is refunded the correct amount. The property is of the form G(p → Fq), where p says that a correct payment is submitted and q states that either the product is shipped or the customer is refunded the correct amount. Moreover, if the customer is refunded, the amount of the correct payment (given in p) should be the same as the amount of the refund (given in q). This requires using a global variable x in both p and q. More precisely, p is interpreted as the formula amount paid = x ∧ amount paid = amount owed and q as status = ”shipped” ∨ amount refunded = x. This yields the LTL-FO property (ϕ1 ) ∀xG((amount paid = x ∧ amount paid = amount owed) → F(status = ”shipped” ∨ amount refunded = x)) Note that, as one would expect, the global variable x is universally quantified at the end. We say that an artifact system A satisfies an LTL-FO sentence ϕ if all runs of the artifact system satisfy ϕ for all values of the global variables. Note that the database is fixed for each run, but may be different for different runs. We now show a second property ϕ2 for the running example, expressed by the LTLFO formula (ϕ2 ) ∀ v, m, s, p, a, w, c, l(G(prod id = λ ∧ ship type = λ∧ COUPONS(coupon, ”free ship”, v, m, s)) ∧ PRODUCTS(prod id, p, a, w)∧ SHIPPING(ship type, c, l) → a > 0 ∧ w ≤ l ∧ p + c ≥ m) (i)
(ii)
(iii)
Property ϕ2 verifies the consistency of orders that use coupons for free shipping. The premise of the implication lists the conditions for a completely specified order that uses such coupons. The conclusion checks the following business rules (i) available quantity of the product is greater than zero, (ii) the weight of the product is in the limit allowed by the shipment method, and (iii) the total order value satifies the minimum for the application of the coupon. Note that this property holds only due to the integrity constraints on the schema. Indeed, observe that (i) is guaranteed by the post-condition of service choose product, (ii) by choose shiptype, and (iii) by apply coupon. In the post-conditions, the checks are perfomed by looking up in the database the weight/price/cost/limit attributes associated to the customer’s selection of product id and shipment type (stored in artifact variables). The property performs the same lookup in the database, and it is guaranteed to retrieve the same tuples only because product id and shipment type are keys for PRODUCTS, respectively SHIPPING. The verifier must take these key declarations into account, to avoid generating a spurious counter-example in which the tuples retrieved by the service post-conditions are distinct from those retrieved by the property, despite agreeing on product id and shipment type.
10
E. Damaggio et al.
Other applications of verification. As discussed in [21], various useful static analysis problems on business artifacts can be reduced to verification of temporal properties. We mention some of them. Business rules. The basic artifact model is extended in [6] with business rules, in order to support service reuse and customization. Business rules are conditions that can be super-imposed on the pre-conditions of existing services without changing their implementation. They are useful in practice when services are provided by autonomous third-parties, who typically strive for wide applicability and impose as unrestrictive preconditions as possible. When such third-party services are incorporated into a specific business process, this often requires more control over when services apply, in the form of more restrictive pre-conditions. Such additional control may also be needed to ensure compliance with business regulations formulated by third parties, independently of the specific application. Verification of properties in the presence of business rules then becomes of interest and can be addressed by our techniques. A related issue is the detection of redundant business rules, which can also be reduced to a verification problem. Redundant attributes. Another design simplification consists of redundant attribute removal, a problem also raised in [6]. This is formulated as follows. We would like to test whether there is a way to satisfy a property ϕ of runs without using one of the attributes. This easily reduces to a verification problem as well. Verifying termination properties. In some applications, one would like to verify properties relating to termination, e.g. reachability of configurations in which no service can be applied. Note that one can state, within an LTL-FO property, that a configuration of an artifact system is blocking. This allows reducing termination questions to verification.
4 Automatic Verification of Artifact Systems Verification without constraints or dependencies We consider first artifact systems and properties without arithmetic constraints or data dependencies. This case was studied in [21], with a slightly richer model in which artifacts can carry some limited relational state information (however, here we stick for simplicity to the earlier minimalistic model). The main result is the following. Theorem 1. It is decidable, given an artifact system A with no data dependencies or arithmetic constraints, and an LTL-FO property ϕ with no arithmetic constraints, whether A satisfies ϕ. The complexity of verification is PSPACE -complete for fixed-arity database and artifacts, and EXPSPACE otherwise. This is the best one can expect, given that even very simple static analysis problems for finite-state systems are already PSPACE-complete. The main idea behind the verification algorithm is to explore the space of runs of the artifact system using symbolic runs rather than actual runs. This is based on the fact that the relevant information at each instant is the pattern of connections in the database
Automatic Verification of Data-Centric Business Processes
11
between attribute values of the current and successor artifact records in the run, referred to as their isomorphism type. Indeed, the sequence of isomorphism types in a run can be generated symbolically and is enough to determine satisfaction of the property. Since each isomorphism type can be represented by a polynomial number of tuples (for fixed arity), this yields a PSPACE verification algorithm. It turns out that the verification algorithm can be extended to specifications and properties that use a total order on the data domain, which is useful in many cases. This however complicates the algorithm considerably, since the order imposes global constraints that are not captured by the local isomorphism types. The algorithm was first extended in [21] for the case of a dense countable order with no end-points. This was later generalized to an arbitrary total order by Segoufin and Torunczyk [48] using automata-theoretic techniques. In both cases, the worst-case complexity remains PSPACE . Verification with arithmetic constraints and data dependencies. Unfortunately, Theorem 1 fails even in the presence of simple data dependencies or arithmetic. Specifically, as shown in [21, 16], verification becomes undecidable as soon as the database is equipped with at least one key dependency, or if the specification of the artifact system uses simple arithmetic constraints allowing to increment and decrement by one the value of some atributes. Therefore, a restriction is needed to achieve decidability. We discuss this next. To gain some intuition, consider the undecidability of verification for artifact systems with increments and decrements. The proof of undecidability is based on the ability of such systems to simulate counter machines, for which the problem of state reachability is known to be undecidable [41]. To simulate counter machines, an artifact system uses an attribute for each counter. A service performs an increment (or decrement) operations by “feeding back” the incremented (or decremented) value into the next occurrence of the corresponding attribute. To simulate counters, this must be done un unbounded number of times. To prevent such computations, the restriction imposed in [16] is designed to limit the data flow between occurrences of the same artifact attribute at different times in runs of the system that satisfy the desired property. As a first cut, a possible restriction would prevent any data flow path between unequal occurrences of the same artifact attribute. Let us call this restriction acyclicity. While acyclicity would achieve the goal of rendering verification decidable, it is too strong for many practical situations. In our running example, a customer can choose a shipping type and coupon and repeatedly change her mind and start over. Such repeated performance of a task is useful in many scenarios, but would be prohibited by acyclicity of the data flow. To this end, we define in [16] a more permissive restriction called feedback freedom. The formal definition considers, for each run, a graph capturing the data flow among variables, and imposes a restriction on the graph. Intuitively, paths among different occurrences of the same attribute are permitted, but only as long as each value of the attribute is independent on its previous values. This is ensured by a syntactic condition that takes into account both the artifact system and the property to be verified. We omit here the rather technical details. It is shown in [16] that feedback freedom of an artifact system together with an LTL-FO property can be checked in PSPACE by reduction to a test of emptiness of a two-way alternating finite-state automaton.
12
E. Damaggio et al.
There is evidence that the feedback freedom condition is permissive enough to capture a wide class of applications of practical interest. Indeed, this is confirmed by numerous examples of practical business processes modeled as artifact systems, encountered in our collaboration with IBM. Many of these, including typical e-commerce applications, satisfy the feedback freedom condition. The underlying reason seems to be that business processes are usually not meant to “chain” an unbounded number of tasks together, with the output of each task being input to the next. Instead, the unboundedness is usually confined to two forms, both consistent with feeback-freedom: 1. Allowing a certain task to undo and retry an unbounded number of times, with each retrial independent of previous ones, and depending only on a context that remains unchanged throughout the retrial phase. A typical example is repeatedly providing credit card information until the payment goes through, while the order details remain unchanged. Another is the situation in which an order is filled according to sequentially ordered phases, where the customer can repeatedly change her mind within each phase while the input provided in the previous phases remains unchanged (e.g. changing her mind about the shipment type for the same product, the rental car reservation for the same flight, etc.) 2. Allowing a task to batch-process an unbounded collection of inputs, each processed independently, within an otherwise unchanged context (e.g. sending invitations to an event to all attendants on the list, for the same event details). Feedback freedom turns out to ensure decidability of verification in the presence of arithmetic constraints, and also under a large class of data dependencies including key and foreign key constraints on the database. Theorem 2. [16] It is decidable, given an artifact system A whose database satisfies a set of key and foreign key constraints, and an LTL-FO property ϕ such that (A, ϕ) is feedback free, whether every run of A on a valid database satisfies ϕ. The intuition behind decidability is the following. Recall the verification algorithm of Theorem 1. Because of the data dependencies and arithmetic constraints, the isomorphism types of symbolic runs are no longer sufficient, because every artifact record in a run is constrained by the entire history leading up to it. This can be specified as an ∃FO formula using one quantified variable for each artifact attribute occurring in the history, referred to as the inherited constraint of the record. The key observation is that due to feedback freedom, the inherited constraint can be rewritten into an ∃FO formula with quantifier rank2 bounded by k 2 , where k is the number of attributes of the artifact. This implies that there are only finitely many non-equivalent inherited constraints. This allows to use again a symbolic run approach to verification, by replacing isomorphism types with inherited constraints. Complexity and implementation. Unfortunately, there is a price to pay for the extension to data dependencies and arithmetic in terms of complexity. Recall that verification without arithmetic constraints or data dependencies can be done in PSPACE. In 2
The quantifier rank of a formula is the maximum number of quantifiers occurring along a path from root to leaf in the syntax tree of the formula, see [37].
Automatic Verification of Data-Centric Business Processes
13
contrast, the worst-case complexity of the extended verification algorithm is very high – hyperexponential in k 2 (i.e. a tower of exponentials of height k2 ). The complexity is more palatable when the clusters of artifact attributes that are mutually dependent are bounded by a small w. In this case, the complexity is hyperexponential in w2 . If the more stringent acyclicity restriction holds, the complexity drops to 2- EXPTIME in k. What does the high complexity say in terms of potential for implementation? As has often been noted, many static analysis tasks that have very high worst-case complexity turn out to be amenable to implementation that is reasonably efficient in most practical cases. Whether this holds for verification of business artifacts can only be settled by means of an implementation, which is currently in progress. Previous experience provides some ground for optimism. Indeed, a successful implementation of a verifier for Web site specifications that are technically similar to business artifacts without data dependencies or arithmetic is described in [23]. Despite the theoretical PSPACE worst-case complexity, the verifier checks properties of most specifications extremely efficiently, in a matter of seconds. This relies on a mix of database optimization and model checking techniques. We plan to build upon this approach in our implementation of a verifier for business artifacts.
5 Conclusions We presented recent work on automatic verification of data-centric business processes, focusing on a simple abstraction of IBM’s business artifacts. Rather than using generalpurpose software verification tools, we focused on identifying resticted but practically relevant classes of business processes and properties for which fully automatic verification is possible. The results suggest that significant classes of data-centric business processes may be amenable to automatic verification. While the worst-case complexity of the verification algorithms we considered ranges from PSPACE to non-elementary, previous experience suggests that the behavior may be much better for realistic specifications and properties. We plan to further explore the practical potential of our approach using real-world business artifact specifications made available through our collaboration with IBM. More broadly, it is likely that the techniques presented here may be fruitfully used with other data-aware business process models, beyond artifact systems.
References 1. Abiteboul, S., Segoufin, L., Vianu, V.: Static analysis of active XML systems. ACM Trans. Database Syst. 34(4) (2009) 2. Abiteboul, S., Vianu, V., Fordham, B.S., Yesha, Y.: Relational transducers for electronic commerce. JCSS 61(2), 236–269 (2000), Extended abstract in PODS 1998 3. Hariri, B.B., Calvanese, D., De Giacomo, G., De Masellis, R., Felli, P.: Foundations of relational artifacts verification. In: Rinderle, S., Toumani, F., Wolf, K. (eds.) BPM 2011. LNCS, vol. 6896. Springer, Heidelberg (2011) 4. Bhattacharya, K., Caswell, N.S., Kumaran, S., Nigam, A., Wu, F.Y.: Artifact-centered operational modeling: Lessons from customer engagements. IBM Systems Journal 46(4), 703–721 (2007)
14
E. Damaggio et al.
5. Bhattacharya, K., et al.: A model-driven approach to industrializing discovery processes in pharmaceutical research. IBM Systems Journal 44(1), 145–162 (2005) 6. Bhattacharya, K., Gerede, C.E., Hull, R., Liu, R., Su, J.: Towards formal analysis of artifactcentric business process models. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 288–304. Springer, Heidelberg (2007) 7. Bojanczyk, M., Muscholl, A., Schwentick, T., Segoufin, L., David, C.: Two-variable logic on words with data. In: LICS, pp. 7–16 (2006) 8. Bouajjani, A., Habermehl, P., Jurski, Y., Sighireanu, M.: Rewriting systems with data. In: ´ Csuhaj-Varj´u, E., Esik, Z. (eds.) FCT 2007. LNCS, vol. 4639, pp. 1–22. Springer, Heidelberg (2007) 9. Bouajjani, A., Habermehl, P., Mayr, R.: Automatic verification of recursive procedures with one integer parameter. Theoretical Computer Science 295, 85–106 (2003) 10. Bouajjani, A., Jurski, Y., Sighireanu, M.: A generic framework for reasoning about dynamic networks of infinite-state processes. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 690–705. Springer, Heidelberg (2007) 11. Bouyer, P.: A logical characterization of data languages. Information Processing Letters 84(2), 75–85 (2002) 12. Bouyer, P., Petit, A., Th´erien, D.: An algebraic approach to data languages and timed languages. Information and Computation 182(2), 137–162 (2003) 13. Burkart, O., Caucal, D., Moller, F., Steffen, B.: Verification of infinite structures. In: Handbook of Process Algebra, pp. 545–623. Elsevier Science, Amsterdam (2001) 14. Calvanese, D., De Giacomo, G., Hull, R., Su, J.: Artifact-centric workflow dominance. In: Baresi, L., Chi, C.-H., Suzuki, J. (eds.) ICSOC-ServiceWave 2009. LNCS, vol. 5900, pp. 130–143. Springer, Heidelberg (2009) 15. Chao, T., et al.: Artifact-based transformation of IBM Global Financing: A case study. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 261– 277. Springer, Heidelberg (2009) 16. Damaggio, E., Deutsch, A., Vianu, V.: Artifact systems with data dependencies and arithmetic. In: Int’l. Conf. on Database Theory, ICDT (2011) 17. Damaggio, E., Hull, R., Vaculin, R.: On the equivalence of incremental and fixpoint semantics for business artifacts with guard-stage-milestone lifecycles. In: Rinderle, S., Toumani, F., Wolf, K. (eds.) BPM 2011. LNCS, vol. 6896. Springer, Heidelberg (2011) 18. de Man, H.: Case management: Cordys approach. BP Trends (February 2009), http://www.bptrends.com 19. Demri, S., Lazi´c, R.: LTL with the Freeze Quantifier and Register Automata. In: LICS, pp. 17–26 (2006) 20. Demri, S., Lazi´c, R., Sangnier, A.: Model checking freeze LTL over one-counter automata. In: Amadio, R.M. (ed.) FOSSACS 2008. LNCS, vol. 4962, pp. 490–504. Springer, Heidelberg (2008) 21. Deutsch, A., Hull, R., Patrizi, F., Vianu, V.: Automatic verification of data-centric business processes. In: ICDT, pp. 252–267 (2009) 22. Deutsch, A., Sui, L., Vianu, V.: Specification and verification of data-driven web applications. JCSS 73(3), 442–474 (2007) 23. Deutsch, A., Sui, L., Vianu, V., Zhou, D.: A system for specification and verification of interactive, data-driven web applications. In: SIGMOD Conference (2006) 24. Dong, G., Hull, R., Kumar, B., Su, J., Zhou, G.: A framework for optimizing distributed workflow executions. In: Connor, R.C.H., Mendelzon, A.O. (eds.) DBPL 1999. LNCS, vol. 1949, pp. 152–167. Springer, Heidelberg (2000) 25. Hull, R., et al.: Introducing the guard-stage-milestone approach for specifying business entity lifecycles. In: Proc. of 7th Intl. Workshop on Web Services and Formal Methods, WS-FM, pp. 1–24 (2010)
Automatic Verification of Data-Centric Business Processes
15
26. Gerede, C.E., Bhattacharya, K., Su, J.: Static analysis of business artifact-centric operational models. In: IEEE International Conference on Service-Oriented Computing and Applications (2007) 27. Gerede, C.E., Su, J.: Specification and verification of artifact behaviors in business process models. In: Kr¨amer, B.J., Lin, K.-J., Narasimhan, P. (eds.) ICSOC 2007. LNCS, vol. 4749, pp. 181–192. Springer, Heidelberg (2007), http://www.springerlink.com/content/c371144007878627 28. Glushko, R.J., McGrath, T.: Document Engineering: Analyzing and Designing Documents for Business Informatics and Web Services. MIT Press, Cmabridge (2005) 29. Hull, R., Damaggio, E., De Masellis, R., Fournier, F., Gupta, M., Heath III, F., Hobson, S., Linehan, M., Maradugu, S., Nigam, A., Sukaviriya, P., Vacul´ın, R.: Business artifacts with guard-stage-milestone lifecycles: Managing artifact interactions with conditions and events. In: ACM Intl. Conf. on Distributed Event-based Systems, DEBS (2011) 30. Hull, R., Llirbat, F., Kumar, B., Zhou, G., Dong, G., Su, J.: Optimization techniques for data-intensive decision flows. In: Proc. IEEE Intl. Conf. on Data Engineering (ICDE), pp. 281–292 (2000) 31. Hull, R., Llirbat, F., Simon, E., Su, J., Dong, G., Kumar, B., Zhou, G.: Declarative workflows that support easy modification and dynamic browsing. In: Proc. Int. Joint Conf. on Work Activities Coordination and Collaboration (1999) 32. Jurdzinski, M., Lazi´c, R.: Alternation-free modal mu-calculus for data trees. In: LICS, pp. 131–140 (2007) 33. Kumaran, S., Liu, R., Wu, F.Y.: On the duality of information-centric and activity-centric models of business processes. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 32–47. Springer, Heidelberg (2008) 34. Kumaran, S., Nandi, P., Heath, T., Bhaskaran, K., Das, R.: ADoc-oriented programming. In: Symp. on Applications and the Internet (SAINT), pp. 334–343 (2003) 35. K¨uster, J., Ryndina, K., Gall, H.: Generation of BPM for object life cycle compliance. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 165–181. Springer, Heidelberg (2007) 36. Lazi´c, R., Newcomb, T., Ouaknine, J., Roscoe, A., Worrell, J.: Nets with tokens which carry data. In: Kleijn, J., Yakovlev, A. (eds.) ICATPN 2007. LNCS, vol. 4546, pp. 301–320. Springer, Heidelberg (2007) 37. Libkin, L.: Elements of Finite Model Theory. Springer, Heidelberg (2004) 38. Liu, R., Bhattacharya, K., Wu, F.Y.: Modeling business contexture and behavior using business artifacts. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 324–339. Springer, Heidelberg (2007) 39. Martin, D., et al.: OWL-S: Semantic markup for web services, W3C Member Submission (November 2003) 40. McIlraith, S.A., Son, T.C., Zeng, H.: Semantic web services. IEEE Intelligent Systems 16(2), 46–53 (2001) 41. Minsky, M.L.: Computation: finite and infinite machines. Prentice-Hall, Inc., Upper Saddle River (1967) 42. Nandi, P., et al.: Data4BPM, Part 1: Introducing Business Entities and the Business Entity Definition Language (BEDL) (April 2010), http://www.ibm.com/developerworks/websphere/library/ techarticles/1004 nandi/1004 nandi.html 43. Nandi, P., Kumaran, S.: Adaptive business objects – a new component model for business integration. In: Proc. Intl. Conf. on Enterprise Information Systems, pp. 179–188 (2005) 44. Narayanan, S., McIlraith, S.: Simulation, verification and automated composition of web services. In: Intl. World Wide Web Conf, (WWW 2002) (2002)
16
E. Damaggio et al.
45. Neven, F., Schwentick, T., Vianu, V.: Finite State Machines for Strings Over Infinite Alphabets. ACM Transactions on Computational Logic 5(3), 403–435 (2004) 46. Nigam, A., Caswell, N.S.: Business artifacts: An approach to operational specification. IBM Systems Journal 42(3), 428–445 (2003) 47. Pnueli, A.: The temporal logic of programs. In: FOCS, pp. 46–57 (1977) 48. Segoufin, L., Torunczyk, S.: Automata based verification over linearly ordered data domains. In: Int’l. Symp. on Theoretical Aspects of Computer Science, STACS (2011) 49. Spielmann, M.: Verification of relational transducers for electronic commerce. JCSS 66(1), 40–65 (2003), Extended abstract in PODS 2000 50. Wang, J., Kumar, A.: A framework for document-driven workflow systems. In: van der Aalst, W.M.P., Benatallah, B., Casati, F., Curbera, F. (eds.) BPM 2005. LNCS, vol. 3649, pp. 285– 301. Springer, Heidelberg (2005) 51. Zhao, X., Su, J., Yang, H., Qiu, Z.: Enforcing constraints on life cycles of business artifacts. In: TASE, pp. 111–118 (2009) 52. Zhu, W.-D., et al.: Advanced Case Management with IBM Case Manager. Published by IBM, http://www.redbooks.ibm.com/redpieces/abstracts/ sg247929.html?Open
A Blueprint for Event-Driven Business Activity Management Christian Janiesch1, Martin Matzner2, and Oliver Müller3 1 Institute of Applied Informatics and Formal Description Methods (AIFB), Karlsruhe Institute of Technology (KIT), Englerstr. 11, Geb 11.40, Karlsruhe, Germany
[email protected] 2 University of Münster, European Research Center for Information Systems (ERCIS), Leonardo-Campus 3, 48149 Münster, Germany
[email protected] 3 University of Liechtenstein, Hilti Chair of Business Process Management, Fürst-Franz-Josef-Strasse 21, 9490 Vaduz, Liechtenstein
[email protected] Abstract. Timely insight into a company’s business processes is of great importance for operational efficiency. However, still today companies struggle with the inflexibility of monitoring solutions and reacting to process information on time. We review the current state of the art of business process management and analytics and put it in relation to complex event processing to explore process data. Following the tri-partition in complex event processing of event producer, processor, and consumer, we develop an architecture for eventdriven business activity management which is capable of delivering blueprints for flexible business activity monitoring as well as closed loop action to manage the full circle of automated insight to action. We close with a discussion of future research directions. Keywords: Reference architecture, complex event processing, business process management, business process analytics, business activity management.
1 Introduction Processes represent the core of an enterprise‘s value creation. Thus, it is intuitively comprehensible that their performance makes a – if not the – major contribution to an enterprise’s success. Nowadays business process management (BPM) is not as simple as managing processes contained within one company but within a network of suppliers and customers [1]. This entails a multiplicity of companies with disparate locations and information technology. While process interoperability is obviously one major issue in this scenario [2], we want to focus on what is commonly perceived as a second step when realizing such a system: business process analytics that can provide automated insight to action. By law, enterprise systems have to provide some support to log their transactions. Consequently, most process engines store data about the enactment of processes in audit logs. Single log entries indicate e.g. the start of a process, the completion of a S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 17–28, 2011. © Springer-Verlag Berlin Heidelberg 2011
18
C. Janiesch, M. Matzner, and O. Müller
task or the invocation of an application [3]. This entails that a large amount of data is already available which can be used to gain insight into processes. Business activity monitoring (BAM) strives for providing real-time insights into running process instances in order to reduce the latency between the occurrence of critical audit events and decision making on effective responds [4, 5]. However, the common architecture of BAM systems does not cover communication with the process engine but only observe execution. Also BAM architectures usually require hardwiring the two systems to each other which makes it difficult to extend the architecture to a network of process engines. We propose a more flexible approach of loosely coupling systems through business events. Business events represent state changes in process execution and can not only be monitored but managed by complex event processing (CEP) engines [6]. Hence, with an event-driven architecture, it is not only possible to observe process information and display it in (near) real-time but also to take immediate action as a result of the analysis which does exceed simple alerting capability. We propose an architecture based on the ideas of BPM and CEP which is scalable and flexible so that it can also be used in a networked scenario. The architecture provides a starting point for the design of business activity management solutions as opposed to business activity monitoring. It can cover more elaborate scenarios such as context-aware BPM [7]. The structure of the paper is as follows: In the subsequent section, we do a thorough review of related work in BPM and include a wide array of approaches to business process analytics ranging from controlling and monitoring to predictive methods. We include CEP fundamentals in this discussion. We detail our reference architecture in Section 3 and discuss interaction options between BPM and CEP systems. The architecture enables real-time business activity management rather than mere observation of processes. We close with an outlook to future research opportunities.
2 Related Work 2.1 Processes and Activities Processes are generally seen as activities performed within or across companies or organizations [8]. Activities are considered to be the smallest entity used to build a process. These entities may be seen as complex chunks, which do not need to be analyzed further, or atomic activities, which cannot be divided any further. Furthermore, an activity can either refer to work, which needs to be done manually, or to work, which can be done automatically. Other definitions of processes [9, 10] emphasize on the timely and local order of activities and also regard different kinds of inputs and value-added outputs. Business Process Management (BPM) refers to a collection of tools and methods for achieving an understanding and then for managing and improving an enterprises’ process portfolio [11]. It plans, controls, and monitors intra- and inter-organizational processes with regards to existing operational sequences and structures in a consistent, continuous, and iterative way of process improvement [12]. A BPM
A Blueprint for Event-Driven Business Activity Management
19
system or engine allows for the definition, execution, and logging of business processes. 2.2 Business Process Analytics Business process analysis is a means to fathom organizational behavior within BPM. From a process perspective, it investigates if the actual behavior of single process instances is compliant with the prescribed business process design. Further, business process analysis might reveal if resources are well utilized within processes, e.g. if work load is assigned purposefully among resources. Finally – taking an objectperspective – business process analysis helps to explore the life cycle of specific business objects and e.g. helps to find out if orders or enquiries have been processed in time and at appropriate costs [13]. Today, enterprise architectures provide various and rich sources for such information, e.g. workflow management systems, enterprise resource planning systems, customer relationship management systems, and product data management systems [14]. The emerging paradigm of context-aware BPM suggests that even more sources of information relevant to process execution [7] may be available in the future via the Web. Single process instances can be analyzed through behavior data that informs on the state-of-operation of processes’ constituting activities. Generally, this means that processes are analyzed with focus on the behavior of completed processes, currently running process instances, or predicting the behavior of process instances in the future [3]. Each of these perspectives corresponds to a certain set of methods that is yet to be established in practice [3, 13, 15]: • • •
Process Controlling is concerned with the ex-post analysis of completed business processes. Business Activity Monitoring strives to support the real-time monitoring of currently active business processes’ behavioral properties. Process Intelligence is the post-execution forecast of the organization’s future behavior through scenario planning or simulation.
There are two design options [13]. A top-down approach starts with conceptual models to reason about BPM activities. The bottom-up concept utilizes the limited analysis functionality of present BPM solutions specifically by integrating bypassed data warehouse concepts [16]. Process Controlling (Historical) process information can be found in log files and event streams of workflow systems and where other transactional data of information systems is captured [3]. Information is then stored and processed in data warehouses to allow for ex-post analysis of this data. Data warehouses store event-related data permanently, which might be subject to user-specified analytical roll-ups later on [5]. Thus, BI typically looks at process data on an aggregated level from an external perspective [14]. Process mining has a conceptual position as it strives for the “automatic construction of models explaining the behavior observed in the event log” [17]. Its focus is on concurrent processes rather than on static or mainly sequential structures
20
C. Janiesch, M. Matzner, and O. Müller
[14]. To its very nature, process mining is an ex-post analysis of organizational behavior. Process mining informs real-time business process analytics as it poses comparable questions to the log, namely pointing to the “process perspective (How?), the organizational perspective (Who?), and the case perspective (What?)” [17] of a business process. Thus, it rather has the goal of “looking inside the process” [14]. Typical methods of data mining are association analysis to discover significant associations between attributes of (process) objects, cluster analysis to build heterogeneous groups of homogenous (process) objects, and classification to assign (process) objects to groups according to their attributes [18]. Business Activity Monitoring Business Activity Monitoring (BAM) systems, in contrast, focus on capturing and processing events with minimum latency [5] and therefore have to be tightly integrated with operational data sources. BAM foremost strives for the monitoring of key operational business events for changes or trends indicating opportunities or problems, and enabling business managers then to take corrective actions [5]. Single events recorded are usually not very meaningful in a BAM context, “instead many events have to be summarized to yield so-called key performance indicators” (KPI) [19]. BAM systems are then typically engaged with updating KPI [20]. Therefore a BAM system must be able to detect events occurring in enterprise systems which are relevant to the activity monitored, calculate information for temporal processing, integrate event and contextual business information on the fly, delivering quality data, execute business rules to set thresholds according to key performance indicators and other business-specific triggers, and provide intuitive interfaces for presenting rules and metrics [4, 5]. In contrast, functionality offered by BAM tools in practice is limited to rather simple performance indicators such as flow time and utilization [17]. In parts, the BAM concept goes beyond tracking KPI. Instead, software systems contain some event processing functionality, namely “filtering, transformation, and most importantly aggregation of events” [15]. By applying rules engines to KPI, BAM systems can generate alerts and actions which inform managers of critical situations to possibly alter the behavior of the running process [3]. “This can be done in batch mode […] or in online mode so that the current value of the KPI can be continually tracked, typically on a dashboard” [15]. Process Intelligence Process Intelligence refers to the predictive analysis of process design to influence future process instances’ behavior. This also includes the detection of patterns and links within processes. One can distinguish the process intelligence methods of simulation and optimization as well as – again – data mining [3]. Simulations are commonly used either before a process is deployed for production or after a process has been executed for a reasonable amount of time. In the former case, simulation can assist in assessing the adequacy of the new process for the designated task. In the later case, variants of the process model can be used for simulation in order to improve or optimize the process design. Simulation usually addresses resource allocation, process structure, and process execution context.
A Blueprint for Event-Driven Business Activity Management
21
Simulation of process instances with methods such as Monte Carlo offers a broader range of possible process simulation results than deterministic what-if scenarios which focus on manually chosen best case, worst case, and common case scenarios. Trend estimation and forecasting, on the other hand, offer insight into the longer term horizon of process execution and can help identify future bottlenecks in a process or in the process landscape [21]. Simulation commonly uses log data batches for processing. 2.3 Complex Event Processing Complex event processing (CEP), in general, comprises a set of techniques for making sense of events in a system by deriving higher-level knowledge from lowerlevel system events in a timely and online fashion [15, 19, 22]. The central concept in CEP is the event. In everyday usage, an event is something that happens. In CEP, an event is defined as a data object that is a record of an activity that has happened, or is thought of happening in a system [22]. The event signifies the activity. In the context of BPM, events represent “state changes of objects within the context of a business process” [3]. Such objects comprise, among others, activities, actors, data elements, information systems, or entire processes operated in a BPM system. Typical events in the context of BPM are, for example, process started, work item assigned, or activity completed [for more example cf. 23]. Events comprise of several data components called attributes [22]. Attributes of an event can include, for example, the ID of the signified activity, start and end time of the activity, the system in which the activity has been executed, or the user who carried out the activity. Another key idea of CEP is the aggregation of events. Through the definition of abstraction relationships it is possible to define complex events by aggregating a set of low-level events. Hence, a complex event signifies a complex activity which consists of all the activities that the aggregation of individual events signified [22]. The complex event sub-process completed, for example, is created when the completion of all activities of the sub-process has been signified by the corresponding low-level activity completed events. Orthogonal to the concept of aggregation is the concept of causality. Also, lowlevel events might happen in dependence of each other [22]. This dependence relationship is called causality. An event is caused by another event if the former happened only because the latter event happened before. In a business process the event order declined might be caused by the event credit card check failed. Causality plays an important role in any kind of root cause analysis, either online or ex-post. In doing so and extending in particular the BAM concept to more valuable, actionable business level information, CEP is a good candidate [24], as BPM engines generate events while executing business processes and recording them to logs [13]. Today’s BI approaches to process analysis differ from event processing systems in that they are request-driven [15]. It is an ex-post analysis which can basically be distinguished into a based on and not based on formal representation of the business processes [3].
22
C. Janiesch, M. Matzner, and O. Müller
3 A Blueprint for Event-Driven Business Activity Management 3.1 Interaction Points of BPM and CEP Only recently, so called event-based or event-driven BI approaches emerged, involving variants of business process analytics which can run online and be triggered by events. These systems are referred to as operational BI by some vendors [15] and by allowing for event-driven decision making, they overcome former shortcomings of BI approaches to event handling. However, event-driven BI falls short when it comes to the dynamic modeling and use of business rules to define actions upon event occurrences and especially in correlating events into metrics over time [5]. Current trends require BAM to extend processing functionality and call for a tighter alignment of BPM and CEP in business process analytics: There is (1) “the drive to provide richer types of observation which necessitates more event processing functions such as event pattern detection” and (2) “the need to provide more online observations when data comes from multiple event sources” [15]. However, even when this information is retrieved, “business processes remain static and cannot be changed dynamically to adapt to the actual scenario” [6] and therefore require guidance for translating insight into action. BPM seeks to support the entire life cycle of business processes and therefore requires an integration of design and execution phase. Accordingly, business process analytics “undoubtedly involves post-execution analysis and reengineering of process models” [13]. Such analysis requires an exploration of “time, cause, and belonging relationships” [6] among the various activities embedded into a BPM as well as the environment the process is executed in. Here, CEP emerges to be a valuable instrument for a more timely analysis of process-related events on a business-level perspective. As an independent system, CEP is “a parallel running platform that defines, analyses, processes events” [25]. BPM and CEP might interact via events. The interaction between BPM and event processing then features two facets as BPM can act as (a) a producer and as (b) a consumer of events [15]. There are basically two challenges [13] in such tightly integration of BPM and CEP that are not addressed by practice yet: First, how to move from low-level monitoring and representation of rather technical events reported by various enterprise systems to higher-level business information. Second, business insights claim for instant actions in terms of round tripping to the BPM system for either human or automated response that is supported by a tightly aligned BPM-CEP concept. This means that BPM systems generate events which signify state changes within the process of a BPM system. Events are then analyzed by the event processing system and the resulting events are either returned to the BPM system or sent to other systems. BPM systems can then react to situations detected by the event processing system after analyzing events, e.g. an event can trigger a new instance of a business process or it could affect a decision point within the flow of a business process that is already running. Also, it could cause an existing process instance to suspend or stop executing altogether.
A Blueprint for Event-Driven Business Activity Management
23
The state-of-the-art of business activity monitoring as introduced above primarily deals with the observation of events and the dissemination of results in terms of diagram/ dashboard updates, as well as alerting. However, automated insight to action is out of scope for these systems. Thus, we extend upon the real-time concept of business activity monitoring by leveraging the capabilities of CEP to include decision-making and feedback mechanisms to the originating BPM engine or other systems. This effectively means that the CEP engine, in parts, manages rather than observes the execution of the business process. Hence, we deem it more appropriate to speak of business activity management than of monitoring when describing our architecture in the following Section. 3.2 Event-Driven Business Activity Management In the following, we describe how CEP can be applied to monitor and control business processes in real-time. Fig. 1 depicts the proposed blueprint for an eventdriven business activity management architecture using the TAM notation. The architecture is based on the usual tri-partition in CEP. In most CEP applications, there is a distinction among the entities that introduce events into the actual CEP system (event producers), the entities that process the events (event processors), and the entities that receive the events after processing (event consumers). (YHQW3URGXFHU
(YHQW3URFHVVRU
%30
(YHQW&RQVXPHU
&(3 3URFHVV 'HILQLWLRQV
3URFHVV (QJLQH
3URFHVV ,QVWDQFHV
/RZOHYHO (YHQW
(YHQW3XEOLVKHU 2XWSXW$GDSWHU
(YHQW6XEVFULEHU ,QSXW$GDSWHU
,QSXW$GDSWHU
,QSXW (YHQW6WUHDP
.3, 6WUHDP
(YHQW 3URFHVVLQJ 1HWZRUN
.3,V
3URFHVV 'DVKERDUGV
(YHQW3URFHVVLQJ $JHQWHJ )LOWHU
3URFHVV /RJJLQJ
$XGLW'DWD
9LVXDOL]DWLRQ
/RZOHYHO (YHQW
(YHQW3URFHVVLQJ $JHQWHJ 7UDQVIRUPDWLRQ
:HE 6HUYLFH
%30 $3,FDOO
(YHQW3URFHVVLQJ $JHQWHJ 3DWWHUQ'HWHFW
3URFHVV (QJLQH
3URFHVV ,QVWDQFHV
2XWSXW (YHQW6WUHDP
2WKHU3URGXFHUV 6HQVRUV 'DWDEDVHV
2WKHU&RQVXPHUV 2XWSXW $GDSWHU
0HVVDJLQJ $FWXDWRUV $3,FDOO
Fig. 1. Blueprint for Event-driven Business Activity Management
Event Producer The major event producers in our scenario are BPM systems themselves. For analytic purposes, two components of a BPM system are of main interest: The process engine provides an environment for the execution of process instances, based on existing process definitions. The process monitoring component offers functions to track the
24
C. Janiesch, M. Matzner, and O. Müller
progress of single process instances and to log corresponding audit data including data on occurred state transitions of process-relevant objects (e.g. processes, activities, resources, work items). This audit data constitutes a rich source of low-level process events for subsequent analysis, but is usually not accessible from outside the BPM system. Hence, a third component – an event publisher – is required to make the data available for external systems. The main functions of this output adapter is sniffing, formatting, and sending of events [22]. In the sniffing step, the adapter detects events in the process audit data and other relevant sources (e.g. work item lists). Typically, the sniffer maintains a list of filters that defines which events should be observed and subsequently extracted to prevent that sensible data from being published or simply to reduce the volume of events. Such filters can be defined on type level (e.g. only events signifying that an activity has been completed shall be extracted; events of type running and cancelled shall be filtered out) or instance level (e.g. only events with a process definition ID from 4711 to 4742 shall be extracted). It is important that event sniffing must be benign, i.e. it must not change the target system’s behavior. After detecting an event, the corresponding data has to be extracted and transformed to a specific event format. Most BPM systems use proprietary formats for event representation in their log. Recently, there have been attempts to standardize event formats, e.g. the XML-based Business Process Analytics Format (BPAF) [23]. After an event has been detected and formatted, a corresponding message has to be sent to the CEP system. Communication can be done in a direct point-to-point fashion or via some kind of message-oriented middleware (for reasons of complexity not depicted in Figure 1). Either way, the use of a publish/ subscribe messaging mechanism is a promising way to realize a loosely coupled and scalable communication that allows for pushing events to processors. Besides BPM systems, there are a number of other systems that can act as event sources in the context of a business process. Sensors, for instance, can produce events that signify the actual execution of a business process activity or report on real-world events that are happening in the context of a business process (e.g. weather change, traffic conditions, position and movement of objects). Transactional systems, such as ERP or CRM systems, can produce events that are related to a business process but lie outside of the control sphere of a BPM system (e.g. low stock alert, creation of a master data record). Event Processor The raw events sent out by the various event producers are received and processed by a central event processor, i.e. the actual CEP system. A typical CEP system comprises three logical components: an input adapter, an event processing network consisting of a set of connected event processing agents, and an output adapter. The input adapter is responsible for transforming incoming events into the internal event format of the CEP system. Adapters are essential to configure a CEP system to run in a given IT landscape. Usually, there is an input adapter for each connected event producer and/ or each possible incoming event format. After adaptation, events
A Blueprint for Event-Driven Business Activity Management
25
are queued into an input event stream, ready to be processed by an event processing network. An event processing network (EPN) consists of a number of connected event processing agents (EPA). An EPA is an active component that monitors an event stream to detect and react on certain conditions. Its behavior is defined by a set of socalled event pattern rules. These reactive rules trigger on specific instances or patterns of input events. As a result of executing a rule, the agent creates output events and changes its local state. Three basic classes of EPAs can be distinguished [15]: A filter EPA performs filtering only. It passes on only those input events that are defined in its event pattern rules. A filter does not perform any transformation of events. A transformation EPA processes input events that are defined in its event pattern rules and creates output events that are functions of these input events. A transformation EPA can be stateless or stateful. A special case of transformation EPA is an aggregate EPA which takes as input a collection of events and creates a single derived event by applying a function over the input events. Other specializations of transformation EPAs are translate, enrich, project, split, and compose. A pattern detect EPA performs a pattern matching function on a stream of input events. It emits one or more derived events if it detects a combination of events in the input stream that satisfies a specified pattern. A basic EPA is usually realized by programming it in a query language such as an SQL variant. In addition, functionality can be made available by using scripting or programming languages inside as an EPA inside the EPN. In certain cases it can be beneficial to even distribute this functionality and have EPAs call remote applications such as a business rules engine to extend their functionality. Having defined a number of EPAs, EPNs can be composed by linking EPAs so that the output of one agent is the input of another agent. The EPAs communicate by sending their output events to other agents. This allows for building complex event processing logic out of lightweight and reusable building blocks. Also, EPNs can be composed into even larger networks of event processors for scalability or query distribution purposes. This way one EPN acts as the producer for another EPN which is the consumer. After processing, events are queued into an output event stream, ready to be forwarded to possible event consumers. An event output adapter is responsible for transforming the outgoing events into external formats that can be processed by the connected event consumers. After adaptation the outgoing data does not necessarily have to represent an event anymore; it can, for example, be in the form of metrics, messages or function calls (see below). Event Consumer An event consumer is an entity that receives and visualizes or acts upon events from a CEP system. It is the logical counterpart of the event producer. As depicted in Fig. 1, dashboards, BPM systems, and messaging services are the main event consumers in the context of business activity management. “A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance” [26]. Dashboards provide high-level
26
C. Janiesch, M. Matzner, and O. Müller
summaries of information in a concise, clear, intuitive way and often in real-time [26]. This real-time aspect makes dashboards especially appealing in context of eventdriven business activity management, as it fits the concept of event streams. Typical information to be displayed in a process dashboard include KPI (e.g. runtime, idle time), occurred or anticipated exceptions (e.g. when certain thresholds of KPI are or will likely be exceeded), or visualizations of the actual flow/ path of process instances. The visualized information is often averaged on the basis of the last n events of an event output stream or on the basis of a certain time window. Apart from being an event producer, BPM systems also represent a key event consumer in our reference architecture. Based on the insights derived from processing events produced by a BPM system, an event processor might initiate actions that influence processes in that very system (or other BPM systems respectively). An event output adapter of a CEP system might, for instance, make an API call to start new processes or suspend, resume, or cancel running process instances. Likewise, gateways that control the flow of a process instance might be changed by manipulating or updating the conditions of a business rule. The last class of event consumers in our reference architecture comprises various messaging services that can be leveraged to send alert messages to proactively inform process owners about exceptions or disturbing KPI. A CEP engine can basically use all available channels of communication such as mobile phone short messaging services (SMS), instant messaging, e-mail, or mass messaging services such as Twitter to actuate reactions.
4 Conclusion Timely insight into a company’s value chain is of great importance to be able to monitor and improve its operations. Business processes are the core building blocks of a company’s operation and provide a plethora of information that can be tapped into. However, business process analytics are often only the second step in BPM. Furthermore, real-time or at least near real-time insight into processes is almost nonexistent in practice. In the past, custom applications have been developed for this requirement. Recently, the topic of operational BI has gained attention through the emergence of more sophisticated CEP engines. However, the dedicated application to business process management is only in its infancy. Nevertheless, it promises cost saving due to the ability to react in a reasonable timeframe to errors, exceptions, or other indicators such as workload, waiting times, etc. We reviewed the current state of the art of BPM, CEP and the different facets of business process analytics. Based on the existing work, we developed an architecture blueprint for business activity management. Following the tri-partition in CEP we introduced an architecture that is capable of delivering applications ranging from realtime business activity monitoring to context-aware BPM. It consists of event producers, event processors, and event consumers. We elaborated on the requirements and different options for event producers which provide events to the CEP engine. We also highlighted the requirements for event processors and their integration options with an additional system such as business rules management or applications, e.g. for data mining and simulation. When
A Blueprint for Event-Driven Business Activity Management
27
introducing event consumers, we elaborated on the different relevant output options for BPM and presented detailed insight for closed loop action to performing the full circle of activity management. The reference architecture is a software independent conceptual model that can be instantiated with basically any combination of BPM and CEP system that is capable of communicating with each other. It enables the monitoring and management of multiple BPM engines through its loose coupling and can also serve as the basis for future context-aware BPM systems. We have implemented a proof of concept of this reference architecture using multiple SAP NetWeaver BPM servers as BPM engines, the Sybase Aleri Streaming Platform as a CEP engine, and SAP BusinessObjects Xcelsius Enterprise as a visualization frontend. All of the above can be downloaded for trial and evaluation and represent state-of-the-art BPM and CEP technology. The implementation is able to showcase the majority of the described use cases. Limitations are due to the shortness of the project and not yet existing standard interfaces of the BPM engine to receive and understand process events. We refrained from hardwiring as much as possible in order to retain generalizability. We observed several shortcomings in the process: These include but are not limited to the lack of a common business event format, the lack of a standardized set of logging events, and the lack of standardized operations to trigger BPM engines for automated insight to action. On the conceptual side there is a clear need for an integrated or at least synchronized modeling paradigm for business processes and reporting on the one hand and business processes and their interaction with event processing networks on the other hand
References 1. Word, J.: Business Network Transformation: Strategies to Reconfigure Your Business Relationships for Competitive Advantage. Jossey-Bass, San Francisco (2009) 2. Janiesch, C.: Implementing Views on Business Semantics: Model-based Configuration of Business Documents. In: 15th European Conference on Information Systems (ECIS), St. Gallen, pp. 2050–2061 (2007) 3. zur Muehlen, M., Shapiro, R.: Business Process Analytics. In: vom Brocke, J., Rosemann, M. (eds.) Handbook on Business Process Management: Strategic Alignment, Governance, People and Culture, pp. 137–157. Springer, Berlin (2010) 4. Costello, C., Molloy, O.: Towards a Semantic Framework for Business Activity Monitoring and Management. In: AAAI 2008 Spring Symposium: AI Meets Business Rules and Process Management, pp. 1–11. AAAI Press, Stanford (2008) 5. Nesamoney, D.: BAM: Event-driven Business Intelligence for the Real-Time Enterprise. DM Review 14, 38–40 (2004) 6. Hermosillo, G., Seinturier, L., Duchien, L.: Using Complex Event Processing for Dynamic Business Process Adaptation. In: 7th IEEE International Conference on Service Computing (SCC), Miami, FL, pp. 466–473 (2010) 7. Rosemann, M., Recker, J., Flender, C., Ansell, P.: Understanding Context-Awareness in Business Process Design. In: 17th Australasian Conference on Information Systems (ACIS), Adelaide, pp. 1–10 (2006)
28
C. Janiesch, M. Matzner, and O. Müller
8. Object Management Group Inc.: Business Process Modeling Notation (BPMN) Specification 1.0 (2006), http://www.bpmn.org/Documents/OMG%20Final% 20Adopted%20BPMN%201-0%20Spec%2006-02-01.pdf 9. Davenport, T.: Process Innovation: Reengineering Work Through Information Technology. Harvard Business School Press, Boston (1993) 10. Johansson, H.J., McHugh, P., Pendlebury, A.H., Wheeler, W.A.: Business Process Reengineering: Breakpoint Strategies for Market Dominance. John Wiley & Sons, Chichester (1993) 11. zur Muehlen, M., Indulska, M.: Modeling Languages for Business Processes and Business Rules: A Representational Analysis. Information Systems 35, 379–390 (2010) 12. Becker, J., Kugeler, M., Rosemann, M. (eds.): Process Management: A Guide for the Design of Business Processes. Springer, Berlin (2003) 13. Pedrinaci, C., Domingue, J., de Medeiros, A.K.A.: A Core Ontology for Business Process Analysis. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 49–64. Springer, Heidelberg (2008) 14. Verbeek, H.M.W., Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: XES Tools. In: CAiSE Forum 2010. CEUR, vol. 592, pp. 1–8 (2010) 15. Etzion, O., Niblett, P.: Event Processing in Action. Manning Publications, Cincinnati, OH (2010) 16. Grigori, D., Casati, F., Castellanos, M., Dayal, U.: Business Process Intelligence. Computers in Industry Journal 53, 321–343 (2004) 17. van der Aalst, W.M.P., Reijers, H.A., Weijters, A.: Business Process Mining: An Industrial Application. Information Systems 32, 713–732 (2007) 18. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, Amsterdam (2006) 19. Eckert, M.: Complex Event Processing with XChange EQ: Language Design, Formal Semantics, and Incremental Evaluation for Querying Events. LMU München: Faculty of Mathematics, München (2008) 20. McCoy, D.W.: Business Activity Monitoring: Calm Before the Storm. Gartner Research Note LE-15-9727 (2002), http://www.gartner.com/resources/105500/ 105562/105562.pdf 21. Bianchi, M., Boyle, M., Hollingsworth, D.: A Comparison of Methods for Trend Estimation. Applied Economics Letters 6, 103–109 (1999) 22. Luckham, D.: The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Addison-Wesley Professional, Boston (2002) 23. Workflow Management Coalition (WfMC): Business Process Analytics Format Draft Version 2.0 (2008), http://www.wfmc.org/index.php?option=com_docman &task=doc_download&gid=437&Itemid=72 24. Zang, C., Fan, Y., Liu, R.: Architecture, Implementation and Application of Complex Rvent Processing in Enterprise Information Systems Based on RFID. Information Systems Frontiers 10, 543–553 (2008) 25. von Ammon, R., Ertlmaier, T., Etzion, O., Kofman, A., Paulus, T.: Integrating Complex Events for Collaborating and Dynamically Changing Business Processes. In: 2nd Workshop on Monitoring, Adaptation and Beyond (MONA+), Stockholm, pp. 1–16 (2009) 26. Few, S.: Dashboard Confusion (2004), http://intelligent-enterprise. informationweek.com/showArticle.jhtml;jsessionid=BJXFCGOZCMU E3QE1GHPCKHWATMY32JVN?articleID=18300136
On Cross-Enterprise Collaboration Lav R. Varshney and Daniel V. Oppenheim (eds.) IBM Thomas J. Watson Research Center, Hawthorne NY 10532, USA {lrvarshn,music}@us.ibm.com
Abstract. Globalization, specialization, and rapid innovation are changing several aspects of business operations. Many large organizations that were once self-sufficient and that could dictate processes to partners and suppliers are now finding this model difficult to sustain. In the emerging model of global service delivery several competing service providers must work collaboratively to develop business solutions, either hierarchically or as peers. Overall project failures, uncontrollable delays, and significant financial losses have been observed in many global service delivery projects, indicating the importance of finding new ways to better support cross-enterprise work. This paper puts forth directions to explore the management and coordination of complex end-to-end processes carried out collaboratively by several organizations. Keywords: coordination, collaboration, work.
1
Introduction
Due to growing globalization [10], and the need for innovation and differentiation, the value of specialization through division of labor is more important now than ever before [12]. Specialized firms that concentrate on a narrow set of tasks can be more productive than firms that are jacks-of-all-trades. Indeed in founding modern economic theory, the opening of Adam Smith’s Wealth of Nations said that “The greatest improvement in the productive powers of labor, and the greater part of the skill, dexterity, and judgement with which it is anywhere directed or applied, seem to have been the effects of the division of labor.” With growing specialization, however, companies that once could do it on their own find they need to collaborate with others in order to achieve large business or societal objectives. As the doing of work shifts from within a single enterprise to a distributed ecosystem of service providers, there is also greater potential flexibility and responsiveness to changing conditions [7,9,8]. Moreover, ecosystems of firms may be able to achieve open innovation more easily than concentrated modes of organization [2]. This emerging model of work, cross-enterprise collaboration, is the focus herein. As way of definition, we consider cross-enterprise collaboration to be when companies partner to achieve a common goal, to realize an opportunity, through
This paper collects together contributions from several members of the SRII Special Interest Group on Cross-Enterprise Collaboration.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 29–37, 2011. c Springer-Verlag Berlin Heidelberg 2011
30
L.R. Varshney and D.V. Oppenheim
Fig. 1. Realizing opportunities through cross-enterprise collaboration
business partnership and joint operations, see Figure 1. From the perspective of this joint goal, they want to behave and operate as one organization. However, each enterprise is additionally separate and may have its own needs. Note that unlike simply integrating supply chains, cross-enterprise collaboration requires joint operations. With the growth in cross-enterprise collaboration and the potential productivity, responsiveness, and innovation benefits that this mode of doing work may provide, it is of interest to understand the nature of problems that arise in such collaborations, and how these problems can be mitigated. Collaboration within the enterprise is already very difficult. Recent IBM surveys of CIOs in top-performing organizations demonstrated that collaboration and integration are especially important. More than two-thirds of surveyed CIOs said they would focus more on internal collaboration and communication, compared to less than half in underperforming organizations [5]. When crossing organizational boundaries, collaboration is much harder. Problems that arise in cross-enterprise collaborations are well recognized in industry. Consider the design and development of the Boeing 787 Dreamliner, Figure 2. To reduce the 787 Dreamliner’s development time to four years from the traditional six years and development costs to 6 billion dollars from the traditional 10 billion dollars, it was developed and produced using unconventional methods new to aircraft manufacturing, where approximately 50 strategic partners of Boeing served as integrators that design/assemble different parts and subsystems. With this new model, communication and coordination among Boeing and its suppliers became critical for managing the progress of the development program [13]. A real-time collaboration system that linked together all of the various partners to a platform of product life-cycle management tools and a shared pool of design data was implemented [14, Chapter 8]. Yet there were a lot of coordination problems. Ultimately, Boeing had to redesign the entire aircraft subassembly process due to visibility, process, and management risks [13]. This is not confined simply to the case of the Boeing 787 Dreamliner. One recent market study by the Aberdeen Group examined more than 160 enterprise products that required collaboration among several engineering disciplines such as mechanical engineering, electronic engineering, computer engineering, control engineering, and system design. Among the people interviewed, 71% identified a need to improve communication and collaboration across silo disciplines; 49%
On Cross-Enterprise Collaboration
31
Fig. 2. Boeing 787 Dreamliner
required improved visibility; and 43% raised the need to implement or alter new product development processes for a multidisciplinary approach. The primary reason for breakdowns was the paucity of support for coordinating the overall work between the silo disciplines [8]. In studying global software development Wiredu put forth four main causes of problems, that also apply more broadly to cross-enterprise collaborations [17,16]. These are: – Interdependencies that arise from distributed work processes, – Conflicts of interest that arise due to distributed work teams with localized incentives, – Technology representation problems that arise from distributed technologies with localized standards, and – Uncertainties and equivocalities that arise due to geographically and organizationally distributed information. In the language of economic theory, these are all different kinds of coordination costs. It is well known that coordination costs increase as the degree of specialization increases, thereby limiting the degree of specialization [1,3]. A deeper understanding of these potential problems in cross-enterprise collaboration through the development of detailed use cases as well as general abstractions and theories, along with the development of standards and technologies that support cross-enterprise collaboration, coordination costs may be reduced. With reduced coordination costs allowing greater specialization of firms, society may garner the benefits of improved productivity, agility, and innovation.
2
Goal and Scope
The goal in studying cross-enterprise collaboration is to improve the ability of organizations to collaboratively realize new opportunities. Towards this goal, we have recently formed a special interest group on cross-enterprise collaboration, under the auspices of the Service Research and Innovation Institute (SRII). A call for action and an invitation for participation was issued at the recent SRII Global Conference 2011. We welcome practitioners, researchers, and enthusiasts
32
L.R. Varshney and D.V. Oppenheim
to (collaboratively) further the understanding of cross-enterprise collaboration. As discussed in Section 4, there are several work streams that have been initiated towards this goal. To clarify the scope of study, let us note that there are several kinds of work that are being done through cross-enterprise collaboration. One kind of example is in product design and development. Whether considering the design of aircraft such as by Airbus or Boeing, or the design of motor vehicles such as by GM and Ford, there is a growing trend towards involving suppliers not just as suppliers but as partners engaged in the design process. The growth of cross-enterprise collaboration is also happening in global software development, where projects are broken across globally distributed teams in several different organizations. Likewise for disaster recovery and its attendant command and control: whether considering multi-agency collaboration in response to the events of September 11, 2001 or recent earthquakes and tsunamis that warranted multinational collaboration. As another example, one can consider the domain of healthcare. There are hospital experts helping silo centers in developing countries, as well as multi-institutional collaborations in cancer research and treatment. We focus on the intersection between business, operations, services, people, and work in cross-enterprise collaborations, Figure 3. We include both collaborations between enterprises and collaborations between organizations within an enterprise. Indeed, as has been noted by others, any successful alliance involves at least three sets of interactions. The first is between the two organizations. But the second is between the negotiator for the first organization and other managers in her company, while the third is between the negotiator for the second organization and other managers in his company. Understanding the internal negotiations helps to structure and manage the alliance better over time [11]. Special focus is placed on implications for Services as modular units of work. By taking a service-oriented view of work, notions of value co-creation [15] become readily apparent. Moreover, we can build up a layered notion of crossenterprise collaboration that incorporates several dimensions, Figure 4.
3
Existing Research
With such global importance, elements of cross-enterprise collaboration have certainly been studied before and several problems have already been solved. Writing in 2001, Hammer described a basic protocol issue in cross-enterprise collaboration [4, p. 170]: Expediters, schedulers, and a host of clerical personnel were required to manage the interface between Geon and OxyVinyls; a vast amount of work had to be performed twice, once on each side of the new divide. Unsurprisingly, this added work added costs. Such basic protocol problems of cross-collaboration have certainly already been solved using modern business process management methods [6].
On Cross-Enterprise Collaboration
33
Fig. 3. The scope of cross-enterprise collaboration includes the interaction of business, operations, services, people, and work
Fig. 4. Dimensions of cross-enterprise collaboration
More recent research and development, e.g. as presented at The 1st International Workshop on Cross Enterprise Collaboration, People, and Work, held in collaboration with the 8th International Conference on Business Process Management (BPM2010), has led to deeper understanding, however there is often a disciplinary separation. Current disciplines tend to focus on a limited subset of the aspects that typify cross-enterprise collaboration; the challenge is to bring them together. Business Process Management (BPM), for example, does not adequately support
34
L.R. Varshney and D.V. Oppenheim
teamwork around tasks or the creation and execution of dynamic service plans. Enterprise Architecture (EA) models use a layered approach to bridge between the business and IT that does not adequately consider the role of people, process, or organization. Computer Supported Collaborative Work (CSCW) focuses on people, awareness, and distributed collaboration to enable cooperative work; but does not adequately connect this with process, data, or organization. Services Oriented Computing (SOC) tends to focus on composable bite-size processes that can be executed by machines, but does not provide the flexibility required to scale and support complex cross-organizational work. BPEL4People and similar standards do not address the full scope of cross-enterprise work or the complex needs of humans in their various roles. The next section discusses several particular avenues of research.
4
Work Streams
As part of the SRII SIG on cross-enterprise collaboration, we have set out four major work streams. 4.1
Mapping the Domain
It is important to empirically understand, with some specificity, the demands placed on organizations and enterprises when working together. As such this stream will develop schema and taxonomies to qualitatively describe specific cross-enterprise collaborations. These schema and taxonomies will deal with the structural arrangement of collaboration, the dynamics of human interaction, and the issues/challenges that arise in cross-enterprise collaborations. This work stream will build on extant work in case management. It will also build on work in the field of CSCW by extending it from considering the individual as the main agent of study to considering the enterprise as the main agent of study. The concepts of Awareness, Articulation work, and Appropriation stand to be of central importance. The product of this work stream will include a standardized taxonomy of kinds of cross-enterprise collaborations, canonical use-cases for these several kinds of cross-enterprise collaborations, and a characterization of the main difficulties faced in these several kinds of cross-enterprise collaborations. Other expected outcomes include a study on the impact that emerging social computing technologies and methods have on collaboration in the context of service provision and a definition of qualitative metrics for assessing the effectiveness and efficiency of collaboration. 4.2
Formalisms and Standards
A central feature of cross-enterprise collaboration is the need to communicate across institutional boundaries, often governed by agreement frameworks, and always in support of the doing of work. As such, it is important to develop
On Cross-Enterprise Collaboration
35
formal standards and languages to facilitate communication for collaboration. Visibility is also a central concern of enterprises engaged in collaboration and so it is important to develop standardized quantitative metrics for assessing the effectiveness and efficiency of collaborative activities. Good languages facilitate good metrics. This work stream will build on extant work in BPM on designing, modeling, executing, monitoring, and optimizing collaboration activities. It will also build on existing formalization structures in SOC and C2. This work stream will study cross-enterprise collaborations and develop formal languages to control and coordinate cross-enterprise collaboration in different domains. It will also develop quantitative metrics and formal end-to-end monitoring frameworks to measure the efficacy of collaboration. Finally it will develop formalisms to allow the construction of governance, privacy, and service-level agreements among enterprises. In particular, SOA formalisms and constructs may be extended to facilitate the definition, dispatch, and orchestration of work as services that can be carried out by and for organizations. When mature, these could lead to cross-industry standards for collaboration. With the rapid introduction of new technologies and the adaption of could computing, scalable and cost-effective IT service depends on faultless automation and collaboration between many different service providers and partners. A main objective is to define a clear end-to-end view (standard interfaces and processes) of service delivery to enable companies (vendors, solution integrators/partners and end customers) to consistently integrate (or outsource to multiple IT partners) service components. Such standards should include consistent service monitoring and integration and techniques to ensure consistent End-to-End Service Delivery at Optimized Costs. 4.3
Technologies
Although cross-enterprise collaboration is possible with little technological support, the introduction of advanced technologies can greatly facilitate the process. Indeed, IT middleware is often cast as a technology to mediate between different parties, such as different business enterprises. This work stream will build on extant work in BPM, CSCW, SOC, and Cloud Computing, as it can be applied specifically to the challenges faced in crossenterprise collaboration, such as communication across institutional boundaries and coordination of workflows. This work stream will work on defining the properties and requirements of computer technologies needed to support cross-enterprise collaborations. In particular, it will study the context, data, and knowledge management requirements for managing and coordinating work across organizations and their interrelationship with domain data, tools, and processes. It will also serve to standardize the requirements for IT, middleware, systems, tools, and frameworks that support cross-enterprise collaboration, and delineate relationships with current enterprise or domain specific tools and information technologies.
36
4.4
L.R. Varshney and D.V. Oppenheim
Abstractions and Models
Theoretical abstractions and mathematical models have had an uncanny ability to provide insights into the physical and social worlds. In particular, mathematical engineering theories have provided insights into the fundamental limits of engineered systems, no matter how much ingenuity is brought to bear on a problem. As such, developing abstractions and models of general cross-enterprise collaboration stands to provide fundamental insights. Moreover, the language of mathematics also allows the use of optimization theory which may lead to best practices and policies for cross-enterprise collaboration. This work stream will build on extant work in information theory, control theory, theories of the firm in economics and management science, value-oriented service science, as well as optimization theory. It will also allow the incorporation of mathematically-oriented data analysis methods. In a theoretical framework, new approaches of distributed work, such as crowdsourcing, can be compared to other forms of service organization. The product of this work stream will eventually be a mathematical theory of cross-enterprise collaboration. Shorter-term goals include optimization algorithms for particular kinds of collaboration structures and optimal control/coordination policies for cross-enterprise collaboration.
5
Closing Remarks
Whether due to cultural and organizational differences between potential partners or interoperability issues among their information technology infrastructures and business processes, there are severe challenges to making crossenterprise collaborations work. Building on a well-founded taxonomy of cross-enterprise collaboration developed from detailed use cases, a fundamental understanding of the cross-enterprise collaboration problem may be possible from mathematical foundations. Further, standards and technological specifications may reduce coordination costs among enterprises and allow the emergence of more specialized firms working in partnership with others in a global marketplace. There is much work to be done. Acknowledgments. This paper collects together contributions from several members of the SRII Special Interest Group on Cross-Enterprise Collaboration.
References 1. Becker, G.S., Murphy, K.M.: The division of labor, coordination costs, and knowledge. The Quarterly Journal of Economics 107(4), 1137–1160 (1992) 2. Chesbrough, H.: Open Services Innovation. Jossey-Bass, San Francisco (2011) 3. Ehret, M., Wirtz, J.: Division of labor between firms: Business services, nonownership-value and the rise of the service economy. Service Science 2(3), 136–145 (2010)
On Cross-Enterprise Collaboration
37
4. Hammer, M.: The Agenda. Crown Business, New York (2001) 5. IBM: The essential CIO: Insights from the global chief information officer study (May 2011) 6. Leymann, F., Roller, D.: Production Workflow: Concepts and Techniques. PrenticeHall, Upper Saddle River (2000) 7. Mehandjiev, N., Grefen, P.: Dynamic Business Process Formation for Instant Virtual Enterprises. Springer, London (2010) 8. Oppenheim, D., Bagheri, S., Ratakonda, K., Chee, Y.M.: Coordinating distributed operations. In: Maximilien, E.M., Rossi, G., Yuan, S.T., Ludwig, H., Fantinato, M. (eds.) ICSOC 2010. LNCS, vol. 6568, pp. 213–224. Springer, Heidelberg (2011) 9. Oppenheim, D., Ratakonda, K., Chee, Y.M.: Enterprise oriented services. In: Dan, A., Gittler, F., Toumani, F. (eds.) ICSOC/ServiceWave 2009. LNCS, vol. 6275, pp. 82–95. Springer, Heidelberg (2010) 10. Palmisano, S.J.: The globally integrated enterprise. Foreign Affairs 85(3), 127–136 (2006) 11. Slowinski, G., Segal, M.: The Strongest Link. Amacom, New York (2003) 12. Stigler, G.J.: The division of labor is limited by the extent of the market. The Journal of Political Economy 59(3), 185–193 (1951) 13. Tang, C.S., Zimmerman, J.D.: Managing new product development and supply chain risks: The Boeing 787 case. Supply Chain Forum 10(2), 74–86 (2009) 14. Tapscott, D., Williams, A.D.: Wikinomics: How Mass Collaboration Changes Everything. Portfolio Penguin, New York (2006), expanded edn. 15. Vargo, S.L., Maglio, P.P., Akaka, M.A.: On value and value co-creation: A service systems and service logic perspective. European Management Journal 26(3), 145– 152 (2008) 16. Varshney, L.R., Oppenheim, D.V.: Coordinating global service delivery in the presence of uncertainty. In: Proceedings of the 12th International Research Symposium on Service Excellence in Management, QUIS12 (June 2011) 17. Wiredu, G.O.: A framework for the analysis of coordination in global software development. In: Proceedings of the 2006 International Workshop on Global Software Development for the Practitioner, pp. 38–44 (May 2006)
Source Code Partitioning Using Process Mining Koki Kato1 , Tsuyoshi Kanai1 , and Sanya Uehara2 1
Software Innovation Lab., Fujitsu Laboratories Ltd., Japan 2 Fujitsu Ltd., Japan {kato koki,kanai.go,s.uehara}@jp.fujitsu.com
Abstract. Software maintenance of business application software such as adding new functions and anti-aging should be performed cost-effectively. Information such as grouping of business activities that are executed as a unit, source code which corresponds to the activities, and the execution volume of the activities is useful for deciding on what areas of business application software to invest in, and prioritizing maintenance requests. We propose a new method which extracts such information using the BPM-E process mining tool we have developed. The method was applied to in-house business systems; the results showed that the method successfully extracted the grouping of events, but that there are accuracy issues in associating events with source code. Keywords: process mining, source code analysis, business application maintenance.
1
Introduction
Business application software in enterprises requires new features and functions to adapt to changing customers’ needs and corporate strategies. Adaptive maintenance[2] is applied to business applications to add new business functions for adapting to business trend. Adaptive maintenance requires more human resources than developing the application itself when the application is used for a long time[8]. As there may be many modification requests to fulfill, it is necessary to prioritize them within budget. It is also important to estimate the extent of the impact to source code when adding a new business function. Maintenance of aging software tends to become more difficult (and expensive), since updates gradually destroy the original structure of the applications[8]. Anti-aging methods such as refactoring and restructuring are kinds of perfective maintenance[2] which improve the maintainability[5]. If perfective maintenance is applied, cost reduction for function addition is expected, though cost for perfective maintenance itself is needed. To perform anti-aging methods cost-effectively, they should be applied only to those parts of the source code that are related to a group of business activities that are executed as a unit, and many changes to the units are expected in the future. We call such a unit of business activities a ‘work package’ in this paper. It can be said that a work package that the execution volume is in the increasing tendency is growing up on business, S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 38–49, 2011. c Springer-Verlag Berlin Heidelberg 2011
Source Code Partitioning Using Process Mining
39
and it is expected that new functions are added to such a work package. Therefore, a new technology is necessary for extracting work packages, their execution volume, and relating work packages to source code, to examine the priorities of adaptive and perfective maintenance in terms of work packages, based on the tendency to the business execution. Process mining[3] is a technology for analyzing business operations based on event logs, and can extract useful data such as the types of events and their execution volume without placing any load on the business system. In this paper, we propose a new method which extracts work packages using BPM-E, which is a process mining tool, and associates work packages with portions of source code of business application(s). After introducing BPM-E, a method for extracting work packages and a method for extracting source code corresponding to the events are described. We show and discuss the results of applying the method to two in-house business systems.
2
BPM-E
Business Process Management by Evidence (BPM-E, or Automated Process Discovery (APD))[1] is a process mining tool that has been developed at Fujitsu Laboratories since 2004, and has already been used to analyze customers’ business in the US, Europe and Japan. One of the distinguishing features of BPM-E is that it extracts business process flows directly from copies of transactional data (in Comma Separated Value (CSV) format); there is no need to convert data to a specific input format, and it is easy to dump databases in CSV format. 2.1
Events
A business activity is called an event in BPM-E. An event is expressed by three kinds of information (id, type, timestamp), where id is a business identifier (such as ‘ORD1234567’ of an ‘order number’), type is an identifier that shows which kind of event was generated (such as ‘acceptance of order’ and ‘packaging’) and is also called an ‘event class’, and timestamp is the time at which the event occurred. 2.2
Classification of Tables
There are two main types of table in enterprise systems: – Master data tables: These maintain lists of identifiers such as bills of materials (BOM), and are updated infrequently. – Transaction tables: These contain data (such as IDs and time-stamps) of daily business operations. BPM-E uses transaction tables as input data. The transaction tables are categorized as follows based on the table structure:
40
K. Kato, T. Kanai, and S. Uehara
– Log/history type: An event is recorded as a record which contains at least an ID, a time-stamp, and a type of event (an event class). There may be multiple records for one ID in a table. The event class for a record may be distinguished by the combination of multiple column values. – No column for storing event class values: There are two ways to express the event class: • The table name indicates the event class: Events of a certain event class are recorded in a the dedicated table (e.g. a table for ‘acceptance of orders’). The event class is identified by the table name. • The column name indicates the event class: A table has multiple timestamp columns. When an event of a certain event class arises, the corresponding column of the record is updated with the time-stamp. – Summary table for progress reporting: A table for progress reporting, which has multiple time-stamp columns. Each time-stamp column corresponds to an event class. This is a variation of the above ‘the column name indicates the event class’, which differs in that the time-stamp values are duplicated ones; the values are written together with other transaction tables, or the values are replicated from other tables periodically. The above-mentioned types may exist together, and BPM-E can process them. The main analysis functions of BPM-E include the business process flow diagram, the exceptional flow diagram, the narrow-down search, and the flow type display.
3
Problem and Notations
In this section, we define a problem and some notations. We assume that the business applications are written in Java and SQL. When using process mining tools such as BPM-E, we sometimes observe that some events (activities) form a unit, and such a unit is executed in different flow types. We call such a unit a ‘work package’ in this paper. A work package Gi is defined as a set of event classes {Ei1 , Ei2 , . . .}, and an event class belongs to only one work package. As events of a work package have strong ties from the viewpoint of business execution, it is presumed that source code that relates to events of a work package also has strong relation. Therefore, the influence is expected to occur by the unit of work package when adding a function or doing refactoring. Source code that relates to a work package may not correspond to the static source structure; sometimes a work package and corresponding source code may extend over multiple business systems. When doing maintenance such as adding functions and refactoring only to the source code corresponding to the work package, cost reduction is expected because source code which should be surveyed can be limited. On the other hand, there is source code like utility which is called from various points; when such source code is changed, the change influences other source code.
Source Code Partitioning Using Process Mining
41
Therefore, we would like to solve a problem of deriving a set of work packages, source code (methods) that corresponds to work package, and a set of commonly used methods corresponding to common events from flow types (as well as their execution volume) and a set of source code. The problem can be described as follows: Problem 1. Find a tuple (G, MG , CM ) from the tuple (F, code), where, G = {G0 , G1 , . . .} is the set of work packages, MG = {MG0 , MG1 , . . .} is the set of sets of Java methods corresponding to work packages, CM is the set of common methods (described later), F = {F1 , F2 , . . . , FM }, #(Fi ) ≥ #(Fj ) for i > j is the set of business process flow types, which has M elements, and #(Fi ) is the number of occurrences of business process flow type Fi , code is the set of source code of the business application(s), MGi is a set of methods corresponding to the work package Gi , and Fi = (Ei1 , Ei2 , . . . , Eini ) is a business process flow type.
4
Processing Procedures
We propose the following procedure for solving the problem: 1. Extract work packages from business process flow types. 2. Estimate the source code which corresponds to events directly. 3. Determine the relevant source code among the above source code. 4.1
Extracting Work Packages
When we examine business process flow types, we find that there are two types of work packages. One is a work package that characterizes flow type(s); for instance in order-receiving system, manufacturing-request operation may characterize the business operation at stock shortage. Another is a work package that does not depend on the variation of the business; for instance, accepting-order operation will almost always occur in spite of stock condition or method of delivery. We classify work packages as follows: – Common events (G0 ): The work package which is executed in almost all business process flow types. Common events may not exist according to circumstances. – (Other) work packages (G1 , G2 , . . .): the sets of event classes which are specific to business operations. The work packages are extracted as follows: 1. Extract common events. 2. For the remaining events, create groups of the events using the business process flow types in order of the number of executions. 3. Compensate the groups.
42
K. Kato, T. Kanai, and S. Uehara
P ← G0 for i = 1 to M do Gi ← {} for j = 1 to ni do if Eij ∈ / P then Gi ← Gi + Eij P ← P + Eij end if end for end for (a) Extracting work packages.
for j = M downto 2 do for i = 1 to M do Ci ← 0 {clear counter} end for for i = 1 to M do if Fi ⊃ Gj then for k = 1 to j − 1 do if Fi ⊃ Gk then Ck ← Ck + #(Fi ) end if end for end if end for Ctotal ← M i=1 Ci Cmax ← max(C1, . . . , CM ) maxP os ← x, where Cmax = Cx if Cmax /Ctotal > τm then Gmove.put(j, maxP os) end if end for for j = M downto 1 do maxP os ← Gmove.get(j) if maxP os = null then GmaxP os ← GmaxP os + Gj Gj ← {} end if end for (b) Compensation.
Fig. 1. Algorithm of extracting work packages
Extraction of Common Events. The common events are calculated and set as G0 by finding the event classes for which the ratio of the business process flow type including an event class is more than a threshold (τc ). Extraction of Work Packages. We assume that the larger the execution volume of a certain business process flow type, the more important for the enterprise. Therefore, when extracting the work packages it is appropriate to put priority on business process flow types with larger execution volume. Then, a set of event classes which are not yet classified are registered as a work package in the order of execution frequency of the business process flow type. The pseudo code is shown in Fig. 1 (a). In the figure, P is a variable that stores already processed event classes. Compensation Process. The work packages obtained in the above step may contain those that should be merged with other work packages to obtain larger work packages. Therefore, from the last registered work package, if the business process flow type contains a work package and another work package with high
Source Code Partitioning Using Process Mining
43
frequency (i.e. the ratio is more than τm ), these work packages are merged. The pseudo code is shown in Fig. 1 (b). In the figure, ‘Gmove’ is an associative memory; Gmove.put(i, j) stores j in association with i, and Gmove.get(i) obtains the value associated with i (i.e. j). Ck accumulates the number of occurrences of flow types which include all events of work package Gk as well as Gj for k < j. The ratio Cmax /Ctotal means how well both GmaxP os and Gj are included in flow instances simultaneously, and whether GmaxP os is able to merge with Gj . Execution volume for each work package is estimated by using execution volume of events. 4.2
Estimating the Source Code Which Corresponds to Events Directly
The estimation is done using the fact that the events of BPM-E, which correspond to the data of the table(s), are generated by the write operations (execution of INSERT/UPDATE), according to the SQL INSERT/UPDATE statements in Java methods. An event should be related to a method that generates INSERT or UPDATE statement, not a method that executes SQL statements, because the latter method may be the common utility invoked from different methods with parameters which have SQL statements. Methods which correspond to events directly are extracted as follows: 1. Extract the tuple (SQLstatement, m) from the source code, where SQL statement is a string that is grammatically correct and contains the string “INSERT” or “UPDATE”, and m is the method which contains the above SQL statement. 2. Extract the table name, time-stamp column name, and event class which corresponds to the event class Ei from the BPM-E definition information. 3. When the above relation exists, store the method m in MGi . Note that a certain method may correspond to two or more event classes.
4.3
Determine the Relevant Source Code
The escalations are done for each method that is directly related to the event, by finding1 caller methods and callee methods recursively as follows: 1. Do the following process with MGi , i = 0, 1, . . .: (a) Find the methods that call the method m in MGi recursively. When an already registered method is reached, stop the search. (b) Add found methods to MGi . 2. Do the following process with MGi , i = 0, 1, . . ., (a) Find the methods which are called from method m in MGi recursively, and store the newly found method m and work package pair. 1
We used Eclipse (http://www.eclipse.org/) AST (Abstract Syntax Tree).
44
K. Kato, T. Kanai, and S. Uehara
3. If multiple packages are related to method m , then register method m as the common method CM , otherwise add method m to the corresponding work package MGi . 4. Form a new work package and register all the methods that have not been registered yet.
5
Experimental Results
Two in-house business systems are used for the experiment. Both are written in Java and SQL without stored procedures. The features are shown in Table 1. Table 1. System features System A System B Software architecture Three-tiers + batch Three-tiers Framework (not used) Apache Struts + DbUtils Number of classes 577 169 Number of methods 6872 3219 Lines of code 129744 31817 Number of event classes 36 7 Number of business process flow types 6810 11 Number of business process flow instances 307035 1709 Start year of development 2000 2008
5.1
Results of Work Package Extraction
The results are shown for system A only. We set the thresholds τc and τm to 0.8 by way of experiment. The results of work package extraction are shown in Table 2 (displayed using aliases). Nine work packages are extracted including the common events. Table 2. Results of work package extraction for System A WP Event classes WP Event classes G0 e1, e2, e3, e4, e5, e6 G5 e33 G1 e7, e8, e9, e10, e11, e12, e13, e14, e15, e16, e17, e18 G6 e34 G2 e19, e20, e21, e22, e23 G7 e35 G3 e24, e25, e26, e27, e28, e29, e30, e31 G8 e36 G4 e32
5.2
Results of Work Package–Source Code Relation
The numbers of methods which are associated with each work package for System A are shown in Table 3 (a). There are 27 duplicated methods in (a) because there were some methods related to multiple events. Similarly for System B, there are two duplicated methods in (b).
Source Code Partitioning Using Process Mining
45
Table 3. Results of methods associated with work packages (a) System A
(b) System B
Work packages Number of methods Work packages Number of methods (Common Methods) 2600 (Common Methods) 584 G0 3175 G0 0 G1 37 G1 98 G2 14 G2 1 G3 197 G3 3 G4 4 (Unrelated methods) 2535 G5 13 G6 5 G7 7 G8 10 (Unrelated methods) 837
6
Discussion
First, we discuss the results of work package extraction compared with the business process flow diagrams drawn by the person in charge. Then, we discuss the accuracy of event–method relationships. 6.1
Evaluation of Work Package Extraction
To evaluate the presented method, we compared the results with manually extracted work packages from four business process flow diagrams (i.e. four flow types) which were drawn by the person in charge of System A. Work Packages from Flow Diagrams for System A. One of the authors extracted the work packages from the diagrams manually based on the similarity of the activities (events) among the diagrams. The similarities between distantly-positioned activities within a diagram were not considered. The work packages and the business process flows expressed by work packages are shown in Tables 4 and 5, respectively. The right arrows indicate the execution order of the work packages. To estimate the execution volume of the diagrams, the diagrams were compared with the business process flow type of BPM-E. The flow types were categorized using the activities in the diagrams and the flow types of the top 40 (the flow instance total is 218755). The results are shown in Table 5. From Table 5, flow type A is the most frequently executed business process. Further analysis reveals two main types of variation of flow type A in ‘not categorized’ flows; one type contained Γ 14, and the other contained Γ 12. It is assumed that the person in charge did not describe the diagrams entirely. Analysis of Work Packages. From Table 5, work packages that are common to flow types A–D are {Γ 1, Γ 4} = {e1, e4, e5, e6}.
46
K. Kato, T. Kanai, and S. Uehara
Table 4. Manually extracted work packages and corresponding events for System A WP Γ1 Γ2 Γ3 Γ4 Γ5 Γ6 Γ7
Corresponding events e4, e5, e6 e14, e15 e2, e7, e8, e9, e10, e12, e13, e17, e18 e1 e11 e26, e29 e20, e21, e23, e28, e32, e33, e36
WP Γ8 Γ9 Γ 10 Γ 11 Γ 12 Γ 13 Γ 14
Corresponding events e27 e34 e3, e25, e30, e31, e35 e24 e19 e22 e16
Table 5. Flow type, work packages, and number of executions Flow type Flow type A Flow type B
Flow using WP Number of executions Γ1 → Γ2 → Γ3 → Γ4 → Γ5 79697 Γ 1 → Γ 6 → Γ 7 → Γ 8 → Γ 9 → Γ 10 → Γ2 → Γ3 → Γ4 → Γ5 23564 Flow type C Γ 1 → Γ 9 → Γ 11 → Γ 10 → Γ 4 → Γ 6 → Γ 12 → Γ 8 11599 Flow type D Γ 13 → Γ 1 → Γ 7 → Γ 3 → Γ 4 → Γ 12 11522 (Not categorized) 92373
Compared with the above results, the common events (G0 ) in Table 2 have extra events {e2, e3}. As for e2, as e2 is included in Γ 3 which is a part of flow type A and the non-categorized flow types, this result is considered to be valid. As for e3, however, it is not common for the top 40 BPM-E flows, as there were only 13 flows which contained e3. However, there were 5541 flows which contained e3 in whole flows. BPM-E treats the flows as different types of flows when the orders of the events differ, and/or the repeat times of an event differ (as the default). If there are business processes which produce variations, the above phenomenon may occur. Table 6 shows the results of applying Table 4 to Table 2. From Table 6, we consider that the work packages G0 –G3 represent the portion of business process flows that were accurately drawn by the person in charge, though it is hard to comment on G4 –G8 . 6.2
Evaluation of Event–Method Relationships
We evaluated the method from two viewpoints: the potential cover rate (the rate of methods that are related to events over total methods), and the accuracy of the event–method relationships. Cover Rate. The ideal cover rate is 1, i.e. all methods are related to the events. A lower cover rate means that fewer methods can be estimated for the execution
Source Code Partitioning Using Process Mining
47
Table 6. Work packages expressed by manually extracted work packages G0 G1 G2 G3 G4 G5 G6 G7 G8
Γ 1 + Γ 4 + (e2) + (e3) Γ 2 + (Γ 3 − e2) + Γ 5 + Γ 14 ≈ a portion of type A and B. Γ 12 + Γ 13 + (e20 + e21 + e23) ≈ a portion of type D. Γ 6 + Γ 8 + Γ 11 + (e25 + e28 + e30 + e31) ≈ a portion of type C. (no relevant matching) (no relevant matching) Γ9 (no relevant matching) (no relevant matching)
volume. The actual cover rate was 0.88 for System A and 0.21 for System B, showing a large difference between the systems. According to the naming of the packages, the categories of methods that are not related are as follows: – System A: batches, beans (which are not related to events), database, servlet, utilities, and CORBA. Concentration on specific packages or classes was not observed. – System B: master maintenance, business logics, menus, and utilities. The numbers of methods for each category on System B, classified into Apache Struts (Action and Form) and data processing, are as follows: – Struts Action + Form [master maintenance: 89, business logic: 1648, menus: 520] – data processing (including DB access) [master maintenance: 95, business logic: 159, utilities: 23] There is some concentration on the methods corresponding to Struts Action and Form. Some ways to improve the cover rate for these might be: – Analyzing the Struts configuration files to extract the relation between the methods for accessing web pages. – Analyzing SELECT statements that read the data written by INSERT/UPDATE. There are also some issues for extracting SQL statements. In both systems, there were SQL statements that are generated dynamically, such as building SQL statements using for-loops, and changing parameters using if-statements. To overcome these issues, we will need some techniques to execute partial Java code using an interpreter or virtual code execution, such as given in reference [4]. Evaluation of Events–Methods Relationships. Multiple events may be extracted for one method. The following two types cause such phenomenon. – There are multiple INSERT/UPDATE statements in one method. – One INSERT/UPDATE statement updates multiple time-stamp columns.
48
K. Kato, T. Kanai, and S. Uehara
For the former case, the number of method executions is estimated as the total number of executions of events which are related to INSERT/UPDATE. The latter case has issues, and System A includes this category. For a table structure type that has multiple time-stamp columns which correspond to events, the usual update processing may be that a record for a certain business ID is inserted first, and then a time-stamp column corresponding to an event is updated when an event occurs. However, especially for object-oriented languages like Java, a bean object may be used for storing a record data corresponding to a business ID, and all time-stamps are updated simultaneously when an event occurs. This problem may be solved if we can pinpoint each method which sets the time-stamp to the bean, though some methods could not be specified on System A for the following reasons: – The time-stamp value for each business activity was written into a temporary table each time an activity occurred, then the record in the temporary table was copied to the table that BPM-E used for event extraction. – The implementation could not relate a method that set the time-stamp value into a bean and the SQL execution method when the bean was passed via Vector which contained multiple types of beans.
7
Related Work
One of the methods for estimating the execution volume and extracting portions of source code directly is to use trace logs or profiling information. In [9], a method called ‘phase detection’ is proposed to divide the trace logs, and the divided trace logs are digested into groups and visualized. The phases extracted by phase detection are similar to the work packages in this paper. If we apply this kind of technology to real business systems, we will obtain phases and their number of executions accurately. However, it is not easy to acquire the trace logs (or to retrieve profile information) from a business system that is in operation for a long time, because doing so places a non-negligible load on the system. Meanwhile, the testing environment and test patterns for the business system development are not sufficient for obtaining trace logs or profiling, because the number of executions for each test pattern is quite different from the actual business environment, even though the test patterns are exhaustive. Therefore, process mining such as BPM-E is practical for obtaining the statistics of business systems in operation rather than using trace logs, even though the approach is not as accurate as using trace logs. Several studies [6] and [7] extracted groupings such as work packages using process mining. [6] proposed a method for describing events at a different level of abstraction, while [7] constructed a hierarchy of events clustering using the AHC algorithm.
Source Code Partitioning Using Process Mining
8
49
Conclusions
We proposed a method for extracting work packages, their execution volume, and their relevant source code using process mining. The information can be used to develop a more cost-effective maintenance strategy for each work package. The method was applied to two in-house business systems, and successfully extracted the common events and the work packages. However, there remain issues of accuracy in associating events with source code that arises from the extraction of dynamically generated SQL statements, and from acquiring the values which are set by beans. To overcome such issues in a future study, we intend to investigate the virtual code execution (such as [4]) technology and apply it to code that generates SQL statements dynamically.
References 1. http://www.fujitsu.com/global/services/software/interstage/ solutions/bpm/apd.html 2. International standard - iso/iec 14764 ieee std 14764-2006 software engineering — software life cycle processes — maintenance. ISO/IEC 14764:2006 (E) IEEE Std 14764-2006 Revision of IEEE Std 1219-1998) (2006) 3. van der Aalst, W.M.P., Reijers, H.A., Weijters, A.J.M.M., van Dongen, B.F., de Medeiros, A.K.A., Song, M., Verbeek, H.M.W.E.: Business process mining: An industrial application. Inf. Syst. 32(5), 713–732 (2007) 4. Brat, G., Havelund, K., Park, S., Visser, W.: Java pathfinder - second generation of a java model checker. In: Proc. of the Workshop on Advances in Verification (2000) 5. Carriere, J., Kazman, R., Ozkaya, I.: A cost-benefit framework for making architectural decisions in a business context. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, ICSE 2010, vol. 2, pp. 149–157. ACM, New York (2010) 6. van Dongen, B.F., Adriansyah, A.: Process mining: Fuzzy clustering and performance visualization. In: BPM Workshops, pp. 158–169 (2009) 7. G¨ unther, C.W., Rozinat, A., van der Aalst, W.M.P.: Activity mining by global trace segmentation. In: BPM Workshops, pp. 128–139 (2009) 8. Jones, C.: Geriatric issues of aging software. CrossTalk 20(12), 4–8 (2007) 9. Watanabe, Y., Ishio, T., Inoue, K.: Feature-level phase detection for execution trace using object cache. In: Proceedings of the 2008 International Workshop on Dynamic Analysis, WODA 2008, pp. 8–14. ACM, New York (2008)
Next Best Step and Expert Recommendation for Collaborative Processes in IT Service Management Hamid Reza Motahari-Nezhad and Claudio Bartolini Hewlett Packard Laboratory 1501 Page Mill Rd, Palo Alto, CA, USA {hamid.motahari,claudio.bartolini}@hp.com
Abstract. IT service management processes are people intensive and collaborative by nature. There is an emerging trend in IT service management applications, moving away from rigid process orchestration to the leveraging of collaboration technologies. An interesting consequence is that staff can collaboratively define customized and ad-hoc step flows, consisting of the sequence of activities necessary to handle each particular case. Capturing and sharing the knowledge of how previous similar cases have been resolved becomes useful in recommending what steps to take and what experts to consult to handle a new case effectively. We present an approach and a tool that analyzes previous IT case resolutions in order to recommend the best next steps to handle a new case, including recommendations on the experts to invite to help with resolution of the case. Our early evaluation results indicate that this approach shows significant improvement for making recommendations over using only process models discovered from log traces. Keywords: Best practice processes, Collaborative and Conversational Process Definition and Enactment, Step Recommendation, Expert Recommendation.
1 Introduction IT incident management is one of the key ITIL (IT Infrastructure Library, www.itilofficialsite.com) processes [1]. ITIL includes a number of core processes that provide guidelines on managing the IT ecosystem of an organization. IT service management processes are people intensive and collaborative by nature. Existing IT management tools, including HP Service Manager [14], support ITIL processes by interpreting and implementing a specific interpretation of those processes. However, they do not support people collaboration. A new generation of IT management tools, including our earlier work, the IT Support Conversation Manager (ITSCM) [1], has arisen that relies on collaborative techniques to manage IT support cases. They take a collaborative and semi-structured approach to supporting ITIL processes. In particular, ITSCM enables people collaboration and flexible and adaptive process definition and enactment for IT case management. This enables IT staff to customize and define case management processes, while taking into consideration the specific organization and case information within the guidelines of the ITIL framework. S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 50–61, 2011. © Springer-Verlag Berlin Heidelberg 2011
Next Best Step and Expert Recommendation for Collaborative Processes
51
While collaboration-based IT process management systems advocate flexibility in defining the flow of activities, one issue is that staff may need to manually define the same or similar step flows for similar cases. IT staff may benefit from reusing the wealth of knowledge that is latent in the support organization about the best sequence of activities to follow to resolve a given case, based on learning from past similar case resolutions. One of the key desired functionalities of this new generation of IT case management tools is the ability – for a new case – to provide automated assistance to IT staff involved in case management on how to advance the process more efficiently. The problem that we focus on in this paper is providing decision support for guiding the resolution of new cases, by providing automated recommendation of the best steps and experts to resolve an open case, based on the knowledge of past similar cases in the new generation of IT Support System that support collaborative and conversational case management. According to statistics from HP Technical Services (TS), HP handles more than 100,000 IT calls per day, on behalf of the customers that it represents. Some of these calls are about similar issues. Even if a small percentage of such cases benefit from automated guidance on activities that lead to quicker and effective resolution, large efficiency gains would result in the overall case management process. In Section 2, we provide an overview of the collaborative case management advocated by ITSCM, and discuss the challenges of guiding IT staff in such settings. Section 3 presents the solution architecture and algorithmic details. In Section 4, we discuss the implementation and evaluation results. Finally, in Section 5 we discuss related work and conclude with a perspective on future work in Section 6.
2 Automated Guidance in Collaborative Processes: Background and Challenges In the IT Support Conversation Manager (ITSCM) [1], we introduce the concept of conversation as a container for the interaction and collaboration of IT staff following a series of process-oriented actions to handle an open IT case. A conversation includes a set of participants, an informal thread of conversation among participants (textual chat data), and a structured part of conversation including a step sequence (or step flow). Each step defines an action. A subset of conversation participants may get involved in fulfilling an action following the RACI (Responsible, Accountable, Consulted, and Informed) model. Fig. 1 shows the data model of a conversation in ITSCM (referred as a case in IT case management). In the following, we describe the most important attributes that are used in the analysis of conversation data and the automated guidance to participants: • • • • •
title is the title chosen for the IT case by the submitter; tags which are a set of keywords that are attached to the case by the support staff handling the case; step represents an individual action; step flow represents the sequence of steps performed during the course of handling the open case; participant refers to a member of the support staff involved in the case
52
H.R. Motahari-Nezhad and C. Bartolini
• • •
•
handling or the case submitter; type refers to a particular case type, from a pre-defined classification of types allocated by the support staff; priority is a category attribute that specifies the level of severity of the issue. The value for this attribute is first identified by the submitter and may be changed later by the support staff; status which specifies the status of the case, e.g., draft, open, closed, etc. other text attributes which include participant comments and a case description, that are treated as a bag of keywords in the analysis. Case
Case
Step Embedded form
Step Step Flow Step Flow ID Step Case Order
Embedded form
Case ID/Number Type Customer Title Description Assets Priority Tags Due Date Participants Status (default = draft) Continue date Journal (can be deferred) Created by Created date/time Last modified by Last modified date/time Attachments (can be deferred) Messages Resolution Case Template Step
Step Form
Embedded form
Case ID/Number Type Customer Title Description Assets Priority Tags Due Date
Step ID Component URL Title Step Template Created by Created date/time Updated by Update date/time Status Participants Continue date Resolution Comments
Fig. 1. The case data model in the collaborative IT case management system (ITSCM).
In ITSCM, depending on the type of a case, IT staff from the helpdesk may choose to use a pre-existing step flow template for the standard initial task, such as capturing case data and classifying it. Then, ITSCM empowers users to define new steps as the informal discussions among the participants progresses; steps may be assigned to participants and new people may be invited to participate in steps or in the conversation, until the case is resolved. The ITSCM keeps a model of the task flow in the backend, to support the enactment and tracking of steps, to send notifications and to watch due dates/times for escalation purposes. Challenges. In a collaborative and conversational approach for case management, the IT staff defines custom step flows. One issue is that the same step may be defined/named differently in two different cases. The challenges of automated assistance for recommending next steps include capturing the similarity between custom-defined step flows of different cases. Another challenge is considering the contextual information of the informal conversations before and after a step. A more general challenge is identifying the similarity of the step flows of two cases which
Next Best Step and Expert Recommendation for Collaborative Processes
53
requires taking into consideration the unstructured, textual information that is exchanged among the conversation participants. Finally, there is the need to find matching steps for a given open case and for measures to rank the steps based on an interpretation of the “best” next steps.
3 Next Step and Expert Recommendation We have developed a solution that recommends best next steps and experts for an open case at hand by: 1) building a model of step flows of past cases annotated with metadata from the threads of conversations from informal interactions, and 2) matching the current flow of activities in the open case with the annotated step flow model to recommend the next best step(s), and experts. The solution architecture. Fig. 2 shows the architecture of our solution. The repository contains the historical case information. It has two main components: an annotated step-flow-model discovery component that analyzes the previous cases present in a repository and builds a step flow model in which each step is annotated with the metadata information of previous cases that took that particular step. In order to make models more manageable in terms of size and easier to understand, we have decided to discover a separate step flow model for each case type. The step recommendation component uses a discovered step flow model corresponding to the type of the open case and takes into account the information of the open case to find the best match(es) of paths in the model to that of the open case, and returns to the user an ordered list of the recommended step(s). Along with each step recommendation, the solution also recommends experts: IT support staff who have performed the recommended step during the resolution of a past case.
Annotated Model Discovery Discover Step Flow Model
Steps/Experts Recommendation Find Matching Paths
Annotate Step Flow Model
Recommend Steps/Experts
Recommended Steps/Experts • Historical Cases • From Template
[Case Type] [Case Type]
s1 [All]
s2 [c ,…c ] 1 n s3 [c1,…cn ]
[c1,…cn ]
s4
Cases Repository
Annotated Step Flow Model
Open Case
Fig. 2. The component architecture of the Next Best Steps/Expert Recommender solution
54
H.R. Motahari-Nezhad and C. Bartolini
3.1 The Model Discovery Component This component analyzes the existing cases in the repository and creates an annotated process model in two phases: (i) discovering the step flow model and (ii) annotating the process model. Step flow model discovery. We have designed an algorithm that reads the step flows for the cases of a given type from the repository and discovers a model. The first decision we needed to make about the algorithm wasthe language (with formal and visual definitions) to represent the discovered step flow model. We chose finite state machines (FSM) for representing step flow models in this work. FSMs are a wellknown formalism, easy to understand and visualize, and suitable for modeling reactive behaviors. There are a number of methods and tools that enable analysis and management of step flow models specified in FSMs too. Each transition in the model is labeled with a step name. By traversing this model from its start state, we can reproduce individual step sequences of the past cases that are used in the discovery phase. We build on top of our previous work [2] for discovering the process model from process event logs for this purpose. The flow discovery algorithm is a hybrid of grammar inference and probabilistic approaches. The algorithm uses the Markov model to identify step sequences based on statistics about steps that frequently follow each other, and to build an FSM from these sequences. We build the step flow graph as follows: one vertex is created for each distinct step in any step flow in the repository. For a given order n (Markov order), and for each sequence of length n + 1, uniquely labeled edges will connect each step (which is now a vertex) in the sequence to the immediately following step (vertex) in the graph. The resulting graph is converted to an FSM [2, 3]. The FSM generated by the previous step may contain equivalent states that could be reduced. To reduce these states, we merge them. The following criteria are applied to find candidate states for merging: i) states with the same outgoing transitions, i.e., transitions with the same steps to the same target states; ii) final states with transitions labeled with the same incoming step name; and iii) states with the same outgoing Escalate
Open
Assign to FLS
Classify
...
ReferToPMGroup
...
Close Incident
Check Log
GetSpecID
...
...
Fig. 3. An example step flow model for an incident management case repository
Next Best Step and Expert Recommendation for Collaborative Processes
55
transitions, excluding the transition(s) that goes to the other state(s) to be merged with it. In this case, preexisting transitions between merged states are represented as selftransitions on the newly created state. The resulting model is an FSM in which each transition is annotated with statistical metadata. Each number on a transition is normalized into the range of zero to one denoting the probability that the current transition is taken among the outgoing transitions of the source state. Annotating the step flow model with case information. The discovered process model in the previous phase is an FSM graph that shows all possible steps taken after a particular step, annotated only with statistical metadata but not contextual information on the cases. In a second phase, the algorithm annotates each transition in the process model with information about the conditions under which a transition representing a step is taken in previous cases. For each transition in the FSM, which is labeled with a step name ‘a’, we build an index of the cases that contain ‘a’ in the same order that it is observed in the FSM. For instance, in Fig. 3 all the cases that take the “Escalate” step right after “Classify” are identified. We extract information from case title, tags, priority, description, status and the step title, description and participants. We then summarize information of each attribute from the set of cases that include a given step in the model. For categoryattributes such as priority and case status, we record all values seen in the cases and their frequencies. For text attributes, we employ an information retrieval technique to extract keywords that uniquely identify the attribute by focusing on proper names and domain-specific terms (after removing all the stop words). In this approach, besides regular keywords, we extract semantic associations between specific words and identify words as names of companies, people or locations. In IT cases, names of locations and companies (such as HP and Microsoft) are useful for unique identification of a step. The set of keywords are kept along with their frequency as part of the step metadata. Finally, we also keep a list of the support staff who have carried out the step represented by a transition. We use this information to annotate the transition representing the step. The annotation is in the form of a set of key value pairs, where the keys are the attributes and the values specify the frequent values (e.g., keywords, category values) observed for the attribute, along with statistical data. Note that the discovery and annotation of the step flow models is performed offline. The frequency of the update of the model is configurable, with a default setting of 12 hours for updating the models with the step flow of new cases that have not previously been considered in the creation of the model. This approach incrementally updates the model rather than re-building it every time. The system is configurable to take cases between a specified date range, e.g. past 3 days only. 3.2 The Steps/Experts Recommendation Component This component uses the step flow information of an open case and the step flow model of the same type as that of the open case to recommend the next steps. Next steps recommendations are then presented to the user, who may either accept or ignore them. In order to recommend the next steps, we find the step in the flow model that corresponds to the last step in the open case. In some situations, there may
56
H.R. Motahari-Nezhad and C. Bartolini
be more than one step in the model that matches the current step in the open case, based on the similarity between the case information and that of the metadata attached to the steps in the model. Resolution paths
Legend Final States
… History path
Start State
Matching Step
Resolution paths
Candidate Next Steps
Fig. 4. The step recommendation algorithm considers both the path leading to a matching step (history path) and paths leading to a possible resolution in making recommendations
To identify the matching steps, the algorithm considers the path leading to a matching step candidate; looking for a match with the path to the current step in the open case. The recommendation algorithm then considers the resolution paths (the paths from a candidate next step to final states with resolutions) in order to make a recommendation (see Fig. 4 -- The figure shows the history path for one matching step and resolution paths for its candidate next steps in the model). In this sense, the algorithm employs an Occam’s razor argument (http://en.wikipedia.org/wiki/ Occam’s_razor) in giving a higher priority to steps that lead to a resolution in fewer steps and to steps that match the context of the current open case. Step Matching. The step matching method incorporates the approximate sequence matching between the step sequence of the open case and the paths in the process model where the candidate steps are located. Since the step flow in the open case is not complete, we only consider paths of the same length or at most three more steps in the model, counting from the Start state of the FSM. We call such paths candidate paths. The algorithm first computes the similarity between case information in the current step and metadata of the steps in the model (the steps in the candidate paths), without considering their location in the path. The following is the measure of similarity between steps i of the case (denoted as Ci) and j in the step flow model (denoted as Mj): StepSimilarity(Ci,Mj) = titleSimilarity * titleCE + tagsSimilarity * tagsCE + descriptionSimilarity * descriptionCE + prioritySimilarity * priorityCE; In this function, titleCE, tagsCE, descriptionCE and priorityCE are coefficients that specify weights of different attributes. These coefficients sum to One. Priority measure defines a filter so that steps from cases with at most one priority level
Next Best Step and Expert Recommendation for Collaborative Processes
57
difference, over a 1-to-5 scale, are considered. For example, if the priority of the ongoing case is 2, cases with priorities of 1, 2 and 3 are considered. Then, we choose the most similar steps in the model to that of the last step in the case (top 5 in practice). For each of these steps, we compute the similarity of the path leading to the step in the model with the path leading to the last step in the step flow of the open case. The step flow similarity is computed as follows: flowSimilarity(lastStep_Ci,matchingStep_Mj)=StepSimilarity(lastStep_Ci,matchingSte p_Mj) + pathSimilarity(1..i, 1..j); in which pathSimilarity (1..i,1.. j ) = ∑ StepSimilarity (Cl , matchingStep _ M l ), k = i − 1, i ≤ j; k = j − 1, j ≤ i 1≤ l ≤ k
This equation computes the similarity of pair-wise steps in the path immediately before lastStep in the case and the matchingStep in the model. This equation considers the position of steps in the paths and whether steps are similar. As a result, when compared to the last step in the open case, the steps in the model with a more similar history of actions are ranked higher. Note that for recommending next steps, we give preference to steps that lead to a resolution in a fewer number of steps, as discussed below. Step Recommendation. The step recommendation builds on and extends the approach for finding matching steps in the model. The algorithm finds the next steps immediately after the matching steps and ranks them according to their potential match to the context information of the open case, and the likelihood that they lead to a faster resolution based on information from past cases. We define a matching measure of a step Ci for next steps for recommendation as follow: nextStepMatch(Ci) = flowSimilarity(lastStep_Ci, matchingStep_Mj) + stepCaseMatch (Ci)× resolutionPath(Mj); in which
stepCaseMatch (Ci)= tagsSimilarity * tagsCE + descriptionSimilarity * descriptionCE + prioritySimilarity * priorityCE; and resolutionPath ( M ) = j
1 ⎧ ⎨Max{( resPathLen − 1 × ∑ stepSimilarity ( C , M ) ), ∀path ∈ resPath} k ⎩ 1≤ k ≤ resPathLen
resPathLen = 0 ⎫ resPathLen ≠ 0⎬
⎭
The resolutionPath function favors a candidate next step with a resolution path of smaller length and higher similarity. The stepSimilarity in this function computes the sum of the similarities of individual steps in the path to the metadata of the case. For each candidate next step, it returns the similarity of the best match to be multiplied by the stepCaseMatch similarity value. We then rank the next steps based on the value of nextStepMatch and return them to the user interface to be presented to the user. Along with each next best step, the recommender also suggests experts – IT staff who previously carried out the
58
H.R. Motahari-Nezhad and C. Bartolini
recommended step on similar cases. The list of IT staff is sorted, based on the frequency of handling this particular type of case, and the relevance of their profile to the open case, where available. The textual similarity is calculated with a cosine similarity function, which is a well known technique for comparing vectors that are represented by keyword sets [13]. Step and model standardization. Using a standardized approach for naming steps improves the quality of step recommendation. However, often no standardized names are adopted. To compensate for this fact, we support storing previously used names for steps and recommending them for reuse in naming new tasks. In the IT Support Conversation Manager [1] we add all newly added steps of a conversation to a step library for that conversation type. When a conversation participant adds a new step, we provide a search box over past names and an auto-complete feature. Nevertheless, the step flow model of a conversation type may contain steps that are similar to each other but use different step names. We analyze the model for each conversation type to identify similar steps. We present them to a case domain expert for decision making on whether they are considered the same step, therefore enabling them to be merged to build a more concise and compact model.
4 Implementation and Experiments We have created a proof-of-concept prototype of the next best step recommender solution, in connection with the 48Upper.com project within the context of an HP R&D project on next generation IT Service Management solutions that adopt a collaborative approach for ITIL processes. Our prototype is implemented in Java, on top of Eclipse. Fig. 5 shows a screenshot of the recommendation solution integrated with the IT Conversation Manager [1]. To evaluate the approach, we used a set of case data from simulation and from early trials of the tool (more than 250 cases, all related to issues in a human resource software application and incident management). We evaluated the next best step recommender solution in two modes: using information on the sequence of activities alone, and using the information on the sequence of activities combined with the case’s informal thread of interactions among participants. The goal of this experiment is to validate the significance of the case contextual information, as well as that of the approximate matching of step flows with respect to the quality of recommendations made purely based on a process model that includes prescribed action flows but is not augmented with contextual case information. The results of the evaluation show that using only process model information leads to a lower quality of recommendations in terms of ranking the steps, as the algorithm can only rank the steps based on the frequency of the step usage. However, when case information such as case type, title, tags and contextual information on the people interactions are considered, the recommended steps are more likely to be relevant to the context of the case. Based on the 250 cases in the dataset (4 types of cases), we used the recommendation approach on 20 cases. The number of times that users found any of the recommendations useful when using only a process model (structural information) augmented with statistics was 7 times (35%); whereas recommendations made using the presented approach was considered useful in 16 cases (80%). This
Next Best Step and Expert Recommendation for Collaborative Processes
59
initial evaluation is an indicator that the step recommendation approach is effective in finding relevant cases for IT cases with adaptive and collaboratively defined processes for case resolution. In terms of efficiency and performance, this approach has a very important advantage in pre-building a model of the step flows of past cases in offline mode. At runtime, we only need to perform in-memory operations to find the recommended steps. In our experiments, the results are computed in the order of seconds.
Fig. 5. A screenshot of IT Conversation Manager, where the next best step recommender is integrated
5 Related Work To the best of our knowledge, no other tools exist today that address the problem of recommending activity steps in collaborative IT support systems. The work presented in this paper advances our collaborative approach for process definition and enactment introduced in the IT Support Conversation Manager [1]. The process mining techniques [2, 3] enable the discovery of a process model from a set of process instances of ill-documented business processes, and understand changes in process execution. These techniques discover the structural model of the process. In this paper, we have introduced a method for the discovery of an annotated process model with contextual process information. Our model discovery component leverages text analytics and process mining techniques to discover the annotated process model and associated experts. We use the discovered annotated process model to make next step and expert recommendations as an open case progresses. Step recommendation for business processes has been investigated in [4, 5]. In particular, [4] proposes a step recommendation approach for an existing process, based on past process execution data. The authors in [5] propose an approach for learning the process models used among office workers and using that knowledge to recommend process tasks to other colleagues. However, these approaches only consider the structure of the process model as a basis for recommendation. Our experiments in the
60
H.R. Motahari-Nezhad and C. Bartolini
IT case management domain show that using an annotated model significantly improves the quality of recommendations. The authors in [15][16] propose a complementary approach to our work by constructing decision trees based on the structured data attributes available in a process log for each decision point of the process. [16] uses this information to make predications on the future route of a process. In this paper, we use both structured and unstructured information to annotate steps in the model, and use a similarity-based approach for making step recommendations for collaborative, ad-hoc processes in which no standardized naming scheme for steps is used. The problem of routing incident tickets to the right team/individual to handle it is investigated in [6, 7, 8]. In this line of work, the process model captures the routes in which tickets circulate among experts on the help desk and in other supporting offices. They consider the structure of this routing graph [6, 7] as well as the textual information of the ticket description [8] in making group/expert recommendations. However, they do not investigate the problem of recommending next process steps on how to handle a case. Prior work on workflow adaptation in [10, 11, 12] proposes workflow adaptation techniques that use observations on past workflow variants to inform a given new workflow execution case. These approaches find prior change cases that are similar to a new workflow change request, to enable the reuse of previous solutions in similar new cases. However, there is an assumption that there is a pre-defined workflow model whose execution may vary depending on the situation.
6 Conclusions and Future Work To the best of our knowledge, this is the first approach that addresses the problem of recommending next steps and experts for IT case management systems. The main components of the system – the automated discovery of annotated models and the recommender – introduce novel techniques that advance the state of the art in next step and expert recommendation, by employing information on conversational and social interactions among people. In particular, the IT incident cases are richer in their attribute sets (consisting of text, categoryand structured step flows) compared to workflows (that consider metadata such as frequencies, time, etc). In case step flows, the same/similar steps may be defined differently across different cases; and the case step flow for an open case is built gradually (no predefined workflows exist). The convergence of information retrieval techniques and a novel process mining technique in our case addresses the next step and expert recommendation problem given the conversational and collaborative nature of step flow definition. As future work, we plan to incorporate users’ feedback on recommended next steps to improve recommendation results. We also plan to expand the scope of this solution in a Software-as-a-Service (SaaS) environment. In particular, we want to exploit the fact that multi-tenant implementation of case management tools offers opportunities for analyzing activities across different tenants, thereby improving our suggestions.
Next Best Step and Expert Recommendation for Collaborative Processes
61
References 1. Motahari-Nezhad, H.R., Bartolini, C., Graupner, S., Singhal, S., Spence, S.: IT Support Conversation Manager: A Conversation- Centered Approach and Tool for Managing Best Practice IT Processes. In: The Proc. of 14th IEEE EDOC (EDOC 2010) (October 25-29, 2010) 2. Motahari-Nezhad, H.R., Saint-Paul, R., Benatallah, B., Casati, F.: Protocol Discovery from Imperfect Service Interaction Logs. In: ICDE 2007, pp. 1405–1409 (2007) 3. Van der Aalst, W.M.P., Van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.: Workflow mining: a survey of issues and approaches. Data Knowl. Eng. 47(2), 237– 267 (2003) 4. Schonenberg, H., Weber, B., van Dongen, B.F., van der Aalst, W.: Supporting Flexible Processes through Recommendations Based on History. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 51–66. Springer, Heidelberg (2008) 5. Dorn, C., Burkhart, T., Werth, D., Dustdar, S.: Self-adjusting recommendations for peopledriven ad-hoc processes. In: Hull, R., Mendling, J., Tai, S. (eds.) BPM 2010. LNCS, vol. 6336, pp. 327–342. Springer, Heidelberg (2010) 6. Shao, Q., Chen, Y., Tao, S., Yan, X., Anerousis, N.: Efficient ticket routing by resolution sequence mining. In: KDD 2008, pp. 605–613 (2008) 7. Miao, G., Moser, L., Yan, X., Tao, S., Chen, Y., Anerousis, N.: Generative Models for Ticket Resolution in Expert Networks. In: Proceedings KDD 2010 (2010) 8. Sun, P., Tao, S., Yan, X., Anerousis, N., Chen, Y.: Content-Aware Resolution Sequence Mining for Ticket Routing. In: Hull, R., Mendling, J., Tai, S. (eds.) BPM 2010. LNCS, vol. 6336, pp. 243–259. Springer, Heidelberg (2010) 9. Kang, Y.-B., Zaslavsky, A.B., Krishnaswamy, S., Bartolini, C.: A knowledge-rich similarity measure for improving IT incident resolution process. In: SAC 2010, pp. 1781– 1788 (2010) 10. Weber, B., Reichert, M., Rinderle-Ma, S., Wild, W.: Providing Integrated Life Cycle Support in Process-Aware Information Systems. Int. J. Cooperative Inf. Syst. 18(1), 115– 165 (2009) 11. Lu, R., Sadiq, S.W., Governatori, G.: On managing business processes variants. Data Knowl. Eng. 68(7), 642–664 (2009) 12. Madhusudan, T., Leon Zhao, J., Marshall, B.: A Case-based Reasoning Framework for Workflow Model Management. Data and Knowledge Engineering: Special Issue on Business Process Management 50(1), 87–115 (2004) 13. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001) 14. HP BTO Software, HP Service Manager, http://www.managementsoftware.hp.com 15. Lakshmanan, G.T., Duan, S., Keyser, P.T., Curbera, F., Khalaf, R.: A Heuristic Approach for Making Predictions for Semi-Structured Case Oriented Business Processes. Business Process Management Workshops, pp.640–651 (2010) 16. Rozinat, A., van der Aalst, W.M.P.: Decision Mining in ProM. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 420–425. Springer, Heidelberg (2006)
Process Variation Analysis Using Empirical Methods: A Case Study Heiko Ludwig, Yolanda Rankin, Robert Enyedi, and Laura C. Anderson IBM Almaden Research Center, San Jose, CA, 95120 {hludwig,yarankin,enyedir,lca}@us.ibm.com
Abstract. Large organizations often weigh the trade-offs of standardization versus customization of business processes. Standardization of processes results in cost reduction due to the focus on one process management system, one set of applications supporting it, and one set of process specifications and instructions to maintain and support. On the other hand, specific requirements for different business units, e.g., for a particular country or customer, often require several business processes variants to be implemented. When introducing a standardized process an organization has to identify how processes have been conducted in the past, identify variations and adjudicate which variations are necessary and which can be eliminated. This paper outlines a method of identifying process variations and demonstrates its application in a case study. Keywords: Business Process, Variations, Qualitative Methods.
1
Introduction
In the drive to reduce costs, organizations look at process standardization to reduce duplication of process management systems, process specifications, applications used within the process, and the maintenance infrastructure needed to sustain the business process through its life cycle. Large companies often developed independent business process implementations to address the needs of different business units and countries. Service providers, e.g., IT service outsourcers, often take over existing processes from clients or develop client-specific business processes, also leading to duplicate business processes. Homogenizing business processes is typically a large effort requiring careful planning. An important issue to be addressed in business process standardization is the identification of current process variations and their drivers. Why is a process different from one business unit to the next? Are there similarities in the variations that can be abstracted and leveraged for standardization? Answers to these questions suggest which process variants are necessary to address a business need and which ones can be homogenized to one standard variant. Dealing with variants in the context of business process standardization proceed along the following trajectory: 1. scoping, 2. variant identification, 3. variant adjudication and 4. change implementation. The scope of the business process to be standardized defines the context in which variants can occur. In the next S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 62–65, 2011. c Springer-Verlag Berlin Heidelberg 2011
Process Variation Analysis Using Empirical Methods: A Case Study
63
Fig. 1. Variant management in the context of business process standardization
step, variants are identified. This includes understanding why these variants have been created independently. Subsequently, a variant will be adjudicated whether there is a good reason to keep the variant. If the decision is to implement the variant as a standardized change in the process, then change implementation occurs. This approach extends the typical business process standardization [1] to include the variant management perspective. The notion of variation in this case study goes beyond variations of process specifications in a workflow management system such as control flow, data specification, and resources. Variations here also include the implementation of process management system, the implementation of each step, the assignment of roles to activities, and other fundamental differences in technical platform and organizational context. There are significant contributions to variation management in existing work. From a business processes modeling perspective, most representations such as Business Process Execution Language (BPEL) [3] or the Business Process Modeling Notation (BPMN) [4] allow users to express alternative control flow. Some work also includes the representation of variants of other model components, e.g., [2] and [3]. Software product line modeling also addresses variations, e.g., COVAMOF Software Variability Method (COSVAM) [4]; extensions to PLUSS [5]; choice calculus representation for software variation [6]; Although these methods provide a formalized foundation, they do not adequately accommodate the range, flexibility, and dynamic quality of process variations needed in our services context. Review article [3] emphasized the importance of integrating non-technical issues into variability management techniques to derive holistic solutions. This paper addresses the problem of identifying business process variations and their underlying rationale using an empirical, user-centered method of elicitation. We report on a case in which this method was employed for identifying variations of a pricing process in a large enterprise.
2
Work Practice Design Case Study
The empirical method for variant elicitation used in this case study is the Work Practice Design (WPD) [7, 8]. WPD examines people, technology, and processes to understand how we can better integrate all three to ultimately transform standardized processes to align with the needs of the business. Similar to practicebased and user-centered design approaches WPD employs data collection methods, including observational studies, semi-structured one-one interviews with users, survey analysis, etc., to uncover the implicit practices that employees utilize to perform their responsibilities and achieve business goals. WPD facilitates identification and adjudication of process variants that correlate to the actual work practices of employees.
64
H. Ludwig et al.
Case: the case our variation elicitation has been applied to pertains to the part of the sales process of a large, global IT service company. The process involves the design of the customer solution, the calculation of cost and the negotiation with the customer over the price. This process is preceded by the opportunity qualification, finding out about a customer need, and succeeded, if successful, by contract fulfillment. The process typically is highly iterative and involves several financial solution roles, artifacts and inputs into the process of generating a financial solution or price for the client. The process results in a price structure and finally a signed contract to be handed to fulfillment. At the point of the study there were high-level, normative global process definitions (not actionable) in place. In addition, there were detailed rules about who must be involved in designing a customer solution and who must approve bids of a certain size. There was no actionable global process definition and no standard process management environment, no standard process definitions, and no standard applications involved. The long-term objective of the project was to standardize the business process as possible while taking necessary variations into account. The WPD study was used to elicit the variations of business processes. Method: having gathered some basic first data on the process and organizations, w e conducted focus groups of process participants during three site visits. We choose focus groups for the following reasons: 1) the sample size of end users willing to engage in the WPD process; 2) the organizational configurations of process participants, which translated to different perspectives of the financial solution process; 3) the allotted timeframe for transformational change within the business (e.g. deployment deadline of new IT financial solution software); and 4. the need to promote transparency to accommodate knowledge sharing by creating an open forum in which end users could learn about what their peers were doing in the context of the financial solution process. Each site visit was recorded and began with an introduction of the WPD process and background information of each participant (e.g. designated role, tenure in designated role, line of business within the organization, and experience using the IT financial software application). Process participants engaged in cognitive walk-throughs of the solution process, utilizing process diagrams, additional tools necessary for creating the cost/price structure, and other supporting materials (e.g. price release letter). The group session encouraged participants to discuss the different ways in which they perform the process and the reasons why. Participants: a total of 44 geographically dispersed end users had been identified prior to each site visit. The population of end users represented several roles in the process, including account representatives, pricers, risk managers, pricing line managers, technical solution managers, and roles involved in managing a bid process at different levels of seniority.
3
Results and Conclusion
The focus groups and the subsequent analysis of scripts and recordings enabled the establishment of a dominant process, major variations and their drivers.
Process Variation Analysis Using Empirical Methods: A Case Study
65
We highlight examples of important variations identified and their rationale: Variations mostly evolved along geographic and line-of-business lines. We discovered 4 different implementations of process management systems, often shared in multiple countries. Most support a case management paradigm as opposed to a production process, as could be expected. Another source of variation was the upstream and downstream process integration. Some countries and business lines have an automated way of passing a contract to fulfillment, others have a standard format to pass on, and others do this all manually. Likewise for the integration with opportunity qualification. The reason is that those processes upstream and downstream are also not standardized. A major difference entailed the use of an outsourcing center for pricing. If this was the case, processes followed more the dominant case. Involvement of only in-country participants led to variants. Role scope varied between the countries and LOBs. This was the major source of large process model variations. In some markets some participant roles exceed their typical scope, for example encompassing service design and modeling the cost/price case. This led to structurally different process models and very idiosyncratic local business process implementation. While this is only a high-level overview of the results it demonstrates that WPD as a bottom-up method of variation elicitation, here based on focus groups, is a very suitable way of complementing a traditional business process standardization project with variation management. In a subsequent step, variations must be adjudicated and accommodated in conjunction with the standardized new business process.
References 1. Hammer, M., Stanton, S.: How process enterprises really work. Harvard Business Review (November-December 1999) 2. Lazovik, A., Ludwig, H.: Managing Process Customizability and Customization: Model, Language and Process. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 373– 384. Springer, Heidelberg (2007) 3. Chang, S.H., Kim, S.D.: A Variability Modeling Method for Adaptable Services in Service-Oriented Computing. In: 11th International Software Product Line Conference, SPLC 2007 (2007) 4. Decker, G., Puhlmann, F.: Extending BPMN for modeling complex choreographies. In: CoopIS, DOA, ODBASE, GADA, and IS (2010) 5. Dijkman, R., Dumas, M., van Dongen, B., Kaarik, R., Mendling, J.: Similarity of business process models: Metrics and evaluation. Information Systems (2010) 6. Eriksson, M., Borg, K., Brstler, J.: Managing Requirements Specifications for Product Lines. Journal of Systems and Software (2009) 7. Bailey, J., Kieliszewski, C., Blomberg, J.: Work in organizational context and implications for technology interventions. In: Proceeding of Human Factors in Organizational Design and Management, ODAM 2008 (2008) 8. Kieliszewski, C., Bailey, J., Blomberg, J.: A Service Practice Approach: People, Activities and Information in Highly Collaborative Knowledge-based Service Systems. In: Maglio, P., et al. (eds.) Handbook of Service Science. Springer, Heidelberg (2010)
Stimulating Skill Evolution in Market-Based Crowdsourcing Benjamin Satzger, Harald Psaier, Daniel Schall, and Schahram Dustdar Distributed Systems Group, Vienna University of Technology Argentinierstrasse 8/184-1, A-1040 Vienna, Austria {satzger,psaier,schall,dustdar}@infosys.tuwien.ac.at http://www.infosys.tuwien.ac.at
Abstract. Crowdsourcing has emerged as an important paradigm in human problem-solving techniques on the Web. One application of crowdsourcing is to outsource certain tasks to the crowd that are difficult to implement in software. Another potential benefit of crowdsourcing is the on-demand allocation of a flexible workforce. Businesses may outsource tasks to the crowd based on temporary workload variations. A major challenge in crowdsourcing is to guarantee high-quality processing of tasks. We present a novel crowdsourcing marketplace that matches tasks to suitable workers based on auctions. The key to ensuring high quality lies in skilled members whose capabilities can be estimated correctly. We present a novel auction mechanism for skill evolution that helps to correctly estimate workers and to evolve skills that are needed. Evaluations show that this leads to improved crowdsourcing. Keywords: Human-centric BPM, Crowdsourcing, Online communities, Task markets, Auctions, Skill evolution.
1
Introduction
Today, ever changing requirements force in-house business processes to rapidly adapt to changing situations in order to stay competitive. Changes involve not only the need for process adaptation, but also, require an additional inclusion of new capabilities and knowledge, previously unavailable to the company. Thus, outsourcing of parts of business processes became an attractive model. This work, in particular, focuses on a distinguished recent type of outsourcing called crowdsourcing. The term crowdsourcing describes a new web-based business model that harnesses the creative solutions of a distributed network of individuals [4], [18]. This network of humans is typically an open Internet-based platform that follows the open world assumption and tries to attract members with different knowledge and interests. Large IT companies such as Amazon, Google, or Yahoo! have recognized the opportunities behind such mass collaboration systems [8] for both improving their own services and as business case. In particular, Amazon focuses on a task-based marketplace that requires explicit collaboration. The most prominent platform they currently offer is Amazon Mechanical Turk (AMT) [3]. S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 66–82, 2011. c Springer-Verlag Berlin Heidelberg 2011
Stimulating Skill Evolution in Market-Based Crowdsourcing
67
Requesters are invited to issue human-intelligence tasks (HITs) requiring a certain qualification to the AMT. The registered customers post mostly tasks with minor effort that, however, require human capabilities (e.g., transcription, classification, or categorization tasks [11]). Reasons for companies to outsource tasks to a crowd platform are flexibility, high availability, and low costs. Workers from all over the world and different timezones are allowed to register with the platform. With 90% of tasks processed at a cost of $0.10 and less, the same task is usually also offered to multiple AMT workers. However, crowdsourcing platforms are loosely-coupled networks with members of different interests, working style, and cultural background. The flexible crowd structure allows workers to join and leave at any time. This hampers to predict or guarantee the quality of a task’s result. There are crowd providers such as CrowdFlower [6] that broker crowd resources to customers to overcome quality and reliability issues. However, managing and adapting the crowd’s skills and resources in an automated manner remains challenging. Crowd customers prefer fully automated deployment of their tasks to a crowd, just as in common business process models. In this paper, we base our solution on the service-oriented architecture (SOA) paradigm. SOAs are an ideal grounding for distributed environments. With their notion of the participants as services and registries, resources can be easily and even automatically discovered for composing whole business processes. A plethora of standards supports seamless integration and registration of new services, and provides protocols for communication, interaction and control of the components. Altogether, we believe SOAs provide an intuitive and convenient technical grounding to automate large-scale crowdsourcing environments. Important to note, today SOA not only includes software-based services, but also Human-Provided Services [16] and BPEL4People [2] for human interactions in business processes and allow to host mass collaboration environments. The main challenges addressed in this work relate to building and managing an automated crowd platform. It is not only of importance to find suitable workers for a task and to provide the customer with satisfying quality, but also, to maintain a motivated base of crowd members and provide stimulus for learning required skills. Only a recurring, satisfied crowd staff is able to ensure high quality and high output. As any crowd, fluctuations must be compensated and a skill evolution model must support new and existing crowd workers in developing their capabilities and knowledge. Finally, the standard processes on such a platform should be automated and free from intervention to handle the vast amount of tasks and to make it compatible with a SOA approach. Atop, the model should increase the benefit of all participants. In detail, we provide the following contributions in this paper: – Automated matching and auctions. For providing a beneficial distribution of the tasks to the available resources we organize auctions according to novel mechanisms.
68
B. Satzger et al.
– Stimulating skill evolution. In order to bootstrap new skills and unexperienced workers we provide skill evolution by integrating assessment tasks into our auction model. – Evaluation. Experiments quantify the advantages of a skill evolution based approach in comparison to traditional auctions. The remainder of this paper is structured as follows: In Section 2 related work is discussed. Section 3 describes the design of our crowdsourcing system, including its actors and their interaction. Then, Section 4 details our adaptive auction mechanisms and Section 5 presents the conducted experiments and discusses their results. Section 6 concludes the paper and points to future work.
2
Related Work
In this work we position crowdsourcing in a service-oriented business setting by providing automation. In crowdsourcing environments, people offer their skills and capabilities in a service-oriented manner. Major industry players have been working towards standardized protocols and languages for interfacing with people in SOA. Specifications such as WS-HumanTask [10] and BPEL4People [2] have been defined to address the lack of human interactions in service-oriented businesses [14]. These standards, however, have been designed to model interactions in closed enterprise environments where people have predefined, mostly static, roles and responsibilities. Here we address the service-oriented integration of human capabilities situated in a much more dynamic environment where the availability of people is under constant flux and change [5]. The recent trend towards collective intelligence and crowdsourcing can be observed by looking at the success of various Web-based platforms that have attracted a huge number of users. Well known representatives of crowdsourcing platforms include Yahoo! Answers [20] (YA) and the aforementioned AMT [3]. The difference between these platforms lies in how the labor of the crowd is used. YA, for instance, is mainly based on interactions between members. Questions are asked and answered by humans, thereby lacking the ability to automatically control the execution of tasks. The interesting aspect of YA relevant for our crowdsourcing approach is the role of two-sided markets [13]. In YA, users get 100 points by signing-up to the platform [19]. For each answer being provided, users get additional points (more points if the answer is selected as best answer). However, users get negative points if they ask questions, thereby encouraging members to provide answers. Based on the rewarding scheme in YA, users tend to have either role – being answerer or asker – instead of having both roles. In the context of YA and human-reviewed data, [17] provided an analysis of data quality, throughput and user behavior. In contrast, AMT offers access to the largest number of crowdsourcing workers. With their notion of HITs that can be created using a Web service-based interface they are closely related to our aim of mediating the capabilities of crowds to service-oriented business environments. According to one of the latest analysis of AMT [11], HIT topics include, first
Stimulating Skill Evolution in Market-Based Crowdsourcing
69
of all, transcription, classification, and categorizations tasks for documents and images. Furthermore, there is also tasks for collecting data, image tagging, and feedback or advice on different subjects. The major challenges are to find on request skilled workers that are able to provide high quality results for a particular topic (e.g., see [1]), to avoid spamming, and to recognize low-performers. To the best of our knowledge, these problems are still not faced by AMT. In this work we focus on those issues. The shortcoming of most existing real platforms is the lack of detailed skill information. Most platforms have a simple measure to prevent workers (in AMT, a threshold of task success rate can be defined) from claiming tasks. In [15], the automated calculation of expertise profiles and skills based on interactions in collaborative networks was discussed. In this work, we introduce a novel auction mechanism to promote skill evolution in crowdsourcing environments. Compared to open bidding systems used by popular online auction markets such as eBay [9], our auction mechanism is a closed bidding system similar to public procurement. In [7], a model of crowdsourcing is analyzed in which strategic players select among, and subsequently compete in, contests. In contrast, the presented approach in this paper supports the evolution of worker skills through auction mechanisms.
3
Design of Marketplaces in Crowdsourcing
The core activity in task-based crowd environments is members providing their labor by processing tasks. In this section, we explain our idea of task-based crowdsourcing on a market-oriented platform. The aim is to organize and manage the platform to the satisfaction and benefit of all participants; crowd members and platform provider. We will now introduce the basic design of the proposed crowdsourcing environment. 3.1
Skill-Based Task Markets
In task markets different stakeholders can be identified. Generally, there is the requesters and workers representing the registered members of a crowd marketplace. The task of the third stakeholder, the crowd operator in between, is to manage the crowd task auctions. To satisfy any of the stakeholders the operator must assure that the requesters obtain a result of high quality in a timely manner. On the other hand, the workers would like to have tasks available whenever they are motivated to work and are interested in a high reward for processing a task. The operator itself works towards a long-term profit. To bootstrap the skill-based system, each member interested in offering of processing tasks is required to create a profile containing information about her/his interests and skills. The basic interactions and an external view on the proposed crowdsourcing environment are depicted in Fig. 1. The crowdsourcing environment consists of members who can participate in transactions (see Fig. 1(a)). Within a particular transaction a member can either adopt the role of a requester R, who initiates a transaction by announcing tasks (see Fig. 1(b)), or the role of a worker
70
B. Satzger et al.
... ht1
R
W Crowd Platform
... ht2 ... ht3
R
Reward $ Quality announce
W P work
Transaction
(a) Crowd
(b) Tasks
(c) Requester
(d) Worker
Fig. 1. Crowd environment building blocks and interaction of stakeholders
W, who processes a task. We propose a crowdsourcing marketplace that handles transactions transparently for its members; requesters and workers do not communicate directly, but only with the crowdsourcing marketplace in between. We argue that this standardized style of interaction is less prone for misconceptions and more efficient because it allows members getting used to the system. Tasks (Fig. 1(b)) are created by requesters based on their current needs. Requesters initiate a transaction by submitting a task to the marketplace, with additional information about the amount of money he is willing to pay for the processing of the task and additional requirements (Fig. 1(c)). It is the responsibility of the marketplace operator to find a suitable worker, to submit the task to the worker, to collect the result, and to transmit it to the requester. The interaction of a worker with the market platform is initiated by the latter by asking a member whether s/he is interested in processing a task (Fig. 1(d)). This interest can be expressed by bidding for the task. Workers have skill profiles denoted by the symbol P. These profiles are not statically defined, but are updated based on the level of delivered task quality. This procedure ensures an up-to-date view on workers’ capabilities and skills. Based on the bids and background information about the bidders, the system selects one or multiple workers, who are then asked to process the task. 3.2
Towards Auction-Based Crowdsourcing
Auctions are a very old idea already used by the Babylonians but still an active area of research. The rise of e-commerce has drastically increased the number and diversity of goods traded via auctions. Many recently installed markets, such as energy or pollution permit markets, are based on auctions [12]. There are many different flavors of auctions differing in the number of items considered (single/multi item), the number of buyers and sellers (demand/supply/double auction), the bidding procedure (open/closed bids and ascending/descending), and how the price is determined (e.g., first/second price); however, four standard types are widely used [12]. They all assume a single seller and multiple buyers (demand auction) and, in their simplest forms, a single item to sell (single-item auction). The so-called English auction is an ascending open-bid auction, the Dutch auction is a descending open-bid auction. The other two standard auction types are closed-bid auctions, i.e., each bidder submits a single bid which is hidden for other buyers. We use an adapted version of a closed-bid auction; a single auction deals with the matching of one task to one or many crowd
Stimulating Skill Evolution in Market-Based Crowdsourcing
71
workers (single-item demand auction). We will introduce our crowdsourcing auction mechanisms in the following sections.
4
Auction-Based Task Assignment
In the following we discuss the steps involved in transaction processing and outline the novel idea of skill evolution. 4.1
Processing of Transactions
Figure 2 illustrates the steps involved in the internal processing of a transaction. In the qualification step the marketplace identifies all members capable of processing the task (Fig. 2(a)), based on the task description and the members’ profiles. The preselection chooses a limited number of the most suitable workers (Fig. 2(b)) to have a reasonable amount of participants for the auction. The preselection step helps to avoid a flooding of auction announcements. In this way members are only exposed to tasks/auctions for which they actually have a realistic chance of being accepted as worker. Due to the voluntary, open nature of crowdsourcing environments, not all preselected workers may decide to follow the invitation to compete in an auction.
W1 P
Auction o Reward $ o Quality o Skills
(a) Auction
W2 P
Auction W3 P W4 P
R W2 P
W3 P
points
W
Rating
(b) Preselection
(c) Biddings
(d) Feedback
Fig. 2. Internal processing of a transaction
This fact is depicted by the transition phase between Fig. 2(b) and Fig. 2(c) where only a subset of preselected workers decides to participate. The auction phase (Fig. 2(c)) allows each participant to submit an offer and finally, a winner is determined who is supposed to process the task. In the case of a successful processing the marketplace returns the final result to the requester, handles the payment, and allows the requester to give feedback about the transaction in the form of a rating (Fig. 2(d)). As mentioned before, tasks come with a description, a maximum amount of money the requester is willing to pay and further requirements, i.e., time requirements and quality requirements. The former is typically given in the form of a deadline, the latter could range from a simple categorization (e.g., low, high) to sophisticated quality requirement models. Each worker has a self-provided profile describing her/his skills and, additionally, the marketplace operator is keeping track of the actual performance. We propose to maintain a performance
72
B. Satzger et al.
value per user and skill, encoded as tuple consisting of the observed performance and the confidence in that value. The input used to generate these performance values comes from the ratings of the requesters and a check whether the deadline was met, which can be performed by the system without feedback from the requester. The qualification phase is based on a matching of the task description to the skills of the members considering their performance and confidence values. Higher requirements impose higher standards on the performance of the member. The result of this matching is a boolean value indicating whether a member is meeting the minimum requirements. In the next step, the preselection, the qualified members are ranked based on skill, performance, and the confidence in the performance; only the top-k members are chosen to participate in the auction. This helps to reduce the number of auction requests to members in order to avoid spamming members and to spare members the frustration caused by not winning the auction. The marketplace operator as the auctioneer hides parts of the task’s data. Workers only see the task description and the time and quality requirements, but not the associated price determined by the requester. The auction is performed as a closed bid auction, whereas each participant is only allowed one bid. At the end of the auction a winner is determined based on the amounts of the bids and the performance-confidence combination of the bidders’ skills. If all bids are higher than the amount the requester is willing to pay the auctioneer would typically reject all bids and inform the requester that the task cannot be assigned under the current conditions. In this case the requester could change the task by increasing the earnings, lowering the quality requirements or extending the deadline and resubmit the task. With a selection strategy outlined in more detail in the next section the marketplace assigns the task to the worker for processing. After the processing of the task by the worker and the receipt of rating information, the performance of the worker is adjusted and the confidence value is increased. Technically, an aptitude function estimates how well workers are suited for handling a task. It is used as basis for qualification and preselection and can be formally defined as aptitude : W × T → [0, 1], (1) where W is the set of workers and T represents tasks. aptitude(w, t) = 1 would mean that worker w ∈ W is perfectly qualified for handling task t ∈ T . A mapping to zero would represent a total inaptness. Similarly, a ranking function is used to rank workers’ bids: rank : W × T × B → [0, 1],
(2)
where B is the set of bids. In addition to the aptitude, the rank function also takes monetary aspects, contained in bid b ∈ B, into account. A property of a sound ranking function is that if two workers have the same aptitude for a task then the one with the lower bid will have a higher rank. The aptitude function is used for performing qualification and preselection. As auction admittance strategy you can either admit all workers with an aptitude higher than a certain threshold, the top-k workers according to aptitude, or a combination of the two
Stimulating Skill Evolution in Market-Based Crowdsourcing
73
strategies. The ranking function is used to determine the winner of an auction: the highest ranked worker. 4.2
Skill Evolution
The concepts discussed so far provide the fundamentals for automated matching of tasks to workers. As outlined previously, a further major challenge hampering the establishment of a new service-oriented computing paradigm spanning enterprise and open crowdsourcing environments are quality issues. In our scenario this is strongly connected to correctly estimating the skills of workers. One approach for increasing the confidence in worker skills are qualification tasks, with the shortcoming that these tasks would need to be created (manually) by the requesters who have the necessary knowledge. This implies a huge overhead for the testing requester; s/he is also the only one who benefits from the gathered insights. Here, we take a different approach by integrating the capability of confidence management into the crowdsourcing platform itself. Instead of having point-to-point tests, we propose the automated assessment of workers to unburden requesters in inspecting workers’ skills. We believe that this approach offers great potential for the (semi-)automatic inclusion of crowd capabilities in business environments. The first challenge one needs to address is to cope with the “hostile” environment in which computing is performed. Workers may cheat on results (e.g., copy and paste of existing results available in the platform), spam the platform with unusable task results, or even provide false information. A well-known principle in open, Web-based communities is the notion of authoritative sources that act as points of references. For example, this principle has been applied on the Web to propagate trust based on good seeds. Our idea of skill evolution is in a manner similar. We propose the automatic assessment of workers where confidence values are low. For example, newcomers who recently signed up to the platform may be high or low performers. To unveil the tendency of a worker, we create a hidden ‘tandem’ task assignment comprising a worker whose skills are known (high performer) with a high confidence and a worker where the crowdsourcing platform has limited knowledge about its skills (i.e., low confidence). The next step is that both workers process the same task in the context of a requester’s (real) task. However, only the result of the high confidence worker is returned to the requester, whereas the result of the low confidence worker is compared against the delivered reference. This approach has advantages and drawbacks. First, skill evolution through tandem assignments provides an elegant solution to avoid training tasks (assessments are created automatically and managed by the platform) and also implicitly stimulates a learning effect. Of course, the crowdsourcing platform cannot charge the requester for tandem task assignments since it mainly helps the platform to better understand the true skill (confidence) of a worker. Thus, the platform must pay for worker assessments. As we shall show later in our evaluation, performing assessments provides the positive effect that the overall quality of provided results and thus requester satisfaction increases due to a better understanding of worker skills. We embed skill evolution in our crowdsourcing platform as follows.
74
B. Satzger et al.
After the winner of an auction has been determined it is evaluated whether an assessment task is issued to further workers. The function assess outputs 1 if an assessment task is to be assigned to a worker and 0 otherwise. assess : W × T × B × W → {0, 1}
(3)
An input tuple (w, t, b, wr ) checks whether tasks t ∈ T is to be assigned to w ∈ W who offered bid b ∈ B. Worker wr ∈ W is the reference worker, in our case the worker who has won the corresponding auction and who will thus process the same task.
5
Implementation and Evaluation
In this section we discuss implementation aspects, introduce the detailed design of our experiments and present results. 5.1
Simulation Environment
We have implemented a Java-based simulation framework that supports all previously introduced concepts and interactions between requesters, the platform, and workers. All of the above introduced functions (1)-(3) have been implemented in our framework. Due to space limitations, we do not present a detailed discussion on implementation aspects. The interested reader can find details regarding the prototype as well as a Web-based demo online1 . 5.2
Experiment Design and Results
An evaluation scenario consists of a set of workers W and a set of requesters R. In every round of the simulation each requester usually announces a task. An auction is conducted for each announced task t, which consists of a description of the skills needed for its processing, an expected duration, a deadline, and the expected quality. High quality requirements indicate highly sophisticated and demanding tasks. For each worker w and skill s the platform maintains a performance value pf mc(w, s) and a confidence in that value cnf d(w, s). This observed performance value is derived from requester ratings; if it is based on many ratings the confidence is close to one, if there are only a few ratings available the confidence is close to zero. Based on task t’s skill requirements and a worker w’s performance/confidence values for these skills it is possible to calculate the expected performance pf mc(w, t) and confidence cnf d(w, t) for that task. For the evaluation we assume that each worker w has a certain performance pf mcreal (w, s) for a skill s which is hidden but affects the quality of the results. Requesters rate the workers based on the results which in turn is the basis for the observed performance and confidence values. We assume that the processing of tasks demanding a certain skill causes a training effect of that skill, 1
http://www.infosys.tuwien.ac.at/prototyp/Crowds/Markets_index.html
Stimulating Skill Evolution in Market-Based Crowdsourcing
75
i.e., pf mcreal (w, s) increases. For the sake of simplicity in the evaluation we assume that there is only one skill. This does not change fundamental system properties and one skill allows to extrapolate the behavior of multiple skills. Whether a single or multiple skills are considered indeed affects qualification, preselection, bidding, and rating but all the mechanisms are naturally extensible from one to multiple skills by performing the same computations for each skill and a final combination step. Hence, pf mc : W × S → [0, 1] and pf mc : W × T → [0, 1] are reduced to pf mc : W → [0, 1]. The same holds for cnf d and pf mcreal . For the simulation we have created 500 workers with random values for pf mcreal (w) and cnf d(w) according to a normal distribution N (μ, σ 2 ). The initial performance value pf mc(w) is set according to the formula pf mcreal(w) + N (0, 1−cnf d(w)) which ensures that for high confidence the expected deviation of pf mc(w) from pf mcreal (w) is small and for low confidence values it is high, respectively. All values are restricted to the range of [0, 1]. The following figures illustrate the simulation setup in detail. Scenario 1 assumes that there are three requesters (i.e., typically three tasks are issued in every round). It is an environment in which skilled workers are rare and the confidence in the workers’ performance is relatively low, i.e., there are many workers who have few ratings. The real performance pf mcreal (w) for a worker w is drawn according to N (0.3, 0.25), the confidence value cnf d(w) is randomly generated by N (0.2, 0.25). Given the two generated values pf mcreal (w) and cnf d(w) the observed performance is randomly drawn according to N (pf mcreal (w), 1 − cnf d(w)). Hence, a low confidence in the performance leads to highly distorted values for pf mc(w), higher confidence values decrease the variance. Figure 3 gives a detailed view on the statistical distributions of the workers’ performance and confidence in our experiments.
200
200
150
150
100
100
250
200
200
150
150 100 100 50
50
0
0
1.0 0.8 0.6 0.4 0.2
(a) real pfmc
50
50 1.0 0.8 0.6 0.4 0.2
(b) obs pfmc
0
1.0 0.8 0.6 0.4 0.2
(c) diff real/obs
0
1.0 0.8 0.6 0.4 0.2
(d) cnfd
Fig. 3. Generated worker population according to Scenario 1
The histrograms count the number of workers in the buckets [0, 0.2), [0.2, 0.4), [0.4, 0.6), [0.6, 0.8), and [0.8, 1] according to real performance pf mcreal in Fig. 3(a) and according to the observed performance pf mc in Fig. 3(b). Fig. 3(c) represents the difference of real to observed performance; the confidence conf d values is shown in Fig. 3(d). Scenario 2 contains 20 requesters which results in a much more “loaded” system. The workers are relatively skilled; their real performance pf mcreal (w) is generated according to N (0.7, 0.25), i.e., the mean performance value is 0.7 compared to 0.3 as in the previous scenario. We further assume that there is a
76
B. Satzger et al.
200
150
300
100
200
50
100
0
0
250 200
150
150
100 100
50 0
50 1.0 0.8 0.6 0.4 0.2
(a) real pfmc
1.0 0.8 0.6 0.4 0.2
(b) obs pfmc
1.0 0.8 0.6 0.4 0.2
(c) diff obs/real
0
1.0 0.8 0.6 0.4 0.2
(d) cnfd
Fig. 4. Generated worker population according to Scenario 2
higher amount of ratings already available and cnf d(w) is generated according to N (0.8, 0.25). Performance values are again generated as N (pf mcreal (w), 1 − cnf d(w)). The figures of Fig. 4 illustrate the generated worker population for Scenario 2 in the same way as for the previous one. In both scenarios tasks are issued in the first 500 rounds and the simulation ends when all workers have finished all accepted tasks. Tasks are generated randomly with the following methodology. The real hidden result of a task is set to a uniformly distributed random value U(0, 1) from the range [0, 1]. An expected duration is randomly drawn from the set {1, 2, . . . , 24}. A randomly assigned quality requirement U(0.4, 1) states whether the task poses high requirements to its processing. This means that requesters never state that the quality of their tasks’ results is allowed to be lower than 0.4. Last but not least a random deadline is generated for which is guaranteed that it is after the expected duration. Requester Behavior. In every round of the simulation each requester is asked to submit a task. After processing requesters receive the result for the task. If they receive the result after the deadline they rate the transaction with a value of zero and suspend for 20 rounds, i.e., they would refuse to issue a new task due to the negative experience. If the result is transmitted on time requesters rate the quality of the received result. Computationally this is done by comparing the task’s real result with the received result. It is assumed that task requesters are able to estimate whether the received result is close to what was expected. The best possible rating is one, zero is the worst result, all values in between are possible. Ratings for a worker w, be it negative or positive, increase the confidence cnf d(w) and update the observed performance pf mc(w). If the rating is below a threshold of 0.3 the worker suspends for ten rounds, similar to a deadline violation. Hence, requesters with negative experiences tend to make less usage of the crowdsourcing marketplace. In addition to the pure task description requesters announce the maximum price they are willing to pay for the processing of the task. Prices are also represented by random values within the range [0, 1]. Tasks with high quality and high expected duration are more likely to have costs close to the maximum value. Worker Behavior. When asked for a bid during an auction a worker first checks whether it is realistic to finish the task before the deadline considering all tasks the worker is working on. Each task has an expected duration t and each
Stimulating Skill Evolution in Market-Based Crowdsourcing
77
worker would only submit a bid if s/he has at least 1.5 · t of time to work on the task before the end of the deadline considering already accepted tasks. The actual processing time is set to a random value within [t, 2t]. For workers with high real performance the processing time is more likely to be close to t. This value is set by the simulation environment and the workers do not know about the exact processing time in advance. For the evaluation we consider workers with two different bidding behaviors: conservative and aggressive. Conservative workers determine the price according to a linear combination of a tasks effort (i.e., a normalized value of the expected duration), the workers real performance, and her/his workload. The rationale is that workers want more money for workintensive tasks; workers with a high real performance are aware of their capabilities which influences the price as well. Finally, the higher a worker’s current workload the more payment is conceived. bid(w, t)conservative = 0.4 · ef f ort(t) + 0.4 · pf mcreal (w) + 0.2 · load(w) Aggressive workers, in contrast to conservative workers, do not increase the bid’s price based on the workload but are more strongly driven by their own real performance. bid(w, t)aggressive = 0.4 · ef f ort(t) + 0.6 · pf mcreal (w) Whether a worker is conservative or aggressive is chosen randomly with the same probability. The processing of a task has a positive influence on the real performance of the worker w, i.e, pf mc(w)real,new = pf mcreal (w) + 0.1 · (1 − pf mcreal (w)). This modeling of a training effect results in a high learning rate for workers with low real performance and a slowed down learning effect for workers who are already very good. Auction Processing. Auctions are conducted for the purpose of matching a task to a worker. As described in Section 4 there is a qualification and preselection stage before the actual auction in order to avoid spamming a huge worker base with auction request for which many workers may not have the necessary skills. Since we only consider one skill and have a limited number of 500 workers it is reasonable to admit all of them to the auctions. To achieve that the aptitude function, see Eq. 1, is set as follows: aptitude : w → 1 After receiving the workers’ bids they are ranked by a ranking function as defined in Eq. 2. Since there is only one skill the function is slightly adjusted: rank : (w, b) → 0.6 · pf mc(w) + 0.3 · cnf d(w) + 0.1 · (1 − price(b)). Workers may either return a bid or refuse to submit a bid. From the received bids all values are removed whose price is higher than the price the requester is willing to pay. The remaining valid bids are ranked such that a high observed performance, high confidence, and a low price of the bid positively influence the
78
B. Satzger et al.
rank. The emphasis at that stage clearly is on the performance and not on the price. It may happen that there is no valid bid; in that case the requester is informed that the task could not be processed. Skill Evolution. In this work we want to investigate how crowdsourcing can benefit from skill evolution, which is achieved by assigning assessment tasks to workers. This is especially useful for workers with a low confidence value. For these workers only few or no ratings are available. An assessment task is a task that is assigned to a worker although another worker has won the auction and was assigned to the task as well. The workers are not aware of the fact that there are other workers processing the very same task; requesters are not either. The crowdsourcing provider is responsible for paying for the training tasks. As usual, the result of the highest ranked worker is returned to the requester but it is additionally used as a reference for the training task. This enables the marketplace to generate a rating for the assessed worker by comparing her/his result to the reference. A further positive effect is the training of the assessed worker. The assignment of training tasks is based on the received list of valid bids. For controlling the skill evolution Equation 3 needs to be set accordingly. The following definition of the assessment function, which results in disabling skill evolution and leads to purely profit driven auction decisions, maps each combination of workers, bids, and reference workers to 0. assessprof it : (w, b, wr ) → 0. In the evaluation we have used the following setting for the skill evolution enabled auctions. ⎧ 0, if working queue not empty ⎪ ⎪ ⎪ ⎨ or pf mc(wr ) < 0.8 assessskill : (w, b, wr ) → ⎪ or cnf d(wr ) < 0.8 ⎪ ⎪ ⎩ select(w, b), otherwise The function assessskill guarantees that only workers with empty working queue are assessed and that reference workers have high performance and high confidence. This is crucial because the worker w is rated according to the result of the reference worker wr . If workers with a performance lower than 0.8 or confidence lower than 0.8 win an auction a training task assignment is prohibited. If all prerequisites are met the select function determines the workers who are assigned a training task. It is possible that multiple training tasks are assigned. 1, with probability (1 − cnf d(w)) · urg select : (w, b) → 0, otherwise The select function assigns a training task based on the confidence of the considered worker. A low confidence increases the likelihood for a training task. The constant urg can be used to finetune the training task assignment procedure.
Stimulating Skill Evolution in Market-Based Crowdsourcing
79
Table 1. Tasks in Scenario 1 and Scenario 2 Tasks Scenario 1 Scenario 2
Issued 527/315 1687/1641
No bid 2/5 19/26
Timeout 57/73 556/579
Training 757/1310/-
In our experiments it is set to a value of 0.01. A high value raises the probability of assessment tasks. Discussions. Based on the introduced scenarios and simulation parameters, we test the benefit of skill evolution (skill ) compared to regular auction processing (no skill). In the simulations, requesters issue a number of tasks to be processed by the crowd. However, requesters suspend their activity if the task quality is low (observed by low ratings) or task deadlines are violated. We hypothesize that a higher quality of task processing, and thus received ratings, also has positive effects on the profit of workers and the crowdsourcing platform. Table 1 gives an overview of the task statistics in each scenario. The number of issued tasks is influenced by the requesters’ satisfaction. No bid means that a task could not be assigned to any matching worker. The column timeout counts the number of tasks that were not delivered on time. Finally, the number of training tasks is depicted in the last column. All entries have the form skill/no skill. In Fig. 5, we compare the results of Scenario 1 and 2 on the basis of total rating scores given by requesters and the average difference between the evolved real performance of the workers and the observed performance.
1
0.35 skill no skill
skill no skill
0.3
0.8 0.25 0.6
0.2
0.4
0.15 0.1
0.2 0.05 0
Scenario 1
Scenario 2
(a) Rating
0
Scenario 1
Scenario 2
(b) Misjudgement
Fig. 5. Rating and skill misjudgement
For the rating (Fig. 5(a)) results in Scenario 1, the difference is more significant than in Scenario 2; 19% difference compared to 2%. In Scenario 1, whilst with no skill evolution support only an average rating score of slightly above 60% is given by the requesters, the advantage of skill evolution support is most apparent. In this scenario with low load, the system can optimally exploit the better accuracy of worker skill estimation, resulting in an average rating of 80%. Interestingly, the ratings are far lower in Scenario 2, although the average true performance of
80
B. Satzger et al.
the workers is higher. The reason is that the high number of requesters causes a heavily loaded system and increases the probability of deadline violations. Also, low performing workers are more likely to win an auction. The reaction of requesters to deadline violation and low quality is to give bad ratings, in this case an average rating of 60%. Due to the heavy load the benefit of skill evolution is small because for determining an auction’s winning bid, performance and confidence become less decisive but free working capacity is more important. For the misjudgement of the workers in Fig. 5(b) (lower values are better), the results indicate clear benefits of skill evolution. Misjudgement is based on the average difference between the real and the observed performance. With more tasks, and in particular assessment-tasks being issued, worker capabilities can be estimated more correctly. For Scenario 1, the difference is below 20% with and below 30% without skill evolution support. For Scenario 2 with more load, the results considering misjudgement are evidently better. With more transactions, the average performance values’ difference in the skill evolution support model is around 7% and around 19% otherwise. Thus, assessments provide remarkable good results for reducing misjudgements.
350
800 skill no skill
300
skill no skill
700 600
250
500
200
400 150
300
100
200
50 0
100 Platform
Requester
Worker
(a) Scenario 1
0
Platform
Requester
Worker
(b) Scenario 2
Fig. 6. Payments from requesters to workers and crowdsourcing platform
In our simulation requester try to minimize expenses; workers and crowdsourcing marketplace try to maximize earnings. Currently, we use a simple model in which the marketplace collects for each transaction the difference between the maximum expenses, as specified by the requester, and the minimum salary, as specified by the worker bid. In a real setting, the crowdsourcing platform could decide to charge less for its service. The quality of the results, and thus, their satisfaction directly influences the task offering tendency of the requesters. Again, with skill evolution applied, more tasks are processed since good ratings encourage requesters in offering tasks at a constant rate (see Fig. 6(a)). Altogether, the requesters spend almost twice as much money with skill evolution. This is only true for Scenario 1 and similar to the previous results, the difference is far smaller considering overload situations. With more than six times as many requesters, their expenses remain way below the sixfold amount as spent in Scenario 1 with skill evolution support. In total, the expenses in Scenario 2 are almost the same with and without skill evolution. The ratios are similar for the benefits of the
Stimulating Skill Evolution in Market-Based Crowdsourcing
81
workers and the platform. As a summary, skill evolution generally performs better, however, is not antagonistic to overload scenarios. While with moderate task offering frequencies the model performs much better in all measurements, the differences become even when load increases and assessment-task further overload the platform. The results show that it is the responsibility of the platform to balance the task load and trade only with a fair amount of requesters.
6
Conclusions and Future Work
In this paper we present a novel crowdsourcing marketplace based on auctions. The design emphasizes automation and low overhead for users and members of the crowdsourcing system. We introduce two novel auctioning variants and show by experiments that it may be beneficial to employ assessment task in order to estimate members’ capabilities and to train skills. Stimulating the demand for certain skills in such a way leads to skill evolution. As part of our ongoing research we plan to investigate how to allow for complex tasks and collaboration. Workers may, for instance, decompose tasks into subtasks, “crowdsource” the subtasks, and finally assemble the partial results into the final one. In such a setting the crowd would contribute the knowledge how to compose and assemble complex tasks. Furthermore, crowd members could provide higher level services such as quality control and insurance. Apart from auction-based mechanisms, specifications such as WS-HumanTask [10] and BPEL4People [2] for modeling human interactions in service-oriented business environments need to be extended to cope with the dynamics inherent to open crowdsourcing platforms. For example, providing skill and quality models based on prior negotiated service-level agreements (SLAs) that augment WS-HumanTask’s people assignment model. Acknowledgement. This work received funding from the EU FP7 program under the agreements 215483 (SCube) and 257483 (Indenica).
References 1. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality content in social media. In: WSDM 2008, pp. 183–194. ACM, New York (2008) 2. Agrawal, A., et al.: WS-BPEL Extension for People (BPEL4People) (2007) 3. Amazon Mechnical Turk, http://www.mturk.com (last access March 2011) 4. Brabham, D.: Crowdsourcing as a model for problem solving: An introduction and cases. Convergence 14(1), 75 (2008) 5. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Reviews of Modern Physics 81(2), 591–646 (2009) 6. CrowdFlower, http://crowdflower.com/ (last access March 2011) 7. DiPalantino, D., Vojnovic, M.: Crowdsourcing and all-pay auctions. In: EC 2009, pp. 119–128. ACM, New York (2009) 8. Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the worldwide web. Commun. ACM 54(4), 86–96 (2011) 9. eBay, http://tinyurl.com/5w6zgfg (last access March 2011)
82
B. Satzger et al.
10. Ford, M., et al.: Web Services Human Task (WS-HumanTask), Version 1.0 (2007) 11. Ipeirotis, P.G.: Analyzing the Amazon Mechanical Turk Marketplace. SSRN eLibrary 17(2), 16–21 (2010) 12. Klemperer, P.: Auctions: Theory and Practice. Princeton University Press, Princeton (2004) 13. Kumar, R., Lifshits, Y., Tomkins, A.: Evolution of two-sided markets. In: WSDM 2010, pp. 311–320. ACM, New York (2010) 14. Leymann, F.: Workflow-based coordination and cooperation in a service world. In: CoopIS 2006, pp. 2–16 (2006) 15. Schall, D., Dustdar, S.: Dynamic context-sensitive pagerank for expertise mining. In: Bolc, L., Makowski, M., Wierzbicki, A. (eds.) SocInfo 2010. LNCS, vol. 6430, pp. 160–175. Springer, Heidelberg (2010) 16. Schall, D., Truong, H.L., Dustdar, S.: Unifying human and software services in web-scale collaborations. IEEE Internet Computing 12(3), 62–68 (2008) 17. Su, Q., Pavlov, D., Chow, J.H., Baker, W.C.: Internet-scale collection of humanreviewed data. In: WWW 2007, pp. 231–240. ACM, New York (2007) 18. Vukovic, M.: Crowdsourcing for Enterprises. In: Proceedings of the 2009 Congress on Services, pp. 686–692. IEEE Computer Society, Los Alamitos (2009) 19. Yahoo: Answers scoring system, http://answers.yahoo.com/info/scoring_system (last access March 2011) 20. Yahoo! Answers, http://answers.yahoo.com/ (last access March 2011)
Better Algorithms for Analyzing and Enacting Declarative Workflow Languages Using LTL Michael Westergaard Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands
[email protected] Abstract. Declarative workflow languages are easy for humans to understand and use for specifications, but difficult for computers to check for consistency and use for enactment. Therefore, declarative languages need to be translated to something a computer can handle. One approach is to translate the declarative language to linear temporal logic (LTL), which can be translated to finite automata. While computers are very good at handling finite automata, the translation itself is often a road block as it may take time exponential in the size of the input. Here, we present algorithms for doing this translation much more efficiently (around a factor of 10,000 times faster and handling 10 times larger systems on a standard computer), making declarative specifications scale to realistic settings.
1
Introduction
Worflow languages provide an efficient means of describing complex workflows allowing analysis and assistance of users. Traditional workflow languages like, e.g., BPMN [12] require that the modeler explicitly specifies all possibilities from each state of the system, making it difficult to model more abstract relations between tasks when the user has many choices in each state. Descriptions such as “you are only young once; when you are young you should get an education; you can only get one master’s degree; and only with a master’s degree in business information systems will you truly be a master of business process modeling” can only with difficulty be implemented using a transitional workflow language and the complexity grows as the freedom increases. For this reason declarative workflow languages, such as Declare (also referred to as ConDec or DecSerFlow) [15] are becoming popular. Declarative languages do not explicitly state the allowed choices, but instead focus on constraints between tasks, and hence on disallowed behaviour rather than allowed behaviour. They are therefore wellsuited to describing systems like the aforementioned example.
This research is supported by the Technology Foundation STW, applied science division of NWO and the technology programme of the Dutch Ministry of Economic Affairs.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 83–98, 2011. c Springer-Verlag Berlin Heidelberg 2011
84
M. Westergaard
Specifications in Declare consist precedence Master of MSc, BIS of tasks, represented as rectanBPM gles, and constraints, represented as not co-existence = reponse 0…1 (hyper-) arcs between tasks. An exYoung MSc, ES ample Declare specification is shown in Fig. 1. Here we have five tasks (e.g., Young) and four constraints. BSc The constraint response from Young to MSc, BIS, MSc, ES, and Bsc states Fig. 1. A process described using Declare that if Young is executed, at least one of the others have to be executed to as well. The not co-existence constraint between the two MSc tasks state that at most one of these can be executed. Finally, the precedence from MSc, BIS to Master of BPM states that Master of BPM can only be executed if MSc, BIS has been executed previously. Finally, Young has a constraint stating that it can at most be executed once (the shape with 0..1 above the task). Thus, this is a Declare implementation of the example mentioned earlier. We shall not go into further details about the available constraints and their graphical representation here, but refer the interested reader to [15]. To provide analysis for and enact declarative languages, we translate specifications to a form that can be used by computers. For Declare, this has been done by translating specifications to the event calculus [2] and by translating specification to linear temporal logic (LTL) [15]. Here, we are concerned with translation to LTL as this allows more advanced analysis prior to execution, as LTL formulae can be translated to finite automata accepting the traces accepted by the LTL formula and original model [6, 7] Analysis made available by this is checking whether the specification is inconsistent, which manifests itself as an empty acceptance language and easily checked on a finite automaton. We can also identify tasks that can never be executed in a valid execution by inspecting the labels of the automaton, and identify redundant constraints by checking language inclusion using automata. Furthermore, an automaton is used for enactment of the workflow specification, by following states in the automaton and provide the user choices based on what the automaton allows. The classical approach to translating LTL to automata [6] results in a B¨ uchiautomaton accepting all infinite execution traces satisfying the formula. A slight modification of the algorithm [7] instead constructs a standard finite automaton accepting all finite traces accepted by the formula. This algorithm is intended for general model-checking, and thus allows requiring multiple atomic propositions to be satisfied at the same time. A problem with this algorithm is that the construction may be exponential in time and space. As our atomic propositions are events that cannot happen at the same time, [14] further improves on the base original algorithm by removing all transitions that require more than one event occurring at the same time, resulting in smaller automata and shorter execution time. The model-checking community has provisions for dealing with the size of automata by minimizing the formula prior to translation [4] or minimizing the
Better Algorithms
85
generated automata on-the-fly or afterwards [13, 4]. While these techniques alleviate the explosion to a certain degree, the model-checking community typically write formulae by hand which result in smaller formulae, and thus their problem is not so much difficulty in generating the automaton in the first place, but rather the subsequent analysis. We, however, generate formulae automatically from easier to understand descriptions, so our formulae become much larger, making reduction during the initial generation important. Here, we improve on state-of-the art by exploiting characteristics of LTL formulae derived from declarative workflow specifications, namely that they are in essence huge conjunctions of relatively simple formulae. Traditional LTL translations handle conjunctions quite poorly, so we instead translate both sides of a conjunction individually and subsequently construct the synchronized product of the two automata, resulting in better performance. Having sub-divided the problem allows further improvement. For example, we can use automaton minimization algorithms prior to synchronizing, and we can order the product using heuristics to reduce intermediate products and achieve better performance. In this paper we describe several algorithms for constructing the automaton for a Declare model more efficiently. Most of our algorithms have a theoretical worstcase execution time no better that the standard algorithms (i.e., exponential in the size of the input), but behave much better in practise. One presented algorithm only uses linear time to construct an automaton which is useful but not the same as the one constructed by the traditional algorithm (it is a nondeterministic automaton for the negation of the model). We have implemented our algorithms in the Declare tool [3], and the improvements yields speed improvements of a factor 1,000-10,000 or more for randomly generated Declare models, and allows us to generate automata for descriptions consisting of in the order of 50 tasks and constraints, where previous algorithms only allowed us to generate automata for descriptions consisting of around 5-10 tasks and constraints. This in our view is a game-changer, as this makes it possible to represent real-life systems and not just toy-examples. The rest of this paper is structured as follows: In Sect. 2, we introduce the concepts useful for understanding the details of the rest of the paper; this may be skipped on a first reading for the reader not interested in the discussions about theoretical run-time. In Sect. 3, we present algorithms for translating Declare specifications to finite automata using LTL, and in Sect. 4, we provide experimental results from our implementation of the improved algorithms. Finally, in Sect. 5 we sum up our conclusions and provide directions for further work.
2
Background
In this section we briefly introduce linear temporal logic (LTL) interpreted over finite traces, the specific kind of finite automata we are dealing with in this paper, the classical algorithm for translating LTL to finite automata including the improvements from [14], and the complexity of common operations on automata. This section can be skipped by a reader not interested in the asymptotic analysis
86
M. Westergaard
provided. Most of the material presented in this section is known; the contribution here is the formalization of the non-standard kind of finite automaton we use (Def. 2) and single-event automata (Def. 3), and the introduction of a notion of a deterministic automaton for our kind of finite automaton (Def. 4). In the following, we interpret formulae and automata over finite sequences, σ = e0 e1 e2 . . . en−1 for ei ∈ E. We call the set of all such sequences E ∗ and use the notation σ(i) = ei , σi = ei ei+1 . . . en−1 , and |σ| = |e0 e1 . . . en−1 | = n. The concatenation of two sequences σ1 = e1,0 e1,1 . . . e1,n−1 and σ2 = e2,0 e2,1 . . . e2,m−1 is σ1 σ2 = e1,0 e1,1 , . . . e1,n−1 e2,0 e2,1 . . . e2,m−1 . Definition 1 (LTL Syntax). Given a set, AP , of atomic propositions, a formula ϕ of linear temporal logic (LTL) over a set of atomic propositions AP has the form ϕ ::= p | ¬ϕ | ϕ ∨ ψ | ϕ ∧ ψ | Xϕ | ϕU ψ where p ∈ AP , and ϕ and ψ are LTL formulae. We allow the abbreviations ϕ =⇒ ψ = ¬ϕ ∨ ψ, F ϕ = U ϕ and Gϕ = ¬F (¬ϕ).
We interpret LTL over finite sequences by assuming a labeling function: λ : S → 2AP . The standard logical connectives behave as normal, i.e., σ |= p iff p ∈ λ(σ(0)), σ |= ¬ϕ iff σ |= ϕ, σ |= ϕ ∨ ψ iff σ |= ϕ or σ |= ψ, and σ |= ϕ ∧ ψ iff σ |= ϕ and σ |= ψ. Xϕ means that ϕ must held in the next state, i.e., σ |= Xϕ iff σ1 |= ϕ1 , and ϕU ψ means that ϕ holds until ψ holds and ψ holds eventually, i.e., σ |= ϕU ψ iff ∃k.0 ≤ k < |σ| such that σk |= ψ and ∀i.0 ≤ i < k it holds that σi |= ϕ. We call the set of all sequences satisfying a formula ϕ the language of ϕ, denoted L(ϕ) = {σ|σ |= ϕ}. Our definition of automata is very similar to the traditional one, except we add a little more structure to the labels. Labels are a set of atomic propositions or negated atomic propositions. Intuitively, in order to transition from one state to another, one must satisfy all positive atomic propositions and none of the negative atomic propositions of a label. An empty set of labels indicates a transition that can be followed regardless of the set of atomic satisfied. Definition 2 (Finite Automaton). A finite automaton FA is a tuple FA = (L, S, δ, sI , A), where L is a finite set of labels, S is a finite set of states, δ ⊆ S × 2L∪¬L × S is the transition relation where ¬L = {¬p|p ∈ L}, sI ∈ S is the initial state, and A ⊆ S is the set of accepting states.
We say a sequence σ = s0 s1 . . . sn−1 ∈ E ∗ with a labeling function λ : E → AP is accepted by a finite automaton FA = (L, S, δ, sI , A) iff ∀i.0 < i < n there exists a transition, t, leading from si−1 to si , i.e., (si−1 , t, si ) ∈ δ, such that s0 = sI and the labels of si−1 are consistent with t, i.e., t ∩ L ⊆ λ(ei−1 ) and ¬λ(ei−1 ) ∩ t = ∅, and the trace ends up in an accepting state, i.e., s|σ|−1 ∈ A. 1
There are some intricacies at the end of the trace leading to introduction of two kinds of next operators [11]; they are not material in this paper and is discussed in a bit more detail when we talk about accept states for generated automata.
Better Algorithms
87
We call the set of all sequences accepted by a finite automaton the language of FA, denoted by L(FA) = {σ ∈ E ∗ |σ is accepted by FA}. Note that we make no assumptions about L = AP . The standard algorithm for translating LTL to finite automata defines and builds the automaton inductively. Each state has a proof obligation, a set of formulae that must be satisfied in order for a trace starting in that state to be satisfied. The algorithm starts by bringing the formula to negative normal form, i.e., a form where negations only appear directly in front of atomic propositions. This is done by using De Morgan’s laws, i.e., ¬(A∨B) = ¬A∧¬B and ¬(A∧B) = ¬A ∨ ¬B, and introducing the release operator, ϕV ψ defined as ¬(ϕU ψ) = ¬ϕV ¬ψ. Given a formula in negative normal form, ϕ, the algorithm starts with a single state with proof obligation ϕ. This state is furthermore marked as initial. Now, proof obligations are split up according to the structure of ϕ. If ϕ = ψ ∧ ρ, we add ψ and ρ as new proof obligations, and mark ϕ as satisfied. If ϕ = ψ ∨ ρ, we duplicate the state (including any transitions) and add ψ to one copy and ρ to the other, and mark ϕ as satisfied. For ϕ = Xψ, we add a new state with the proof obligation ψ and add a transition from the current state to the new state. The intuition is that we consume a single event and in the next state we must satisfy ψ. ϕ = ψU ρ and ϕ = ψV ρ are both handled using their fix-point characterisation, namely that ψU ρ = ρ ∨ (ψ ∧ X(ψU ρ)) and ψV ρ = ρ∧(ψ∨X(ψV ρ)), and using the other rules. We make sure only to create one state for each set of proof obligations, reusing previously created states if possible. We do not need to handle the case where ϕ = ¬ψ as this only happens when ψ = p for some p ∈ AP . To define a finite automaton FAϕ = (L, S, δ, sI , A), the set of states, S, and the initial state sI are as defined by the inductive algorithm. L coincides with the atomic propositions, AP , of ϕ. A state can transition from one state to another if we have created a transition as explained in the inductive algorithm and the labels of the transition are atomic propositions and negated atomic propositions of the source, i.e., (s, t, s ) ∈ δ iff s and s are related as described above and t = s ∩ 2AP ∪¬AP . The accepting states are all states with no outstanding temporal obligations. For V this does not have any impact, and for U it just means that a state cannot have any U proof obligation (an obligation ψU ρ has not yet satisfied ρ, and hence cannot be accepted). For X the problem is a bit harder, and one has to split X up into a weak and a strong variant [9]. The weak variant basically means “if there is a next state, it must satisfy this formula” and the strong variant means “there is a next state and it satisfies this formula”, and is hence an outstanding proof obligation. Except at the end of the trace, the two versions behave the same. We have that L(ϕ) = L(FAϕ ). To improve the algorithm one typically checks each state for inconsistencies before expanding it. A state is inconsistent if it contains both a formula, ϕ, and its negation, ¬ϕ. All inconsistent states are removed. In order to improve the algorithm for workflow specifications, [14] extends this requirement by adding the additional condition that a state is inconsistent if it contains two atomic propositions, p = q (as a sequence from a workflow always produces exactly
88
M. Westergaard
one event so two atomic propositions can never be satisfied at the same time). Furthermore, whenever a state contains an un-negated atomic proposition, p, we can remove all negated atomic propositions (except ¬p). This may allow us to merge the state with others as it contains fewer proof obligations and thus may match another state. Thus, an automaton constructed using this approach will only have transitions that are either a single positive label or a (possibly empty) set of negative labels. We call such an automaton a single-event finite automaton as formalised in the following definition. Definition 3 (Single-event Finite Automaton). A finite automaton FA = (L, S, δ, sI , A) is said to be a single-event finite automaton iff for all t such that (si−1 , t, si ) ∈ δ it holds that either t = {p} for some p ∈ L or t ⊆ ¬L (note that this includes ∅).
All automata we have used until now are non-deterministic, i.e., whenever we are given a state, s, and a transition, t, there is not necessarily a unique successor state, s , such that (s, t, s ) ∈ δ and (s, t, s ) ∈ δ =⇒ s = s . We define a deterministic single-event finite automaton as an automaton where this holds: Definition 4 (Deterministic Single-event Finite Automaton). A singleevent finite automaton DFA = (L, S, δ, sI , A) is said to be deterministic iff for all s ∈ S, either – there exists a s ∈ S such that (s, ∅, s ) ∈ δ and (s, t, s ) ∈ δ =⇒ s = s ∧ t = ∅, or all of the following hold – for all p ∈ L there exists a sp ∈ S such that (s, {p}, sp ) ∈ δ and (s, {p}, sp ) ∈ δ =⇒ sp = sp , – there exists a s such that (s, ¬L, s ) ∈ δ and (s, ¬L, s ) ∈ δ =⇒ s = s , and – if (s, t, st ) ∈ δ either t = {p} for some p ∈ L or t = ¬L.
Given two finite automata sharing labels, FA1 and FA2 , we can construct an automaton FA1 × FA2 accepting the intersection of the languages accepted by the original automata, i.e., such that L(FA1 ) ∩ L(FA2 ) = L(FA1 × FA2 ). Due to the semantics of LTL, we have L(FAϕ∧ψ ) = L(ϕ ∧ ψ) = L(ϕ) ∩ L(ψ) = L(FAϕ ) ∩ L(FAψ ) = L(FAϕ × FAψ ). We can construct this product using the product construction, which carries deterministic automata over to deterministic automata and uses time and space O(|S1 ||S2 | · 22|L| ) if the automata are non-deterministic (a non-deterministic automaton can have up to 22|L| different transition labels) and O(|S1 ||S2 ||L|) for deterministic automata (they have a restricted set of labels). We can construct an automaton FA1 + FA2 accepting the union of the original languages, i.e., L(FA1 ) ∪ L(FA2 ) = L(FA1 + FA2 ), using the similar sum construction. If the resulting automaton does not have to be deterministic, we can construct an automaton accepting the same language by setting the states as the disjoint union of the original sets of states, make the transition relation respect the original transition relations, and make accepting states of the sum be the disjoint union of the original accepting states.
Better Algorithms
89
The initial state is a new state which has transitions that are the union of the transitions from the original initial states. This construction only uses time and space O(|S1 | + |S2 | + 22|L| . It is possible to minimise a finite automaton, deterministic or not, i.e., given a finite automaton, FA, to find another automaton, FA , such that L(FA) = L(FA ) but such that FA contains the minimum number of states possible to accept L(FA). Basically, one can try all automata smaller than a given automaton and pick one with the fewest number of states. For non-deterministic automata, this is the best possible algorithm in the worst case, but for deterministic automata, it is possible to do this more efficiently, and obtain a (unique up to renaming) minimal automaton in time (O(|L| · |S| log |S|)) [8]. For every finite automaton FA = (L, S, δ, sI , A), there exists a deterministic finite automaton, DFAFA = (L, S , δ , sI , A ), such that they accept the same language, i.e., L(FA) = L(DFAFA ). The automaton can be constructed using the subset construction which uses time and space O(2|S| 22|L|) (as we can have up to 42|L| different transitions). Given a deterministic automaton, FA, we can construct an automaton, FAc accepting the complement, i.e., L(FA)c = L(FAc ), interpreted over some domain. Due to the semantics of LTL, we have L(¬ϕ) = L(ϕ)c = L(FAϕ )c = L(FAc ). Finding the complement of a non-deterministic automaton is as hard as constructing a corresponding deterministic automaton and negating that. This takes time and space O(|S|+|L|) for deterministic automata. For any finite automaton, we can construct an automaton accepting the set of all prefixes of L(FA), pref ix(L(FA)) = {σ|στ ∈ L(FA)} for non-deterministic automata in time O(|S| + 22|L|) and for deterministic automata in time O(|S| + |L|).
3
Algorithm
Here we describe various ways to construct an automaton accepting the language of a Declare workflow specification or its complement. Our implementation assumes that single-event automata are used, but the algorithms can be generalized to handle arbitrary automata. Declare allows users to specify workflows by den scribing a set of tasks, T = {Ti }m i=1 , and a set of constraints, C = {Ci }i=1 , which are LTL formulae with atomic propositions AP = T . For the example in Fig. 1, we have T = {Young, MScBIS, MScES, BSc, MasterOfBPM} with m = 5 and C = {¬(F (Young ∧ X(F (Young)))), G(Young =⇒ F (MScBIS ∨ MScES ∨ BSc)), ¬(F MScBIS∧F MScES), ¬((MasterU MScBIS)∨ GMaster)} with n = 4 (the constraints are in order: “you are only young once”, “when you are young you should get an education”, “you can only get one master’s degree”, and “only with a master’s degree in business information systems will you truly be a master of business process modeling”). n The allowed behaviour of the specification is the language accepted by ϕ = i=1 Ci . In Declare constraints are picked from a set of templates, which are instantiated to concrete atomic formulae. Thus, the set of possible formulae for constraints is fixed and finite. It is easy to see that, given a formula, ϕ, if r is a bijective renaming of atomic propositions, the language of the formula obtained by
90
M. Westergaard
renaming atomic propositions of the original formula, ϕ/r, is the same as the one obtained by taking an automaton accepting the language of the original formula, FAϕ , and renaming all labels of the automaton according to the renaming function, FAϕ /r, i.e., L(ϕ/r) = L(FAϕ/r ) = L(FAϕ /r). Thus, we can pre-compute the automata for all constraint templates and construct an automaton accepting L(Ci ), FACi , in constant time. Furthermore, asthe templates are known beforen hand, the size of the entire specification |ϕ| = | i=1 Ci | ≤ n·maxni=1 |Cn | ∈ O(n) as maxni=1 |Ci | ∈ O(1). We only use this for arguing about the speed; our implementation computes an automaton for each constraint on-the-fly. This is linear in the size of the model as long as we stick to a pre-defined set of constraints. A summary of the different algorithms developed in this paper can be seen in Table 1. For each algorithm, we n give the worst case execution time; here |ϕ| is the length of the formula ϕ = i=1 Cn , n ∈ θ(|ϕ|) is the number of constraints in the input and m is the number of tasks, |S| is the maximum number of states of a automaton for any of the possible constraints and |S | is the maximum number of states in a deterministic such automaton. After renaming of labels in the template automata, we can consider them to range over the same set of labels, which corresponds to the tasks, so |L| = m. The memory requirement is the same as the time requirement in all cases. We see that the improved algorithms rarely provide better bounds than the base algorithm, except for algorithm 4, which provides a less valuable result. In the next section we turn to an experimental evaluation of the algorithms showing significant improvement in practise. Base Algorithm. The base algorithm is using the standard translation with single-event optimisation on ϕ. The time for this algorithm is O(2|ϕ| ). Algorithm 1: Construct automaton for the negation. The n first idea is, instead of constructing an automaton for the specification ϕ = i=1 Cn , to construct one for the negation, ¬ϕ.The idea is that when translating ¬ϕ to negative n n normal form, we get ¬ϕ = ¬ ( i=1 Cn ) = i=1 (¬Cn ) using De Morgan’s laws. While LTL translators are bad at handling conjunctions, as they just add proof obligations, they are good at handle disjunctions as they sub-divide the problem. This algorithm also runs in time O(2|ϕ| ).
Table 1. Comparison of the algorithms Algorithm
Base 1 2 3 4 5,6,8 7
Execution time Non-deterministic
Deterministic
O(2 ) O(2|ϕ| ) O(|S|n 22m ) n O(2|S| 22m n) O(n) same as algorithm 2 or 3 same as algorithm 2 or 3 for each partition
O(|S |n m) O(|S |n mn2 log |S |) same as algorithm 2 or 3 same as algorithm 2 or 3 same as algorithm 2 or 3 for each partition
|ϕ|
Better Algorithms
91
Algorithm 2: Use automaton operations to handle conjunctions. The results of using algorithm 1 show significant speed improvement compared to the base algorithm, and suggest that runtime is dominated by the difficulty in efficiently coping with conjunctions. Unfortunately, we often want to construct the prefix automaton for a Declare specification, but this cannot be done efficiently from a non-deterministic automaton for the negated model. We thus seek an automaton accepting L(ϕ)with as many of the nice traits of algorithm 1 as possible. One way our goal is to instead construct the automaton to achieve using L(ϕ) = L ( ni=1 Cn ) = L ( ni=1 FACi ) where ni=1 FACi is shorthand for FAC1 × FAC2 × · · · × FACn . As we have pre-computed FACi , time spent in this algorithm is just the time for constructing the product, which is O(|S|n 22m ) for non-deterministic automata and O(|S |n m) for deterministic automata. While |S | ∈ O(2|S| ), the number of potential labels are smaller (|m| vs 22m ) and in practise they are rarely larger. Though n ∈ θ(|ϕ|), it is numerically smaller. Algorithm 3: Use minimisation. One way to make the base of the runtime of algorithm 2 smaller n in practise is to minimize the sub-automata used in the computation of i=1 FACi . For non-deterministic automata this takes n O 2|S| 22m n (n times minimization of at most |S|n states, each dominating the multiplication producing the automaton). For deterministic automata, this takes O |S |n mn2 log |S | (same reason as in the non-deterministic case, but with a different complexity for minimization). While this theoretically is worse, the hope is that using smaller automata improves performance in practise. Algorithm 4: Using automata to compute the negation. Even though the negation cannot be used to compute the prefix automaton directly, we can still use it to find definite violations, and it is thus interesting to find efficient algorithms for it. Using the results from algorithm 2 combined with algorithm 1, we can obtain a non-deterministic automaton in time O(n|S|22m ) we just have to modify initial states n times, every time taking time proportional to the number of out-going transitions. We note that |S| is a small constant only depending on the constraints and 2n is an over-approximation as the total number of edges is also independent of the model and known from the constraint automata, in effect yielding a running time of O(n). For deterministic automata, we have to use a product construction, leading to the same time as for algorithm 2 or 3. Algorithm 5: Using balanced trees. The commutative law applies for conjunctions of LTL formulae, so we may re-arrange the conjunction prior to translation. This allows us to arrange the conjunctions in a balanced tree as shown in Fig. 2 (bottom) instead of sequentially as shown in Fig. 2 (top). Arranging the automata in a tree can done in linear time, and the time for computing the product is theoretically the same as for algorithm 3 (as this is just a special-case). We do not expect it to perform significantly better or worse than the sequential implementation used for algorithm 3.
92
M. Westergaard
Algorithm 6: Using automaton size. Looking at the tree arrangement in Fig. 2 (bottom), we have more freedom when it comes to the size of the intermediate products (the diamonds).The reason is that we can compute half the intermediate products independently of other computations. We want to keep the size of intermediate products as small as possible to reduce the time spent computing them (the time spent is proportional to the size of the result). One heuristics to do that is to ensure we never compute the product of two large automata. The idea of algorithm 6 is to construct a tree like in algorithm 5 and order leaves so large automata are paired with small ones. On the next levels we do the same.
& & &
Cn C4
&
C3 C2
C1
& & & C1
& C2
C3
& C4
Cn-1
Cn
Fig. 2. Conjunction arrangements
Algorithm 7: Computing partial product only. In order to detect inconsistencies in a Declare model, we check whether the language of the derived LTL formula is empty, i.e., if L(ϕ) = ∅. If we can find CA and CB with ϕ = CA ∧ CB such that L(ϕ) = ∅ ⇐⇒ L(CA ) = ∅ ∨ L(CB ) = ∅ we can just check the n(possibly) smaller automata for CA and CB for emptiness. Aswe have ϕ = i=1 Cn , we seek P ⊆ {1, 2, . . . , n} so CA = i∈P Ci and CB = j∈{1,2,...,n}\P Cj . Even though automata for CA and CB may not be smaller than the automaton for ϕ, at least we can construct them easier (as they consist of fewer conjunctions). In terms of languages we seek CA and CB such that L(CA )∩L(CB ) = L(CA ∧ CB ) = ∅ ⇐⇒ L(CA ) = ∅ ∨L(CB ) = ∅. Let us consider the contraposition of the direction we are interested in, left to right, namely ¬ (L(CA ) = ∅ ∨ L(CB ) = ∅) ⇐⇒ L(CA ) = ∅ ∧ L(CB ) = ∅ =⇒ L(CA ) ∩ L(CB ) = ∅. We must construct CA and CB such that if we have a a ∈ L(CA ) and a b ∈ L(CB ) then we have a c ∈ L(CA ) ∧ c ∈ L(CB ). We construct CA and CB such that c = ab can be used. If CA is closed under adding suffixes, i.e., if a ∈ L(CA ) then aw ∈ L(CA ) for any word w ∈ Σ ∗ , we clearly have that c = ab ∈ L(CA ). If furthermore CB is closed under adding prefixes, i.e., if b ∈ L(CB ) then wb ∈ L(CB ) for all words w ∈ Σ ∗ , we also clearly have that c = ab ∈ L(CB ). Imposing this notion of suffix/prefix closedness is a too strong condition, however, as we would be able to do anything in the suffix/prefix. Instead, we seek a weaker notion that preserve the property. The idea is that constraints from Declare typically have some part that activate them and some part that subsequently satisfy them. The constraints do not care what happens when they are not activated. In the example in Fig. 1, the constraint response is activated by the execution of Young and satisfied by the execution of one of the MScs or BSc. Whether the constraint is satisfied is not affected by any other tasks. Some constraints are activated in the initial state, e.g., the 0..1 constraint on Young.
Better Algorithms
93
For a partitioning into CA and CB we are satisfied if we can pick a suffix/prefix to add to a string of the language accepted by one to the language of the other. Assume that CA is an automaton derived from a constraint with formula ϕ. We then denote by AP (CA ) the set of all atomic propositions occurring in ϕ. By construction, this coincides with the labels occurring in CA . If a ∈ L(CA ) =⇒ av ∈ L(CA ) for any v ∈ / AP (CA ), we conclude that for any string w ∈ (AP (CA )c )∗ a ∈ L(CA ) =⇒ aw ∈ L(CA ). For the same reason, if b ∈ L(CB ) =⇒ vb ∈ L(CB ) for v ∈ / AP (CB ) then b ∈ L(CB ) =⇒ wb ∈ L(B) for any string w ∈ (AP (CB )c )∗ . If we have two automata, CA and CB , with AP (CA ) ∩ AP (CB ) = ∅ then and an a ∈ L(CA ) we can find another a ∈ L(CA ) ∩ (AP (CB )c )∗ which does not use any labels from AP (CB ), as labels of AP (CB ) does not occur in CA . This is possible as CA cannot distinguish characters not in AP (CA ). Similarly, for a b ∈ L(CB ) we can find a b ∈ L(CB ) ∩ (AP (CA )c )∗ . Thus, a b ∈ L(CA ) and a b ∈ L(CB ), i.e., a b ∈ L(CA ) ∩ L(CB ) = L(CA × CB ) as wanted. We thus need two automata CA and CB not sharing labels; one must satisfy a ∈ L(CA ) =⇒ av ∈ L(CA ) for any v ∈ / AP (CA ) and the other must satisfy / AP (CB ). The first is easily checked b ∈ L(CB ) =⇒ vb ∈ L(CB ) for any v ∈ by inspecting the labels of two automata and the latter can be identified by looking for self-loops on accepting and initial states, respectively. Both of these properties are preserved by automaton product, and any automaton derived from a constraint in Declare satisfies the requirement of accepting any suffix not in their own labels (when not enabled, they do not care what happens to tasks unconnected tasks). Most additionally satisfy the requirement for accepting any prefix not in their alphabet (all constraints not initially enabled). This gives us the following partitioning algorithm: partition the graph with constraints as nodes and tasks as edges into connected components, considering not accepting any prefix as an additional constraint. This makes sure that no two automata from different partitions share labels and hence that the product of all automata from one partition does not share labels with the product of all automata from another. Furthermore, the automata derived from all partitions accept any suffix using different labels, and, except for at most one, all accept any prefix. Thus, the language of the product of any two products of all automata from two partitions is empty if and only if one of the factors is, and hence the product of all automata in all partitions is empty if and only if the product of all automata of at least one of the partitions is. We can construct the partitioning in linear time in the size of the original model and compute the product for each partitioning in the same time as for the other tree-based algorithms. This is an improvement as, if we have more than one, the partitions are smaller than the original system, thus reducing the exponent n in the running time. Automata for partitions can be calculated using algorithms 2, 3, 5, 6, and 8. Algorithm 8: Taking atomic propositions into account. The idea of this algorithm is that synchronized products become smaller the more synchronisation takes place. In the previous section we have argued that partitioning according to connected components of the graph induced by constraints as nodes
94
M. Westergaard
and tasks as edges is a good idea. This graph typically contains few connected components (few tasks of the same specification that are not related by a chain of constraints), but it often contains islands that are only connected by few tasks. The idea of algorithm 8 is to use this graph for constructing the conjunction tree. We want to partition the graph into two parts, representing the left and right subtree, so as few atomic propositions as possible are shared between the two. This is equivalent to performing a minimum bisection or sparsest cut [5] and thus NP-hard. As we just use this as a heuristics for partitioning the constraints to construct a conjunction tree, we do not need a guaranteed best solution, so we use a simple hill-climbing implementation working from random partitions to approximate the problem. Aside from the improvement arising from constructing smaller intermediate products, we also get another benefit, as inconsistencies manifest them selves by incompatible requirements. Thus, by grouping constraints more likely to violate each other, we expect to find inconsistencies faster.
4
Experimental Validation
In this section we present results from experiments with the algorithms from the previous section. Unfortunately, we do not have real-life models large enough for our optimizations to really matter – for the most part because the base algorithm was unable to handle that – making it impossible to conduct interesting case studies beyond toy-examples. For this reason, we have experimented with randomly generated models, generated by adding constraints with equal probability and assigning tasks randomly. We make sure not to add constraints that are obviously in conflict (like forcing two tasks to be initial). Completely randomly generated models are not interesting when they grow large. This is mostly because a large model is difficult to make satisfiable. We expect real-life models to be satisfiable or nearly satisfiable (i.e., by removing a few constraints, they become satisfiable). For this reason, we have decided to focus our evaluation on satisfiable models. We have made such models by randomly generating models and testing them using all our algorithms, marking them as satisfiable if any of the algorithms did. Naturally, this way of generating models is not optimal, as it ensures that we only test our algorithms on models that can be analyzed, but we still find this is better than using random models. We also check completely random models to ensure that this method of testing does not impose too much bias towards our algorithms. We note that while the number of tasks, m, appear in the complexity of most of the algorithms in the previous section, it does not manifest in the same way in reality. The reason is that what is really interesting is the number of transitions in the resulting automaton, and adding more tasks that are not connected to any constraint does not add any more. Experiments show that the complexity tops when the number of tasks is same as the number of constraints or slightly larger. As most of our toy-examples have slightly more tasks than constraints (case in point, Fig. 1), we have chosen to only show such configurations.
Better Algorithms
95
In Table. 2, we see the performance of each algorithm on a set of randomly generated models. All experiments are run with known satisfiable models, known unsatisfiable models, and random models. The first column indicates the number of tasks (T) and constraints (C) for each model. For each algorithm, we show the time spent for each model in milliseconds and the percentage of testcases that could complete successfully (in parentheses; omitted if 100). We have indicated whether numbers result from running the non-deterministic or deterministic version of each algorithm by adding an n or a d after the algorithm number. We have run analysis for 120 random models in each case, allowing the algorithms to use up to 256 MiB of memory. To see the performance when granting algorithms more memory, we have also included numbers for the base algorithm and algorithms 7 and 8 allowing them to use 4 GiB memory on a slightly slower computer. These are indicated by adding a prime after the name (e.g., Base’). For some executions we have indicated that the average execution time is a minimum. This is because we have chosen to limit the execution for each model to 5 hours; when a series of experiments have one or more instances being terminated due to running out of time instead of running out of memory, we have included it in the time average but not in the completion percentage, and we have indicated that the time is a lower bound. For satisfiable models, the base algorithm only copes for the smallest examples. Negating the property (algorithm 1) makes it handle all cases. We see the same by computing the non-deterministic automaton for the negation (algorithm 4n). Using automaton properties (algorithm 2) does not improve on the base algorithm in the non-deterministic case, as we end up computing products with a large number of transitions. For deterministic automata, this approach fares much better, however. Especially when we combine this approach with minimization of intermediate automata (algorithm 3). Organizing the products in a tree (algorithm 5), yields faster results but handles slightly fewer cases than just computing the product (algorithm 3) as the intermediate products are smaller requires slightly more memory as we have to store more of them. Being intelligent about the organization of the tree yields an improvement, as shown by ordering by size (algorithm 6) and especially when ordering by shared atomic propositions (algorithm 8). The largest contribution is computing partial products only (algorithm 7). We have used algorithm 7 with the grouping of trees in each factor using algorithm 8. While most algorithms impose a small penalty on adding more tasks when keeping the number of constraints constant, algorithm 7 actually does the opposite. The reason is that more tasks makes it possible to have more factors in randomly generated models. We note that most algorithms fare better for unsatisfiable models than for satisfiable models as they all have a notion of early termination as soon as a formula is recognized as unsatisfiable. This case is not expected to occur commonly in practise as human-constructed models are constructed to be consistent. The algorithms grouping related constraints (7 and 8) fare particularly well, supporting our expectation that they find contradictions earlier.
96
M. Westergaard
Table 2. Experimental results T,C
Base
Base’
satisfiable
10,5 15,10 20,15 30,15 30,20 50,30 70,50
35 9172 1.28e4 1.31e4
(83) (11) (3)
unsatisfiable
(%)
10,5 15,10 20,15 30,15 30,20 50,30 70,50
4 1791 > 2.36e5 1650 6424 > 3.22e6
(98) (64) (53) (58) (50)
random
Time
10,5 15,10 20,15 30,15 30,20 50,30 70,50
51 > 3.60e5 > 3.66e5 2458 > 3.53e6 > 7.88e6
(83) (41) (19) (31) (26)
T,C
5d
1 (%)
> > > >
55 4.82e4 6.06e6 1.09e7 1.67e7 1.57e7
(48) (26) (5) (10)
> > > > >
19 1.15e4 2.11e6 3.32e6 5.53e6 5.12e6 1.59e7
(86) (74) (61) (64) (12)
> > > >
83 7.91e4 3.93e6 8.97e6 1.17e7 1.10e7
(62) (36) (32) (31)
6d
(86) (60) (21)
2 45 718 1144 3313
satisfiable
Time
10,5 15,10 20,15 30,15 30,20 50,30 70,50
3 46 692 1314 4120
unsatisfiable
(%)
10,5 15,10 20,15 30,15 30,20 50,30 70,50
1 8 294 266 491 404 731
(94) (79) (33) (17)
random
Time
Time
10,5 15,10 20,15 30,15 30,20 50,30 70,50
3 57 375 992 1601 473 77
(87) (39) (39) (13) (8)
2n Time
(%)
1 1 1 2 2 3 6
17 1249
(7)
1 1 2 2 2 3 7
3 435 48 32 89 221 669
1 1 1 2 2 3 7
10 471 49 2 112 91 89
7d (%)
Time
(88) (63) (28)
2 6 27 9 152 210 832
2 10 109 315 945 970
(99) (95) (80) (17)
2 40 349 783 735 2210
(90) (69) (43) (7)
2d
Time
11 9 9 6 25 151 360 11 8 18 10 103 325 595
3d
4n
Time
(%)
Time
(%)
2 71 1096 2117 3748 511
(76) (39) (5) (1)
2 70 1093 2116 3365 2666
(85) (65) (33) (2)
(42) (13) (19) (12) (13) (9)
2 18 379 1328 463 196 69
(91) (76) (35) (26) (15)
1 14 222 595 1057 1289 377
(98) (93) (82) (40) (21)
1 1 2 2 2 3 6
(17) (10) (3) (6) (7) (3)
3 64 776 1889 916 113 235
(89) (68) (42) (16) (8)
2 62 650 1278 1537 1018 270
(89) (68) (42) (21) (13)
1 1 1 2 2 3 7
7d’
8d
(%)
Time
(%)
Time
(99) (83) (54)
7 9 49 20 261 2385 4512
2 31 365 728 1262 3040
(93) (67)
4 7 14 11 37 997 8744
(98) (80) (50)
4 8 26 15 117 1523 3759
Time 1 1 1 2 2 3 6
8d’ (%)
Time
(%)
(90) (71) (43) (8)
3 44 644 1763 4135 4768
(12)
(98) (89)
2 4 21 63 205 234 310
(97) (92) (69) (41)
1 5 24 76 1655 3825 1782
(86) (58)
(93) (63)
2 30 235 583 315 427 346
(92) (78) (55) (43) (33)
3 34 744 2356 6878 4140 2057
(99) (90) (52) (43)
Random models are harder to satisfy when the number of tasks and constraints increase, so running algorithms on completely random models without checking a priori whether they are satisfiable or not, share characteristics with satisfiable models for small numbers of tasks and constraints and with the unsatisfiable models for large numbers. We notice that we only see a small bias for our algorithms as the times are not significantly larger and percentages not significantly lower than for the satisfiable and unsatisfiable models. Unfortunately, the best algorithm (7) seems to be the one preferred the most by our testing method, reinforcing that we should do experiments with real-life models.
5
Conclusion and Future Work
We have presented algorithms for translating declarative workflow models specified using Declare to finite automata using LTL for analysis and enactment of
Better Algorithms
97
the model. The best of our algorithms are several orders of magnitude faster than the previous best algorithm. We have obtained the speed-up by exploiting characteristics of LTL formulae originating from a Declare specification, especially that they are conjunction of simpler formulae defined by the individual constraints of the models. We use automaton operations (automaton product) instead of computing the automaton directly for the conjunction. By structuring the product in balanced trees, grouping automata sharing atomic propositions together, and partitioning the formulae according to the tasks constrained by the originating constraints, we obtain speed-ups of a factor of 1.000-10.000 and more, allowing us to handle models with 50 constraints in seconds, improving on the previous state-of-the art of handling models with 10-15 constraints in minutes or hours. This shows that simple generic algorithms are beat by domain-specific algorithm engineering and heuristics exploiting known structure. We believe our improvement is significant, as it allows us to analyse and enact models of a realistic size, rather than just toy examples. Our next step is to try just that. Our experiments show that we have a slight bias towards the best of our algorithms, and it is interesting to see the improvement incurred by the algorithm on a real-life model. Only little work has been on improving the very specific kind of automata for Declare. Aside from the generic LTL translation algorithm of [6,7] and improvement using single-event automata as described in [14], work has been done on parallel algorithms for computing binary decision diagrams (which is a specific kind of finite automata, see [1]) using automaton operations. [10] deals with arbitrary propositional formulae and uses the structure of the formula to build a tree of automaton operations (and/or/not) and compute the automaton for the final product using a strategy similar to ours. As they focus on general formulae, they are not free to restructure the computation tree, which yields the greatest gain after taking the step from the standard LTL translation algorithm to automaton operations. We have already made experiments with improvements of our algorithms. While the ideas have not paid off yet, we believe that improvement can obtained. We have made some experiments with parallelized computation of the products organized in trees. Unfortunately, little is gained as the last computation (computing the root of the tree) dominates the computation. Using a parallel algorithm for product or automaton minimization [16] may improve on this. We would like to experiment with improved algorithms for computing products and representing states, such as using binary decision diagrams [1] or a sharing tree [10]. We have investigated dynamic update of the automaton, so a user can add or remove constraints from the system and interactively obtain an updated automaton for interactive model construction and enactment/analysis or for dynamically reconfigurable models. The idea is that when we add an automaton to/remove an automaton from a tree, we only need update intermediate products on path from the new/removed factor to the root. We can improve on time spent for adding a constraint by adding it next to the old root, introducing a new root. Unfortunately, this cannot compete with the algorithms grouping
98
M. Westergaard
constraints according to tasks (7/8) as we either void the grouping by multiple additions/removals or make the tree unbalanced by inserting according to the grouping. It would be interesting to investigate a way to rebalance the tree respecting a dynamically computed grouping without having to rebuild it from scratch. It would also be interesting to investigate uses of the non-deterministic automaton for a negated model, as we can compute it very quickly (in milliseconds for models with even hundreds of constraints). One idea is to make an approximation of the automaton for the model for quick analysis and guidance, using the negated automaton to weed out spurious errors.
References 1. Bryant, R.E.: Graph Based Algorithms for Boolean Function Manipulation. IEEE Transactions on Computers C-35(8), 677–691 (1986) 2. Chesani, F., Mello, P., Montali, M., Torroni, P.: Verification of Choreographies During Execution Using the Reactive Event Calculus. In: Bruni, R., Wolf, K. (eds.) WS-FM 2008. LNCS, vol. 5387, pp. 55–72. Springer, Heidelberg (2009) 3. Declare webpage, http://declare.sf.net 4. Etessami, K., Holzmann, G.J.: Optimizing B¨ uchi Automata. In: Palamidessi, C. (ed.) CONCUR 2000. LNCS, vol. 1877, pp. 153–168. Springer, Heidelberg (2000) 5. Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplified NP-complete graph problems. Theoretical Computer Science 1, 237–267 (1976) 6. Gerth, R., Peled, D., Vardi, M.Y., Wolper, P.: Simple On-the-fly Automatic Verification of Linear Temporal Logic. In: Proc. of Protocol Specification, Testing and Verification, pp. 3–18 (1995) 7. Giannakopoulou, D., Havelund, K.: Automata-Based Verification of Temporal Properties on Running Programs. In: Proc. of ASE 2001, pp. 412–416. IEEE Computer Society, Los Alamitos (2001) 8. Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford University (1971) 9. Kamp, H.W.: Tense Logic and the Theory of Linear Order. PhD thesis, University of California (1968) 10. Kimura, S., Clarke, E.M.: A parallel algorithm for constructing binary decision diagrams. In: Proc. of ICCD 1990, pp. 220–223 (1990) 11. Manna, Z., Pnueli, A.: Temporal Verification of Reactive Systems: Safety. Springer, Heidelberg (1995) 12. Object Management Group (OMG). Business Process Modeling Notation (BPML). Version 2.0. OMG Avaiable Specification 13. Paige, R., Tarjan, R.E.: Three Partition Refinement Algorithms. SIAM Journal on Computing 16(6), 973–989 (1987) 14. Pei, M., Bonaki, D., van der Aalst, W.M.P.: Enacting Declarative Languages Using LTL: Avoiding Errors and Improving Performance. In: van de Pol, J., Weber, M. (eds.) Model Checking Software. LNCS, vol. 6349, pp. 146–161. Springer, Heidelberg (2010) 15. Pei, M., Schonenberg, M.H., Sidorova, N., van der Aalst, W.M.P.: ConstraintBased Workflow Models: Change Made Easy. In: Chung, S. (ed.) OTM 2007, Part I. LNCS, vol. 4803, pp. 77–94. Springer, Heidelberg (2007) 16. Ravikumar, B., Xiong, X.: A Parallel Algorithm for Minimization of Finite Automata. In: Proc. of IPPS 1996, pp. 187–191. IEEE Computer Society, Los Alamitos (1996)
Compliance by Design for Artifact-Centric Business Processes Niels Lohmann Universit¨ at Rostock, Institut f¨ ur Informatik, 18051 Rostock, Germany
[email protected] Abstract. Compliance to legal regulations, internal policies, or best practices is becoming a more and more important aspect in business processes management. Compliance requirements are usually formulated in a set of rules that can be checked during or after the execution of the business process, called compliance by detection. If noncompliant behavior is detected, the business process needs to be redesigned. Alternatively, the rules can be already taken into account while modeling the business process to result in a business process that is compliant by design. This technique has the advantage that a subsequent verification of compliance is not required. This paper focuses on compliance by design and employs an artifactcentric approach. In this school of thought, business processes are not described as a sequence of tasks to be performed (i. e., imperatively), but from the point of view of the artifacts that are manipulated during the process (i. e., declaratively). We extend the artifact-centric approach to model compliance rules and show how compliant business processes can be synthesized automatically.
1
Introduction
Business processes are the main asset of companies as they describe their value chain and fundamentally define the “way, businesses are done”. Beside fundamental correctness criteria such as soundness (i. e., every started case is eventually finished successfully), also nonfunctional requirements have to be met. Such requirements are often collected under the umbrella term compliance. They include legal regulations such as the often cited Sarbanes-Oxley Act to fight accounting frauds, internal policies to streamline the in-house processes, or industrial best practices to reduce complexity and costs as well as to facilitate collaborations. Finally, compliance can be seen as a means to validate business processes [5]. Compliance requirements are usually defined without a concrete business process in mind, for instance in legal texts such as “The Commission shall [...] certify [...] that the signing officers have designed such internal controls to ensure that material information [...] is made known [...] during the period in which the periodic reports are being prepared”.1 Such informal descriptions then must 1
Excerpt of Title 15 of the United States Code, § 7241(a)(4)(B).
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 99–115, 2011. c Springer-Verlag Berlin Heidelberg 2011
100
N. Lohmann
be translated by domain experts into precise rules that unambiguously capture the essence of the requirement in a shape that it can be checked in concrete business process. The example above could yield a rule such as “Information on financial reports must be sent to the press team by the signing officers at most two weeks after signing”. The rules clarify who has to take which actions and when. That is, they specifically deal with the execution order of actions of the business process or the reachability of data values. This formalization of domain-specific knowledge is far from trivial and out of scope of this paper. In the remainder, we assume — similar to other approaches [18] — that such rules are already present. Compliance rules are often declarative and describe what should be achieved rather than how to achieve it. Temporal logics such as CTL [4], LTL [21] or PLTL [20] are common ways to formalize such declarative rules. To make these logics approachable for nonexperts, also graphical notations have been proposed [1,2]. Given such rules, compliance of a business process model can be verified using model checking techniques [6]. These checks can be classified as compliance by detection, also called after the fact or retrospective checking [25]. Their main goal is to provide a rigorous proof of compliance. In case of noncompliance, diagnosis information may help to fix the business process toward compliance. This step can be very complicated, because the rules may affect various parts and agents of the business process (e. g., financial staff and the press team). Furthermore, the declarative nature of the rules does not provide recipes on how to fix the business process. To meet the previous example rule, an action “send information to press team” needs to be added to the process and must be executed at most two weeks after the execution of an action “sign financial report”. Compliance can be eventually reached after iteratively adjusting the business process model, cf. Fig. 1(a). An alternative approach takes a business process model and the compliance rules as input and automatically generates a business process model that is compliant by design [25], cf. Fig. 1(b). This has several advantages: First, a subsequent proof and potential corrections are not required. This may speed up the modeling process. Second, the approach is flexible as the generation can be repeated when rules are added, removed, or changed. Third, the approach is complete in the sense that an unsuccessful model generation can be interpreted
business process model
?
check compliance rules M
adjust business process model
check compliance rules uncompliant business process model
M compliant business process model
M
business process model compliance rules
?
M
R3 R2
R1
(a) compliance by detection
generate
M
compliant business process model
(b) compliance by design
Fig. 1. Approaches to achieve compliance
Compliance by Design for Artifact-Centric Business Processes
101
as “the business process cannot be made compliant” rather than “the current model is not compliant”. Fourth, compliance is not only detected, but actually enforced. That is, noncompliant behavior becomes technically impossible. This paper investigates the latter compliance-by-design approach. We employ a recent framework for artifact-centric business processes [17]. In this framework, a business process is specified by a description of the life cycles of its data objects (artifacts). From this declarative specification, which also specifies agents and locations of artifacts, a sound, operational, and interorganizational business process can be automatically generated. Contribution. This paper makes two contributions: First, we extend the artifactcentric framework [17] to model a large family of compliance rules. Second, we use existing tools and techniques to achieve not only soundness, but also compliance by design. We also sketch the diagnosis of noncompliant models. Organization. The next section introduces a small example we use throughout the paper to exemplify our approach and later extensions. Section 3 sets the stage for our later contributions. There, we introduce artifact-centric business processes and correctness by design. In Sect. 4, we demonstrate how a large family of compliance rules can be expressed with in our approach. Section 5 presents how compliance by design can be achieved. We also discuss the diagnosis of unrealizable compliance rules. Section 6 brings our approach in the context of related work, before Sect. 7 concludes the paper.
2
Running Example: Insurance Claim Handling
We use a simple insurance claim handling process (based on [24]) as running example for this paper. In this process, a customer submits a claim to an insurer who then prepares a fraud detection check offered by an external service. Based on the result of this check, the claim is either (1) assessed and the settlement estimated, (2) detected fraudulent and reported, or (3) deemed incomplete. In the last case, further information are requested from the customer before the claim is resubmitted to the fraud detection service. In this situation, the customer can alternatively decide to withdraw the claim. On successful assessment, a settlement case is processed by a financial clerk. The claim is settlement paid in several rates or all at once. A single complete payment further requires an authorization of the controlling officer. When the settlement is finally paid, the claim is archived. One way to model this process is to explicitly order the actions to be taken and to give an operational business process model. An alternative to this verb-centric approach offers an artifact-centric framework which starts by identifying what is acted on (noun-centric) and to derive a business process from the life cycles of the involved artifacts. We shall discuss artifact-centric business processes in the next section.
102
3
N. Lohmann
Artifact-Centric Business Processes
In this section, we shall introduce artifact-centric business processes [17]. We first give an informal overview of all concepts involved. Then we present a Petri net formalization and discuss it in on the basis of the running example. Admittedly, this section takes a large part of this paper, but is required to discuss the contributions to compliance. 3.1
Informal Overview
Artifact-centric modeling promotes the data objects of a business process (called artifacts) and their life cycles to first-class citizens. In the running example, we consider two artifacts: an insurance claim file and a settlement case. Each life cycle describes how the state of an artifact may evolve over time. The actions that change states are executed by agents, for instance the customer, the insurer, the financial clerk, and the controlling officer. As multiple agents can participate in a business process, the artifact-centric approach is particularly suited to model interorganizational business processes. Each artifact has at least one final state which models a successful processing of the artifact, for instance “claim archived”, or “settlement paid”. Artifactcentric business processes are inherently declarative: the control flow of the business process is not explicitly modeled, but follows from the life cycles of the artifacts. That is, any execution of actions that brings all artifacts to a final state can be seen as sound. However, not every sound execution makes sense. For instance, semantically ordered actions of different and independently modeled artifacts (e. g., “assess claim” and “create settlement”) may be executed in any order. In addition, not every combination of final states may be desirable, for instance “claim withdrawn” in combination with “settlement paid”. Therefore, the executions have to be constrained using policies and goal states. A policy is a way of expressing constraints between artifacts. For instance, a policy may constrain the order of state changes in different artifacts (e. g., always executing “assess claim” before “pay settlement”). Finally, goal states restrict final states by reducing those combinations of artifacts’ final states that should be considered successful. In recent work [17], we presented an approach that takes artifacts, policies, and goal states as input and automatically synthesizes an interorganizational business process. This business process has two important properties: First, it is operational. It explicitly models which agent may perform which action in which state. In addition to the control flow, we can also derive the data flow and even the message flow, because artifacts may be sent between agents. Operational models can be translated into languages such as BPMN [19] and can be easily refined toward execution. Second, the business process is weakly terminating. Weak termination is a correctness criterion that ensures that a goal state is always reachable from every reachable state. Any actions that would lead to deadlocks or livelocks are removed. This means that the approach is correct by design. To summarize, the artifact-centric approach allows to model
Compliance by Design for Artifact-Centric Business Processes
policies
P1
P3 P2
artifact composition
A1
controller synthesis
A3 A2
artifacts
Γ
M
103
weakly terminating business process model
goal states
Fig. 2. Artifact-centric business processes in a nutshell
artifacts and to restrict their manipulation by additional domain knowledge such as policies and goal states. From this declarative model we can then automatically generate a weakly terminating operational model. Figure 2 illustrates the overall approach. 3.2
Formalization
We model artifact-centric business processes with Petri nets [23]. Petri nets combine a simple graphical representation with a rigorous mathematical foundation. They can naturally express locality of actions in distributed systems. This allows us to model the life cycles of several artifacts independently, yielding a more compact model compared to explicit state machines. Definition 1 (Petri net). A Petri net N = [P, T, F, m0 ] consists of two finite and disjoint sets P of places and T of transitions, a flow relation F ⊆ (P ×T )∪ (T × P ), and an initial marking m0 . A marking m : P → IN represents a state of the Petri net and is visualized as a distribution of tokens on the places. Transition t is enabled in marking m iff, for all [p, t] ∈ F , m(p) > 0. An enabled transition t can fire, transforming m into the new state m with m (p) = m(p) − W ([p, t]) + W ([t, p]) where W ([x, y]) = 1 if [x, y] ∈ F , and W ([x, y]) = 0, otherwise. A Petri net shall describe the life cycle of an artifact. For our purposes, we have to extend this model with several concepts: Each transition is associated with an action from a fixed set L = Lc ∪ Lu of action labels. This set is partitioned into a set Lc of controllable actions that are executed by agents and a set Lu of uncontrollable actions that are not controllable by any agent, but are under the influence of the environment. Such uncontrollable actions are suitable to model choices that are external to the business process model, such as the outcome of a service call, for instance to a fraud detection agency. Definition 2 (Artifact [17]). An artifact A = [N, , Ω] consists of (1) a Petri net N = [P, T, F, m0 ], (2) a transition labeling : T → L associating actions with Petri net transitions, and (3) a set Ω of final markings of N representing endpoints in the life cycle of the artifact. Running example (cont.). Figure 3 depicts the claim and the settlement artifacts. Each transition is labeled by the agent that executes it (insurer, customer,
104
N. Lohmann
void
prepared
submitted ins
submit
prepare
provideInfo
fraudulent
cus
requestInfo pending
@insurer
report
ok
con
sendToController
ins
ins
estimateHigh
archive estimated
withdrawn
cus
@controller
ins
incomplete withdraw
ins
assessed
assess estimateLow
ins
sendToInsurer
reported ins
cus
archived
ins
Ω = {[withdrawn, @insurer], [archived, @insurer], [reported, @insurer]}
claim
settlement empty
fin payPart
created payPart fin
create
payRest
fin
fin
paid
authorize fin payFull
con
unchecked
authorized
Ω = {[empty], [unchecked, paid], [authorized, paid]}
Fig. 3. Artifacts of the running example
controller, and financial cleark) or is shaded gray in case of uncontrollable actions. Two additional places (“@insurer” and “@controller”) model the locaction of the insurance claim file. The artifact-centric approach allows to distinguish physical objects (e. g., documents or goods) that need to be transferred between agents to execute certain actions and virtual objects (e. g., data bases or electronic documents). By taking the shape of the artifacts and their location into account, we can later derive explicit message transfer among the agents. In our example, we assume that after submitting the claim, a physical file is created by the insurer which can be sent to the controlling officer by executing the respective action “send to controller”. We further assume that the settlement case is a data base entry that can be remotely accessed by the insurer, the financial clerk, and the controlling officer. We can now model each artifact of our business process independently with a Petri net model. Together, the artifacts implicitly specify a process of actions that may be performed by the agents or the environment according to the life cycles of the artifacts. This global model is formalized as the union of the artifact models, which is again an artifact. Definition 3 (Artifact union [17]). Let A1 , . . . , An be artifacts with pairwise n disjoint Petri nets N1 , . . . , Nn . Definethe artifact union n n i=1 Ai = [N, , Ω] to n be the artifact consisting of (1) N = [ i=1 Pi , i=1 Ti , i=1 Fi , m01 ⊕ · · · ⊕ m0n ], (2) (t) = i (t) iff t ∈ Ti (i ∈ {1, . . . , n}), and (3) Ω = {m1 ⊕ · · · ⊕ mn | mi ∈ Ωi ∧ 1 ≤ i ≤ n}. Thereby, ⊕ denotes the composition of markings: (m1 ⊕ · · · ⊕ mn )(p) = mi (p) iff p ∈ Pi .
Compliance by Design for Artifact-Centric Business Processes
105
The previous definition is of rather technical nature. The only noteworthy property is that the set of final markings of the union consists of all combinations of final markings of the respective artifacts. Conceptually, the union of the artifacts has several downsides as we discussed in Sect. 3.1. First, it may contain sequences of actions that reach deadlocks or livelocks. That is, the model is not necessarily weakly terminating. Second, the artifacts may evolve independently which may result in implausible execution orders or undesired final states. As stated earlier, the latter problems can be ruled out by defining policies, which restrict interartifact behavior, and goal states, which restrict the final states of the union. Before discussing this, we shall first cope with the first problem. The transitions of the artifacts are labeled with actions that can be executed by agents. Hence, the agents have control about the evolution of the overall business process. To avoid undesired situations such as deadlocks or livelocks, their behavior needs to be coordinated. That is, it must be constrained such that every execution can be continued to a final state. This coordination can be seen as a controller synthesis problem [22]: given an artifact A, we are interested in a controller C (in fact, also modeled as an artifact) such that their interplay is weakly terminating. The interplay of two artifacts is formalized by their composition, cf. Fig. 4. Definition 4 (Artifact composition [17]). Let A1 and A2 be artifacts. Define their shared labels as S = {l | ∃t1 ∈ T1 , ∃t2 ∈ T2 : (t1 ) = (t2 ) = l}. The composition of A1 and A2 is the artifact A1 ⊕ A2 = [N, , Ω] consisting of: – N = [P, T, F, m01 ⊕ m02 ] with • P =P 1 ∪ P2 , • T = T1 ∪ T2 ∪ {[t1 , t2 ] ∈ T1 × T2 | (t1 ) = (t2 )} \ {t ∈ T1 | 1 (t) ∈ S} ∪ {t ∈ T2 | 2 (t) ∈ S} , • F = ((F1 ∪ F2 ) ∩ ((P × T ) ∪ (T × P ))) ∪ {[[t1 , t2 ], p] | [t1 , p] ∈ F1 ∨ [t2 , p] ∈ F2 } ∪ {[p, [t1 , t2 ]] | [p, t1 ] ∈ F1 ∨ [p, t2 ] ∈ F2 }, – for all t ∈ T ∩ T1 : (t) = 1 (t), for all t ∈ T ∩ T2 : (t) = 2 (t), and for all [t1 , t2 ] ∈ T ∩ (T1 × T2 ): ([t1 , t2 ]) = 1 (t1 ), and – Ω = {m1 ⊕ m2 | m1 ∈ Ω1 ∧ m2 ∈ Ω2 }. The composition A1 ⊕ A2 is complete if for all t ∈ Ti holds: if i (t) ∈ / S, then i (t) ∈ Lu (i ∈ {1, 2}).
A a final1
b
B
⊕
final2
Ω = {[final1 ], [final2 ]}
b final Ω = {[final]}
A⊕B
=
a
b final1
final2
final
Ω = {[final1 , final], [final2 , final]}
Fig. 4. Composition of two artifacts
106
N. Lohmann
Given an artifact A, we call another artifact C a controller for A iff (1) their composition A⊕C is complete and (2) for each reachable markings of the composition, a final marking m ⊕ m of A ⊕ C is reachable. The existence of controllers (also called controllability [27]) is a fundamental correctness criterion for communicating systems such as services. It can be decided constructively [27]: If a controller for an artifact exists, it can be constructed automatically [16]. Note that the requirement of a complete composition makes sure that the controller does not constrain the execution of uncontrollable actions. Finally, we can define goal states and policies to constrain the interartifact behavior. Goal states are a set of markings of the artifact union and are used as final markings during controller synthesis. To model interdependencies between artifacts, we employ policies. We also model policies with artifacts (similar to behavioral constraints [15]); that is, labeled Petri nets with a set of final markings. These artifacts have no counterpart in reality and are only used to model dependencies between actions of different artifacts. The application of policies then boils down to the composition of the artifacts with these policies. Running example (cont.). To rule out implausible behavior, we further define the following policies to constrain interartifact behavior and the location’s impact on actions: P1. The claim may be archived only if it resides at the insurer and the settlement is paid. P2. A settlement may only be created after the claim has been estimated. P3. To authorize the complete payment of the settlement, the claim artifact must be at hand to the controlling officer. P4. The claim artifact may only be sent to the controller if it has been estimated and the settlement has not been checked. P5. The claim artifact may only be sent back to the insurer if the settlement has been authorized. The policies address different aspects of the artifacts such as location (P1 and P3), execution order (P2), or data constraints (P4 and P5). The modeling of the policies as artifacts is straightforward and depicted in Fig. 5. Note that in policy P2, we use an unlabeled place to express the causality between the transitions.
estimateHigh estimateLow @insurer
ins
ins
ins
authorize
con ins
archive
paid
@controller
P1
create
fin
sendToInsurer
estimated
con
P2
P3
sendToController authorized
unchecked
Fig. 5. Policies for the running example
P4
P5
Compliance by Design for Artifact-Centric Business Processes
107
This place has no counterpart in any artifact and is added to the composition. As goal states, we specify the set Γ = {[withdrawn, @insurer, empty], [reported, @insurer, empty], [archived, @insurer, paid, unchecked], [archived, @insurer, paid, authorized]}
which models four cases: withdrawal of the claim, detection of a fraud, and settlement with or without authorization. Taking the artifacts, policies, and goal states as input, we can automatically synthesize the weakly terminating and operational business process depicted in Fig. 6. withdraw
cus fin
ins
requestInfo
payPart
payPart
payRest
fin
fin
archive provideInfo
estimateHigh
cus
assess initial
cus
ins
submit
prepare
payFull
ins
ins
fin
fin
ins
archive
create
ins
final
ins
estimateLow ins
con
con
sendToController authorize sendToInsurer report
ins
Fig. 6. The running example as weakly terminating operational business process
4
Modeling Compliance Rules
This section investigates to what extend compliance rules can be integrated into the artifact-centric approach. Before we present different shapes of compliance rules and their formalization with Petri nets, we first discuss the difference between a policy and a compliance rule. 4.1
Enforcing Policies vs. Monitoring Compliance Rules
As described in the previous section, we use policies to express interdependencies between artifacts and explicitly restrict behavior by making the firing of transitions impossible. Policies thereby express domain knowledge about the business process and its artifacts and are suitable to inhibit implausible or undesired behavior. This finally affects the subsequent controller synthesis. In contrast, a compliance rule specifies behavior that is not under the direct control of the business process designer. Consequently, a compliance rule must not restrict the behavior of the process, but only monitor it to detect noncompliance. For instance, a compliance rule must not disable external choices within the business process as they cannot be controlled by any agent. If such a choice would be disabled to achieve compliance, the resulting business process model
108
N. Lohmann
would be spurious as the respective choice could not be disabled in reality. Therefore, compliance rules must not restrict the behavior of the artifacts, but only restrict the final states of the model. This may classify behavior as undesired (viz. noncompliant), but this behavior remains reachable. Only if this behavior can be circumvented by the controller synthesis, we faithfully found a compliant business process which can be actually implemented. We formalize this nonrestricting nature as monitor property [15,27]. Intuitively, this property requires that in every reachable marking of an artifact, it holds that for each action label of that artifact a transition with that label is activated. This rules out situations in which the firing of a transition in a composition is inhibited by a compliance rule. 4.2
Expressiveness of Compliance Rules
Conceptually, we model compliance rules by artifacts with the monitor property. Again, adding a compliance rule to an artifact-centric model boils down to composition. The monitor property ensures that the compliance rule’s transitions are synchronized with the other artifacts, but without restricting (i. e., disabling) actions. That is, the life cycle of a compliance rule model evolves together with the artifacts’ life cycles, but may affect the final states of the composed model. In a finite-state composition of artifacts, the set of runs reaching a final state forms a regular language. The terminating runs of a compliance rule (i. e., sequences of transitions that reach a final marking) describe compliant runs. This set again forms a regular language. In the composition of the artifacts and the compliance rules, these regular languages are synchronized — viz. intersected — yielding a subset of terminating runs. Regular languages allow to express a variety of relevant scenarios. In fact, we can express all patterns listed by Dwyer et al. [8], including: – enforcement and existence of actions (e. g., “Every compliant run must contain an action ‘archive claim’.”), – absence/exclusion of actions (e. g., “The action ‘withdraw claim’ must not be executed.”), – ordering (precedence and response) of actions (e. g., “The action ‘create settlement’ must be executed after ‘submit claim’, but before ‘archive claim’.”), and – numbering constraints/bounded existence of actions (e. g., “The action “partially pay settlement” must not be executed more than three times”.). The explicit model of data states of the artifacts further allows to express rules concerning data flow, such as: – enforcement/exclusion of data states (e. g., “The claim’s state ‘fraud reported’ and the settlement’s state ‘paid’ must never coincide.”), or – data and control flow concurrence (e. g., “The action ‘publish review’ may only be executed if the review artifact is in state ‘reviewers blinded’.”). On top of that, any combinations are possible, allowing to express complex compliance rules.
Compliance by Design for Artifact-Centric Business Processes
109
The presented approach is, however, not applicable to nonregular languages. For instance, a rule requiring that a compliant run must have an arbitrary large, but equal number of a and b actions or that a and b actions must be properly balanced (Dyck languages) cannot be expressed with a finite-state models. Similarly, rules that affect infinite runs (e. g., certain LTL formulae [21]) cannot be expressed. Infinite runs are predominantly used to reason about reactive systems. A business process, however, is usually designed to eventually reach a final state — this basically is the essence of the soundness property. Therefore, we shall focus on an interpretation of LTL which only considers finite runs, similar to a semantics described by Havelund and Ro¸su [11]. Just like Awad et al. [3], we also do not consider the X (next state) operator of CTL∗ , because we typically discuss distributed systems in which states are partially ordered. Finally, we do not use timed Petri nets and hence can make no statements on temporal properties of business processes. However, we can abstract the variation of time by events such as “time passes” or data states such as “expired” as in [10,18]. 4.3
Formalization
As mentioned earlier, we again use artifacts (i. e., Petri nets with final markings and action labels) that satisfy the monitor property to model compliance rules. As an example, we consider the following compliance rules for our example insurance claim process: R1. All insurance claims with an estimated high settlement must be authorized. R2. Customers must not be allowed to withdraw insurance claims. R3. Settlements should be paid in at most three parts. Figure 7 shows the Petri net formalizations of these compliance rules. In rule R1, we exploited the fact that the actions “authorize” and “estimateHigh” are executed at most once. In rule R2 and R3, the monitor property is achieved by allowing “withdraw” and “payPart” to fire in any reachable state. Without
R1
final1
R2
R3
final1
final authorize
pol
ins
estimateHigh
payPart cus
estimateHigh
ins
pol
withdraw
cus
final2 Ω = {[final1 ], [final2 ]}
withdraw
Ω = {[final]}
fin
payPart
fin
fin
payPart
final2 payPart
authorize
fin
final3 Ω = {[final1 ], [final2 ], [final3 ]}
Fig. 7. Compliance rules modeled as Petri nets
110
N. Lohmann
restriction of the behavior, the final markings classify executions as compliant or not. For instance, executing “estimateHigh” in rule R1 without eventually executing “authorize” does not reach the final marking [final2 ]. 4.4
Discussion
We conclude this section by a discussion of the implications of using Petri nets to formalize compliance rules. – Single formalism. We can model artifacts, policies, and compliance rules with the same formalism. Though we do not claim that Petri nets should be used by domain experts to model compliance regulations, using a single formalism still facilitates the modeling and verification process. Furthermore, each rule implicitly models compliant behavior which can be simulated. This is not possible if, for instance, arbitrary LTL formulae are considered. – Level of abstraction. Rules can be expressed using minimal overhead. Each rule contains only those places and transitions that are affected by the rule and plus some additional places to model further causalities. In particular, no placeholder elements (e. g., anonymous activities in BPMN-Q [2]) are required. – Independent design. The rules can be formulated independently of the artifact and policy models. That is, the modeler does not need to be confronted with the composite model. This modular approach is more likely to scale, because the rules can also be validated independently of the other rules. – Reusability. The composition is defined in terms of action labels. Therefore, rules may be reused in different business process models as long as the labels match. This can be enforced using standard naming schemes or ontologies. – Runtime monitoring. The monitor property ensures that the detection of noncompliant behavior is transparent to the process as no behavior is restricted. Therefore, the models of the compliance rules can be also used to check compliance during or after runtime, for instance by inspecting execution logs. – Rule generation. Finally, the structure of the Petri nets modeling compliance rules is very generic. Therefore, it should be possible to automatically generate Petri nets for standard scenarios or to provide templates to which only the names of the constrained actions need to be filled. Also, the monitor property can be automatically enforced.
5
Compliance by Design
This section presents the second contribution of this paper: the construction of business process models that are compliant by design. Beside the construction, we also discuss the diagnosis of noncompliant business process models. 5.1
Constructing Compliant Models
None of the compliance rules discussed in the previous section hold in the example process depicted in Fig. 6. This noncompliance can be detected by standard
Compliance by Design for Artifact-Centric Business Processes
111
model checking tools. They usually provide a counterexample which describes how a noncompliant situation can be reached. For instance, the action sequence “1. submit, 2. prepare, 3. requestInfo, 4. withdraw” is a witness that the process does not comply with rule R2. To satisfy this requirement, the transition “withdraw” can be simply removed. However, implementing the other rules is more complicated, and each modification would require another compliance check. We propose to synthesize a compliant model instead of verifying compliance. By composing the Petri net models of the artifacts (cf. Fig. 3), the policies (cf. Fig. 5), and the compliance rules (cf. Fig. 7) and by taking the goal states into account, we derive a Petri net that models the artifacts’ life cycles that are restricted by the policies and whose final states are constrained by the goal states and the compliance rules. Compliant behavior is now reduced to weak termination, and we can apply the same algorithm [27] and tool [16] to synthesize a controller. If such a controller exists, it provides an operational model that specifies the order in which the agents need to perform their actions. This model is compliant by design — a subsequent verification is not required. Beside weak termination (and hence, compliance), the synthesis algorithm further guarantees the resulting model is most permissive [27]. That is, exactly that behavior has been removed that would violate weak termination. Another important aspect of the approach is its flexibility to add further compliance rules. That is, we do not need to edit the existing model, but we can simply repeat the synthesis for the new rule set. Running example (cont.). Figure 8 depicts the resulting business process model. It obviously contains no transition labeled with “withdraw”, but the implementation of the other rules yielded a whole different structure of the part modeling the settlement processing. It is important to stress that the depicted business process model has been synthesized completely automatically using the partner synthesis tool Wendy [16] and the Petri net synthesis tool Petrify [7]. Admittedly, it is a rather complicated model, but any valid implementation of the compliance
payPart
payRest
fin
fin
payPart ins
provideInfo
requestInfo
cus
ins
submit
prepare
payRest fin
payFull
estimateHigh
cus
assess initial
fin
fin
ins
ins
final
fin ins
create
ins
con estimateLow ins sendToController sendToInsurer
report
archive ins
ins
con
sendToController
authorize
archive
ins
Fig. 8. Operational business process satisfying the compliance rules R1–R3
112
N. Lohmann
rules would yield the same behavior or a subset. Though our running example is clearly a toy example, experimental results [16] show that controller synthesis can be effectively applied to models with millions of states. 5.2
Diagnosing Noncompliant Models
So far, we only considered the case that the business process can be constructed such that it satisfies the compliance rules. Then, we can find a controller that guarantees weak termination. In case a compliance rule is not met by the business process, no such controller exists. That is, the business process cannot be made compliant. Intuitively, the intersection between the behavior of the business process and the compliance rule is empty. Just as a controller would be a witness for compliant behavior, such witness does not exist in case of noncompliance. In previous work [14], we studied uncontrollable models and presented an algorithm that generates diagnosis information. This diagnosis information is presented as a graph that overapproximates the behavior of any controller. As no controller exists that can avoid states that violate weak termination, this graph contains paths to such deadlocks or livelocking states. These paths and the reason they cannot be avoided by a controller serve as a counterexample similar to those of standard model checking. Note that such a counterexample not only describes noncompliance of a concrete business process model, but for a whole family of operational models that can be derived from a set of artifacts. Running example (cont.). To exemplify this diagnosis information, consider the following compliance rule: R4. The customer should not be asked for further information more than once. This rule is not realizable in the business process, because the outcome of the fraud detection service is not under the control of any agent. Furthermore, rule R2 excluded the possibility for the customer to withdraw the claim. Hence, we cannot exclude a noncompliant run in which the transition “requestInfo” is fired more than once. The diagnosis algorithm (which is also implemented in the tool Wendy [16]) generates a graph similar to that depicted in Fig. 9. From this graph, we pruned those paths that eventually reach a final state (represented as check marks) and replaced them by dashed arrows. As we can see, a final state cannot be reached after the second “requestInfo” action is executed. However,
su
bm
it p
p re
ar
e re
e qu
stI
nf
o p
vid ro
eIn
fo pr
a ep
re re
e qu
stI
nf
o
report
assess
report
assess
Fig. 9. Counterexample for an unrealizable compliance rule
Compliance by Design for Artifact-Centric Business Processes
113
we cannot avoid this situation, because in the gray state, the external service decides whether or not the claim is fraudulent or whether further information are required. For the business process, this choice is external and uncontrollable and hence it must be correct for any outcome.
6
Related Work
Compliance has received a lot of attention in the business process management community. Contributions related to our approach can be classified as follows. Compliance by detection. Awad et al. [3] investigate a pattern-based compliance check based on BPMN-Q [2]. They also cover the compliance rule classes defined by Dwyer et al. [8] and give a CTL formalization as well as an antipattern for each rule. These antipatterns are used to highlight the compliance violations in a BPMN model. Such a visualization is very valuable for the process designer and it would be interesting to see whether such antipatterns are also applicable to the artifact-centric approach. Sadiq et al. [26] use a declarative specification of compliance rules from which they derive compliance checks. These checks are then annotated to a business process and monitored during its execution. These checks are similar to the nonblocking compliance rule models that only monitor behavior rather than constraining it. Lu et al. [18] compare business processes with compliance rules and derive a compliance degree. This is an interesting approach, because it replaces yes/no answers by numeric values which could help to easier diagnose noncompliance. Knuplesch et al. [12] analyze data aspects of operational business process models. Similar to the artifact-centric approach, data values are abstracted into compact life cycles. Compliance by design. Goedertier and Vanthienen [10] introduce the declarative language PENELOPE to specify compliance rules. From these rules, a state space and a BPMN model is generated which is compliant by design. This approach is limited to acyclic process models. Furthermore, the purpose of the generated model is rather the validation of the specified rules than the execution. K¨ uster et al. [13] study the interplay between control flow models and object life cycles. The authors present an algorithm to automatically derive a sound process model from given object life cycles. The framework is, however, not designed to express dependencies between life cycles and therefore cannot specify complex policies or compliance rules. To the best of knowledge, this paper presents the first approach that generates compliant and operational business process models from declarative specifications of artifact life cycles, policies, and compliance rules.
7
Conclusion
Summary. We presented an approach to automatically construct business process models that are compliant by design. The approach follows an artifactcentric modeling style in which business processes are specified from the point
114
N. Lohmann
of view of the involved data objects. We showed how artifact-centric business processes can be canonically extended to also take compliance rules into account. These rules can express constraints on the execution of actions, but can also take data and location information into account. By composing compliance rules to the artifact-centric model, we could reduce the check for compliant behavior to the reachability of final states. Consequently, we could use existing synthesis algorithms and tools to automatically generate compliant business process models. In case of noncompliance, we further sketched diagnosis information that can be used to visualize the reasons that make compliance rules unrealizable. Lessons learnt. With Petri nets, we can use a single formalism to model artifact life cycles, interartifact dependencies, and compliance rules. Only this unified way of modeling enabled us to approach the compliance-by-design approach with only a few concepts (namely artifacts, composition, and partner synthesis). In addition, it is notable that the compliance rule models can also be used to check operational business process models for compliance: By composing compliance rules to existing Petri net models, we reduced the compliance check by a check for weak termination and allows to use standard verification tools [9]. Future work. We see numerous directions to continue the work in the area of compliance by design. We are currently working on an extension for BPMN to provide a graphical notation that is more accessible for domain experts to model artifacts, policies, and compliance rules. A canonic second step would then be the integration of the approach into a modeling tool and an empirical evaluation thereof. Another aspect that needs to be addressed is the expressiveness of the artifacts and compliance rules. Of great interest are instances and agent roles. To model more involved scenarios, it is crucial distinguish several instances of an artifact. We currently assume that for each artifact only a single instance exists. Furthermore, agent roles would allow a fine-grained description of access controls or concepts such as the four-eye principle.
References 1. van der Aalst, W.M.P., Pesic, M.: DecSerFlow: Towards a truly declarative service flow language. In: Bravetti, M., N´ un ˜ez, M., Tennenholtz, M. (eds.) WS-FM 2006. LNCS, vol. 4184, pp. 1–23. Springer, Heidelberg (2006) 2. Awad, A.: BPMN-Q: a language to query business processes. In: EMISA 2007. LNI P-119, pp. 115–128. GI (2007) 3. Awad, A., Weidlich, M., Weske, M.: Visually specifying compliance rules and explaining their violations for business processes. J. Vis. Lang. Comput. 22(1), 30–55 (2011) 4. Ben-Ari, M., Manna, Z., Pnueli, A.: The temporal logic of branching time. In: POPL 1981, pp. 164–176. ACM, New York (1981) 5. Cannon, J.C., Byers, M.: Compliance deconstructed. ACM Queue 4(7), 30–37 (2006) 6. Clarke, E.M., Grumberg, O., Peled, D.A.: Model Checking. MIT Press, Cambridge (1999) 7. Cortadella, J., Kishinevsky, M., Kondratyev, A., Lavagno, L., Yakovlev, A.: Petrify: A tool for manipulating concurrent specifications and synthesis of asynchronous controllers. Trans. Inf. and Syst. E80-D(3), 315–325 (1997) 8. Dwyer, M.B., Avrunin, G.S., Corbett, J.C.: Patterns in property specifications for finite-state verification. In: ICSE 1999, pp. 411–420. IEEE, Los Alamitos (1999)
Compliance by Design for Artifact-Centric Business Processes
115
9. Fahland, D., Favre, C., Jobstmann, B., Koehler, J., Lohmann, N., V¨ olzer, H., Wolf, K.: Instantaneous soundness checking of industrial business process models. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 278–293. Springer, Heidelberg (2009) 10. Goedertier, S., Vanthienen, J.: Designing compliant business processes with obligations and permissions. In: Eder, J., Dustdar, S. (eds.) BPM Workshops 2006. LNCS, vol. 4103, pp. 5–14. Springer, Heidelberg (2006) 11. Havelund, K., Ro¸su, G.: Testing linear temporal logic formulae on finite execution traces. Technical Report 01.08, RIACS (2001) 12. Knuplesch, D., Ly, L.T., Rinderle-Ma, S., Pfeifer, H., Dadam, P.: On enabling data-aware compliance checking of business process models. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 332–346. Springer, Heidelberg (2010) 13. K¨ uster, J.M., Ryndina, K., Gall, H.: Generation of business process models for object life cycle compliance. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 165–181. Springer, Heidelberg (2007) 14. Lohmann, N.: Why does my service have no partners? In: Bruni, R., Wolf, K. (eds.) WS-FM 2008. LNCS, vol. 5387, pp. 191–206. Springer, Heidelberg (2009) 15. Lohmann, N., Massuthe, P., Wolf, K.: Behavioral constraints for services. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 271–287. Springer, Heidelberg (2007) 16. Lohmann, N., Weinberg, D.: Wendy: A tool to synthesize partners for services. In: Lilius, J., Penczek, W. (eds.) PETRI NETS 2010. LNCS, vol. 6128, pp. 297–307. Springer, Heidelberg (2010), http://service-technology.org/wendy 17. Lohmann, N., Wolf, K.: Artifact-centric choreographies. In: Maglio, P.P., Weske, M., Yang, J., Fantinato, M. (eds.) ICSOC 2010. LNCS, vol. 6470, pp. 32–46. Springer, Heidelberg (2010) 18. Lu, R., Sadiq, S.W., Governatori, G.: Compliance aware business process design. In: ter Hofstede, A.H.M., Benatallah, B., Paik, H.-Y. (eds.) BPM Workshops 2007. LNCS, vol. 4928, pp. 120–131. Springer, Heidelberg (2008) 19. OMG: Business Process Model and Notation (BPMN). Version 2.0, Object Management Group (2011), http://www.omg.org/spec/BPMN/2.0 20. Pnueli, A.: In transition from global to modular temporal reasoning about programs. In: Logics and models of concurrent systems. NATO Advanced Summer Institutes, vol. F-13, pp. 123–144. Springer, Heidelberg (1985) 21. Pnueli, A.: The temporal logic of programs. In: FOCS 1977, pp. 46–57. IEEE, Los Alamitos (1977) 22. Ramadge, P., Wonham, W.: Supervisory control of a class of discrete event processes. SIAM J. Control Optim. 25(1), 206–230 (1987) 23. Reisig, W.: Petri Nets. EATCS Monographs on Theoretical Computer Science edn. Springer, Heidelberg (1985) 24. Ryndina, K., K¨ uster, J.M., Gall, H.: Consistency of business process models and object life cycles. In: K¨ uhne, T. (ed.) MoDELS 2006. LNCS, vol. 4364, pp. 80–90. Springer, Heidelberg (2007) 25. Sackmann, S., K¨ ahmer, M., Gilliot, M., Lowis, L.: A classification model for automating compliance. In: CEC/EEE 2008, pp. 79–86. IEEE, Los Alamitos (2008) 26. Sadiq, S.W., Governatori, G., Namiri, K.: Modeling control objectives for business process compliance. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 149–164. Springer, Heidelberg (2007) 27. Wolf, K.: Does my service have partners? In: Jensen, K., van der Aalst, W.M.P. (eds.) Transactions on Petri Nets and Other Models of Concurrency II. LNCS, vol. 5460, pp. 152–171. Springer, Heidelberg (2009)
Refining Process Models through the Analysis of Informal Work Practice Simon Brander2 , Knut Hinkelmann2 , Bo Hu1 , Andreas Martin2 , Uwe V. Riss1 , Barbara Th¨onssen2 , and Hans Friedrich Witschel2, 1
2
SAP AG, Walldorf, Germany {Bo01.Hu,Uwe.Riss}@sap.com University of Applied Sciences Northwestern Switzerland (FHNW), Riggenbachstr. 16, 4600 Olten, Switzerland {Barbara.Thoenssen,Simon.Brander,Andreas.Martin, Knut.Hinkelmann,HansFriedrich.Witschel}@fhnw.ch
Abstract. The work presented in this paper explores the potential of leveraging the traces of informal work and collaboration in order to improve business processes over time. As process executions often differ from the original design due to individual preferences, skills or competencies and exceptions, we propose methods to analyse personal preferences of work, such as email communication and personal task execution in a task management application. Outcome of these methods is the detection of internal substructures (subtasks or branches) of activities on the one hand and the recommendation of resources to be used in activities on the other hand, leading to the improvement of business process models. Our first results show that even though human intervention is still required to operationalise these insights it is indeed possible to derive interesting and new insights about business processes from traces of informal work and infer suggestions for process model changes.
1
Introduction
Modelling business processes is a time-consuming, costly and error-prone task. Even with the greatest effort, it is often impossible to foresee all the situations that may occur during process execution. It is not even desirable to include all the possibilities due to a very practical reason: a complete model normally has high complexity and thus can be prohibitively expensive to manage and visualise. One compromise is to define only high-level structures (e.g. critical branches of business processes) either manually or semi-automatically with intervention from human experts. In order to reduce the workload of and dependency on process experts, approaches based on process mining have been put forward. They analyse the event logs of enterprise information systems so as to automatically suggest the model structures or to check the conformance of a model against the actual sequence of events reflected in the log.
Contact author.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 116–131, 2011. c Springer-Verlag Berlin Heidelberg 2011
Refining Process Models through the Analysis of Informal Work Practice
117
Log-based mining relies on the quality of event documentation that renders it less useful in situations where processes are handled through informal methods. In every-day work practice, it is not unusual for people to keep their own records (such as personal notes and draft documents) and information about the process activities out of the enterprise information system and hence invisible in event logs. Meanwhile, in modern organisations, it is also common for employees to conduct process-related communication via email—in addition to or as a replacement of communication through a workflow system. Emails, even though closely work related, demonstrate strong informal and personal characteristics. Although artefacts such as personal records and emails are not deemed the same as formally modelled knowledge, they carry important information about dayto-day business processes, a good understanding of which provide valuable input to the formal process models. The work presented in this paper explores the potential of leveraging the traces of informal work and collaboration in order to improve process models over time. We look at two types of traces, namely emails and personal task instances gathered in a task management application. We propose methods that analyse the internal structure of emails and personal tasks in order to derive information about the business process, e.g. for adding activities or branches or recommending resources. It is important to understand that the suggested methods do not treat emails or tasks as atomic units – as is often done with events in system logs – but analyse their internal structure.
2
Related Work
In this section, we review the two research threads from which our approach is drawn. We first familiarise readers with the basic notions of agile business processes that will then be used throughout the paper when discussing the details and impacts of our algorithms. We then elaborate on the difference of our approach from the apparently similar process mining approaches that also rely on traces of work practice. 2.1
Agile Process Modelling and Execution
Agility is one of the major challenges that modern enterprises confront nowadays and a distinct character of successful companies. Organisational agility can be defined as an organisation’s ability to “[. . .] sense opportunity or threat, prioritise its potential responses, and act efficiently and effectively.” [14] In terms of business processes, organisational agility implies a more permeable separation between build-time modelling and run-time execution, allowing a fluid transition between them [13]. The Knowledge-Intensive Service Support (KISS) [10] facilitates the shift form rigid process modelling approaches to flexible and agile processes. KISS gives the possibility to tackle exceptional situations, unforeseeable events, unpredictable situations, high variability, and highly complex tasks without having to change the process model very often. This is done
118
S. Brander et al.
by combining conventional business process modelling with (business) rules—the process model can be seen as basic action flow and guidance where the business rules define the constraints. In order to provide the flexibility KISS introduced a special modelling construct called Variable Activity [1]. [18] has further developed the Variable Activity concept into Knowledge Intensive Activity (KIA). A knowledge-intensive activity is an activity that requires expertise or skills to be executed. The execution and results of KIAs depend on the actual context wherein the process is reinforced. Context data include application data, process data, functional data, or further information about needed resources [5]. A knowledge intensive process (KIP) is a special form of process where (some of) the activities are optionally executed—depending on information specific for the certain process instance—and in any order and in any frequency. Thus a KIP is comparable to ad-hoc processes known from BPMN1 . The KISS approach is combined with task patterns into a collaborative process knowledge management and maturing system, called KISSmir [13,18]. 2.2
Process Mining
Process mining discovers formal process models from usage data, e.g. event logs containing the recorded actions of real process executions (cf. [2,8,17]). Enterprise information systems, such as enterprise resource planning systems, are well suited to process mining. These systems use a structured approach, which means that they have a “predefined way of dealing with things” [16]. Unstructured systems, such as groupware systems, on the other hand, provide more liberties to the user and the exchange of information. Such freedom makes the use of event logs from unstructured systems more challenging. Let’s take emails as an example. While an email header contains structured elements (e.g. sender, recipient, date/time, subject), its main content is stored in the unstructured body and is not readily available for process analysis. There are approaches and tools that focus on the process discovery from the event data and logs produced by unstructured systems. For instance, Dustdar et al. propose mining techniques for ad-hoc processes that are supported by a process-aware collaboration system named Caramba [9]. Van der Aalst and Nikolov describe a process mining approach and a tool that uses email logs as input and creates a process model out of it [15]. They rely heavily on the assumption that the emails are already appropriately tagged (e.g. task name and status in the subject) which is not necessarily true. Tagging emails is time consuming and error-prone. Di Cicco et al. go one step further and use information extraction to identify task names from the unstructured email body [7]. As opposed to the approaches mentioned in this section, we do not treat traces of people’s work (such as events in event logs, tasks or emails) as atomic units, which essentially need to be put into the right order. Instead, we are interested in investigating the potential of discovering internal structure of such events by analysing the contents of traces (emails, tasks) and relating them to the 1
http://www.omg.org/spec/BPMN/2.0/PDF
Refining Process Models through the Analysis of Informal Work Practice
119
overall business process. We claim that process models may not be fine-grained enough in many cases and that therefore the treatment of events as atomic units will fail to reveal many possible refinements of such models. We also claim that traces of personal work (as opposed to events in information systems) may contain information of another quality, although this information may be hidden between noise, hence making human intervention necessary.
3
Discovering Information Work Practice
As indicated by the number and diversity of ISO9001 certifications in Europe [11], Business Process Management becomes increasingly important not only for large enterprises but also for small- and medium-sized ones. Thus far, business processes are primarily modelled by process experts and delivered as one integrated package to the end users. Recently, an evident trend in businesses is the demand on adaptivity, against both the volatile markets and rapidly changing customer requirements. The conventional business process modeling approach, therefore, is under increasing challenge. On the one hand, it might take tremendous effort to reach an agreement on a standard business process model. On the other hand, in order to ensure generality, exceptions and specialities will be at sacrifice. A business process modeled based on expert knowledge is not likely to reflect actual work practices accurately. The situation will not be alleviated by process automation. For instance, a Workflow-Management-System (WfMS) often leaves out details of the modeled activities so as to provide certain flexibility to the end users. Such “gaps” are then filled by personal task execution preference and particulars. In practice, such information normally hides in the trace of informal communications (e.g. emails) and personal task “diaries” (e.g. personal notes and logs), and in many cases get lost into the vast amount of available information. In the rest of this paper, we will assume the existence of a grossly crafted process model which is treated as the “seed” for further refinement. Due to its knowledge intensive nature, business processes can normally be refined in two general directions: further development of process knowledge (in terms of activity order, granularity, etc.) and further development of process-related knowledge (in terms of supporting information of business processes). In this section, we present two scenarios in which the mining of informal work can contribute to the improvement of business process models. 3.1
Resource Recommendation
When performing a task, a person often consults resources, the selection of which is based on her personal skills, experiences or preferences. A good understanding of such information would allow us to refine the corresponding process models or help improving execution of process instances. Hence we wish to discover the set of local resources (documents as well as task collaborators) that people use to accomplish their tasks. When working on an assigned task with the help of task management tools such as KISSmir [18], a person can add resources to that task. In order to
120
S. Brander et al.
enable automatic resource provision, accomplished ‘historical’ task instances are analysed. Figure 1 shows how this might work: assuming that the current task of checking a certain certification already has an association with a resource R1, we can recommend further resources such as R4 by analysing historical data, e.g. the co-occurrence of R1 with other resources in past task executions. In addition, we propose to consider the set of emails associated to a certain activity across all process instances. This is done by extracting from emails the attachments (document resources) and embedded links (web pages), as well as senders and receivers (people). We then simply count the frequency with which resources occur in these emails and recommend the most frequent ones whenever a user works on the given activity again. An analysis of co-occurrence of email recipients can additionally be leveraged for recipient recommendation during email composition (cf. [6]). R1
Hans Muster BSc Hon BIS 01-8852-84 Hans Muster BSc Hon BIS 01-8852-84 Hans Muster BSc Hon BIS 01-8852-84
R3 R4
R3 R4
R4
recommended resources
Check Cerficaon
history of executed tasks and used resources addional resources
Fig. 1. Resource recommendation based on historical data
3.2
Task Refinement
Email/task diary data contains valuable information about potential substructures of the business process that may not be reflected in the current process model. Such structures can be either implicit subtasks or context-based branches. Subtasks. Sometimes a process model is not fine-grained enough, i.e. its process activity might implicitly encapsulate various sub-activities that need to be performed (with or without a certain order) in order to accomplish a corresponding task. It might become necessary to differentiate and explicate such sub-structures when one has to collaborate with and/or delegate certain sub-tasks to others, or lift up the sub-structures to reflect them in the workflow, for better ICT support. Branches. Often, the need to perform a sub-task of a given task is not fixed and depends on the context wherein the task is carried out. For instance, the task of requesting an intern contract for a student might go along different routes depending on the student’s nationality: contracts for foreign students might be more difficult and complicated to acquire than those for local students. This is not always foreseeable until a concrete instance task calls for attention. These conditional branches should be reflected in the process model by decision points that check context attributes in order to select the most appropriate branches.
Refining Process Models through the Analysis of Informal Work Practice
121
In both cases, we can expect to find corresponding evidence in email/task data, provided that the “seed” process model is sufficiently used, accumulating a large amount of data covering all possible alternatives. When using a task management system, users have the possibility of making subtasks explicit (creation of subtask entities, delegation), which can be exploited. However, often that is done in an implicit way, e.g. delegating work via email. Data mining methods can be employed to discover those sub-structures. Our assumption is that tasks/emails that belong to different branches and/or sub-activities will have different characteristics whereas tasks/emails that belong to the same subtask/branch will be substantially similar. By clustering all items that are associated with an activity (again across all process instances), we can expect that the resulting clusters will demonstrate branches and/or subtasks of the process. Figure 2 shows how tasks have been clustered based on attached resources, yielding three task clusters. We can now imagine that, as a result of inspecting the clusters, the two bigger ones form the basis of refining the task “Check documents” into one branch that uses resource R2 (which might be a document used for EU students) and another using resources R3 and R4 (for non-European students). Of course, automatic clustering is rarely fully reliable. Human analysis is necessary to quality-check the outcomes. Refinement based on task delegation. When executing KIAs, a participant can delegate parts of the task to colleagues. If this happens in a number of cases, it might be an indicator that the process model is not detailed enough and the knowledge-intensive task could be divided into several subtasks. We assume that a participant executing the task will document the delegation in the task description. We can analyse records of previous task executions to identify in which situations a task delegation happened. The criteria can be included in the process model as a decision point. By identifying subtasks, a KIA becomes better structured and evolves towards a knowledge-intensive process. Refinement based on task resources. Resource-based task refinement aims at refining the existing process model by using the stored information about a given task. With clustering, patterns of used resources are discovered. It is assumed that one cluster, i.e. one pattern of used resources equals a particular (sub-)task or branch. The existence of multiple patterns of resources associated with a single task, therefore, indicates the possibility of the process model refinement. The identified process model adaptations are then implemented to reflect
Fig. 2. Task refinement based on clustering by resource usage
122
S. Brander et al.
such resource patterns / clusters. In order to determine whether a model refinement is appropriate, human intervention is necessary. Variables that define an identified pattern are presented to a human expert. The expert can also take a closer look at the individual resources (e.g. the contents of a text document) in a resource cluster to understand the rationale. This ensures that dependencies are discovered even if the system were not initially able to find them. Refinement based on emails. This kind of refinement resembles the previous one - all emails belonging to an activity (and across process instances) are clustered and the cluster representations are inspected by human experts to identify if they correspond to subtasks and/or branches within the activity. We consider two kinds of feature vectors to represent emails for clustering. The first kind is based on communication partners, i.e. senders and recipients of emails; here, we assume that different subtasks and/or branches of an activity require the interaction with different kinds of people. On the other hand, even when communicating with the same people within a task, different aspects (i.e. subtasks) may be addressed, which may be detected by analysing the email’s unstructured parts. Therefore, the second kind of feature vectors is built from the text of the email’s subject and body. After automatic analysis is finished, the human expert is provided with a meaningful description of the clusters. For instance, an expert will need to know the number of emails in a cluster, the categories (i.e. roles) of communication partners together with their frequency, the most prominent keywords, derived via term extraction from the emails’ subjects and bodies and a selection of example email subjects. This should allow her to see what the emails within a cluster have in common.
4
Experiments
We employed well established data mining and data clustering algorithms when implementing the informal work based process model refinement approach. The system is still under development. A preliminary evaluation, however, proved the practical value of our approach. 4.1
Implementation
In this section, we focus more on the emails which present a real challenge due to their unstructured nature. Task diaries, on the other hand, can be processed in a more straightforward way. Mapping to activity. Collecting traces of informal work related to the activities of a business process requires mapping both task instances and emails to these activities. Mapping task instances to activities is straightforward with the help of the KISSmir system [18]—the task instances are assigned by the workflow engine and thus by definition belong to a certain activity of the underlying business process. The real challenge for assigning emails to activities is due to a lack of direct association. As suggested in [15], mapping can be achieved in two steps
Refining Process Models through the Analysis of Informal Work Practice
123
a) mapping emails to a process instance and b) mapping emails to activities or tasks within that process instance. Van der Aalst and Nikolov [15] assume the existence of annotations of emails which are added either manually by end users or by a process-aware information system. They acknowledge that email annotation regarding the activity will not always be available and propose that this classification can alternatively be obtained automatically by matching the name of the task against the email subjects, where a partial match is acceptable. This approach presumes that the subject of each email that belongs to a certain activity of a business process will always contain the name of that activity. This is, however, only guaranteed if the email is written manually with significant attention or generated automatically by a task management system’s email functionality. It may become less useful in other cases. We propose to utilise a classical machine learning approach that can (semi-)automatically assign emails to activities. This is done as follows: 1. Acquiring training data. The users are asked to annotate a certain number (say 100) of emails manually as training data. For the first step, i.e. the identification of the process instance to which an email belongs, we follow the approach in [15] and let the user choose among a number of criteria, e.g. a contact linked to an email, to select all emails belonging to a process instance. 2. Feature extraction. From each email in the training data and the set of remaining emails, we extract the email addresses of the sender and the recipients, the time-stamp, attachments and the text of the subject and body. We use these fields as features - where the text parts (subject and body) have to be broken into words that will constitute the features. 3. Classification. We use standard classifiers from the machine learning literature to learn a model from the training data and apply it to unclassified emails. Since the text-based features will result in a high-dimensional feature space and will thus outweigh the other features when combined into one feature vector, we propose to proceed with fusion methods that have been explored in the area of multi-media content, where low-dimensional image features have to be combined with high-dimensional textual features, cf. [3]. These methods either define and combine separate kernels (i.e. similarity functions) for the different features or build different classifiers for the different feature sets and combine these with a meta-classifier. In this work, we focus on the actual extraction of process-relevant information from tasks and emails and therefore have not implemented/evaluated this automated mapping. A closer analysis will be the subject of future work. Communication partner categories. We classify communication partners in email exchange according to their job function or role in the business process. Such kind of information can be found in the employee directory. Representing this information formally, for example in an enterprise ontology allows inferring the required information of job function, role or project etc. automatically [12]. In our approach,
124
S. Brander et al.
end users can define categories of communication partners by specifying the category names followed by the keywords indicating what method should be used to filter the list of email contacts. Currently, three methods are implemented: LIST enumerating a list of email addresses that together make up the category, Process Owner = LIST(
[email protected]) CONTAINS specifying a string (e.g. the domain) that all member email addresses must contain, SAP = CONTAINS(@sap.com) NOMATCH reserved for all email addresses that are not qualified in the previous categories, External = NOMATCH In order to classify a given email address E using such a category definition, the email processing engine will start to match E against categories defined via LIST, then - if no match was found - against the CONTAINS definitions and finally assign it to the NOMATCH category. Extracted features. After the assignment of task instances and emails to process and activity instances and categorisation of contacts in email communications, further information will be generated automatically, resulting in the following list of features associated to each email: – Process instance (“Student Hiring of Charly Brown”) – Activity instance (“Prepare Interview with Charly Brown” in our example process) – Type (simple email vs. meeting request) – Subject – Body – Sender (name, email address and communication partner category) – Recipients (same as sender) – Time-stamp indicating when the email was sent – The list of attachments of the email. 4.2
Experimental Setup
In order to evaluate the methods outlined above, we performed some experiments with email data corresponding to a student recruitment process that researchers at SAP Research perform when they want to hire intern or bachelor/master thesis students to work in research labs. Figure 3 shows the most relevant parts of this process (as the seed process) from a researcher’s perspective. The data we used for the experiments was extracted from the Inbox and calendar of one of the authors by means of a script that outputs all emails, appointments and meeting requests that contain a user-specified string (e.g. part of the student’s name) up to a certain date - namely one week after the date when the student started working at SAP Research or the date when his/her application was rejected. It represents 9 cases of student hiring undertaken by
Refining Process Models through the Analysis of Informal Work Practice
125
Fig. 3. Student hiring process at SAP Research
the owner of the Inbox (i.e. 9 process instances) and consists of approximately 350 emails after some manual cleaning. The emails were processed as discussed in the previous sections. They are manually associated to the activities of the process in Figure 3. Then, the process owner (i.e. the owner of the email Inbox) defined 8 categories of communication partners, namely process owner, teammates, team assistants (responsible for collecting contract requests), HR (the human resources department of SAP), manager, applicant, SAP (all SAP employees) and external parties which represent the primary categories of communication partners in this process. Overall, there are about 60 emails representing the activity of screening and selecting applicants, 20 emails about coordinating with colleagues, 30 emails preparing the interview, 10 emails representing the actual interview, 40 emails that follow up on an interview, 90 emails about contract requests and about 100 emails representing supervision of the student. 4.3
Results
In our experiment, we concentrated on the task refinement part since our data set was not large enough to enable resource recommendation. For task refinement, the emails were divided into 7 subsets, each corresponding to one of the seven activities “screen and select applicants”, “coordinate”, “settle and prepare interview”, “conduct interview”, “interview follow-up”, “request contract” and “supervise student” of the student hiring process. As a next step, each email was represented as a feature vector. As outlined in Section 4.1, we chose two kinds of feature vectors, one based on communication partner categories, the other based on the text from subject and body of the emails. This means that emails attachments and embedded links were not considered in this set of experiments. For the first kind of feature vectors, if c1 , ...cn is the set of communication partner categories, then an email is formally represented by a vector e = (rc1 , ..., rcn , sc1 , ..., scn ) where rci is the number of recipients of the email belonging to category ci and sci is the number of senders (0 or 1) of the email belonging to category ci .
126
S. Brander et al.
For the second kind of vectors, the text of the emails was tokenised into oneword units (terms), which were weighted using term frequency-inverse document frequency (tf.idf). Thus, if t1 , ..., tn is the set of distinct one-word units that occur in any of the emails of the entire corpus, then an email is represented by a vector e = (w1 , ..., wn ) where wi denotes the tf.idf weight of term ti for the current email. Using the Weka machine learning library2 , the vectors were clustered using the expectation maximisation algorithm and employing cross validation to determine the number of clusters automatically. Table 1 displays the clustering results for the activity “screen and select applicants”, using feature vectors based on communication partner categories
Table 1. Clusters based on communication partner categories for activity “Screen and select applicants” Nr Size Keywords (Weight) 1
13
2
9
3
7
4
7
5
7
6
4
7
3
2
Senders (f ) look (13.19), well (12.96), algebra (10.26), thing teammates (10.26), winner (10.26), get (10.26), linear (13) (10.26), list (10.03), seem (7.07), diploma thesis (5.0), linear algebra (4.0), extern referenzcode (4.0), favorite subject (2.0), thesis student (2.0), reference code (2.0), mix impression (2.0) experience (13.86), look (13.19), think (11.23), relevant (9.32), skill (8.25), decide (7.9), topic (7.9), send (7.49), suggestion (7.03), work (6.59), student application (5.0), programming skill (2.0) diploma thesis (24.0), diploma (20.43), internship (17.33), application (14.29), karlsruhe (13.86), extern (12.43), thesis (12.32), referenzcode (11.79), please (11.68), extern referenzcode (10.0), mature (8.8), deadline (8.52), application process (7.0), whole application (7.0), thesis student (7.0), reference code (7.0) diploma thesis (12.0), diploma (8.67), guy (7.69), sure (6.52), talk (6.52), anybody (6.52), wait (6.52), tomorrow (6.52), extern referenzcode (4.0), application process (2.0), whole application (2.0), mature team (2.0) diploma thesis (27.0), diploma (20.43), extern (14.92), referenzcode (14.14), internship (12.48), extern referenzcode (12.0), thesis (11.47), please (10.51), application (10.21), karlsruhe (9.9), deadline (7.21), application process (6.0), whole application (6.0), thesis student (6.0) experience (30.5), grade (26.07), little (22.82), program (20.19), claim (19.55), mark (15.8), background (14.26), maybe (14.26), work (12.45), program skill (3.0), background somewhat (2.0), grade check (2.0), basic knowledge (2.0) thesise (9.89), internship (8.32), external (7.9), hourly (5.71), please (5.26), thesis (5.1), application (5.1), above (4.95), diploma (4.33), deadline (3.93), application process (3.0), whole application (3.0), thesis student (3.0), reference code (3.0), new application (3.0)
http://www.cs.waikato.ac.nz/ml/weka/
process owner (9)
process owner (7)
Recipients (f ) process owner (13), teammates (13)
Example subjects (f )
AW: Students (4), RE: Students, suggestion (2), AW: Our list of students (2), WG: Student Applications (2), RE: Students (2), ... teammates Students (2), Students, (16) suggestion (2), ...
TAs (7), RE: Student Applicateammates tions (5), RE: Stu(7) dent applications CW 27 (1), RE: Student applications CW 47 (1)
TAs (5), process N AME and N AME teammates owner (7), (2), AW: Student Ap(2) SAP (2) plications (2), ... TAs (7)
SAP (21), Student process (6), ... owner (1)
Applications
process owner (4)
empty (4) [BLOCK] check out students (2), [BLOCK] check student applications (1), [BLOCK] catch up (1)
TAs (3)
SAP (9), Student applications TAs (3) CW 27 (1), Student applications (1), Student applications CW 47 (1)
Refining Process Models through the Analysis of Informal Work Practice
127
(where f gives the frequency values). The keywords and phrases were extracted using the TE (term extraction) module of the ASV toolbox [4]. They carry weights as computed by this module; the other characteristics of the clusters (categories of senders and recipients, subjects) are represented together with their frequency of occurrence in the cluster. For instance, the first line of Table 1 describes a cluster consisting of 13 emails. All of these emails were sent by one of the teammates of the process owner; and they were all received by the process owner and other teammates (for recipients, this table does not distinguish between To and CC fields of an email). Some subjects of the emails in this cluster are given in the last column of the table (e.g. the subject “AW: Students” occurred 4 times) and the most important keywords extracted from the emails of the cluster (“look”, “well”, “algebra”, . . .) are given in the third column. Table 2. Clusters based on email text for activity “Request contract” Nr Size Keywords (Weight)
Senders (f )
Recipients (f ) Example jects (f )
1
45
proces owner (25), teammates (13), TAs (5), ...
2
9
process owner (19), external (6), HR (5), teammates (5), TAs (4), applicant (4),... process owner (5), applicant (4)
3
8
4
6
5
6
6
3
permit (34.81), german (23.28), salary (22.09), regard (20.78), obtain (20.53), student (18.5), internship (18.38), restrict (17.68), hour (17.68), number (17.31), contract request (14.0), start date (10.0), phone number (5.0), student contract (5.0), work duration (3.0), full time (2.0) description (173.29), next (112.16), task (91.71), call (88.46), phone (85.39), phone call (60.0), e mail (54.0), project (48.9), process (46.96), e (46.63), period (42.44), need (41.07), start date (18.0), contract period (12.0), current performance (12.0), performance record (12.0), subject description (11.0), contract status (10.0), mature project (9.0) name (141.54), internship (46.78), family (46.65), use (46.65), offer (45.54), first (44.48), last (36.68), father (31.1), successful (28.09), submission (25.28), last name (24.0), first name (23.0), family name (21.0), father name (14.0), paper work (8.0), request contract (8.0), start date (8.0), mature project (8.0) integration (106.64), thesis (106.49), supervisor (70.18), schema (66.65), need (60.32), case (56.08), social (53.32), month (51.76), position (50.79), e mail (36.0), thesis contract (19.0), service integration (18.0), business schema (18.0), thesis position (18.0), month thesis (12.0), shipment address (12.0) e mail (24.0), studiengang (22.35), betreuung (15.76), pdf (14.9), inhaltliche (14.9), hier (13.28), professor (13.28) integration (53.32), thesis (33.69), schema (33.32), social (26.66), position (25.4), internship (18.38), service (18.19), case (17.41), business (16.02), service integration (9.0), business schema (9.0), thesis position (9.0), thesis contract (8.0), month thesis (5.0), shipment address (5.0)
sub-
[BLOCK] catch up (3), Re: Your master thesis (3), RE: Praktikumsplan N AME (3),...
applicant (5), RE: contract staprocess owner tus (3), ... (4)
process owner teammates (6), applicant (9), applicant (2) (5), process owner (2)
RE: Your internship (5), Re: Your internship (2), ...
process owner teammates (4), external (4), TAs (2), (2) process owner (2), external (2)
Re: PS: Internship (2), FW: PS: Internship (2), RE: PS: Internship (2)
process owner TAs (2), pro- RE: Details ber (4), HR (1), cess owner Diplomarbeit applicant (1) (2), HR (2),... (1), ... process owner teammates RE: PS: Intern(2), external (3), external ship (2), Re: PS: (1) (2), process Internship (1) owner (1)
128
S. Brander et al.
Table 2 presents clusters for the activity “Request contract”, derived with feature vectors based on the text from emails’ subject and body. 4.4
Discussion
Interpretation of example clusters. For the clusters in Table 1, a human can easily derive an interpretation of clusters in terms of sub-activities by looking at communication partners and keywords: – Clusters 5 and 7 are very similar and show communication from the team assistants towards all colleagues (including the process owner). They announce new student applications and ask for feedback. – Cluster 6 contains private appointments only, where - as the keywords suggest - the process owner has taken notes about the applicants’ qualifications. – Clusters 1 and 2 consist of communication between the process owner and his teammates; keywords such as “linear algebra” and “programming skill” signal that the goal of this communication is the discussion of the applicants’ qualification and hence to reach an agreement of whom to invite for an interview. – In Cluster 3, the process owner communicates with team assistants; as can be seen from the subjects of emails in the last column, these emails contain feedback about applicants. – Cluster 4 is less clear in its interpretation; it consists mainly of emails sent by the team assistants to the process owner. A drilldown helped to understand that most of these serve to clarify questions around further procedure, management approval and deadlines. From this analysis, we can quickly suggest to divide the activity “screen and select applicants” into sub-activities “receive notification of new applications”, “screen applications”, “discuss applications with teammates”, “give feedback to team assistants” and “if in doubt, clarify further procedure”. Note that the labeling of clusters and the final suggestion of how to adapt the process model is still a manual effort. As Table 2 shows for the activity “request contract”, the analysis is not always that easy. An interpretation could be as follows. It is evident that Cluster 3 shows the communication between the process owner and the applicant, trying to obtain the necessary data (“start date”, “last name”) to fill in the contract request form. Cluster 2 is less well defined, but seems to be mainly about communicating the status of the contract request. Clusters 4, 5 and 6 seem to be a mixture of planning work together with the student and administrative issues (e.g. answering the applicant’s questions about university supervisors). Cluster 1 is by far the largest and consists of a mixture of topics, an important one being the exact modalities of the contract (“salary”, “hour”, “work duration”, etc). This analysis could suggest a subdivision of the activity into “get student data for filling request form”, “clarify contract modalities”, “answer student questions” and “check and communicate contract status”.
Refining Process Models through the Analysis of Informal Work Practice
129
Analysis of the approach. Although we cannot show all the cluster results here for space reasons, we would like to summarise the insights that we gained through qualitative analysis of the clustering data. In future work, it will also be interesting to apply quantitative measures (such as precision and recall) by comparing clustering results to previously defined gold standards. However, in this work the data set was too small to allow for meaningful quantitative results. We found that in general, clustering with feature vectors based on communication partner categories resulted in more easily interpretable and meaningful clusters than when using text-based features. In addition, for some activities, such as “request contract” (see above), clusters are more imbalanced in size, more difficult to interpret and less revealing. This is the case regardless of whether we use communication partner categories or text-based feature vectors. However, for 5 of the 7 activities, the clusters are very clear and easy to interpret. From the authors’ experience, we could identify cases where the clustering fails to reveal interesting branches that could help to subdivide the activity, e.g. the differentiation between intern and thesis contracts and European and non-European students in the “request contract” activity. The reasons for such failures were twofold: – Imbalance and sparseness problem: some interesting branches (e.g. nonEuropean students) are not represented by enough process instances in the data in order to be detected. – Noise problem: interesting differentiations such as between internship or thesis are buried under more obvious (and hence less interesting) ones. Of course, some of these problems may be resolved by using a bigger data set; the qualitative results presented here can thus only show a general direction and highlight some interesting aspects that still need to be quantified – which is our goal in future research.
5
Conclusions
Conventionally, business processes are modeled manually by process experts. With the growing demand on flexibility and agility, process mining and collaborative approaches start to gain attention as low-cost and low-overhead alternatives. In this paper, we proposed a framework that takes advantage of informal work practice in refining manually crafted process models. This is based on our observation that i) gaps between predefined models and day-to-day work practice is inevitable and thus local adaptation is necessary; ii) in many cases local adaptation carries informal characteristics and is not transparent to an enterprise information system and thus most of such valuable knowledge gets lost within the vast amount of information available to an employee; and iii) local adaptation can be leveraged to reflect work practice at the model level. The proposed approach focuses on two types of informal data of work practice, i.e. emails and personal task diaries. Different from apparently similar approaches, we reduce the overhead of manual annotation with the help of standard
130
S. Brander et al.
data mining / clustering algorithms. Hidden process patterns discovered from the informal work data are then used to fine-tune a “seed” process model, improving the model to reflect real-life work practice. We experimented the idea in a proof-of-concept implementation. The preliminary evaluation with emails from one of the co-authors has confirmed our intuition and the practical value of our approach. Further evaluation in a larger scale and with high diversity is forthcoming. Apart from optimising the learning algorithms, we aim to conceive other application scenarios that leverage the hidden “treasure” of informal work. Acknowledgements. This work is supported by the European Union IST fund through the EU FP7 MATURE Integrating Project (Grant No. 216356).
References 1. Abecker, A., Bernardi, A., Hinkelmann, K., K¨ uhn, O., Sintek, M.: Toward a technology for organizational memories. IEEE Intelligent Systems 13(3), 40–48 (1998) 2. Agrawal, R., Gunopulos, D., Leymann, F.: Mining process models from workflow logs. In: Schek, H.-J., Alonso, G., Saltor, F., Ramos, I. (eds.) EDBT 1998. LNCS, vol. 1377, pp. 469–483. Springer, Heidelberg (1998) 3. Ayache, S., Qu´enot, G., Gensel, J.: Classifier fusion for SVM-based multimedia semantic indexing. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 494–504. Springer, Heidelberg (2007) 4. Biemann, C., Quasthoff, U., Heyer, G., Holz, F.: ASV Toolbox – A Modular Collection of Language Exploration Tools. In: 6th Language Resources and Evaluation Conference, LREC (2008) 5. Brander, S., Hinkelmann, K., Martin, A., Th¨ onssen, B.: Mining of agile business processes. In: Proceedings of the AAAI Spring Symposium on AI for Business Agility (2011) 6. Carvalho, V.R., Cohen, W.W.: Recommending Recipients in the Enron Email Corpus. Technical Report CMU-LTI-07-005, Carnegie Mellon University, Language Technologies Institute (2007) 7. Di Ciccio, C., Macella, M., Scannapieco, M., Zardetto, D., Cartacci, T.: Groupware Mail Messages Analysis for Mining Collaborative Processes. Technical report, Sapienza, Universit` a di Roma (2011) 8. Cook, J.E., Wolf, A.L.: Discovering models of software processes from event-based data. ACM Trans. Softw. Eng. Methodol. 7(3), 215–249 (1998) 9. Dustdar, S., Hoffmann, T., van der Aalst, W.M.P.: Mining of ad-hoc business processes with TeamLog. Data and Knowlegde Engineering 55(2), 129–158 (2005) 10. Feldkamp, D., Hinkelmann, K., Th¨ onssen, B.: KISS: Knowledge-Intensive Service Support: An Approach for Agile Process Management, pp. 25–38 (2007) 11. International Organization for Standardization. ISO Survey 2009, http://www.iso.org/iso/survey2009.pdf (accessed in March 2011) 12. Hinkelmann, K., Merelli, E., Th¨ onssen, B.: The role of content and context in enterprise repositories. In: Proceedings of the 2nd International Workshop on Advanced Enterprise Architecture and Repositories-AER (2010) 13. Martin, A., Brun, R.: Agile Process Execution with KISSmir. In: 5th International Workshop on Semantic Business Process Management (2010)
Refining Process Models through the Analysis of Informal Work Practice
131
14. McDauley, D.: In the face of increasing global competition and rapid changes in technology, legislation, and knowledge, organizations need to overcome inertia and become agile enough to respond quickly. Organizational agility might indeed be one. In: Swenson, K.D. (ed.) Mastering the Unpredictable: How Adaptive Case Management Will Revolutionize the Way That Knowledge Workers Get Things Done, pp. 257–275. Meghan-Kiffer Press (2010) 15. Nikolov, A., van der Aalst, W.M.P.: EMailAnalyzer: An E-Mail Mining Plug-in for the ProM Framework (2007) 16. Wil, M., van der Aalst, P.: Exploring the CSCW spectrum using process mining. Advanced Engineering Informatics 21(2), 191–199 (2007) 17. Wil, M., van der Aalst, P., Weijters, A.J.M.M.: Process mining: A research agenda. Comput. Ind. 53(3), 231–244 (2004) 18. Witschel, H.F., Hu, B., Riss, U.V., Th¨ onssen, B., Brun, R., Martin, A., Hinkelmann, K.: A Collaborative Approach to Maturing Process-Related Knowledge. In: Hull, R., Mendling, J., Tai, S. (eds.) BPM 2010. LNCS, vol. 6336, pp. 343–358. Springer, Heidelberg (2010)
Monitoring Business Constraints with Linear Temporal Logic: An Approach Based on Colored Automata Fabrizio Maria Maggi1, , Marco Montali2, , Michael Westergaard1, , and Wil M.P. van der Aalst1 1
2
Eindhoven University of Technology, The Netherlands {f.m.maggi,m.westergaard,w.m.p.v.d.aalst}@tue.nl KRDB Research Centre, Free University of Bozen-Bolzano, Italy
[email protected] Abstract. Today’s information systems record real-time information about business processes. This enables the monitoring of business constraints at runtime. In this paper, we present a novel runtime verification framework based on linear temporal logic and colored automata. The framework continuously verifies compliance with respect to a predefined constraint model. Our approach is able to provide meaningful diagnostics even after a constraint is violated. This is important as in reality people and organizations will deviate and in many situations it is not desirable or even impossible to circumvent constraint violations. As demonstrated in this paper, there are several approaches to recover after the first constraint violation. Traditional approaches that simply check constraints are unable to recover after the first violation and still foresee (inevitable) future violations. The framework has been implemented in the process mining tool ProM. Keywords: Runtime Verification, Monitoring, Linear Temporal Logic, Declare, Automata.
1
Introduction
Entities within an organization are supposed to operate within boundaries set by internal policies, norms, best practices, regulations, and laws. For example,
This research has been carried out as a part of the Poseidon project at Thales under the responsibilities of the Embedded Systems Institute (ESI). The project is partially supported by the Dutch Ministry of Economic Affairs under the BSIK program. This research has been partially supported by the NWO “Visitor Travel Grant” initiative, and by the EU Project FP7-ICT ACSI (257593). This research is supported by the Technology Foundation STW, applied science division of NWO and the technology program of the Dutch Ministry of Economic Affairs.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 132–147, 2011. c Springer-Verlag Berlin Heidelberg 2011
Monitoring Business Constraints with Linear Temporal Logic
133
requests of a particular type need to be followed by a decision. Also in a crossorganizational setting, people and organizations need respect certain rules, e.g., a bill should be paid within 28 days. We use the generic term business constraint to refer a requirement imposed on the execution of an intra- or interorganizational process [12,10]. Business constraints separate compliant behavior from non-compliant behavior. Compliance has become an important topic in many organizations. Nevertheless, it is still difficult to operationalize compliance notions. Several authors developed techniques to ensure compliance by designing process models that enforce a set of business constraints [1,7]. Given a process model and a set of constraints, e.g., expressed in some temporal logic, one can use model checking [4] to see whether the model satisfies the constraints. However, static verification techniques are not sufficient to tackle compliance problems in a comprehensive way. First of all, some aspects cannot be verified a priori as compliance may depend on the particular context and its participants. Second, it cannot be assumed that the behavior of all actors is known or can be controlled. Most processes involve autonomous actors, e.g., a specialist in a hospital may deviate to save lives and another organization may not be very responsive because of different objectives. Third, process designs are typically the outcome of a collaborative process where only some constraints are taken into account (to reduce complexity and increase flexibility). Due to the procedural nature of most process modeling languages [12], incorporating all constraints is unreasonable (especially in environments with a lot of variability): the model would become unreadable and difficult to maintain. Last but not least, violations do not always correspond to undesirable behavior. Often people deviate for good reasons. In unpredictable and dynamic settings, breaking the rules is sometimes justified by the inadequacy or incompleteness of rules. All these issues call for runtime verification facilities, able to monitor the running cases of a process and to assess whether they comply with the business constraints of interest. Such facilities should provide meaningful information to the stakeholders involved. Roughly speaking, the purpose of monitoring is to check whether a running case satisfies or violates a correctness property, which in our setting is constituted by a set of business constraints. Since runtime verification and monitoring focus on the dynamics of an evolving system as time flows, LTL (Linear Temporal Logic) have been extensively proposed as a suitable framework for formalizing the properties to be monitored, providing at the same time effective verification procedures. However, the majority of current LTL runtime verification techniques limit themselves to produce as output only a truth value representing whether the current trace complies with the monitored property or not. Typically, a three or four-valued semantics is chosen for the LTL entailment, in order to reflect the fact that it is not always possible to produce a definitive answer about compliance (e.g., [2]). Furthermore, runtime verification usually halts as soon as a (permanent) violation is encountered: from now on, every possible extension of the current trace would still lead to a permanent violation.
134
F.M. Maggi et al.
In this paper, we present a novel framework for compliance evaluation at runtime. The framework offers the following: 1. intuitive diagnostics, to give fine-grained feedback to the end users (which constraints are violated and why); 2. continuous support, to provide verification capabilities even after a violation has taken place; 3. recovery capabilities, to realize different strategies for continuous support and accommodate sophisticated recovery mechanisms. Our proposed approach is based on colored automata, i.e., automata whose states are associated to multiple relevant information (“colors”). Moreover, we adopt Declare [11] as a constraint language. Declare constraints have an intuitive graphical notation and LTL-based semantics. Concerning the feedback returned to the monitor system, our approach does not only communicate if a running case is currently complying with the constraint model, but also computes the state of each constraint. We consider three possible states for constraints: satisfied, possibly violated and permanently violated. The first state attests that the monitored case is currently compliant with the constraint. The second state indicates that the constraint is currently violated, but it is possible to bring it back to a satisfied state by executing a sequence of events. The last state models the situation where it has become impossible to satisfy the constraint. At runtime, two possible permanent violations may occur: (a) a forbidden event is executed, or (b) a state is reached such that two or more constraints become conflicting. The presence of a conflict means that there is no possible future continuation of the case such that all the involved constraints become satisfied. Furthermore, when the case is terminated all the possibly violated constraints become permanently violated, because no further event will be executed to satisfy them. The approach has been implemented using ProM and Declare. Declare [11] is a flexible workflow system based on the Declare language. ProM1 is a pluggable framework for process mining providing a wide variety of analysis techniques (discovery, conformance, verification, performance analysis, etc.). In the context of ProM, we have developed a generic Operational Support (OS) environment [14,17] that allows ProM to interact with systems like Declare at runtime. Our monitoring framework has been implemented as an OS provider. The remainder of this paper is organized as follows. Section 2 presents some preliminaries: Declare as a specification language, RV-LTL as finite trace LTL semantics, and a translation of LTL into automata to build the monitors. Section 3 explains how colored automata can be used to check compliance at runtime and provide meaningful diagnostics. Section 4 presents three strategies for dealing with violations. Section 5 shows that the Declare model can also be modified at runtime, e.g., frequent violations may trigger an update of the model. Related work is discussed in Sect. 6. Section 7 concludes the paper. 1
ProM and the runtime verification facilities described in this paper can be downloaded from www.processmining.org.
Monitoring Business Constraints with Linear Temporal Logic
2
135
Background
In this section, we introduce some background material illustrating the basic components of our framework. Using a running example, we introduce Declare (Sect. 2.1). In Sect. 2.2, we present RV-LTL, an LTL semantics for finite traces. In Sect. 2.3, we introduce an approach to translate a Declare constraint model to a set of automata for runtime verification. 2.1
Running Example Using Declare
Declare is both a language and a system based on constraints [11]. The language is grounded in LTL but is equipped with a graphical and intuitive language. The Declare system is a full-fledged workflow system offering much more flexibility than traditional workflow systems. Figure 1 shows a fictive Declare model that we will use as a running example throughout this paper. This example models two different strategies of investment: bonds and stocks. When an investor receives an amount of money, she becomes in charge of eventually investing it in bonds or in stocks and she cannot receive money anymore before the investment (alternate response). If the investor chooses for a low risk investment, she must buy bonds afterwards (response). Moreover, the investor can receive a high yield only if she has bought stocks before (precedence). Finally, the investor cannot receive a high yield and buy bonds in the same trace (not coexistence). The figure shows four constraints. Each constraint is automatically translated into LTL. In particular, the response constraint can be modeled as ϕr = (Low Risk ⇒ ♦Bonds); the not coexistence corresponds to ϕn = ♦Bonds ⇒ ¬♦High Y ield; the alternate response is translated into ϕa = (M oney ⇒ (¬M oney (Bonds ∨ Stocks))); and the precedence constraint corresponds to ϕp = ♦High Y ield ⇒ (¬High Y ield Stocks). Unlike procedural languages, a Declare model allows for everything that is not explicitly forbidden. Removing constraints yields more behavior. The not coexistence constraint in Fig. 1 is difficult or even impossible to model in procedural languages. Mapping this constraint onto a procedural language forces the modeler to introduce a choice between Bonds and High Y ield (or both). Who makes this choice? When is this choice made? How many times will this choice be made? In a procedural language all these questions need to be answered, resulting in a complex or over-restrictive model. Low_Risk
Money
response
alternate response
Bonds
Stocks
not co-existence =
precedence
Fig. 1. Reference Model
High_Yield
136
2.2
F.M. Maggi et al.
LTL Semantics for Runtime Verification
Classically, LTL is defined for infinite traces. However, when focusing on runtime verification, reasoning is carried out on partial, ongoing traces, which describe a finite portion of the system’s execution. Therefore, several authors defined alternative finite-trace semantics for LTL. Here, we use the four-valued semantics proposed in [2], called Runtime Verification Linear Temporal Logic (RV-LTL). Differently from the original RV-LTL semantics, which focuses on trace suffixes of infinite length, we limit ourselves to possible finite continuations. This choice is motivated by the fact that we consider process instances that need to complete eventually. This has considerable impact on the corresponding verification technique: reasoning on Declare models is tackled with finite state automata. Let us denote with u |= ϕ the truth value of an LTL formula ϕ in a partial finite trace u (according to FLTL [9], a variant of the standard semantics for dealing with finite traces). The semantics of [u |= ϕ]RV is defined as follows: – [u |= ϕ]RV = if for each possible continuation σ of u: uσ |= ϕ (in this case ϕ is permanently satisfied by u); – [u |= ϕ]RV = ⊥ if for each possible continuation σ of u: uσ |= ϕ (in this case ϕ is permanently violated by u); – [u |= ϕ]RV = p if u |= ϕ but there is a possible continuation σ of u such that uσ |= ϕ (in this case ϕ is possibly satisfied by u); – [u |= ϕ]RV = ⊥p if u |= ϕ but there is a possible continuation σ of u such that uσ |= ϕ (in this case ϕ is possibly violated by u). Note that when monitoring a business process using LTL, it rarely happens that a constraint is permanently satisfied. For the most part, business constraints are possibly satisfied but can be violated in the future. For this reason, in this paper, we make no difference between permanently satisfied and possibly satisfied constraints but we refer to both of them as satisfied. The following example explains how the above semantics can be used in practice to monitor a running process case. Example 1. Let us consider the Declare model represented in Fig. 1. Figure 2 shows a graphical representation of the constraints’ evolution: events are displayed on the horizontal axis. The vertical axis shows the four constraints. Initially, all four constraints are satisfied. Let u0 = ε denote the initial (empty) trace: [u0 |= ϕr ]RV = p
[u0 |= ϕn ]RV = p
[u0 |= ϕa ]RV = p
[u0 |= ϕp ]RV = p
Event Money is executed next (u1 = M oney), we obtain: [u1 |= ϕr ]RV = p
[u1 |= ϕn ]RV = p
[u1 |= ϕa ]RV = ⊥p
[u1 |= ϕp ]RV = p
Note that [u1 |= ϕa ]RV = ⊥p because the alternate response constraint becomes possibly violated after the occurrence of Money. The constraint is waiting for the occurrence of another event (execution of Bonds or Stocks) to become satisfied
Monitoring Business Constraints with Linear Temporal Logic
137
Fig. 2. One of the views provided by our monitoring system. The colors show the state of each four constraints while the process instance evolves; red refers to ⊥, yellow refers to ⊥p , and green refers to or p .
again. Then, Bonds is executed (u2 = M oney, Bonds), leading to a situation in which constraint ϕa is satisfied again: [u2 |= ϕr ]RV = p
[u2 |= ϕn ]RV = p
[u2 |= ϕa ]RV = p
[u2 |= ϕp ]RV = p
The next event is High Yield (u3 = M oney, Bonds, High Y ield), resulting in: [u3 |= ϕr ]RV = p
[u3 |= ϕn ]RV = ⊥
[u3 |= ϕa ]RV = p
[u3 |= ϕp ]RV = ⊥
ϕn and ϕp become permanently violated because Bonds and High Yield cannot coexist in the same trace. Moreover, the precedence constraint requires that High Yield is always preceded by Stocks and this is not the case for trace u3 . After reporting the violation, the monitoring system should continue to monitor the process. Suppose that the framework is able to provide continuous support and uses the strategy to simply ignore the violated constraints. If Money is executed again, i.e., u4 = M oney, Bonds, High Y ield, M oney, ϕa becomes possibly violated again: [u4 |= ϕr ]RV = p
[u4 |= ϕn ]RV = ⊥
[u4 |= ϕa ]RV = ⊥p
[u4 |= ϕp ]RV = ⊥
However, this time the case completes its execution. We suppose that this is communicated to the runtime verifier by means of a special complete event. Using uf to denote the resulting total trace, we obtain: [uf |= ϕr ]RV =
[uf |= ϕn ]RV = ⊥
[uf |= ϕa ]RV = ⊥
[uf |= ϕp ]RV = ⊥
Note that constraint ϕa that is the possibly violated when the case completes becomes permanently violated (because it cannot become satisfied anymore).
2.3
Translation of a Declare Constraint Model to Automata
To automatically determine the state of each constraint of a Declare model during runtime, we construct a deterministic finite state automaton (we will
138
F.M. Maggi et al.
simply refer to such an automaton as “automaton”). The automaton accepts a trace if and only if it does not (permanently or possibly) violate the modeled constraint. We assume that constraints are specified in LTL (with a finite trace semantics). We use the translation in [6] for constructing the automaton. For the constraints in the model in Fig. 1, we obtain the automata depicted in Fig. 3. In all cases, state 0 is the initial state and accepting states are indicated using a double outline. A transition is labeled with the initial letter of the event triggering it (e.g., we use the label L to indicate that the Low Risk event occurs). For example, the response constraint automaton starts in state 0, which is accepting. Seeing an L (Low Risk) we transition to state 1, which is not accepting. Only upon seeing a B (Bonds) do we transition back to state 0. This models our understanding of the constraint: when we execute Low Risk we have to subsequently execute Bonds. As well as transitions labeled with a single letter, we also have transitions labeled with one or more negated letters (e.g., !L for state 0 of the response constraint automaton and !H&!B for state 0 of the not coexistence automaton). This indicates that we can follow the transition for any event not mentioned (e.g., we can execute the event High Y ield from state 0 of the response automaton and remain in the same state). This allows us to use the same automaton regardless of the input language. When we replay a trace on the automaton, we know that if we are in an accepting state, the constraint is satisfied and when we are in a non-accepting state, it is possibly violated. To support also the case where the constraint is permanently violated, we extend the original automaton (in fact, Fig. 3 already shows the extended version of the automaton) by connecting all the “illegal” transitions of the original automaton to a new state represented using a dashed outline (e.g., state 3 in the not coexistence constraint automaton). When we reach a state with a dashed outline during the execution of the automaton, we know that the constraint is permanently violated. We can use these local automata directly to monitor each constraint, but we can also construct a single automaton for monitoring the entire system. We call such an automaton a global automaton. The global automaton is needed to deal with conflicting constraints. Conflicting constraints are constraints for which there is no possible continuation that satisfies them all. Note that even when all individual local automata indicate that the constraint is not permanently violated, there can still be conflicting constraints. !H !H&!B B !L
s0
!B L
s1
B
(a) Response
s0
s1
H
s2
H
s3
!B H
!S&!H
B
(b) Not coexistence
!M
s0
!S&!B&!M M B S
s2
M
s1
s0
s1 -
S
s2
(c) Alternate response (d) Precedence
Fig. 3. Finite automata accepting traces satisfying (a) ϕr , (b) ϕn , (c) ϕa , and (d) ϕp
Monitoring Business Constraints with Linear Temporal Logic
139
The global automaton can be constructed in different ways. The simplest way just constructs it as the automaton product of the local automata (or, equivalently, as the automaton modeling the conjunction of the individual constraints). [16] describes how to construct it efficiently. The global automaton for the system under study is shown in Fig. 4. We use as state names the state numbers from each of the automata from Fig. 3, so state 1020 corresponds to constraint response being in state 1, constraint not coexistence being in state 0, and so on. These names are for readability only and do not indicate we can infer the states of local automata from the global states. To not clutter the diagram, we do not show self loops. These can be derived; every state also has a self-loop transition for any transition not otherwise explicitly listed. State f ail corresponds to all situations where it is no longer possible to satisfy all constraints. Note that state 1202 is not present in Fig. 4 even though none of the local automata is in a permanently violated state and it is in principle reachable from state 0202 via a L. The reason is that from this state it is never possible to ever reach a state where both response and not coexistence are satisfied, i.e., the two constraints are conflicting (in order to satisfy the first, we have to execute B which would move the latter to state 3). Executing the trace from Example 1 (M oney, Bonds, High Y ield, M oney, complete), we obtain the trace 0000 →M 0020 →B 0100 →H f ail →M f ail. Hence, we correctly identify that after the first two events all constraints are satisfied, but after executing High Y ield we permanently violate a constraint.
Fig. 4. Global automaton for the system in Fig. 1
140
3
F.M. Maggi et al.
Colored Automata for Runtime Verification
The global automaton in Fig. 4 allows us to detect the state of the entire system, but not for individual constraints. In order to combine local and global information in one single automaton, we use a colored automaton. This automaton is also the product of the individual local automata, but now we include information about the acceptance state for each individual constraint. In effect, we color the states with a unique color for each constraint, assigning the color to the state if the constraint is satisfied. Figure 5 shows the colored automaton for our running example. We have indicated that a constraint is satisfied by writing the first letter of its name in upper case (e.g., in state 0000 we have colors RNAP and all constraints are satisfied) and that a constraint can be eventually satisfied by writing the first letter of its name in lower case (e.g., in state 1202 we have colors rNAP where constraint response is not satisfied, but it can be satisfied by executing Bonds and transitioning to state 0302). If the first letter of the name of a constraint does not appear at all, the constraint is permanently violated. Figure 2 already used such a coloring, i.e., red refers to ⊥, yellow refers to ⊥p , and green refers to or p . Comparing figures 4 and 5 shows that we now have many states with a dashed outline, i.e., states from which we cannot reach a state where all constraints are satisfied. This reflects our desire to continue processing after permanently
Fig. 5. Colored global automaton for the system in Fig. 1
Monitoring Business Constraints with Linear Temporal Logic
141
violating a constraint. In fact, by folding all states with a dashed outline in Fig. 5, we obtain the original global automaton of Fig. 4. Note that states 1202 and 1222 have a dashed outline even though all constraints are satisfied or at least can be satisfied in successor states. This is because it is not possible to reach a state where all constraints are satisfied at the same time (we have basically asked for low risk, requiring investment in bonds as well as asked for high yield, which requires investment in stocks only). Executing the trace from Example 1 (M oney, Bonds, High Y ield, M oney, complete), we obtain 0000 →M 0020 →B 0100 →H 0301 →M 0321. The last state is colored with Ra, indicating that the response constraint is satisfied, the alternate response constraint is possibly violated, and the two remaining constraints are permanently violated. Once the colored global automaton is constructed, runtime monitoring can be supported in an efficient manner. The state of an instance can be monitored in constant time, independent of the number of constraints and their complexity. According to [16], the time to construct the global or colored automaton is 5-10 seconds for random models with 30-50 constraints. For models larger than this, automata can no longer routinely be constructed due to lack of memory, even on machines with 4-8 GiB RAM.
4
Strategies for Continuous Support
As discussed in Sect. 1, the monitoring system should continue to provide meaningful diagnostics after a violation takes place. This section presents three ways of recovering from a violation. These have been implemented in ProM. 4.1
Recovery by Ignoring Violated Constraints
The first recovery strategy simply ignores a constraint after it is permanently violated. This was the approach we used for the trace discussed in Example 1. Figure 2 illustrates the desired behavior when this strategy is used for M oney, Bonds, High Y ield, M oney, complete. The colored automaton directly supports this strategy. In Fig. 5, there are multiple states with a dashed outline to be able to monitor the state of all constraints after a violation. 4.2
Recovery by Resetting Violated Constraints
The second recovery strategy resets a constraint when it is permanently violated. Constraints that are not violated progress as before. Consider, for instance, trace M oney, Bonds, M oney, Stocks, High Y ield, M oney, Stocks, High Y ield, M oney, Stocks, High Y ield, complete. The first four events can be executed without encountering any problem: 0000 →M 0020 →B 0100 →M 0120 →S 0102. Executing High Y ield results in a failed state in the colored automaton: 0102 →H 0302. The automaton modeling the not coexistence constraint is in state 3. Resetting the constraint results in global state 0002. The remaining events can be executed without any problem: 0002 →M 0022 →S 0002 →H 0202 →M 0222 →S 0202 →H 0202.
142
F.M. Maggi et al.
Figure 6 shows the monitor in ProM using the reset strategy for this trace. After the first violation (corresponding to the first occurrence of High Y ield), the not coexistence constraint is reset and no violations are detected anymore even when High Y ield occurs again (by resetting the constraint we do not preserve information about the previous occurrence of Bonds). We can provide support for the reset strategy in two different ways: (a) by retaining the local automata for the system and translating back and forth between the global and local automata when an error is encountered or (b) by making an automaton specifically tailored to handle this strategy. The first approach requires a mapping from states of the colored automaton to states of each of the local automata and vice versa. We can do this using a hash mapping, which provides constant lookup for this table. When we encounter a transition that would lead us to a state from which we can no longer reach a state where all constraints are satisfied (a dashed state in Fig. 5), we translate the state to states of the local automata. For instance, transition 0100 →H 0301 during the trace M oney, Bonds, High Y ield, M oney from Example 1 results in a dashed state 0301. Two of the four local automata are in a permanently violated state. These automata are reset resulting in state 0000. The second approach creates a dedicated recovery automaton. Figure 7 shows the recovery automaton for our running example. In this automaton, we take the colored automaton and replace any transition to an error (dashed) state with the state we would recover to, effectively precomputing recovery. We do this by translating any transition to a dashed state to a transition to the correct recovery state. In Fig. 7, we removed all dashed states, and introduced new states not previously reachable (here 0200, 1200, 0220, and 1220). We have handled recovery in states 1202 and 1222 by retaining both of the two conflicting (but not yet violated) constraints response and not coexistence and handling the conflict when the violation occurs. From a performance point of view, a dedicated recovery automaton is preferable. Each step takes constant time regardless of the size of the original model. A possible disadvantage is its rigidity; the recovery strategy needs to be determined beforehand and the only way to switch to another recovery strategy is to generate a new recovery automaton.
Fig. 6. Recovery by resetting violated constraints
Monitoring Business Constraints with Linear Temporal Logic
143
Fig. 7. Recovery automaton for the system in Fig. 1 using recovery strategy reset and retaining states for conflicting constraints
4.3
Recovery by Skipping Events for Violated Constraints
The third recovery strategy skips events for permanently violated constraints (but still executing it for non-violated constraints). Consider trace M oney, Bonds, M oney, Stocks, High Y ield, M oney, Stocks, High Y ield, M oney, Stocks, High Y ield, complete. When High Y ield occurs for the first time, the not coexistence constraint is permanently violated. Under the skip strategy, this constraint is made again active, by bringing it back to the last state before the violation, i.e., the not coexistence constraint effectively ignores the occurrence of High Y ield. In this way, when High Y ield occurs for the second time, the constraint is violated again and we have a third violation when High Y ield occurs for the third time. Figure 8 shows the monitor in ProM using the skip strategy for this trace. Figures 6 and 8 illustrate that the reset and skip strategies may produce different results; the former detects one violation whereas the latter detects three violations. Similar to recovery by resetting violated constraints, it is possible to construct a dedicated recovery automaton using the skipping events for violated constraints. As a result, monitoring can be done in an efficient manner.
144
F.M. Maggi et al.
Fig. 8. Recovery by skipping events for violated constraints
In this section, we described three recovery strategies. There is no “best strategy” for continuous support. The choice strongly depends on the application domain and other contextual factors. Therefore, we have implemented all three approaches in ProM.
5
Runtime Modification
Thus far we assumed that the model does not change during monitoring. In workflow literature one can find many approaches supporting flexibility by change [15,13,5]. The basic idea is that the process model can be changed at runtime. This generates all kinds of complications (see for example the well-known “dynamic change bug” [13]). Models may change due to a variety of reasons, e.g., the implementation of new legislation or the need to reduce response times. This type of flexibility can easily be supported by Declare while avoiding problems such as the “dynamic change bug”. Moreover, frequent violations of existing constraints may trigger model changes such as removing a constraint. Consider for example a trace containing both Bonds and High Y ield, thus violating the not coexistence constraint, e.g., M oney, Bonds, M oney, Stocks, High Y ield, M oney, Low Risk, Bonds, M oney, Stocks, Bonds, complete. Instead of applying one of the aforementioned recovery strategies, we could leverage on the possibility of modifying the model at runtime. In particular, we can perform runtime modification using the algorithms for dynamic modifications presented in [16]. The algorithms are able to update an automaton with changes (such as adding and removing constraints). This approach is much faster than regenerating the automaton from scratch. In our example, we could remove the not coexistence constraint from the reference model adding at the same time a new not succession constraint, obtaining the Declare model shown in Fig. 9. After this modification, events Bonds and High Y ield can coexist but when Low Risk occurs, Stocks cannot occur anymore. After the modification the trace is monitored w.r.t. the new model, leading to the result reported in Fig. 10. Note that in correspondence of the second violation (involving the not succession constraint), a strategy for continuous support is applied and the model does not change.
Monitoring Business Constraints with Linear Temporal Logic
response
Bonds
=
Low_Risk
145
not succession Money
alternate response
Stocks
precedence
High_Yield
Fig. 9. Dynamic change of the model shown in Fig. 1
Fig. 10. Recovery by runtime change; at runtime the model shown in Fig. 1 is replaced by the model shown in Fig. 9
6
Related Work
Several BPM researchers have investigated compliance at the model level [1,7]. In this paper, we focus on runtime verification and monitoring based on the observed behavior rather than the modeled behavior. This has been a topic of ongoing research, not only in the BPM domain, but also in the context of software engineering and service oriented computing. Most authors propose to use temporal logics (e.g., LTL) and model checking [4]. We refer to the survey paper by Bauer et al. [2] for an overview of existing approaches. To implement runtime verification, classical automata-based model checking techniques must be adapted to reason upon partial traces. The monitored traces are finite, and also subject to extensions as new events happen, making it not always possible to draw a definitive conclusion about the property’s satisfaction or violation. The approach presented in this paper is based on RV-LTL semantics [2] which is a version of LTL on finite strings tailored for runtime verification. Our verification technique is inspired by [6], where the use of (a finite-trace version of) LTL is also considered to tackle runtime verification. In [6], a translation from arbitrary (next-free) LTL formulas is used to monitor any running system. The main difference with our approach is that we consider the monitor to be composed by several constraints, each of which can be violated, and we report and recover based on individual automata instead of the entire system.
146
F.M. Maggi et al.
Other logic-based approaches have been proposed to deal with runtime verification of running traces. The work closest to our approach is [3], where DecSerFlow (one of the constraint-based languages supported by Declare) is used to model service choreographies. Here a (reactive version) of the Event Calculus [8] is employed to provide the underlying formalization and monitoring capabilities. Unlike our approach, in [3], the interplay between constraints is not considered. The approach presented in this paper has been implemented as an Operational Support (OS) provider in ProM. The OS framework in ProM can be used to detect, predict, and recommend at runtime. For example, [14] describes OS providers related to time. Based on a partial trace the remaining flow time is predicted and the action most likely to minimize the flow time is recommended.
7
Conclusion
Compliance has become an important topic in organizations that need to ensure the correct execution of their processes. Despite the desire to monitor and control processes, there are events that cannot be controlled. For example, it is impossible and also undesirable to control the actions of customers and professionals. Therefore, we propose a comprehensive set of techniques to monitor business constraints at runtime. These techniques are based on colored automata, i.e., automata containing both global information about the state of the entire system and local information about the states of all individual constraints. This automaton can be precomputed, thus making monitoring very efficient. Since constraints may be permanently violated, it is important to recover after a violation (to still give meaningful diagnostics). We proposed and implemented three recovery approaches (ignore, reset, and skip). Moreover, we showed that it is possible to efficiently modify the constraint model while monitoring. The approach has been applied in the Poseidon project where it has been used to monitor “system health” in the domain of maritime safety and security. Future work aims at more applications in typical BPM domains (banking, insurance, etc.). Moreover, we would like to further improve our diagnostics and include other perspectives (data, time, resources).
References 1. Awad, A., Decker, G., Weske, M.: Efficient Compliance Checking Using BPMN-Q and Temporal Logic. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 326–341. Springer, Heidelberg (2008) 2. Bauer, A., Leucker, M., Schallhart, C.: Comparing LTL Semantics for Runtime Verification. Logic and Computation 20(3), 651–674 (2010) 3. Chesani, F., Mello, P., Montali, M., Torroni, P.: Verification of Choreographies During Execution Using the Reactive Event Calculus. In: Bruni, R., Wolf, K. (eds.) WS-FM 2008. LNCS, vol. 5387, pp. 55–72. Springer, Heidelberg (2009) 4. Clarke, E.M., Grumberg, O., Peled, D.A.: Model Checking. The MIT Press, Cambridge (1999)
Monitoring Business Constraints with Linear Temporal Logic
147
5. Dadam, P., Reichert, M.: The ADEPT project: a decade of research and development for robust and flexible process support. Computer Science - R&D 23(2), 81–97 (2009) 6. Giannakopoulou, D., Havelund, K.: Automata-Based Verification of Temporal Properties on Running Programs. In: Proceedings of the 16th IEEE International Conference on Automated Software Engineering (ASE 2001), pp. 412–416. IEEE Computer Society, Los Alamitos (2001) 7. Governatori, G., Milosevic, Z., Sadiq, S.W.: Compliance Checking Between Business Processes and Business Contracts. In: Proceedings of the 10th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2006), pp. 221–232. IEEE Computer Society, Los Alamitos (2006) 8. Kowalski, R.A., Sergot, M.J.: A Logic-Based Calculus of Events. New Generation Computing 4(1), 67–95 (1986) 9. Lichtenstein, O., Pnueli, A., Zuck, L.D.: The glory of the past. In: Proceedings of the Conference on Logic of Programs, London, UK, pp. 196–218. Springer, Heidelberg (1985) 10. Montali, M.: Specification and Verification of Declarative Open Interaction Models. LNBIP, vol. 56. Springer, Heidelberg (2010) 11. Pesic, M., Schonenberg, H., van der Aalst, W.M.P.: DECLARE: Full Support for Loosely-Structured Processes. In: Proceedings of the 11th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2007), pp. 287–300. IEEE Computer Society, Los Alamitos (2007) 12. Pesic, M., van der Aalst, W.M.P.: A Declarative Approach for Flexible Business Processes Management. In: Eder, J., Dustdar, S. (eds.) BPM Workshops 2006. LNCS, vol. 4103, pp. 169–180. Springer, Heidelberg (2006) 13. Schonenberg, H., Mans, R., Russell, N., Mulyar, N., van der Aalst, W.M.P.: Towards a Taxonomy of Process Flexibility. In: Bellahsene, Z., Woo, C., Hunt, E., Franch, X., Coletta, R. (eds.) Proceedings of the Forum at the CAiSE 2008 Conference. CEUR Workshop Proceedings, vol. 344, pp. 81–84 (2008) 14. van der Aalst, W.M.P., Pesic, M., Song, M.: Beyond process mining: From the past to present and future. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 38–52. Springer, Heidelberg (2010) 15. Weber, B., Rinderle, S., Reichert, M.: Change Patterns and Change Support Features in Process-Aware Information Systems. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 574–588. Springer, Heidelberg (2007) 16. Westergaard, M.: Better Algorithms for Analyzing and Enacting Declarative Workflow Languages Using LTL. In: Rinderle, S., Toumani, F., Wolf, K. (eds.) BPM 2011. LNCS, vol. 6896. Springer, Heidelberg (2011) 17. Westergaard, M., Maggi, F.M.: Modelling and Verification of a Protocol for Operational Support using Coloured Petri Nets. In: Proc. of ATPN 2011 (2011)
Automated Error Correction of Business Process Models Mauro Gambini1 , Marcello La Rosa2,3 , Sara Migliorini1 , and Arthur H.M. Ter Hofstede2,3,4 1
University of Verona, Italy {mauro.gambini,sara.migliorini}@univr.it 2 Queensland University of Technology, Australia {m.larosa,a.terhofstede}@qut.edu.au 3 NICTA Queensland Lab, Australia 4 Eindhoven University of Technology, The Netherlands
Abstract. As order dependencies between process tasks can get complex, it is easy to make mistakes in process model design, especially behavioral ones such as deadlocks. Notions such as soundness formalize behavioral errors and tools exist that can identify such errors. However these tools do not provide assistance with the correction of the process models. Error correction can be very challenging as the intentions of the process modeler are not known and there may be many ways in which an error can be corrected. We present a novel technique for automatic error correction in process models based on simulated annealing. Via this technique a number of process model alternatives are identified that resolve one or more errors in the original model. The technique is implemented and validated on a sample of industrial process models. The tests show that at least one sound solution can be found for each input model within a reasonable response time.
1 Introduction and Background Business process models document organizational procedures and as such are often invaluable to both business and IT stakeholders. They are used to communicate and agree on requirements among business analysts, or used by solution architects and developers as a blueprint for process automation [15]. In all cases, it is of utmost importance that these models are correct. Incorrect process models can lead to ambiguities and misinterpretations, and may not be directly automated [16]. There are different types of errors. A process model can violate simple syntactical requirements (e.g. some nodes are disconnected or used improperly), or suffer from behavioral anomalies (e.g. deadlocks or incorrect completion). Behavioral errors only arise when process models are executed and thus they are typically much more difficult to spot. For these reasons, they are quite common in practice. For example, a recent analysis of more than 1,350 industrial process models reports that 54% of these models are unsound [6]. Formal correctness notions such as soundness [1] define behavioral anomalies for process models, and advanced process modeling tools implement verification methods based on these notions to automatically detect these anomalies in process models [6,16]. Process modelers can take advantage of the information provided by these verification S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 148–165, 2011. c Springer-Verlag Berlin Heidelberg 2011
Automated Error Correction of Business Process Models
149
features to fix design flaws. However, generally it is up to the user to understand the (often very technical) output produced by these tools and to figure out how to fix these errors. Correcting behavioral errors in process models is not trivial: apparently independent errors can have a common cause and correcting one error may introduce new errors in other parts of the model. This problem is amplified by the inherent complexity of process models, which tends to grow as organizations reach higher levels of Business Process Management (BPM) maturity [12]. This paper presents a novel technique called Petri Nets Simulated Annealing (PNSA) for automatically fixing unsound process models. The core of this technique is a heuristic optimization algorithm based on dominance-based Multi-Objective Simulated Annealing (MOSA) [14,13]. Given an unsound process model and the output of its soundness check, at each run, the algorithm generates a small set of solutions (i.e. alternative models) similar to the original model but potentially containing fewer or no behavioral errors, until a maximum number of desired solutions is found or a given timeframe elapses. These solutions are produced by applying a number of perturbations (i.e. small changes) on the current solution, which in turn is derived from the original model. The similarity of a solution to the original model is determined by its structural similarity and (to remain efficient) by an approximation of its behavioral similarity to the original model. Since the intentions of the process modeler are not known and there are usually many ways in which an error can be corrected, the algorithm returns a small set of non-redundant solutions (i.e. no solution is worse than any of the others). The differences between these solutions and the original model can then be presented to a process modeler as suggestions to rectify the behavioral errors in the original model. This technique is implemented in a prototype tool and validated on a sample of industrial process models. The results indicate that multiple errors can be fixed in a short time and at least one sound solution was found for each model. The problem of automatically correcting errors has already been explored for software bug fixing (see e.g. [2]). Given a program, a set of positive tests and at least one failed test proving evidence for a bug, these algorithms produce a patch that fixes the error in question, provided that this error is localized. In the BPM field, [3] describes a technique for automatically fixing certain types of data anomalies that can occur in process models, while [10] presents an approach to compute the edit operations required to correct a faulty service in order to interact in a choreography without deadlocks. However, to the best of our knowledge, the problem of automatically fixing unsound process models has not been addressed yet. In this paper, we represent process models as Workflow nets—a class of Petri nets that has been extensively applied to the formal verification of business process models [16]. In addition, mappings exist between process modeling languages used in practice (e.g. EPCs, BPMN, BPEL) and Petri nets [11]. This provides a basis to extend the results of this paper to concrete process modeling notations. Based on the above, the rest of this paper is organized as follows: Sec. 2 provides the basic definitions of Workflow nets and soundness. Sec. 3 defines the problem in question, while Sec. 4 describes the PNSA technique in detail. The evaluation of the proposed technique is treated in Sec. 5 before Sec. 6 concludes the paper.
150
M. Gambini et al.
2 Preliminaries Petri nets are graphs composed of two types of nodes, namely transitions and places, connected by directed arcs. Transitions represent tasks while places are used for routing purposes. Labels are assigned to transitions to indicate the business action they perform, i.e. the observable behavior. A special label τ is used to represent invisible actions, i.e. actions that are only used for routing purposes and do not represent any task from a business perspective. Definition 1 (Labeled Petri net). A labeled Petri net is a tuple N = (P, T, F, L, ) where P and T (P ∩ T = ∅) are finite sets of places, resp., transitions, F ⊆ (P × T ) ∪ (T × P ) is a flow relation, L is a finite set of labels representing business actions, τ ∈ L is a label representing an invisible action, and : T → L ∪ {τ } is a labeling function which assigns a label to each transition. For n ∈ P ∪ T , we use •n and n• to denote the set of inputs to n (preset) and the set of outputs of n (postset). We are interested in Petri nets with a unique source place and a unique sink place, and such that all other nodes are on a directed path between the input and the output places. A Petri net satisfying these conditions represents a process model and is known as a Workflow net [1]. Definition 2 (Workflow net). Let N = (P, T, F, L, ) be a labeled Petri net and F ∗ be the reflexive transitive closure of F . N is a Workflow net (WF-net) if and only if (iff): – there exists exactly one input place, i.e. ∃!pI ∈P • pI = ∅, and – there exists exactly one output place, i.e. ∃!pO ∈P pO • = ∅, and – each node is on a directed path from the input to the output place, i.e. ∀n∈P ∪T ((pI , n) ∈ F ∗ ∧ (n, pO ) ∈ F ∗ ). Fig. 1 shows four example Petri nets, where actions are depicted within transitions, e.g. (t1 ) = a. All these nets have single start and end places, and any transition lies on a path from the start to the end place. Hence all these nets are WF-nets. Behavioral correctness of a WF-net is defined w.r.t. the states that a process instance can be in during its execution. A state of a WF-net is captured by the marking of its places with tokens. In a given state, each place is either empty or it contains one or more tokens (i.e. it is marked). A transition is enabled in a given marking if all the places in the transition’s preset are marked. Once enabled, the transition can fire (i.e. can be executed) by removing a token from each place in the preset and putting a token into each subsequent place of the transition’s postset. This leads to a new state. Definition 3 (Marking notation). Let N = (P, T, F, L, ) be a WF-net. Then M : P → N is a marking and Q is the set of all markings. Moreover: – – – – –
for any two markings M, M ∈ Q, M ≥ M iff ∀p∈P M (p) ≥ M (p), for any two markings M, M ∈ Q, M > M iff M ≥ M and M = M . M(N ) as the set of all markings of N , MIN as the initial marking of N with one token in place pI , i.e. MIN = [pI ], N MO as the final marking of N with one token in place pO , i.e. MON = [pO ],
Automated Error Correction of Business Process Models
pI
a t1
p2 p4
(a)
pI
a t1
p2 p4
(c)
b t2
p3
c t4
p5
d t6
p6
b t2
p3
c t4 t6 d
p5
e t3 W t5
p7
f t7
pO
pI
a t1
p8
p2
b t2
p3
e t3
p7
p4
c t4
p5
W t5
p8
d t6
p6
p2
b t2
p3
e t3
p7
p4
c t4
p5
W t5
(b) e t3 W t5
p7
f t7
pO
pI
p8
a t1
(d)
151
f t7
pO
f t7
pO
p8
t6 d
Fig. 1. Four example WF-nets
– for any transition t ∈ T and any marking M ∈ M(N ), t is enabled at M , denoted as M [t , iff ∀p∈•t M (p) ≥ 1. Marking M is reached from M by firing t and M = M − •t + t•, – for any two markings M, M ∈ M(N ), M is reachable from M in N , denoted as M ∈ N [M , iff there exists a firing sequence σ = t1 .t2 . . . tn (n ≥ 0) leading σ from M to M , and we write M → M . N σ ∗ M } is its set of traces, i.e. firing se– Tr(N ) = {σ ∈ T : ∃M∈M(N ) MI → N quences that start from the initial marking, σ – CTr(N ) = {σ ∈ T ∗ : MI → MO } is its set of correct traces, i.e. traces that N lead to the final marking, ∗ – ˆ : T ∗ → ⎧L returns the string of its visible actions, where ⎨ if σ = (the empty string), ˆ ), if σ = t.σ and (t) = τ , ˆ (σ) = (t).(σ ⎩ˆ (σ ), if σ = t.σ and (t) = τ . In the remainder, we omit N as superscript or subscript when it is clear from the context. All the nets in Fig. 1 are marked with their initial marking, i.e. they have one token in their input place depicted as a dot inside the place. The execution of a process instance starts with the initial marking and should then progress through transition firings until a proper completion state. This intuition is captured by three requirements [1]. First, every process instance should always have an option to complete. If a WF-net satisfies this requirement, it will never run into deadlocks or livelocks. Second, every process instance should eventually reach the final marking, i.e. the state in which there is one token in the output place, and no tokens are left behind in any other place, since this would signal that there is still work to be done. Third, for every transition, there should be at least one correct trace that includes at least one firing of this transition. A WF-net fulfilling these requirements is sound [1]. Definition 4 (Sound WF-net). Let N = (P, T, F ) be a WF-net and MI , MO be the initial and end markings. N is sound iff:
152
M. Gambini et al.
– option to complete: for every marking M reachable from MI , there exists a firing sequence leading from M to M ≥ MO , i.e. ∀M∈N [MI ∃M ∈N [M M ≥ MO , and – proper completion: the marking MO is the only marking reachable from MI with one token in place pO , i.e. ∀M∈N [MI M ≥ MO ⇒ M = MO , and – no dead transitions: every transition can be reached by the initial marking, i.e. ∀t∈T ∃M∈N [MI M [t . A behavioral error is a violation of one of the three soundness properties. For example, the first WF-net in Fig. 1 is not sound: it can only complete successfully if transition t5 fires only once before t3 fires. In fact, if t3 fires before t5 , t7 will deadlock in marking [p7 ] waiting for a token in p8 which will never arrive (no option to complete error). Also, if t5 fires more than once before t3 , it will put more than one token in p8 and when t3 and t7 fire, the net will complete with the marking [pO + kp8 ] with k > 0 (no proper completion error). Nets (b) and (c) also suffer from behavioral problems. (b) has no proper completion in markings [pO + p5 + p8 ] and [pO + p6 + kp8 ] with k ≥ 0, while (c) has a deadlock in [p7 ] and a dead transition t7 . Net (d) is sound. In the next section we focus on the problem of automatically fixing unsound process models.
3 Automatic Process Model Correction We name the problem of automatically fixing process model errors as the Automatic Process Model Correction (APMC) problem. Intuitively, given an unsound WF-net N and the output of its verification method, we want to find a set of remedial suggestions, each one presented as a minimal set of changes, that transforms N into a sufficiently similar WF-net N with fewer or no behavioral errors. Since there are usually different ways to solve a behavioral error in an unsound model, we should return different alternative solutions N , such that none of these solutions is strictly worse than the others. However since the set of solutions can be potentially large, we should limit it to a maximum number of solutions and a maximum timeframe that the user is willing to wait to find such solutions. The user can then evaluate the solutions found to see if some of these are consistent with their initial intentions, and apply the changes accordingly. A solution should be sufficiently similar to the original model in order to preserve the modeler’s intentions. Otherwise we could obtain a sound variant simply by reducing the original model to a trivial sequence of actions. The notion of process model similarity can be approached from a structural and from a behavioral perspective [5]. From a structural perspective, we should create models whose structure is similar to that of the original model, i.e. we should minimize the changes that we apply to the original model in terms of insertion and deletion of nodes and arcs, and control the type of these changes. From a behavioral perspective, we should preserve the observable behavior of a process model as far as possible, and meantime, try to reduce the number of behavioral errors without introducing new ones. Finally, a solution should not be inferior to the others in terms of number and type of errors being fixed. For example, the final set of solutions should not contain a model that fixes one error if there also exists another model in the same set that fixes that error and a second error. Structural similarity alone does not guarantee behavioral similarity. Also, two different process structures can have same observable behavior, thus a solution that is
Automated Error Correction of Business Process Models
153
structurally-different to the original model may actually suffer from the same behavioral errors. While behavioral similarity may be sufficient to characterize a solution, a solution that is behaviorally-similar to the original model but structurally very different may not be recognized as a good solution by the user. Moreover, structural similarity enhances confidence about not introducing new errors, and limits the search space. For these reasons, we consider both structural and behavioral similarities in our study. In the following section, we formalize these notions.
4 Petri Nets Simulated Annealing A well-established measure for structural similarity of process models is based on graph-edit distance [4]. A challenge in this technique is finding the best mapping between the nodes of the two graphs to compare, which can be computationally expensive. However we avoid this problem because we measure the graph edit distance by computing the costs of the operations that we perform to change the original model into a given solution. Unfortunately, behavioral similarity cannot be computed efficiently. As an example, the Transition Adjacency Relation technique [17] requires the exploration of the entire state-space of the two models to determine their behavioral similarity. This would be unfeasible in our case, as we would need to check the state-space of each proposed solution against that of the original model, and we do build a high number of candidate solutions. Thus, we opt for an approximation of behavioral similarity based on two components. First, we introduce a notion of behavioral distance to measure the ability of a solution to simulate a finite set of correct traces of the original model. Second, we introduce a notion of badness to measure how many errors of the original model are still present in the solution, and how many new errors have been introduced. In the light of this, the APMC problem becomes a multi-objective optimization problem which can be solved by trying to simultaneously minimize three objective functions: the structural distance, the behavioral distance and the badness of a solution w.r.t. the original model. At each iteration of our algorithm, we search candidate solutions with lower structural distance, behavioral distance and badness w.r.t. previous solutions. We compute the behavioral distance and the badness over a sample set of traces of the original model, which we select at each iteration. If we find a solution that increases these objective functions, we discard it. Otherwise we maintain the solution so that it can be tested with further sample traces in subsequent iterations. The more tests a solution passes, the more the confidence increases that this is a good solution. The procedure concludes when a maximum number of non-redundant solutions with high confidence is reached (i.e. each solution is not worse than any of the others), or a given timeframe elapses. In the following subsections, we formalize the ingredients of this technique and describe its algorithm in detail. 4.1 Structural Distance Each candidate solution is obtained by applying a minimal sequence of edit operations, i.e. an edit sequence, in order to insert or remove a single arc or node. Inserting or removing one arc is a single atomic operation and does not violate the properties of a WF-net, i.e. the solution will always be a WF-net. Inserting a node implies adding
154
M. Gambini et al.
the node itself and one incoming and one outgoing arc to connect this new node to the rest of the net (a total of three edit operations are needed). Deleting a node implies removing the node itself and all its incoming and outgoing arcs, and can only be done if the elements in the node’s preset and postset remain on a path from pI to pO after removing the node. This requires at least three edit operations. Given that each edit sequence is minimal, i.e. it corresponds to the insertion or removal of one arc or node only, more complex perturbations on the original model can be obtained by applying multiple edit sequences through a number of intermediate solutions. Coming back to the example in Fig. 1, nets (b)-(d) are all solutions of (a) which can be obtained via one or more edit sequences. For example, net (b) can be obtained directly from (a) with one edit sequence consisting of one edit operation (removal of arc p5 − t3 ), while net (c) can be obtained with two edit sequences consisting of four edit operations in total (addition of arc p8 − t6 and removal of place p6 with its two arcs). While it is safe to add or remove places and arcs, we need special considerations for transitions. Since the modeler’s intentions are not known, we assume that a business action was introduced with a specific business purpose, so its associated transition should not be removed from the model. On the other hand, an invisible action (i.e. a τ transition) is only used for routing purposes, so we assume it can be safely removed or inserted. Thus, we only allow insertion and removal of τ transitions. The cost of each type of edit operation can be controlled by the user. For example, one may rate removal operations as more expensive than addition operations, based on the assumption that it is less likely that modelers would introduce something erroneous than that they forgot to introduce something essential. Similarly, one may rate operations on transitions as more expensive than operations on places, and the latter as more expensive than operations on arcs. The structural distance between two WF-nets is obtained by the total cost of the edit sequence between the two nets. Definition 5 (Structural distance). Let N and N be two WF-nets, E the set of all edit operations, e(N, N ) = ei ∈ E ki=1 the edit sequence between N and N , and c : E → R a function that assigns a cost to each edit operation. The structural distance k λ : N ×N → R between N and N is λ(N, N ) = i=1 c(ei ). Let us assume a cost of 1 for each type of edit operation. Then the structural distance between nets (b) and (a) is 1, between (c) and (a) is 4 and between (d) and (a) is 5. 4.2 Behavioral Distance Given a sample set of traces R from the original model N , we compute the behavioral distance of a solution N from N as its ability to simulate the correct traces in R, i.e. those traces that start from marking MI and complete in MO . Hence, this function can only be used if the original model has at least one correct trace. ˆ can be replayed A model N can simulate a trace σ if its string of visible actions (σ) in N . For example, given the trace σ1 = t1 .t2 .t4 .t5 .t6 .t4 .t3 .t7 of net (a) in Fig. 1, ˆ 1 ) = a.b.c.d.c.e.f can be replayed in a solution of (a). we want to check whether (σ ˆ 1 ) can be simulated by its best simulation Precisely, we want to know to what extent (σ
Automated Error Correction of Business Process Models
155
ˆ 1 ) that can be replayed in N . For example, net (b) trace, i.e. by the longest prefix of (σ can fully simulate σ1 while (c) cannot simulate action f because t7 is dead in (c). The ˆ )| extent of a simulation is called simulation ratio and is given by ||(σ where σ is the ˆ (σ)| best simulation of σ in a given solution. The simulation ratio of σ1 in net (b) is 1 while in (c) it is 67 ≈ 0.86. We limit the number of silent actions that can be used to simulate a trace via a parameter m for efficiency reasons. In fact, the higher m is, the more the solution model needs to be explored to see if a given trace can be simulated (and each silent action may open new paths to explore). Definition 6 (Trace simulation, Best simulation). Let N and N be two WF-nets and σ ∈ Tr(N ) be a trace of N . Then Sim(N, N , σ, m) = {σ ∈ Tr(N ) : |{t ∈ ˆ ) ∈ Pref((σ))} ˆ α(σ) : l(t) = τ }| ≤ m ∧ (σ is the set of traces that can sim ulate σ in N with at most m silent actions, where α(σ) returns the alphabet of σ and Pref(s) returns the set of all prefixes of string s. We say that σ is a best simulation of σ in N with at most m silent actions, denoted as simb (N, N , σ, m), iff σ ∈ ˆ )| = ˆ Sim(N, N , σ, m) and |(σ max |(θ)|. Let σ = simb (N, N , σ, m) σ
θ∈Sim(N,N ,σ,m) σ Me = M is the final
such that MI →N M . Then marking of σ and Mσ = {M ∈ θ M(N ) : ∃θ∈Pref(σ ) MI →N M } is the set of all markings traversed by σ . The behavioral distance of a solution w.r.t. a set of traces R is the inverse of the cumulative simulation ratio of all the best simulation traces, normalized to the number of correct traces in R. Definition 7 (Behavioral distance to N w.r.t. R and m). Let N be a WF-net such that its set of correct traces CTr(N ) = ∅ and let N be another WF-net. The behavioral distance δ : N × N × 2Tr(N ) × N → [0, 1] of N to N w.r.t. a repre sentative set of traces R of N and a simulation parameter m is δ(N, N , R, m) = ψ(N, N , σ, m) σ∈R∩CTr(N ) 1− , where the simulation ratio ψ(N, N , σ, m) = |R ∩ CTr(N )| ⎧ ˆ )| ⎨ |(σ such that σ = simb (N, N , σ, m) ˆ | (σ)| ⎩ 0 otherwise With reference to Fig. 1, let us consider R1 = {(t1 .t2 .t4 .t5 .t6 .t4 .t3 .t7 ), (t1 .t4 .t5 .t6 .t4 . t2 .t3 .t7 ), (t1 .t2 .t4 .t3 ), (t1 .t2 .t4 .t5 .t6 .t4 .t5 .t6 .t4 .t3 .t7 )} where the first two traces are correct. The behavioral distance of nets (b) and (d) to (a) over the correct traces in R1 is 0 since all such traces in R1 can be fully simulated in (b) and (d), while the behavioral distance of (c) to (a) is 1 − 0.86+0.86 = 0.14 (using at most m = 5 silent actions). 2 4.3 Badness Broadly speaking a model is behaviorally better than another if it contains less behavioral errors, but not all errors are equal and errors of the same type can have a different severity. For example, a no option to complete may prevent an entire process fragment
156
M. Gambini et al.
from being executed, which is far worse than having a single dead transition in a model. Thus, we need a function to rank errors based on their gravity. More precisely, given a sample set of traces R, we want to find out if and to what extent each of these traces leads to an error. In the WF-net context, a trace is proof of no option to complete if it produces a marking M such that M = MO and no transition can be enabled in M . A trace is proof of improper completion if it produces a marking M such that M > MO , i.e. there are other marked places besides pO . These conditions can be efficiently checked on a trace. Unfortunately it is not possible to provide evidence for a dead transition unless the whole state space is explored, which we cannot afford to do when generating solutions (exploring the entire state space is notoriously an exponential problem). However, given a trace, we can still provide a “warning” for a potentially-dead transition, i.e. a transition that is partially enabled in the last marking of that trace, due to some places in its preset not being marked (where the maximum number of admissible missing places is a parameter d). Definition 8 (Potentially-dead transition). Given a WF-net N , a marking M ∈ M(N ) and a transition t ∈ T , t is potentially-dead in marking M , i.e. pdt(N, M, t, d) iff it holds that 0 < |{p ∈ •t | M (p) = 0}| ≤ d and ∃p∈•t M (p) > 0, where 0 < d < | • t| denotes the maximum number of admissible missing places. We can now provide the classification of erroneous traces. Definition 9 (Erroneous trace classification). Let N be a WF-net, σ ∈ Tr(N ) be one σ of its traces (MI → M ), and d be a parameter indicating the maximum number of admissible missing places. Then σ has: 1. a no option to complete error iff M = MO and ∃t∈T M [t 2. an improper completion error iff M > MO 3. a potentially-dead transition, iff ∃t∈T pdt(N, M, t, d). We denote the set of all erroneous traces for a net N as ETr(N ). While no option to complete and improper completion are mutually exclusive errors in a trace, a trace that suffers from either of these problems can also have a potentially-dead transition. For each σ in R of N , the badness is a function measuring the severity of each error for a best simulation σ of σ in N (if N ≡ N , the best simulation of σ is the trace itself). For example, if σ has an improper completion error while its best simulation σ in N does not have this error, the badness of σ will be lower than that of σ (0 if no errors are found for that trace). Thus, while for the behavioral distance we are only interested in simulating correct traces, when computing the badness we need to make sure R contains both correct and erroneous traces of N . Correct traces in N will have a badness of 0 and can be tested against their simulations in N to see if a new error has been introduced (the badness of the simulation trace will be greater than 0); erroneous traces in N will have a badness > 0 and can be tested against their simulations to see if the errors have diminished or disappeared (the badness of the simulation trace will be lower than that of the original trace). The badness, denoted as β, consists of three components each measuring the severity of one error type: βn for no option to complete, βi for improper completion and βd for potentially-dead transitions. Given a trace σ of N , βn measures the probability of
Automated Error Correction of Business Process Models
157
ending up in a deadlock while executing one of its best simulations σ in N . This is done by counting the number of enabled transitions that are not fired at each state traversed by σ : each such a transition provides an option to diverge from the route of σ, thus potentially avoiding the final deadlock. So the more such options there are while executing σ , the lower the probability is of ending up in the deadlock in question, and the less onerous this error is. βn is also proportional to the complexity of σ , which is estimated by counting the number of different transitions in σ over the total number of transitions in N . The intuition is that an error appearing in a more complex trace is worse than the same error appearing in a simple trace. Thus, if the badness of a complex trace is higher than that of a simpler trace featuring the same error, we will prioritize the fixing of the complex trace as opposed to fixing the simple one, with the hope that as a side-effect of fixing the complex trace, other erroneous traces may be fixed. βi measures the probability of ending up in an improper completion state while executing σ . This is done by counting the number of places that are marked when σ marks pO , over the total number of places that are marked while executing σ . βi is also proportional to the complexity of σ . Finally, βd returns the probability of having dead transitions in the last state of σ . This is done by counting how many places are not marked over the size of the preset of each transition that i) is potentially-dead in the last state of σ and ii) is not fired in any best simulation of any trace in R. Indeed, even if a transition satisfies the potentially-dead condition, we know for sure that it is not dead if it can be fired in a best simulation of a trace in R. This measure is counteracted by the complexity of the trace, under the assumption that the more transitions are fired by σ , the less likely it is that the potentially-dead transitions in the last state of σ need to be fired. Definition 10 (Badness w.r.t. R, m and d). Let N, N be two WF-nets, R a set of traces of N , m the simulationparameter and d the maximum number of admissible missing places. Let also V = σ∈R α(simb (N, N , σ, m)) be the set of all transitions of N fired in a best simulation of a trace in R. The badness of N w.r.t. R, m and d is β : N ×N × 2Tr(N ) × N × N → R: β(N, N , R, m, d) =
(wn βn (N, N , σ, m) + wi βi (N, N , σ, m) + wq βd (N, N , σ, m, d))
σ∈R
where wn , wi and wq are the weights of each error type. For each σ ∈ R, given σ = simb (N, N , σ, m) with set of traversed markings Mσ and final marking Meσ we define:1
|α(σ )| [Meσ (pO ) = 0 ∧ t∈T Meσ [t ] · |T | 1 + ( M ∈Mσ ,t∈T [M [t ]) − |σ | σ σ |α(σ )| [Me (pO ) > 0] p∈P \{pO } [Me (p) > 0] βi (N, N , σ, m) = · |T | M ∈Mσ ,p∈P \{pO } [M (p) > 0] βn (N, N , σ, m) =
βd (N, N , σ, m, d) = 1
1 · |α(σ )|
t∈T \V,pdt(N ,Meσ ,t,d)
[x] returns 1 if the boolean formula x is true, or 0 otherwise
{|p ∈ •t : Meσ (p) = 0|} | • t|
158
M. Gambini et al.
Let us consider again set R1 used in Sec. 4.2 for net (a) and let us assume m = 5, d = 2 and wn = wi = wq = 1. We recall that net (a) has a deadlock in state [p7 ]. This can be obtained by firing σ3 = t1 .t2 .t4 .t3 ∈ R1 producing a badness of 0.29. This error is corrected in net (b). In fact the best simulation of σ3 in (b) completes in state [p5 + p7 ] which enables t5 . Its badness is thus 0 (there are no potentially-dead transitions in this state with d = 2). So (b) improves (a) w.r.t. σ3 . On the other hand, since the deadlock still remains in (c), the badness for σ3 in (c) is 0.41. This badness is higher than that of (a) since t7 can never be executed in (c). So (c) worsens (a) w.r.t. σ3 . The no proper completion of (a) in [pO + p8 ], obtained e.g. by firing σ4 = t1 .t2 .t4 .t5 .t6 .t4 .t5 .t6 .t4 .t3 .t7 with badness 0.12, is best simulated in (b) by a trace completing in [pO + p5 + p8 ]. Since this state marks two places besides pO instead of one, it induces a badness of 0.17 which is worse than that of σ4 . So (b) worsens (a) w.r.t. σ4 . For net (c), σ4 can be best simulated with a trace completing in state [p7 ] with a badness of 0.28. In fact, while there is no improper completion error (the trace does not even mark pO ), there is still a deadlock in [p7 ] and t7 is dead. So (c) also worsens (a) w.r.t. σ4 . Since (d) is sound, its badness for the above traces is 0. Finally, both (b) and (c) worsen (a) w.r.t. σ1 and σ2 (the correct traces of R1 shown in Sec. 4.2), as they introduce new improper completion, resp., no option to complete errors. The overall badness of (a), (b), (c) and (d) w.r.t. R1 is 0.41, 0.53, 1.19 and 0. 4.4 Dominance-Based Simulated Annealing Given an erroneous model N ∈ N , the goal of the PNSA technique is to produce a good set of solutions S ⊆ N such that each model Ni ∈ S is similar to N but contains fewer or no errors. S can be considered good if i) its members are good solutions according to the three objective functions (structural distance, behavioral distance and badness); ii) they are non-redundant, i.e. no member is better than the others; and iii) all have high confidence, i.e. they have been tested against a given number of sets R. In order to find good solutions while avoiding redundancy in the final solution set, we need to be able to compare two solutions. To do so, we use the values of their objective functions. First, we need to group these functions into a unique objective function w.r.t. a set R. Since a solution can be tested against multiple sets R, we also need a notion of average unique objective function over the various R. Definition 11 (Unique objective function). Let N, N be two WF-nets. Assuming N and the simulation parameter m are fixed, we define the following objective functions w.r.t. a set of traces R of N : f1 (N , R) = λ(N, N ), f2 (N , R) = δ(N, N , R, m) and f3 (N , R) = β(N, N , R, m). These functions are grouped into a unique objective function f¯ : N × 2Tr(N ) → R3 such that for each N and R, f¯(N , R) identifies the triple (λ(N, N ), δ(N, N , R, m), β(N, N , R, m)). Given i sets Rk with 1 ≤ k ≤ i, we compute the average unique objective function f¯avg ({f¯k (N , Rk )}ik=1 ) = (avg1≤k≤i (f1 (N , Rk )), avg1≤k≤i (f2 (N , Rk )), avg1≤k≤i (f3 (N , Rk ))). Considering our example set R1 , the values of the unique objective functions for nets (ad) are: (0,0,0.41), (1,0,0.53), (4,0.14,1.19) and (5,0,0). The (average) unique objective function can be used to compare two solutions via the notion of dominance. A solution
Automated Error Correction of Business Process Models
159
N1 dominates a solution N2 iff N1 is better than N2 in at least one objective function and equivalent in the remaining ones. Definition 12 (Dominance). Given a set of traces R, a solution N1 dominates a different solution N2 w.r.t. R , i.e. N1 ≺R N2 , iff f¯(N1 , R) ≤ f¯(N2 , R) and there exists a 1 ≤ j ≤ 3 such that fj (N1 , R) < fj (N2 , R). If N1 does not dominate N2 , we write N1 ≺R N2 . When we have multiple sets R, we compute the average dominance by comparing the average unique objective functions of two solutions, and denote this relation with ≺avg . The dominance relation establishes a partial order and two solutions N1 and N2 are mutually non-dominating iff neither dominates the other. In our example, net (a) dominates both (b) and (c), (b) dominates (c), while (d) is mutually non-dominating with all other nets: (d) has higher structural distance than (a-c) although its behavioral similarity and badness are lower than those of the other nets. A Pareto-set is the set of all mutually non-dominating solutions w.r.t. the average unique objective functions of the solutions, i.e. a set of non-redundant solutions. Definition 13 (Pareto-set). Two solutions N1 , N2 ∈ N are mutually non-dominating w.r.t. a set of traces R iff N1 ≺R N2 and N2 ≺R N1 . Similarly, N1 and N2 are mutually non-dominating on average w.r.t. a number of sets of traces, iff N1 ≺avg N2 and N2 ≺avg N1 . A Pareto-set is a set S ⊆ N such that for all solutions N1 , N2 ∈ S, N1 ≺avg N2 and N2 ≺avg N1 . A Pareto-optimum is a solution that is not dominated by any other solution. The set of Pareto-optima is called the Pareto-front. Definition 14 (Pareto-optimum, Pareto-front). A Pareto-optimum N1 ∈ N is a solution for which no N2 ∈ N exists such that N2 ≺avg N1 . A Pareto-front G is the set of all Pareto-optima. Our PNSA technique is an adaptation of dominance-based MOSA [13,14] to the problem of automatically fixing errors in process models. Dominance-based MOSA is a robust technique for solving multi-objective optimization problems. At the core of the MOSA technique is an optimization procedure called simulated annealing [9]. The term “simulated annealing” derives from the “annealing” process used in metallurgy: the idea is to heat and then slowly cool down a metal so that its atoms reach a low-energy, crystalline state. At high temperatures atoms are free to move around. However, as the temperature lowers down, their movements are increasingly limited due to the high-energy cost of movement. By analogy with this physical process, each step of the annealing procedure replaces the current solution by a random “nearby” solution with a probability that depends both on the difference between the corresponding objective values and a global parameter Temp (the temperature), that is decreased during the process. The dependency is such that the current solution changes almost randomly when Temp is high, but increasingly less as Temp goes to zero. Allowing “uphill” moves potentially saves the method from getting stuck at local optima, which is the main drawback of greedy algorithms. The goal is thus to move towards the Pareto-front while encouraging the diversification of the candidate solutions. It has been shown that simulated annealing
160
M. Gambini et al.
can be more effective than exhaustive enumeration when the goal is to find an acceptably good solution in a fixed amount of time, rather than the best possible solution [9]. Our technique differs from the classic dominance-based MOSA by the presence of approximate objective functions (behavioral distance and badness), and by the notion of confidence, which is used to compensate for this approximation by testing each solution against multiple sample traces. In order to escape from local optima, and in-line with dominance-based MOSA, we do not simply compare two solutions based on their dominance relation. This in fact would exclude a candidate solution N2 that based on the current set of sample traces R is dominated by the current solution N1 even if globally N2 may be better than N1 . Rather, we use a notion of energy. The energy of a solution measures the portion of the front that dominates that solution. Thus, the lower the energy of a solution is, the better the solution is. Unfortunately, the true Pareto-front G is unavailable during an optimization procedure. To obviate this problem, the energy function is computed based on a finite approximation of the Pareto-front G ⊆ N , called estimated Paretofront. G is built incrementally based on the Pareto-set S under construction. Thus, G is initially empty and incrementally populated with new values as long as new mutually non-dominating solutions are added to S during the annealing procedure. Definition 15 (Energy). Let R be a set of traces and G ⊆ N be a finite estimation of the Pareto-front G. The energy of a WF-net N w.r.t. R and G is E(N , R, G) = |{N ∈ G : N ≺R N }|. Having defined the notion of energy, we can use this to compare two solutions in the annealing procedure. A candidate solution N2 is accepted in place of the current solution N1 on the basis of their energy difference ΔE w.r.t. R and an estimated Pareto-front, i.e. the difference in the number of solutions in the estimated Pareto-front that dominate N1 and N2 w.r.t. R. Definition 16 (Energy difference). Let N1 , N2 be two WF-nets, R be a set of ˜ = G ∪ {N1 , N2 }, traces and G be the finite estimation of G. Given the set G the energy difference between N1 and N2 w.r.t. R and G is ΔE (N2 , N1 , R, G) = ˜ − E(N1 , R, G) ˜ E(N2 , R, G) . ˜ |G| A candidate solution N2 that is dominated by one or more members of the current estimated Pareto-front, may still be accepted with a probability equal to min (1, exp (−ΔE (N2 , N1 , R, G)/Temp(i))) where Temp(i) is a monotonically decreasing function indicating the temperature for the iteration i of the annealing proce˜ yields a negative ΔE if N2 ≺R N1 . dure. In Def. 16, the inclusion of N1 and N2 in set G This ensures that candidate solutions that move the estimated front towards the true ˜ also ensures that ΔE is always less than front are always accepted. The division by |G| 1, and provides some robustness against fluctuations in the number of solutions in G. A further benefit of ΔE is that, while fostering convergence to the front, it also fosters its wide coverage. For example, let us assume we only have two objective functions f1 and f2 . Fig. 2 depicts the objective values of two solutions, N1 and N2 (represented as empty circles) in relation to the values of the Pareto-front and its estimation G. N1 and
Automated Error Correction of Business Process Models
161
N2 are mutually non-dominating (N1 is better than N2 along f2 but N2 is better along f1 ). However N1 is dominated by fewer elements of G than N2 (2 instead of 4). Thus, N1 has lower energy and would be more likely accepted in place of N1 . Let us assume an estimated Pareto-front of {(0, 0, 0.41), (5, 0.12, 0.23), (9, 0, 0), (11, 0, 0.71)} for our working example. Accordingly, net (b) has a lower energy than (c) so it has a higher probability of being accepted than (c). In turn, despite (d) is mutually non-dominating with both (b) and (c), it has a lower energy than these two nets, so it has a higher probability of being accepted in place of them. At each iteration i of the annealing procedure we test the Fig. 2. Two solutions N1 solutions for a random set of traces Ri of the original model. and N2 and their energy If at any iteration a candidate solution N2 has the same energy as the current one N1 , N2 has probability of 1 of being chosen (ΔE = 0). However, we prefer to keep the current solution since this has also been tested against some other sets Rj k/2 or ETr(N ) ⊂ Ri if |ETr(N )| ≤ k/2. 5. Compute the unique objective functions of N1 and N2 w.r.t. Ri and add their values to the respective archives AN1 and AN2 . 6. Compute the energy difference ΔE (N2 , N1 , Ri , G) between N2 and N1 using the estimated Pareto-front G. 7. If ΔE (N2 , N1 , Ri , G) = 0 replace the current solution N1 with N2 with a probability equal to min (1, exp (−ΔE (N2 , N1 , Ri , G)/Temp(i))). Otherwise discard N2 and increase the confidence of N1 by 1. 8. If N2 is accepted in place of N1 , set N2 as the current solution with confidence 1, and add N2 to LS. 9. Repeat from Step 3 while the confidence of the current solution is less than c and the maximum number of iterations o is not reached. 10. Add the current solution to LS. If the first element of LS, ls1 , has confidence at least c, compute its average unique objective function f¯avg (ls1 , Als1 ) based on its archive Als1 . Add ls1 to S if ls1 is not dominated on average by any element of S, and remove all elements of S that are dominated on average by ls1 . 11. Repeat from Step 1 until the timeframe tf elapses or |S| = s.
The complexity of each annealing iteration is dominated by Steps 3, 5 and 6 (the rest is achieved in constant time). Step 3 computes a perturbation, which is linear on the size of the WF-net to be changed (the net is explored depth-first to check if a node can be removed; node/arc insertion and arc removal are achieved in constant time). Step 5 entails the computation of the objective functions. Computing the structural similarity is linear on the number of edit operations used in the perturbation, which in turn is bounded by the size of the WF-net. For the behavioral similarity and badness we use a depth-first search on the WF-net, which is linear on the product of i) the sum of the lengths of the traces in R to be simulated, ii) the maximum number of silent actions we can use to simulate a trace, and iii) the size of the WF-net. Step 6 entails the computation of the energy difference, which is done in logarithmic time on the size of the estimated Pareto-front [13]. So each annealing iteration can be executed efficiently.
5 Experimental Results We implemented the PNSA algorithm in a prototype Java tool (available at www.apromore.org/tools). This tool imports an unsound WF-net in LoLA format, and builds a finite representation of its state-space using Karp-Miller’s acceleration [8]. This is done by exploring as many distinct transitions as possible in the smallest number of states. The result of the soundness check and the user parameters trigger the PNSA algorithm. At each iteration i of the annealing procedure, the tool performs at most k prioritized random walks on the state-space to extract the sample traces of set Ri . More precisely, for each trace, the state-space is traversed backwards starting from
Automated Error Correction of Business Process Models
163
a final state and by prioritizing those transitions that have been visited the least, until the initial state is reached. Correct traces are extracted starting from MO ; erroneous traces are extracted starting from a final state that led to an error (available from the soundness check). The output of the tool is a set of solutions in LoLA format, which is limited by the maximum response time or by the maximum number of solutions set by the user. We used the tool to fix a sample of 152 unsound nets drawn from the BIT process library [6]. This library contains 1,386 BPMN models in five collections (A, B1, B2, B3, C), out of which 744 are unsound. We converted these 744 models into WF-nets and filtered out those models that did not result in lawful WF-nets (e.g. those models that had multiple output places). In particular, we only kept models with up to two output places, and when we found two output places we merged them in a single output place. Since this operation may introduce a lack of synchronization, we discarded models with more than two output places to minimize the impact of such artificial errors. As a result, we obtained 152 models none of which from collection C. We set the maximum number of final solutions to 6, the desired confidence to 100, the maximum number of iterations for each annealing run to 1,000, the initial temperature to 144, the maximum size of each set of sample traces to 50, and all the parameters for behavioral distance and badness to 1. After the experiment each solution was checked for soundness. The tests were conducted on a PC with a 3GHz Intel Dual-Core x64, 3GB memory, running Microsoft Windows 7 and JVM v1.5. Each test was run 10 times by using the same random seed (to obtain deterministic results) and the execution times were averaged. The results are reported in Table 1. Table 1. Experimental results Collection BIT A BIT B1 BIT B2 BIT B3
Input models No. Avg/Max models nodes 48 46.54 / 129 22 29.55 / 87 35 22.86 / 117 47 20.38 / 73
Avg/Max Avg/Max errors nodes 2.21 / 5 45.85 / 129 2.55 / 6 27.18 / 87 2.54 / 9 21.79 / 118 2.77 / 9 21.42 / 118
Solutions Avg/Max Avg Error Sound Avg/Max str. errors reduction models distance [cost] 0.74 / 20 73.6% 85.7% 4.25 / 39 1.07 / 29 75.6% 82.9% 5.02 / 91 0.69 / 9 84.9% 87.3% 3.47 / 47 1.26 / 24 72.4% 79.6% 4.62 / 61
Avg/Max Time [s] 16.52 / 171.43 8.75 / 100.45 12.08 / 217.91 10.85 / 156.93
The error-reduction rate of a solution is the difference between the number of errors in the initial model and the number of errors in the solution, over the number of errors in the initial model. Table 1 shows the average error-reduction rate for all solutions of each collection, which leads to an average error-reduction of 76.6% over all four collections. This indicates that the algorithm is able to fix the majority of errors in the input models. Moreover, the structural distance of the solutions is very low (4.34 on average using a cost of 1 for each edit operation). Thus, the solutions are very similar to the original model. These solutions are obtained with an average response time of 11s. Despite the large response time in some outlier cases (218s), most solutions are behaviorally better than their input models (83.9% are sound) and at least one sound solution was found for each input model. Very few cases are worse than the input model (e.g. 29 errors instead of 6 errors in the input model). This is due to the fact that we did not fine-tune the annealing parameters based on the characteristics of each input model (e.g. if the number of annealing iterations is too low w.r.t. the number of errors in the input model, a solution could contain more errors than the input model).
164
M. Gambini et al.
6 Conclusion This paper contributes a technique for automatically fixing behavioral errors in process models. Given an unsound WF-net and the output of its soundness check, we generate a set of alternative nets containing fewer or no behavioral errors, until a given number of desired solutions is found or a timeframe elapses. Each solution is i) sufficiently similar to the original model, ii) non-redundant in terms of fixed errors, and iii) optimal with high-confidence, i.e. it must pass a number of tests. Moreover, there is no restriction on the type of unsound WF-net that can be fixed (e.g. acyclic, non-free choice). The core of this technique is a heuristic optimization algorithm based on dominancebased MOSA. The choice of this algorithm over other optimization algorithms is mainly motivated by i) the availability of a single individual (the unsound model) rather than a population; ii) the fact that the modeler’s intentions are unknown, which determines the fuzziness of the result; and iii) by the need to independently optimize different objectives. Also, an important advantage of MOSA over other optimization algorithms is that while converging to optimal solutions, it encourages the diversification of the candidate solutions. In turn, this allows the algorithm to escape from local optima, i.e. solutions that sub-optimally improve the original model. It has been shown that dominance-based MOSA outperforms other multi-objective genetic algorithms [13]. Our adaptation of MOSA to the problem of fixing unsound process models uses three objective functions to drive the selection of candidate solutions. These functions measure the similarity of a solution to the unsound model and the severity of its errors. Moreover, we embed a notion of confidence to measure the reliability of a solution. This compensates for the approximation of our definitions of behavioral distance and badness, which is due to efficiency reasons. Clearly, more sophisticated metrics could be employed in place of our objective functions, e.g. to identify dead transitions. However one has to strike a trade-off between accuracy and computational cost. Indeed, we compute our objective functions in linear time. More generally, our work can be seen as a modular framework for improving business process models. In fact one could plug in other objective functions to serve different purposes, such as fixing non-compliance issues or increasing process performances. We prototyped our technique in a tool and validated it on a sample of industrial process models. While we cannot guarantee that the returned solutions are always sound, we found at least one sound solution for each model used in the tests, and the response times were reasonably short. However, in order to prove the usefulness of this technique, the quality of the proposed changes needs to be empirically evaluated. In future work, we plan to compare our solutions with those produced by a team of modeling experts and use these results to train the algorithm to discard certain types of perturbations. There are other interesting avenues for future work. First, the randomness of the perturbations could be controlled by exploiting crossover techniques from genetic algorithms [7], so as to obtain a new perturbation by combining correct (sub-)traces from each solution in the current Pareto-set. Second, to accelerate the identification of optimal solutions, the energy resolution could be increased by using attainment surface sampling techniques [13]. This would compensate for the small size of the estimated Pareto-front at the beginning, which yields coarse-grained comparisons of solutions. Finally, the structural features of the unsound model and the result of its soundness check
Automated Error Correction of Business Process Models
165
could be exploited to estimate the annealing parameters (e.g. the number of annealing iterations could depend on the number of errors found in the unsound model). Acknowledgments. This research is partly funded by the NICTA Queensland Lab.
References 1. van der Aalst, W.M.P., van Hee, K.M., ter Hofstede, A.H.M., Sidorova, N., Verbeek, H.M.W., Voorhoeve, M., Wynn, M.T.: Soundness of Workflow Nets: Classification, Decidability, and Analysis. Formal Aspects of Computing (2011) 2. Arcuri, A.: On the automation of fixing software bugs. In: ICSE (2008) 3. Awad, A., Decker, G., Lohmann, N.: Diagnosing and repairing data anomalies in process models. In: BPM Workshops (2009) 4. Bunke, H., Allermann, G.: Inexact graph matching for structural pattern recognition. Pattern Recognition Letters 1(4), 245–253 (1983) 5. Dijkman, R., Dumas, M., van Dongen, B., Kaarik, R., Mendling, J.: Similarity of business process models: Metrics and evaluation. Information Systems 36(2) (2011) 6. Fahland, D., Favre, C., Jobstmann, B., Koehler, J., Lohmann, N., V¨olzer, H., Wolf, K.: Instantaneous soundness checking of industrial business process models. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 278–293. Springer, Heidelberg (2009) 7. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan, Ann Arbor (1975) 8. Karp, R.M., Miller, R.E.: Parallel program schemata. Journal of Computer and System Sciences 3(2) (1969) 9. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220, 671–680 (1983) 10. Lohmann, N.: Correcting deadlocking service choreographies using a simulation-based graph edit distance. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 132–147. Springer, Heidelberg (2008) 11. Lohmann, N., Verbeek, E., Dijkman, R.M.: Petri net transformations for business processes – a survey. TOPNOC 2, 46–63 (2009) 12. Reijers, H.A., Mans, R.S., van der Toorn, R.A.: Improved Model Management with Aggregated Business Process Models. Data Knowl. Eng. 68(2), 221–243 (2009) 13. Smith, K.I., Everson, R.M., Fieldsend, J.E., Murphy, C., Misra, R.: Dominance-based multiobjective simulated annealing. IEEE Trans. on Evolutionary Computation 12(3) (2008) 14. Suman, B.: Study of simulated annealing based algorithms for multiobjective optimization of a constrained problem. Computers & Chemical Engineering 28(9) (2004) 15. Weske, M.: Business Process Management: Concepts, Languages, Architectures. Springer, Berlin (2007) 16. Wynn, M.T., Verbeek, H.M.W., van der Aalst, W.M.P., ter Hofstede, A.H.M., Edmond, D.: Business process verification: Finally a reality! BPM Journal 15(1), 74–92 (2009) 17. Zha, H., Wang, J., Wen, L., Wang, C., Sun, J.: A workflow net similarity measure based on transition adjacency relations. Computers in Industry 61(5) (2010)
Behavioral Similarity – A Proper Metric Matthias Kunze, Matthias Weidlich, and Mathias Weske Hasso Plattner Institute at the University of Potsdam, Prof.-Dr.-Helmert-Strasse 2-3, 14482 Potsdam {matthias.kunze,matthias.weidlich,mathias.weske}@hpi.uni-potsdam.de
Abstract. With the increasing influence of Business Process Management, large process model repositories emerged in enterprises and public administrations. Their effective utilization requires meaningful and efficient capabilities to search for models that go beyond text based search or folder navigation, e.g., by similarity. Existing measures for process model similarity are often not applicable for efficient similarity search, as they lack metric features. In this paper, we introduce a proper metric to quantify process similarity based on behavioral profiles. It is grounded in the Jaccard coefficient and leverages behavioral relations between pairs of process model activities. The metric is successfully evaluated towards its approximation of human similarity assessment.
1
Introduction
Business Process Management (BPM) found its way into enterprises and public administrations likewise, which led to the advent of large process model repositories. Organizations, both public and private, maintain collections of thousands of process models [1,2]: The Dutch government maintains a set of about 600 reference processes1 and the German government strives to establish a platform to access and maintain public administration process models2 . The effective use of large model repositories requires means to access and manage models, and fast search methods in particular. Process modelers want to search a repository for similar processes to avoid the creation of duplicates. Also, reuse of existing knowledge, avoiding redundant work, or establishing reference models requires powerful means to search existing information. In practice, these needs are addressed only by simplistic search features, e.g., folder based navigation or text based search. Exact match search of process models is often not desired, due to the high heterogeneity of modeling languages, guidelines, and terminology. Similarity search also aims at addressing this problem. In general, the search problem is constrained by three factors: (1) the type of data that is searched for, (2) the method of comparing individual instances of this data, and (3) the specification of the search query [3]. In process model similarity search, (1) and (3) are expressed alike, i.e., one searches for process models that 1 2
Documentair StruktuurPlan: http://www.model-dsp.nl/ Nationale Prozessbibliothek: http://www.prozessbibliothek.de/
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 166–181, 2011. c Springer-Verlag Berlin Heidelberg 2011
Behavioral Similarity – A Proper Metric
167
are similar to a given query model. Considering (2), there have been various proposals to assess process model similarity: based on textual information, the structure of process models, and their execution semantics. Still, most of these approaches rely on exhaustive searching, i.e., the query model is compared with each model in the repository. Well-established techniques for indexing cannot be applied in most cases, as the measures lack metric properties. That is, they are missing a distance function that features the triangle inequality and provides transitivity semantics for pairwise distances of process models. In this paper, we propose a behavioral metric that quantifies the similarity of process models. It enables efficient similarity search as it satisfies metric properties, in particular the triangle inequality [3]. The metric builds on behavioral profiles, an abstraction of the behavior of a process model [4]. These profiles capture constraints on the execution order of pairs of activities, such as exclusiveness or strict order. Using this abstraction, we propose five elementary similarity measures. Based on these measures, we construct a metric that quantifies the behavioral similarity of process models. As an evaluation, we conducted experiments using the SAP reference model and manual similarity judgments by BPM experts. Our metric shows a good approximation of human similarity assessment. We sketched the idea of leveraging behavioral profiles for similarity search in [5]. In this paper, we formally introduce a more advanced metric and evaluate it experimentally. The remainder of this paper is structured as follows: The next section gives on overview of similarity search and process model similarity. We introduce the formal preliminaries in Section 3. Our metric to quantify process model similarity is introduced in Section 4. We present the results of an experimental evaluation in Section 5. Related work is discussed in Section 6, before we conclude the paper and give an outlook on future work in Section 7.
2
Background
Search involves comparison of a query with stored data objects to decide, whether these objects shall be included in the search result. Its complexity can be estimated by the number of required comparison operations multiplied by the complexity of the comparison operation. Similarity search algorithms make search more efficient by significantly reducing the number of comparison operations. Thus, we review principles thereof in Section 2.1, before we explore existing measures for process model similarity and determine whether they can be applied to the aforementioned principles in Section 2.2. 2.1
Similarity Search
Traditional databases have been tailored to execute searches on structured data efficiently. They use tree or hash based indexes to quickly access data. Contemporary data, e.g., process models, cannot be mapped to such search structures in a meaningful way, because there exists no natural ordering among the data objects and hashing does not expose any expressive classification. Instead, data
168
M. Kunze, M. Weidlich, and M. Weske
transformation and reduction techniques need to be applied, which often result in objects for which only pairwise similarity can be computed [3]. Within the last two decades, data structures and algorithms have been studied that allow for efficient similarity search in such a case, cf. [6,7]. These methods leverage transitivity properties exposed by a distance function—a complement to the similarity of data objects—a metric. Definition 1 (Metric). A metric is a distance function d : D ×D → R between objects of domain D with the following properties: – Symmetry: ∀oi , oj ∈ D : d(oi , oj ) = d(oj , oi ) – Nonnegativity: ∀oi , oj ∈ D, oi = oj : d(oi , oj ) > 0 – Identity: ∀oi , oj ∈ D : d(oi , oj ) = 0 ⇔ oi = oj – Triangle inequality: ∀oi , oj , ok ∈ D : d(oi , ok ) ≤ d(oi , oj ) + d(oj , ok ) A metric space is a pair S = (D, d). The triangle inequality enables determining minimum and maximum distances of two objects oi , ok without calculating it, if their pairwise distance to a third object oj is given. To search efficiently in large data sets, i.e., avoid comparison with every object, the data set is split into partitions, from which some can be pruned during search. In metric spaces, a partition is established by a pivot p ∈ D and a covering radius r(p) ∈ R that spans a sphere around p. All data objects oi within a distance d(oi , p) ≤ r(p) are contained in that sphere, oi ∈ T (p). Similarity search requires a query q from the same domain of the objects in the data set, i.e., q ∈ D, and a tolerance that expresses how similar the objects of a valid result may be to q—a query radius r(q) ∈ R. Fig. 1(a) shows a pivot p with its covering radius r(p) and all elements of T (p) (solid circle), along with the query model q and its similarity tolerance (dotted circle) spanned by r(q). Search in metric spaces is efficient, as some partitions are excluded without comparing any contained element with the given query, i.e., the number of comparison operations is reduced compared to exhaustive search. For each partition, the distance between the pivot and a query model d(p, q) is calculated by the distance metric. Because the distances between q and each object in T (p), including p, obey the triangle inequality, one can prune partitions from further examination: r(q) + r(p) < d(p, q): All objects in T (p) are further away from q than r(q), i.e., they cannot be in the result set, cf. Fig. 1(a). The distance from q to any other element in T (p) does not have to be calculated. r(q) − r(p) ≤ d(p, q): T (p) is completely included in the sphere around q, cf. Fig. 1(b). Thus, all elements satisfy the similarity constraint and need not be compared with the query unless a ranking of the search result is desired. r(q) + r(p) ≥ d(p, q): The spheres of p and q intersect, cf. Fig. 1(c). Identification of the objects that lie in this intersection requires exhaustive search of T (p). Pruning complete partitions from search reduces the time complexity of search, i.e., the number of comparison operations. Indexing techniques use these capabilities to implement efficient search [6,7]. Here, the limiting factor is the complexity of the comparison operation, i.e., the distance metric.
169
Behavioral Similarity – A Proper Metric
(a)
(b)
(c)
Fig. 1. Metric space partition T (p) (solid circle) with pivot p and similarity query (dotted circle) with query model q. (a) Exclusion, (b) Inclusion, (c) Intersection.
2.2
Process Model Similarity
Similarity of process models is evaluated based on three complementary aspects, i.e., the element labels, the graph structure, or the execution semantics [8]. To determine the similarity of labels, techniques from schema matching [9] and ontology matching [10] are used. Labels are compared on the syntactic or semantic level. The string edit distance (SED) [11] is a well-known example for the former. It counts the minimal number of atomic character operations (insert, delete, update) needed to transform one string into another. For instance, the SED for the labels ‘file order’ and ‘filed the order’ is five. Other measures do not compare strings as a whole, but tokenize strings into a bag of terms. Then, distances in a vector space are used to judge the similarity of two strings [12]. Various approaches to label matching rely on Natural Language Processing techniques [13] to compensate for heterogeneous terminology, e.g., term stemming, stop-word elimination, or external knowledge such as WordNet [14]. Difference in the granularity of two models can be handled by automatic approaches only to very limited degree. Therefore we resort to comparing only single node’s labels. Most of the existing approaches to structural or behavioral process model similarity incorporate some kind of label matching to judge the similarity of model elements. Approaches to structural similarity of process models leverage the maximum common sub-graph isomorphism and the graph edit distance (GED). The latter defines the minimal number of atomic graph operations (substitute node, insert/delete node, (un)grouping nodes, substitute edge, insert/delete edge) needed to transform one graph into another [15]. The GED problem is NP-hard [16], so that search algorithms and heuristics are applied to compute the distance. The GED has been used to score the similarity of process models in [17,2]. In the same vein, structural differences between process models may be grouped to change operations to determine the similarity of two process models [18]. Further, classification of process model elements according to the cardinality of their incoming and outgoing flows has been used for similarity assessment [19]. Similarity measures have also been defined on the sets of all traces of process models. The size of the intersection of these sets relative to the overall number of traces would be a straight-forward example for such a behavioral measure [8]. Further, edit distances have been defined for the behavior of a process model
170
M. Kunze, M. Weidlich, and M. Weske
Table 1. Overview of process model similarities Approach Minor et al. [24] Li et al. [18] Ehrig et al. [25] Dijkman et al. [17] Eshuis and Grefen [22] Aalst et al. [26] Wombacher and Rozie [20] Lu and Sadiq [27] Dongen et al. [23] Nejati et al. [21] Yan et al. [19]
Aspect Structure Structure Structure Structure Behavior Behavior Behavior Structure Behavior Behavior Structure
Symmetry Nonnegativity Identity Triangle inequality yes yes yes yes yes no yes yes yes yes yes
yes yes yes yes yes yes yes yes yes yes yes
yes yes yes yes yes yes yes yes yes yes yes
no no no yes yes no no yes no no no
based on all traces, an automaton encoding the language, or an n-gram representation of all traces [20]. If the process behavior is defined by a transition system, the degree to which two systems simulate each other can be used as a similarity measure [21]. Other measures exploit behavioral abstractions. Behavioral relations defined over pairs of activities, e.g., order and exclusiveness, provide a means to enrich structural matching [22]. Causal footprints have been proposed to approximate the behavior of process models to assess their similarity [23]. Those footprints capture causal dependencies for activities by sets of causal predecessors and successors for each activity. For each of the discussed similarities, Table 1 lists whether the measures meet the metric properties, cf. Definition 1. For none of the approaches, the authors actually proved metric properties, so that the Table 1 shows our informal evaluation of the proposed measures. Most measures turn out to be semi-metrics, i.e., violate the triangle inequality. Hence, they cannot be applied to search a metric space efficiently. There are few notable exceptions. The measure based on the graph edit distance [17] features the metric properties, see also [28]. Still, this measure is restricted to the model structure. The pairwise set similarity of process model features (e.g., nodes and edges) applied in [27] yields a metric, but neglects the graph structure completely. Although not proven formally, the measure proposed for BPEL processes [22] seems to satisfy the triangle inequality. We later discuss that our metric can be seen as a generalization of this approach. We conclude that there are virtually no measures available that are based on the behavior of a process model and satisfy all metric properties.
3
Preliminaries
For our work, we rely on a notion of a process model that comprises activity nodes and control nodes. It captures the commonalities of many process modeling languages, such as BPMN.
Behavioral Similarity – A Proper Metric
171
Definition 2 (Process Model) A process model is a tuple P = (A, s, e, C, N, F, T ) where: – A is a finite non-empty set of activity nodes, C is a finite set of control nodes, and N = A ∪ C is a set of nodes with A ∩ C = ∅, – F ⊆ N × N is the flow relation, – •n = {n ∈ N |(n , n) ∈ F } and n• = {n ∈ N |(n, n ) ∈ F } denote direct predecessors and successors, we require ∀ a ∈ A : | • a| ≤ 1 ∧ |a • | ≤ 1, – s ∈ A is the only start node, •s = ∅, and e ∈ A is the only end node, e• = ∅, – (N, F ∪ {(e, s)}) is a strongly connected graph, – T : C → {and, xor} associates each control node with a type. Our notion requires dedicated start (s) and end (e) activities. Refactoring techniques may be applied to normalize models that do not match this assumption [29]. For a process model, we assume trace semantics. The behavior of a process model P = (A, s, e, C, N, F, T ) is a set of traces TP . A trace is a list σ = s, a1 , a2 , . . ., ai ∈ A for all 0 < i. It represents the order of execution of activities, as it follows on common Petri net-based formalizations [30]. Our metric relies on behavioral profiles as a behavioral abstraction. These profiles capture behavioral characteristics of a process model by means of relations between activity pairs. These relations are grounded in weak order, which holds between two activities if both are observed in a trace in a certain order. Definition 3 (Weak Order). Let P = (A, s, e, C, N, F, T ) be a process model and TP its set of traces. The weak order relation P ⊆ (A × A) contains all pairs (x, y), such that there exists a trace σ = a1 , a2 , . . . in TP and there exist two indices j, k ∈ {1, 2, . . .} with j < k for which holds aj = x and ak = y. Using the notion of weak order, we define the behavioral profile as follows. Definition 4 (Behavioral Profile). Let P = (A, s, e, C, N, F, T ) be a process model. A pair (x, y) ∈ (A × A) is in the following relations: – The strict order relation P , iff x P y and y P x. – The exclusiveness relation +P , iff x P y and y P x. – The interleaving order relation ||P , iff x P y and y P x. BP = {P , +P , ||P } is the behavioral profile of P . For each pair (x, y) in strict order, the reverse strict order relation comprises the inverse pair (y, x), i.e., x P y ⇔ y −1 P x. Behavioral profiles show a number of properties that we will exploit for similarity analysis. Together with the reverse strict order relation, the relations of the behavioral profile partition the Cartesian product of activities [4]. Further, behavioral profiles are computed efficiently for the class of process models introduced above under the assumption of soundness. Soundness is a correctness criterion that guarantees the absence of behavioral anomalies, see [31]. Our notion of a process model translates into a free-choice WF-net [30], a dedicated structural sub-class of Petri nets, which may involve adding fresh transitions to the WF-net. Thus, we can apply the soundness criterion to process models directly. We also reuse the computation techniques for behavioral profiles introduced for sound free-choice WF-nets. This allows for
172
M. Kunze, M. Weidlich, and M. Weske
the computation of behavioral profiles for sound process models in cubic time to the size of the model [4].
4
A Spectrum of Behavioral Similarity Metrics
This section introduces how behavioral profiles are used to measure similarity between a pair of process models. We propose a set of five elementary similarity measures and explain how we construct a metric from these measures. 4.1
Similarity Based on Behavioral Profiles
Behavioral profiles as introduced in the previous section provide a behavioral abstraction. The notion focuses on the order of potential execution of activities and neglects other behavioral details. Causality between activities and cardinality constraints on their execution are abstracted, cf. [32]. Also, the interleaving order relation does not differentiate whether activities share a loop or are on concurrent paths. Findings from experiments with process variants [4] and empirical work on the perception of consistency between process models by process analysts [33] suggest that behavioral profiles capture a significant share of the behavioral characteristics of a process model. This provides us with evidence that implied information loss is suited to be traded for computational efficiency of the search technique. We later challenge this assumption using an experimental setup. Similarity assessment based on behavioral profiles requires matching activities. Given two process models, correspondences between activities have to be determined. These correspondences are used to quantify the overlap of behavior in both models. In Section 2.2, we provided an overview of techniques to identify corresponding pairs of activities. Our work does not focus on this aspect of similarity, but rather on behavioral properties. Therefore, we assume these correspondences to be given, hereafter. To keep the formalization of our metrics concise, we abstract from such correspondences and assume corresponding activities to be identical. In other words, given two process models P and Q with their sets of activities AP and AQ , a correspondence between an activity in P and an activity in Q is manifested as the existence of an activity a ∈ (AP ∩ AQ ). 4.2
Elementary Similarity Measures
A process model P resembles another model Q in certain behavioral aspects if they overlap in their behavioral profiles BP and BQ , respectively. The larger this overlap is, the more similar we assume these models to be. We quantify similarity by the well-known Jaccard coefficient for two sets: sim(A, B) = |A∩B| . First, |A∪B| this measure can easily be applied to behavioral profiles, since each relation of a behavioral profile is essentially a set. Second, it can be translated into a metric d(A, B) = 1 − sim(A, B) [34] to enable efficient similarity search. Fig. 2 shows two order handling process models (m1 , m2 ), in which an order is received, checked, and fulfilled or rejected. The correspondences between activities are illustrated by equal labels in both models, and are represented through
Behavioral Similarity – A Proper Metric
173
Fig. 2. Order management process models m1 and m2 with their behavioral profile matrices Bm1 and Bm2
the indices a to g. The behavioral profiles (Bm1 , Bm2 ) are depicted as matrices over the activities, e.g., a and g are in strict order, since in all traces, where both activities appear, a happens before g. Exclusiveness is the strictest relation of a behavioral profile, because it enforces the absence of a co-occurrence of two activities within one process instance. The corresponding similarity quantifies, how many common pairs exist in two process models that feature the same exclusiveness relation. Definition 5 (Exclusiveness Similarity). Let P , Q be process models and +P , +Q the exclusiveness relations of their respective behavioral profiles BP , BQ . We define the Exclusiveness Similarity |+ ∩+ | sim+ (BP , BQ ) = |+PP ∪+Q Q| See, for example, the activities c and d in both process models of Fig. 2. From the scenario, it is prohibitive to reject an order and file this order and fulfill it. The only difference between m1 and m2 with regard to the exclusiveness relation is the absence of b in m2 , because this activity is exclusive to itself. This yields a high similarity sim+ (Bm1 , Bm2 ) = 14 15 ≈ 0.933. In most cases, the order of tasks of a business process is reflected by the strict order of these activities in the process model. The strict order similarity strives to reward a large overlap of two strict order relations, whereas order violations are penalized. Definition 6 (Strict Order Similarity). Let P , Q be process models and P , Q the strict order relations of their respective behavioral profiles BP , BQ . We define the Strict Order Similarity | ∩ | sim (BP , BQ ) = |PP ∪Q Q| Due to x P y ⇔ y −1 P x, cf. Section 3, it suffices to incorporate only the strict order relation into the similarity measure. The reverse relation is implicitly covered. This can be seen in the behavioral profile matrix as the strict order
174
M. Kunze, M. Weidlich, and M. Weske
relations are mirrored along the diagonal axis, e.g., in m2 the pair (f,g) is in reverse strict order and (g,f) is in strict order. The exemplary process models m1 and m2 show a significant difference in their strict order relations, because the activities e and f are in interleaving order in m1 , whereas in m2 they are in strict order. Additionally, activity b is missing in m2 , which yields five additional pairs for the strict order relation in Bm1 . The strict order relations between (a,c), (a,d), (a,e), (a,f), and (a,g) are, however, not affected by that. This leads to the following strict order 6 similarity: sim (Bm1 , Bm2 ) = 16 = 0.375. Interleaving order is the weakest imposition on the relations between two activities, since it only states that they may be executed in any order in one process instance. Thus, the interleaving order similarity also rewards matching pairs, if they are, e.g., executed in parallel in one process and as part of a control flow cycle in the other. Definition 7 (Interleaving Order Similarity). Let P , Q be process models and ||P , ||Q the interleaving order relations of their respective behavioral profiles BP , BQ . We define the Interleaving Order Similarity | || ∩|| | sim|| (BP , BQ ) = | ||PP ∪||Q Q| The process models of Fig. 2 share interleaving order relations for (d,e),(e,d), (d,f) and (f,d). This yields sim|| (Bm1 , Bm2 ) = 48 = 0.5. Not in all cases does the ordering of activities in a model correspond to the actual order of execution of these activities in practice. Their order may simply be one possible sequence sprung from the habit of the process modeler. To address such cases, we extend the elementary similarities above to increase the tolerance of the measures. The strict order similarity only rewards pairs of activities that are executed in the same order. If two activities appear both in each trace of two distinct process models but are executed in inverse order respectively, it is plausible that they do not depend on each other, but rather have simply been modeled in a sequence that seemed suitable to the process modeler. Thus, we propose the extended strict order similarity that will reward these pairs of activities. Definition 8 (Extended Strict Order Similarity). Let P , Q be process mod−1 els and P , Q the strict order relations, −1 Q , P the reverse strict order relations of their respective behavioral profiles BP , BQ . We define the Extended Strict Order Similarity sim (BP , BQ ) =
−1 |(P ∪−1 P )∩(Q ∪Q )| −1 |(P ∪−1 P )∪(Q ∪Q )|
The above consideration applies to activities f and g in Fig. 2: They are in strict order in m1 and in reverse strict order in m2 . In the given scenario, it seems to be reasonable to assume that there is no explicit order constraint between them. Thus, these pairs should contribute to the similarity of both models: sim (Bm1 , Bm2 ) = 14 ≈ 0.467. 30 The interleaving order relation will only identify pairs of activities that are neither in strict order nor exclusive to each other. However, the execution of a
Behavioral Similarity – A Proper Metric
175
pair of activities in interleaving order in one model also resembles the execution of the same activities in a static sequence in another model. That is, interleaving order supersedes strict order in flexibility. However, this is not supported by the interleaving order similarity. Thus, we suggest the extended interleaving order similarity, which rewards the containment of strict order execution in interleaving order, accounting for strict order in both behavioral profiles of models P and Q in both directions, i.e., in forward and reverse strict order. Definition 9 (Extended Interleaving Order Similarity). Let P , Q be pro−1 cess models and P , Q the strict order relations, −1 P , Q the inverse strict order relations, ||P , ||Q the interleaving order relations of their respective behavioral profiles BP , BQ . We define the Extended Interleaving Order Similarity sim|| (BP , BQ ) =
−1 |(P ∪−1 P ∪||P )∩(Q ∪Q ∪||Q ) | −1 |(P ∪−1 P ∪||P )∪(Q ∪Q ∪||Q ) |
As an example, refer to the activities e and f in Fig. 2. They are in interleaving order in m1 . In an instance of this process they may be executed in the order (e,f), which resembles the strict order relation of these activities in m2 . Analogously, this holds for activities d and g, which yields sim|| (Bm1 , Bm2 ) = 20 36 ≈ 0.556. In contrast to pure ordering features, i.e., strict order and interleaving order, we do not relax the exclusiveness similarity, since exclusiveness is a very strong statement about the dependency and correlation of activities. 4.3
Aggregated Metric for Behavioral Profiles
Based on the elementary similarity measures defined above, we construct an aggregated similarity metric as follows. Each elementary similarity translates into an elementary metric, dh (BP , BQ ) = 1 − simh (BP , BQ ) for all h ∈ {+, , ||, , || }, as explained in Section 4.2 and [34]. Then, we sum up these values and assign a weight that accounts for the respective metric’s impact on the overall metric. We postulate B as the universe of all possible behavioral profiles. In practice this matches the behavioral profiles of all models within a repository. Definition 10 (Behavioral Profile Metric). B = (B, dB ) is a metric space of behavioral profiles B, where the behavioral profile metric dB : B × B → R is a metric, dB (BP , BQ ) = 1 − wh · simh (BP , BQ ) h
with h ∈ {+, , ||, , || } and weighting factors wh ∈ R, 0 < wh < 1 such that wh = 1. h
In order to use this aggregate metric for similarity search, it has to be proven that it is indeed a metric as established in Definition 1. Theorem 1. The weighted sum D(oi , oj ) = wh · dh (oi , oj ) of elementary metrics dh is a metric if ∀h ∈ [1...n] : wh ∈ R ∧ 0 < wh < 1, ∀oi , oj , ok ∈ D, and dh (oi , oj ) ∈ R.
176
M. Kunze, M. Weidlich, and M. Weske
Proof. D(oi , oj ) holds the properties symmetry, nonnegativity, identity, and triangle inequality. Symmetry: From dh (oi , oj ) = dh (oj , oi ) it follows directly from the commutativity of the summation operation that D(oi , oj ) = D(oj , oi ). Nonnegativity: From dh (oi , oj ) ≥ 0 and wh > 0 it follows that wh · dh (oi , oj ) ≥ 0 and thus their sum D(oi , oj ) ≥ 0. Identity: From dh (oi , oi ) = 0 it follows directly that D(oi , oi ) = 0 and if D(oi , oj ) = 0 all dh (oi , oj ) = 0 because wh · dh (oi , oj ) ≥ 0 with wh > 0. From dh (oi , oj ) = 0 however, it follows that oi = oj and thus D(oi , oj ) = 0 ⇔ oi = oj . Triangle Inequality: From dh (oi , ok ) ≤ dh (oi , oj ) + dh (oj , ok ) and wh > 0 it follows wh ·dh (oi , ok ) ≤ wh ·dh (oi , oj )+wh ·dh (oj , ok ) and through summation D(oi , ok ) ≤ D(oi , oj ) + D(oj , ok ).
5
Experimental Evaluation
This section presents an experimental evaluation of the proposed similarity measures and the aggregated metric. We describe the setup to challenge them in Section 5.1. Section 5.2 presents the results of our evaluation. 5.1
Setup
Our experiment relies on a collection of process models that has been used to test similarity measures in [2]. We took the test set for the ‘evaluation with homogeneous labels’ [2] as our focus is on behavioral similarity. The collection comprises 100 randomly selected models from the SAP reference model, called document models. 10 models have been selected as query models. Two of those models were left unchanged, the others underwent slight modifications such as extracting a subgraph or changing types of control flow routing elements, see also [2]. The authors of [2] provide a relevance mapping that assigns to each query model, the document models that were found relevant by humans (this scoring involved process modeling experts). Using this information, similarity measures can be tested for their approximation of the human similarity assessment. Such tests have been conducted for measures based on the graph edit distance and based on causal footprints [2], which we discussed in Section 2.2. For our experiment, 15 out of 100 models had to be sorted out because of ambiguous instantiation semantics. These models have multiple start events that are not merged into a single join connector, which is also referred to as a start join [35]. We manually checked all models without a start join and identified 15 models with ambiguous instantiation semantics. One of these models was a query model, so that our setup comprises 85 document models and nine query models. Behavioral profiles do not discover order constraints in cyclic structures. However, in our collection, we observed only 3 models with control flow cycles. We evaluated the similarity of model elements based on the string edit distance similarity of their labels. If this similarity exceeds a threshold (we used
Behavioral Similarity – A Proper Metric
177
a threshold of 0.6), two elements are considered to be a matching pair. This proves very effective for the given set of process models, which shows a very homogeneous vocabulary, cf. [2]. 5.2
Evaluation
First, we evaluated measures that are grounded in one of the elementary similarities presented in Section 4.2. Hence, only one aspect, e.g., exclusiveness, is considered for a similarity assessment. For all nine query models, we obtained a ranked list of query results. We computed the average precision, i.e., the average of the precision values obtained after each relevant document model is found [36]. Aggregating the results for all queries using the arithmetic mean yields the following mean average precision values. – – – – –
Exclusiveness Similarity (Ex): 0.70 Strict Order Similarity (So): 0.75 Interleaving Order Similarity (In): 0.29 Extended Strict Order Similarity (ES): 0.75 Extended Interleaving Order Similarity (EI): 0.76
We obtained the best results with a similarity assessment based on strict order and exclusiveness. The difference between strict order similarity and interleaving order similarity shows that the good results obtained for extended interleaving order similarity are grounded on the strict order. This is reasonable against the background discussed in Section 4.2. Exclusiveness and strict order can be seen as the strictest relations of the behavioral profile, so that they are most distinguishing. The mean average precision provides a rather compact view on the quality of the similarity assessment. We also investigated the relation between precision and recall for all metrics. For ten recall values in the interval between zero and one, Fig. 3(a) depicts the precision values obtained with the different metrics. Even though the strict order similarity yields good overall results, it does not achieve the best precision values for all recall levels. This suggests applying an aggregated similarity that combines several elementary metrics in a weighted fashion as introduced in Section 4.3. We evaluated aggregated metrics based on behavioral profiles in a second experiment. The result is illustrated in Fig. 3(b) as a precision-recall curve for three metrics. First, as a baseline metric, we used the set similarity applied to the matching activities of two process models (data series Baseline). Second, we applied an aggregated metric that combined all elementary similarities with equal weights (Ex-So-In-ES-EI). Third, based on the results obtained in the previous experiment, we chose the exclusiveness and extended strict order similarities and combined them into an aggregated metric. For the respective weights, we tested a spectrum of weights for the two similarities. The data series 1Ex-4ES in Fig. 3(b) relates to the similarity that assigns four times the weight to extended strict order. Fig. 3(b) illustrates that our metrics perform better than the simple structural assessment using set similarity for matching activities. Still, the
178
M. Kunze, M. Weidlich, and M. Weske
So
In
ES
Baseline
EI 1
0,9
0,9
0,8
0,8
0,7
0,7
0,6
0,6
P Precision
P Precision
Ex 1
0,5 04 0,4
Ex-So-In-ES-EI
1Ex-4ES
0,5 04 0,4
0,3
0,3
0,2
0,2
0,1
0,1 0
0 0
0,1
0,2
0,3
0,4
0,5
Recall
(a)
0,6
0,7
0,8
0,9
1
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Recall
(b)
Fig. 3. (a) Precision-recall curve for metrics based on elementary similarities, (b) precision-recall curve for a baseline metric, and two aggregated metrics
increase in precision for high recall levels is modest, which is likely to be caused by the homogeneity of the used model collection. Also, the precision-recall curves suggest that for this model collection, the differences between both aggregated metrics are rather small. We cannot compare the results obtained in our experiments directly with the measures based on the graph edit distance and on causal footprints presented in [2]. Due to instantiation issues, we had to exclude 15 models from our experiment, one being a query model. We aim at a direct comparison in future work.
6
Related Work
We elaborated on similarity for process models in Section 2.2. The measure presented in [22] needs further discussion against the background of our metric. The measure is based on behavioral relations that are derived from the treestructure of a BPEL process. The Jaccard coefficient over all these relations is used as a similarity metric. The relations of [22] virtually coincide with the relations of the behavioral profile used in our metrics. Still, our approach is a generalization of the metric presented in [22]. We (1) leverage the generic concept of a behavioral profile that is independent of a process description language, (2) provide more fine granular metrics that take the interplay of the different relations into account, and (3) proved metric properties. One of the foremost applications of process similarity that has been nominated by their authors is undoubtedly search in process model collections. Numerous approaches to advanced query languages for process models exist, e.g., [37,38,39], but neither of them addresses efficient search within large collections of process models. Recent approaches to efficient process model search apply a two phase approach: An index is used in a first phase to identify candidate models that
Behavioral Similarity – A Proper Metric
179
could match a given query model, whereas in the second phase the final result set and ranking is established through exhaustive comparison within the candidate set. Jin et al. [40] propose for exact search of process subgraphs a path-based index, i.e., a hash-based index over concatenated labels of process traces to speed up the first phase. Yan et al. [19] leverage process features, i.e., characteristic structures, such as sequences, splits, and joins, to narrow the candidate list in the first phase of similarity search. To our knowledge, we are the first to apply the metric space approach to behavioral process model similarity that enables efficient search without exhaustive comparison of the data set at all. We showed earlier that metrics can be successfully used to implement efficient similarity search for process model structure [28]. Finally, behavioral profiles have been used to judge the quality of process model alignments, i.e., the quality of correspondences between the activities of two process models [32,4]. Here, preservation of the behavioral relations for corresponding pairs of activities in two process models is quantified. Although this measure may be utilized to also assess process model similarity, it would have several drawbacks. It does not quantify the size of the overlap in terms of shared activities of two process models, which is counter-intuitive for process model similarity search. In addition, the existing measures are no metrics.
7
Conclusion
We motivated our work with efficient similarity search for process models that requires metrics. In this paper, we introduced such a proper metric to quantify behavioral similarity of process models. This metric is built from five elementary similarity measures that are based on behavioral profiles and the Jaccard coefficient. We evaluated the metric experimentally towards its approximation of human similarity assessment. For our evaluation, we focused on the appropriateness of the metric with respect to human similarity perception instead of its application for similarity search. Our results indicate that the metric is well-suited for assessing the similarity of process models, even though we apply a behavioral abstraction. Further, we already showed that similarity search based on a metric scales well with process model structures and saves up to 80% of comparison operations compared to exhaustive search [28]. Similar results can be expected with the metric proposed in this paper. In future work, we aim at comparing the proposed metric with the one based on the graph edit distance [17,2] in detail. We also want to investigate the combination of both metrics and their application to more heterogeneous model collections. This may give further insights on how to tune the aggregation weights of our metric. Finally, activity matching has been out of scope of this paper. We established such a mapping based on the syntactic similarity of labels before calculating the behavioral similarity. The influence of other techniques, e.g., the use of thesauri, may lead to a better alignment of similarity, and shall be examined in future experiments.
180
M. Kunze, M. Weidlich, and M. Weske
References 1. Rosemann, M.: Potential pitfalls of process modeling: part B. Business Process Management Journal 12(3), 377–384 (2006) 2. Dijkman, R., Dumas, M., van Dongen, B., K¨ aa ¨rik, R., Mendling, J.: Similarity of business process models: Metrics and evaluation. Inf. Syst. 36(2), 498–516 (2011) 3. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer-Verlag New York, Inc., Secaucus (2005) 4. Weidlich, M., Mendling, J., Weske, M.: Efficient Consistency Measurement based on Behavioural Profiles of Process Models. IEEE Trans. Softw. Eng. (2011) 5. Kunze, M., Weidlich, M., Weske, M.: m3 - A Behavioral Similarity Metric for Business Processes. In: ZEUS 2011. CEUR-WS, pp. 89–95 (2011) 6. Ch´ avez, E., Navarro, G., Baeza-Yates, R., Marroqu´ın, J.L.: Searching in Metric Spaces. ACM Comput. Surv. 33(3), 273–321 (2001) 7. Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (Survey Article). ACM Trans. Database Syst. 28(4), 517–580 (2003) 8. Dumas, M., Garc´ıa-Ba˜ nuelos, L., Dijkman, R.M.: Similarity Search of Business Process Models. IEEE Data Eng. Bull. 32(3), 23–28 (2009) 9. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001) 10. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 11. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966) 12. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Commun. ACM 18(11), 613–620 (1975) 13. Manning, C.D., Sch¨ atze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999) 14. Miller, G.A.: WordNet: A Lexical Database for English. Commun. ACM 38(11), 39–41 (1995) 15. Bunke, H., Allermann, G.: Inexact graph matching for structural pattern recognition. Pattern Recognition Letters 1(4), 245–253 (1983) 16. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York (1979) 17. Dijkman, R.M., Dumas, M., Garc´ıa-Ba˜ nuelos, L.: Graph Matching Algorithms for Business Process Model Similarity Search. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 48–63. Springer, Heidelberg (2009) 18. Li, C., Reichert, M., Wombacher, A.: On Measuring Process Model Similarity Based on High-Level Change Operations. In: Li, Q., Spaccapietra, S., Yu, E., Oliv´e, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 248–264. Springer, Heidelberg (2008) 19. Yan, Z., Dijkman, R.M., Grefen, P.: Fast Business Process Similarity Search with Feature-Based Similarity Estimation. In: Meersman, R., Dillon, T.S., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6426, pp. 60–77. Springer, Heidelberg (2010) 20. Wombacher, A., Rozie, M.: Evaluation of Workflow Similarity Measures in Service Discovery. In: Service Oriented Electronic Commerce. LNI., vol. 80, pp. 51–71. GI (2006) 21. Nejati, S., Sabetzadeh, M., Chechik, M., Easterbrook, S., Zave, P.: Matching and Merging of Statecharts Specifications. In: ICSE 2007, pp. 54–64. IEEE Computer Society, Washington, DC, USA (2007) 22. Eshuis, R., Grefen, P.: Structural Matching of BPEL Processes. In: ECOWS 2007, pp. 171–180. IEEE Computer Society, Washington, DC, USA (2007)
Behavioral Similarity – A Proper Metric
181
23. van Dongen, B., Dijkman, R., Mendling, J.: Measuring Similarity between Business Process Models. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 450–464. Springer, Heidelberg (2008) 24. Minor, M., Tartakovski, A., Bergmann, R.: Representation and Structure-Based Similarity Assessment for Agile Workflows. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 224–238. Springer, Heidelberg (2007) 25. Ehrig, M., Koschmider, A., Oberweis, A.: Measuring Similarity between Semantic Business Process Models. In: APCCM 2007, pp. 71–80. Australian Computer Society, Inc., (2007) 26. van der Aalst, W., de Medeiros, A.A., Weijters, A.: Process Equivalence: Comparing Two Process Models Based on Observed Behavior. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 129–144. Springer, Heidelberg (2006) 27. Lu, R., Sadiq, S.: On the Discovery of Preferred Work Practice Through Business Process Variants. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 165–180. Springer, Heidelberg (2007) 28. Kunze, M., Weske, M.: Metric Trees for Efficient Similarity Search in Process Model Repositories. In: IW-PL 2010, Hoboken, NJ (September 2010) 29. Vanhatalo, J., V¨ olzer, H., Leymann, F., Moser, S.: Automatic Workflow Graph Refactoring and Completion. In: Bouguettaya, A., Krueger, I., Margaria, T. (eds.) ICSOC 2008. LNCS, vol. 5364, pp. 100–115. Springer, Heidelberg (2008) 30. Lohmann, N., Verbeek, E., Dijkman, R.M.: Petri Net Transformations for Business Processes - A Survey. T. Petri Nets and Other Models of Concurrency 2, 46–63 (2009) 31. van der Aalst, W.: Workflow Verification: Finding Control-Flow Errors Using PetriNet-Based Techniques. In: van der Aalst, W.M.P., Desel, J., Oberweis, A. (eds.) Business Process Management. LNCS, vol. 1806, pp. 161–183. Springer, Heidelberg (2000) 32. Weidlich, M., Polyvyanyy, A., Mendling, J., Weske, M.: Efficient Computation of Causal Behavioural Profiles Using Structural Decomposition. In: Lilius, J., Penczek, W. (eds.) PETRI NETS 2010. LNCS, vol. 6128, pp. 63–83. Springer, Heidelberg (2010) 33. Weidlich, M., Mendling, J.: Perceived Consistency between Process Models. Inf. Syst. (in press, 2011) 34. Lipkus, A.: A Proof of the Triangle Inequality for the Tanimoto Distance. Journal of Mathematical Chemistry 26, 263–265 (1999) 35. Decker, G., Mendling, J.: Process instantiation. Data Knowl. Eng. 68(9), 777–792 (2009) 36. Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40 (2000) 37. Awad, A.: BPMN-Q: A Language to Query Business Processes. In: EMISA. LNI., vol. P-119, pp. 115–128. GI (2007) 38. Beeri, C., Eyal, A., Kamenkovich, S., Milo, T.: Querying Business Processes. In: VLDB 2006 VLDB Endowment, pp. 343–354 (2006) 39. Choi, I., Kim, K., Jang, M.: An XML-based process repository and process query language for integrated process management. Knowledge and Process Management 14(4), 303–316 (2007) 40. Jin, T., Wang, J., Wu, N., Rosa, M.L., ter Hofstede, A.H.M.: Efficient and Accurate Retrieval of Business Process Models through Indexing - (Short Paper). In: Meersman, R., Dillon, T.S., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6426, pp. 402–409. Springer, Heidelberg (2010)
Event-Based Monitoring of Process Execution Violations Matthias Weidlich1 , Holger Ziekow2, , Jan Mendling3, Oliver G¨unther3 , Mathias Weske1 , and Nirmit Desai4 1 Hasso-Plattner-Institute, University of Potsdam, Germany {matthias.weidlich,weske}@hpi.uni-potsdam.de 2 AGT Germany, Germany
[email protected] 3 Humboldt-Universit¨at zu Berlin, Germany {jan.mendling,guenther}@wiwi.hu-berlin.de 4 IBM India Research Labs, India
[email protected] Abstract. Process-aware information systems support business operations as they are typically defined in a normative process model. Often these systems do not directly execute the process model, but provide the flexibility to deviate from the normative model. This paper proposes a method for monitoring control-flow deviations during process execution. Our contribution is a formal technique to derive monitoring queries from a process model, such that they can be directly used in a complex event processing environment. Furthermore, we also introduce an approach to filter and aggregate query results to provide compact feedback on deviations. Our techniques is applied in a case study within the IT service industry.
1 Introduction Process-aware information systems are increasingly used to execute and control business processes [1]. Such systems provide a more general support to process execution in comparison to classical workflow systems as they do not necessarily need to enforce a normative process model. This notion of process-awareness rather relates to guiding the execution of an individual case instead of restricting the behaviour to a narrow set of sequences. In this way, process-aware information systems often provide the flexibility to deviate from the normative process model [2]. While the flexibility provided by process-aware information systems is often a crucial benefit, it also raises the question in how far deviations from the normative behaviour can be efficiently identified. For instance, a process analyst may explicitly want to be notified in a timely manner when actions are taken on a particular case that deviate from the standard procedure. Such reactive mechanisms are typically referred to as process monitoring. They build on the identification of specific types of events, which are analysed and processed to yield business-relevant insights. In recent years, techniques for complex event processing have been introduced for identifying non-trivial patterns in event sequences and for triggering appropriate reactions. Such patterns are formulated as queries over event sequences. While certain
The main part of the work was done while at Humboldt-Universit¨at zu Berlin.
S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 182–198, 2011. c Springer-Verlag Berlin Heidelberg 2011
Event-Based Monitoring of Process Execution Violations
183
non-temporal requirements upon a process can be easily formulated as complex event queries, e.g., a four-eyes principle on a pair of activities, it is a non-trivial problem to encode the control flow of a normative process model as a set of event queries that pinpoint violations. However, such queries are highly relevant, e.g., for monitoring IT incident resolution for which best practice process models exist, but are not enforced by a workflow system. In this paper, we address this problem from two angles. First, we define a technique to generate monitoring queries from a process model. To this end, we leverage the concept of a behavioural profile. Second, we support the presentation of the query results in such a way that root causes of a rule violation are directly visible to the process analyst. To achieve this, we develop an approach to filter and aggregate query results. In this way, we aim to contribute towards a better integration of process monitoring and complex event processing techniques. The paper is structured as follows. Section 2 introduces the formal foundations of our approach including complex event processing and behavioural profiles. Section 3 defines a transformation technique to generate queries from process models. Section 4 identifies three classes of execution violations and how they are handled to provide meaningful feedback to the process analyst. Section 5 applies our techniques to a case study in the IT service domain. Section 6 discusses related work, before Section 7 concludes the paper.
2 Preliminaries This section presents preliminaries for our investigations. Section 2.1 discusses process models. Then, Section 2.2 gives background information on complex event processing. Finally, Section 2.3 introduces casual behavioural profiles as an abstraction of the behaviour of a process model. 2.1 Process Models In large organisations, process models are used to describe business operations. As such, they formalise requirements that have to be considered during the development of process-aware information systems [1]. Process models are also directly used to foster process execution when a workflow engine utilises a model to enforce the specified behaviour. In both cases, the process model has a normative role. It precisely defines the intended behaviour of the process. Abstracting from differences in expressiveness and notation of common process description languages, a process model is a graph consisting of nodes and edges. The former represent activities and control flow routing nodes – split and merge nodes that implement execution logic beyond simple sequencing of activities. The latter implement the control flow structure of the process. Figure 1 depicts an example process model in BPMN, which we use in this paper to illustrate our concepts. It depicts a Lead-to-Quote process. The process starts with an import of contact data or with the reception of a request for quote. In the former case, the contact details are updated. This step may be repeated if data integrity constraints are not met. Then, the quote is prepared by first entering prospective project details and then conducting an effort estimation, updating the requirements, and preparing the quote template. These steps are done concurrently. Finally, the quote is approved and submitted.
184
M. Weidlich et al.
Load Contact from CRM (a)
Get Project History (h)
Import Contact from Excel (b)
Update Contact Details (d)
Receive Quote Request (c)
Check Request Details (f)
Check Contact Data Integrity (e)
Enter Project Details (g)
Estimate Project Effort (k)
Update Requirements (i)
Approve & Send Quote (l)
Prepare Quote Template (j)
Fig. 1. Example Lead-to-Quote process modelled in BPMN
2.2 Complex Event Processing Complex event processing (CEP) refers to technologies that support processing of real time data that occur as events. The goal of CEP is to detect situations of interest in real time as well as to initiate corresponding actions. The notion of an event slightly differs throughout literature. Etzion and Niblett refer to an event as “occurrence within a particular system or domain” [3]. Examples for such events are manifold. They can be as simple as a sensor reading or as complex as the prediction of an supply bottleneck. Data models for events vary between systems and application contexts. For instance, some systems encode events as tuples while others follow an object oriented model. Further, events may be modelled explicitly with a duration or only with an occurrence time stamp. We assume events to be represented as tuples with the following structure: event = (eventID, caseID, activity, timeStamp, < eval >) Here, eventID is a unique identifier of an event, caseID a unique identifier for the process instance, activity the observed activity in the process, timeStamp the time of the completion time of the activity, and < eval > a collection of attribute-value pairs. Our approach is concerned with monitoring single instances. Still, the unique identifier for process instances is required to correlate events to instances. If this information is not available in a system, techniques to correlate events [4,5] may be applied to discover events related to a single instance. Intuitively, a CEP system acts like an inverted database. A database stores data persistently and processes events in an ad-hoc manner. In contrast, a CEP system stores standing queries and evaluates these queries as new event data arrives. This principle is implemented in different systems to support real time queries over event data, e.g., [6,7,8,9]. Queries are formulated in terms of dedicated query languages. Rule based languages following an ECA structure and SQL-extensions are two main categories for query language styles. A key element in many languages is a pattern, which defines relations between events. Typical operations for pattern definition are conjunction, disjunction, negation, and a sequence operator that defines a specific order of event occurrences. Temporal constraints are supported by constructs for defining time windows. Below we show a query example written in the SQL-like language of ESPER [9]. The query matches if the activity ‘Update Contact Details’ is followed by the activity ‘Enter Project Details’ in the same process instance.
Event-Based Monitoring of Process Execution Violations
185
s e l e c t ∗ from p a t t e r n [ e v e r y a = O b s e r v a t i o n E v e n t ( a . O b s e r v a t i o n = ’ U p d a t e C o n t a c t D e t a i l s ’ ) −> b= O b s e r v a t i o n E v e n t ( b . O b s e r v a t i o n = ’ E n t e r P r o j e c t D e t a i l s ’ , b . c a s e I D = a . c a s e I D ) ]
In this paper, we use a more abstract notation for queries. The notation focuses on the query parts that are relevant to our approach and is intended to support an intuitive understanding. With a small letter, we denote the atomic query for events reflecting the execution of the corresponding activity. A query a evaluates true, if activity a was completed. For simplicity, we say that ‘an event a occurred’. With an expression in capital letters we denote sets of atomic event queries. Such an expression evaluates true, if any atomic query in the set evaluates true. We build more complex queries using the operators and, not, seq, and within. The and operator represents Boolean conjunction. The term and(a, b) evaluates true if the two atomic queries a and b evaluate true, i.e., if events a and b occurred. The not operator represents Boolean negation. The seq operator implements sequencing, i.e., seq(a, b) is true if query a is true first and at a later time query b evaluates true. Constraints on event attributes are not explicitly modelled. However, we implicitly assume that all queries are limited to events with the same caseID. This ensures that all matched events belong to the same process instance. To constrain time, we use the operator within. It takes a query q and a time window t as input and evaluates to true if q is true when constrained to events that are not more than t apart in time. Since the operators and, not, and seq are defined over Boolean values, we can nest them, e.g., and(a, seq(b, c)) with a, b, and c being atomic queries. As CEP engines typically can support these operations, our approach is not limited to any specific implementation. For instance, the defined operators can be evaluated by an extended automaton, in which state transitions correspond to the occurrence of events [10]. 2.3 Causal Behavioural Profiles Our approach to monitoring of process instances exploits the behavioural constraints that are imposed by a normative process model. We use the notion of a causal behavioural profile [11] to capture these constraints. A causal behavioural profile provides an abstraction of the behaviour defined by a process model. It captures behavioural characteristics by relations on the level of activity pairs. The order of potential execution of activities is captured by three relations. Two activities are either in strict order, exclusive to each other, or in interleaving order. These relations follow from the possible execution sequences, alias traces, of the process model. ◦ The strict order relation, denoted by , holds between two activities x and y, if x may happen before y, but not vice versa. That is, in all traces comprising both activities, x will occur before y. ◦ The exclusiveness relation, denoted by +, holds for two activities, if they never occur together in any process trace. ◦ The interleaving order relation, denoted by ||, holds for two activities x and y, if x may happen before y and y may also happen before x. Interleaving order can be interpreted as the absence of any specific order constraint for two activities. The causal behavioural profile also comprises a co-occurrence relation, denoted by , to capture occurrence dependencies between activities. Co-occurrence holds between two activities x and y, if any complete trace, from the initial to the final state of the
186
M. Weidlich et al.
process, that contains activity x contains also activity y. For the activities of the model in Fig. 1, for instance, activities a and b are exclusive. Activities e and g are in strict order. The concurrent execution of activities h and i yields interleaving order in the behavioural profile. As a self-relation, activities show either exclusiveness (if they are executed at most once, e.g., a + a) or interleaving order (if they may be repeated, e.g., d||d). As an example for co-occurrence, we observe that every trace containing a also contains d, but not vice versa, a d and d a. Causal behavioural profiles are computed efficiently for process models as the one shown in Fig. 1, see [11,12]. The causal behavioural profile is a behavioural abstraction. For instance, cardinalities of activity execution are not captured. Still, findings from experiments with process variants [12] suggest that these profiles capture a significant share of the behavioural characteristics of a process model. Although our approach builds on the causal behavioural profile, for brevity, we use the term behavioural profile in the remainder of this paper.
3 Monitoring Based on Event Queries We describe the architecture of our approach to run-time monitoring of process execution in Section 3.1. Then, we focus on the derivation of monitoring queries in Section 3.2. 3.1 Architecture
Monitoring System
Our approach facilitates the automatic generation of monitoring queries from process models as well as preprocessing of the query results for presentation. Conceptually this involves four main steps: 1. Extraction of behavioural profiles from process models 2. Generation of complex event queries 3. Running queries over process events 4. Applying filters and aggregates on detected deviations Figure 2 provides an overview of the system architecture. We assume that Process Models the monitoring system is provided with Analyst events that reflect the completion of proNotification/ cess activities. This might require some Deviation reports preprocessing to deduce the completion CEP Engine of an activity from available input data, Profile Analyser Result Filters/ e.g. as in RFID based process moniAggregators Behavioural toring [13]. However, we make no asDetected Profiles Deviations sumptions on the specific event sources. Query Monitoring Instead, events can have multiple origins, Generator Queries CEP Queries such as an enterprise service bus, a manProcess Events ufacturing execution system, or sensor Information middleware. System Further, we assume the availability of process models as input. From these process models we extract behavioural Fig. 2. Architecture overview
Event-Based Monitoring of Process Execution Violations
187
profiles to obtain behavioural constraints on event occurrences in model conformant process instances. Given the behavioural profiles, we generate complex event queries that target violations on a fine grained level and reveal where a process instance deviates from the model. To run the queries, we assume the utilisation of a CEP engine such as provided in [6,7,8,9]. However, our approach abstracts from specific implementation details. Instead, we describe our solution using general concepts that can be realised in different systems. The CEP engine continuously receives the process events. It then matches the events against the monitoring queries to detect deviations from the model. The system can create an alert for each detected deviation. We enhance this mechanism with filters and aggregations to condense reports and avoid information overload. 3.2 Derivation of Event Queries from Process Models To leverage the information provided by the behavioural profile of a process model for monitoring, we generate a set of event queries from the profile. These queries follow directly from the relations of the behavioural profile and relate to exclusiveness, order, and co-occurrence constraints. As discussed in Section 2.3, interleaving order between two activities can be seen as the absence of an ordering constraint. Therefore, there is no need to consider this relation when monitoring the accuracy of process execution. Exclusiveness: Exclusiveness between activities as defined by the relation of the behavioural profile has to be respected in the events signalling activity execution. To identify violations to these constraints, a query matches joint execution of two exclusive activities. Note that the exclusiveness relation also indicates whether an activity is executed at most once (exclusiveness as a self-relation). Given the behavioural profile B = {, +, ||, } of a process model, we define the exclusiveness query set as follows. Q+ = {and(a1 , a2 )}. (a1 ,a2 )∈+
For the model depicted in Fig. 1, monitoring exclusiveness comprises the following queries, Q+ = {and(a, a), and(a, b), and(b, a), and(a, c), . . . , and(e, f )}. Mirrored queries such as and(a, b) and and(b, a) have the same semantics. We assume those to be filtered by optimisation techniques of the query processing. The described queries are sufficient for detecting exclusiveness violations. However, they provide no means to discard partial query results if no violations occur. Such a mechanism is crucial for the applicability of our approach as an overload of the event processor has to be avoided. To optimise resource utilisation, we present modified versions of the query generation that incorporate termination in correct process instances. We consider two options for terminating non matching queries. One option is using timeouts. Setting an appropriate timeout t requires background knowledge on the monitored processes and should be defined by a domain expert. Setting the timeout too low may result in missing constraint violations. To avoid this risk, the set of terminating activities EN D of the process may be leveraged. This option solely relies on the process model. We construct a set of optimised queries as follows. Q+optimised = {within(and(and(a1 , a2 ), not(EN D \ {a1 , a2 })), t)}. (a1 ,a2 )∈+
188
M. Weidlich et al.
Here, not(EN D) evaluates true if no event in the set EN D occurred with EN D being the set of events representing terminating activities. The operator within(a, t) limits the evaluation of the query a to the relative time frame t. That is, query a is only evaluated for events that are not more than t apart in time. Order: Violations of the order of activity execution is queried for with an order query set. It matches pairs of activities for which the order of execution is not in line with the behavioural profile. Given the behavioural profile B = {, +, ||, } of a process model, we define this set of queries as {seq(a2 , a1 )}. Q = (a1 ,a2 )∈
For the model in Fig. 1, the order query set contains the following queries, Q = {seq(d, a), seq(e, a), seq(g, a), . . . , seq(l, k)}. Similar to queries for exclusiveness, queries for order violation do not match in correct process instances. For optimisation we again incorporate constructs for termination based on timeouts t or terminating activities EN D of the process. Qoptimised = {within(and(seq(a2 , a1 ), not(EN D \ {a1 , a2 })), t)}. (a1 ,a2 )∈
Co-occurrence: For exclusiveness and order of activity execution, the respective queries follow directly from the relations of the behavioural profile. Co-occurrence constraints, in turn, cannot be verified in this way. Co-occurrence constraints are violated not by the presence of a certain activity execution, but by its absence. Therefore, a constraint violation materialises only at the completion of a process instance. Only then it becomes visible which activities are missing the observed execution sequence even though they should have been executed. We construct a corresponding query set as follows: Qdetection = {and(and(a1 , EN D), not(a2 ))}. (a1 ,a2 )∈
For a running process instance, the question whether an activity is missing cannot be answered definitely. Nevertheless, the strict order relation can be exploited to identify states of a process instance, which may evolve into a co-occurrence constraint violation. Taking up the idea of conformance measuring based on behavioural profiles [14], we query for activities for which we deduced from the observed execution sequence that they should have been executed already. That is, their execution is implied by a co-occurrence constraint for one of the observed activities and they are in strict order with one of the observed activities. As the co-occurrence constraint is not yet violated definitely, we refer to these queries as co-occurrence warning queries. Note that, although the co-occurrence constraint is not violated definitely when the query matches, the respective instance is fraudulent. Even if the co-occurrence constraint is satisfied at a later stage, the instance will show an order violation. In other words, if the cooccurrence warning query matches, a constraint violation is detected, but it is still
Event-Based Monitoring of Process Execution Violations
189
unclear whether the order or the co-occurrence constraint is violated. Given the behavioural profile B = {, +, ||, } of a process model, the co-occurrence warning query set is defined as follows. {and(and(a1 , a3 ), not(a2 ))}. Qwarning = ((a1 ,a2 )∈(∩)) ∧ ((a2 ,a3 )∈)
For the example model depicted in Fig. 1, there exist the following co-occurrence warning queries, Qwarning = {and(and(a, g), not(d)), . . . , and(and(h, l), not(k))}. The first term refers to the co-occurrence constraint between activities a and d, i. e., if the contact is loaded from the CRM, the contact details have to be updated. The execution of activity g, entering the project details, is used to judge on the progress of the processing. The query matches, if we observe the execution of activities a and g, but there has not been any execution of activity d. If this activity is executed at a later stage, the co-occurrence constraint would be satisfied, whereas the order constraint between activities d and g would be violated. Du to the lack of space we do not provide optimised versions of the queries for co-occurrence violations. However, optimisation can be done along the lines of the optimisations for queries for exclusiveness and order violations.
4 Feedback on Behavioural Deviations The queries which we derived in the previous section allow us to monitor the behaviour of a process instance on a fine-granular level. On the downside, fine-granular queries may result in multiple alerts when an activity is performed at an unexpected stage of the business process. For instance, consider a sequence of activities a, d, e, which are all exclusive to c, as in our example in Fig. 3. When now c is executed after the other activities have already been completed, we obtain three alerts. The idea of the concepts presented in this section is to filter these alerts. First, we identify the root cause for the set of violations. A root cause refers to the responsible event occurrence, which implies that we observe a violation at a later stage. Second, we check whether a certain violation is an implication of an earlier violation. Sections 4.1 to 4.3 define according filtering and reporting strategies for the constraint violations detected by the queries introduced in the previous section. 4.1 Exclusiveness Violations We first consider violations that stem from the exclusiveness monitoring query Q+. Let σ = a1 , a2 , . . . , an be the sequence of recorded events for a process instance and B = {, +, ||, } the behavioural profile of the respective process model. Then, we derive the set of violations V+n at the time event an is recorded as follows. V+n = {(ax , ay ) ∈ + | ay = an ∧ ax ∈ σ}. Root Cause. A single event may cause multiple exclusiveness violations. Given a set of exclusiveness violations V+n , the root cause is the violation that relates to the earliest event in the sequence of recorded events σ = a1 , a2 , . . . , an . In this way, the
190
M. Weidlich et al.
(a)
(k)
(h)
(b)
(d)
(c)
(f)
(e)
(g)
(i)
(l)
(j)
Fig. 3. Example process revisited
root cause refers to the earliest event that implies that the latest event would not be allowed anymore. We define a function root to extract the root cause for the most recent violations with respect to an as follows. root(V+n ) = (ax , ay ) ∈ V+n such that ∀ (ak , al ) ∈ V+n [ x ≤ k ]. We illustrate the introduced concept for the process from Fig. 3. Assume that the activities a and d have been executed, when we observe an event that signals the completion of activity c, i.e., we recorded σ = a, d, c. The exclusiveness monitoring query Q+ matches and identifies two violations, V+3 = {(a, c), (d, c)}. The violation of exclusiveness for a and c is the root cause, since a was the first activity to complete in the recorded event sequence, i.e., root (V+3 ) = (a, c). Consecutive Violation. The identification of a root cause for a set of violations triggered by a single event is the first step to structure the feedback on violations. Once a violation is identified, subsequent events may result in violations that logically follow from the violations observed already. For a root cause (ax , ay ), consecutive violations (ap , aq ) are characterized by the fact that (1) either ax with ap and ay with aq are not conflicting or (2) this non-conflicting property is observed for ax with aq and ay with ap . Further, we have to consider the case that potentially it holds ap = ax or ap = ay . Consecutive violations are recorded but explicitly marked once they are observed. Given a root cause (ax , ay ) of exclusiveness violations, we define the set of consecutive violations by a function consec. consec(ax , ay ) ={(ap , aq ) ∈ + | ((ax = ap ∨ ax + ap ) ∧ (aq = ay ∨ ay + aq )) ∨ ((ay = ap ∨ ay + ap ) ∧ (ax = aq ∨ ax + aq ))}. Consider the process from Fig. 3 and the recorded event sequence σ = a, d, c. Now assume a subsequent recording of event e. The exclusiveness monitoring query Q+ matches and we extract a set of violations V+4 = {(c, e)} with the root cause root (V+4 ) = (c, e). Apparently, this violation follows directly from the violations identified when event c has been recorded, because e is expected to occur subsequent to d. This is captured by our notion of consecutive violations for the previously identified root cause (a, c). Since c and e are expected to be exclusive and it holds ay = c = ap and a + e, we observe that (c, e) ∈ conseq(a, c). Hence, the exclusiveness violation (c, e) would be reported as a consecutive violation of the previous violations identified by their root cause root (V+3 ) = (a, c). Further, assume that the next recorded event is b, so that σ = a, d, c, e, b and V+5 = {(a, b), (c, b)}. Then, both violations represent a situation that does not follow
Event-Based Monitoring of Process Execution Violations
191
logically from violations observed so far, i.e., they are non-consecutive and reported as independent violations to the analyst. Still, the feedback is structured as we identify a root cause as root (V+5 ) = (a, b), since event a has occurred before event c. 4.2 Order Violations Now, we consider violations that stem from order monitoring query Q . Let σ = a1 , a2 , . . . , an be the sequence of recorded events for a process instance and B = {, +, ||, } the behavioural profile of the respective process model. Let −1 be the inverse relation of the strict order relation, (ax ay ) ⇔ (ay −1 ax ). Then, we n derive the set of violations V at the time event an is recorded as follows. n = {(ax , ay ) ∈ −1 | ay = an ∧ ax ∈ σ}. V
Root Cause. Also for this set of violations, it may be the case that a single event n causes multiple order violations. Given a set of order violations V , the root cause is the violation that relates to the earliest event in the sequence of recorded events σ = a1 , a2 , . . . , an . Again, we define a function root to extract the root cause. n n n ) = (ax , ay ) ∈ V such that ∀ (ak , al ) ∈ V [ x ≤ k ]. root (V
For illustration, consider the example of Fig. 3 and a sequence of recorded events σ = a, d, h, k. Now, e is completed, which points to a violation of the order constraint between e and h as well as between e and k. The idea is to report the earliest event in the execution sequence, which was supposed to be executed after e. Then, the violation 5 ) = (h, e) is the root cause, since h has been the first event in this case. root (V Consecutive Violation. As for exclusiveness constraint violations, we also define consecutive violations. These include violations from subsequent events that logically follow from violations observed earlier. For a root cause (ax , ay ), consecutive violations (ap , aq ) are characterized by the fact that either ax with ap and ay with aq are in strict order, or ax with aq and ay with ap , respectively. Taking into account that it may hold ap = ax or ap = ay , we lift the function consec to root causes of strict order violations. Given such a root cause (ax , ay ), it is defined as follows. consec(ax , ay ) ={(ap , aq ) ∈ −1 | ((ax = ap ∨ ax ap ) ∧ (aq = ay ∨ ay aq )) ∨ ((ay = ap ∨ ay ap ) ∧ (ax = aq ∨ ax aq ))}. Consider the example of Fig. 3 and the sequence σ = a, d, h, k, e, which resulted 5 in root (V ) = (h, e). Now, g is observed, which violates the order with h and k as monitored by the order monitoring query Q . Apparently, this violation follows from the earlier violations. It is identified as a consecutive violation for the previous root cause (h, e) since h is in both violations and e and g are not conflicting in terms of order, i.e., (h, g) ∈ consec(h, e). Since k −1 g, h k, and e g, this also holds for the second violation. Hence, violations (h, g) and (h, k) are reported as consecutive violations.
192
M. Weidlich et al.
4.3 Occurrence Violations Finally, we consider violations that relate to occurrence dependencies between executions of activities. As discussed in Section 3.2, the absence of an required activity execution materialises only once the process instance finished execution (query Qdetection ). To indicate problematic processing in terms of potential occurrence violations, we introduced a set of co-occurrence warning queries Qwarning . These queries may match repeatedly during execution of a process instance, so that we focus on filtering violations that relate to these queries. Once again, note that the queries Qwarning detect violations and only the type of violation (co-occurrence constraint violation or order violation) cannot be determined definitely. Let σ = a1 , a2 , . . . , an be the sequence of recorded events for a process instance and B = {, +, ||, } the behavioural profile of the respective process model. Further, let A denote the set of events related to activities of the process model. Then, we derive n the set of violations V at the time event an is recorded as follows. n ={(ax , ay ) ∈ (A × A) | ay = an ∧ ax ∈ σ V
∧ ∃a∈A[a∈ / σ ∧ ax a ∧ ax a ∧ a ay ]}. Such a violation comprises the event that requests the presence of the missing event along with the event signalling that the missing event should have been observed already. This information on a violation is needed to structure the feedback for the analyst. In contrast to exclusiveness and order violations, the actual violated constraint is n not part of the violation captured in V . We define an auxiliary function miss to charn acterise the actual missing events for a violation (ax , ay ) ∈ V at the time event an is recorded. miss(ax , ay ) = {a ∈ A | a ∈ / σ ∧ ax a ∧ a ay }. Root Cause. Also here, a single event may cause multiple violations. Given a set of n , the root cause is the violation that relates to latest event in the sequence violations V of recorded events σ = a1 , a2 , . . . , an . By reporting the latest event, we identify the smallest region in which an event is missing. We define root as follows. n n n root (V ) = (ax , ay ) ∈ V such that ∀ (ak , al ) ∈ V [ x ≥ k ].
For illustration, consider the example of Fig. 3 and assume that a sequence of events σ = a, d has been recorded. Now, h is completed, which points to a missing execution 3 = {(a, h), (d, h)} and miss(a, h) = miss(d, h) = {e, g}. The idea is of e and g, V to report the latest event in the execution sequence that is triggering the violation, in our 3 case event d. Then, the violation for d and h is the root cause, root (V ) = (d, h). Consecutive Violation. For a root cause (ax , ay ) of occurrence violations, consecutive violations (ap , aq ) are characterized by the fact that there exists an intermediate missing event at the time event an is recorded which (1) resulted in the root cause, and (2) led to a violation observed when an event am is added later, m ≥ n. consec(ax , ay ) ={(ap , aq ) ∈ (A × A) | ∃ a ∈ miss(ax , ay ) [ ap a ∧ a aq ]}.
Event-Based Monitoring of Process Execution Violations
193
Consider the sequence σ = a, d, h for our example in Fig. 3, which resulted in 3 4 root (V ) = (d, h). Now, k is observed, which yields V = {(a, k), (d, k)}. For both violations there are missing intermediate events, namely e and g, that resulted in the root cause (d, h). Hence, (a, k) and (d, k) are reported as consecutive violations.
5 Case Study: SIMP As an evaluation, we applied our approach in a case study. This case study builds on a replay of real-world log data within a prototypical implementation of our approach. In this way, we are able to draw conclusions on the number and types of detected controlflow deviations. We also illustrate how feedback on these deviations is given. First, Section 5.1 gives background information on the investigated business process. Second, Section 5.2 presents the results on detected deviations. 5.1 Background We applied our techniques to the Security Incident Management Process (SIMP), an issue management process. This process is run in one of IBM’s global service delivery centres. This centre provides infrastructure management and technical support to customers. Figure 4 gives an overview of the SIMP as a BPMN diagram. The process is initiate by creating an issue. Then, issue details may be updated and a resolution plan is created. Subsequently, change management activities may be conducted, which may be followed by monitoring of target dates and risk management tasks. In the mean time, a customer extension may be conducted. Finally, a proposal to close is issued. If accepted, the issue is closed. If rejected, either a new proposal to close the issue is created or the whole processing starts over again. The SIMP is standardised and documented within the respective IBM global service delivery centre, but it is not explicitly enforced by workflow technology. Therefore, employees may deviate from the predefined processing. The latter is tracked by a proprietary tool, in which an employee submits the execution of a certain activity. For the SIMP, we obtained 852 execution sequences, aka cases, that have been recorded within nearly five years. SIMP instances are long-running, up to several months. We analysed the degree of conformance of these cases with the predefined behaviour shown in Fig. 4 in prior work, along with an evaluation of frequencies for particular Reject PTC Customer extension (CE)
Create issue (CRI)
Resolution plan (RP) Issue details (ID) Change management (CM)
Proposal to close (PTC)
Close issue (CLI)
Monitor target dates (MTD)
Risk management (RM)
Fig. 4. BPMN model of the Security Incident Management Process (SIMP), see also [14]
194
M. Weidlich et al. Table 1. Identified violations in 852 cases of the SIMP Exclusiveness
Order
Occurrence
All Violations
Overall Avg per Case (StDev) Max per Case
194 0.23 (0.36) 6
173 0.20 (0.40) 74
247 0.29 (0.49) 4
614 0.72 (1.10) 77
Non-Consecutive (%)
179 (92.27%)
14 (8.09%)
134 (54.25%)
327 (53.26%)
# Fraudulent Cases (%)
179 (21.01%)
7 (0.82%)
134 (15.73%)
202 (23.71%)
violations [14]. Here, we apply the techniques introduced in this paper for identifying execution violations and providing aggregated feedback to all SIMP cases. This allows for evaluating the number and type of detected violations in a fine-granular manner. In addition, we evaluate our techniques for providing aggregated feedback in a real-world setting. This analysis offers the basis to judge on the applicability of our approach for online monitoring of the SIMP. 5.2 Results
Percentage of Filtered Violations
As a first step, we evaluated how often the queries derived from the process model match for the 852 cases of the SIMP. Table 1 gives a summary of the results. We detected 614 constraint violations in all cases. On average, a case violates 0.72 behavioural constraints. Further, we see that the violations are rather homogeneously distributed over the three types of constraints. However, when evaluating how the different types of violations are distributed over the cases, we observe differences. Exclusiveness and occurrence constraint violations typically appear isolated, at most 6 or 4 of such constraints are violated within a single case and the violations are spread over 179 or 134 cases, respectively. Order constraint violations, in turn, are observed in solely 7 cases. However, those cases violate a large number of constraints, at most 74. The latter clearly indicates a need to provide the feedback in a structured 100 way. To this end, we also evaluated 80 our techniques for filtering consecutive violations. Table 1 lists how many non60 consecutive violations have been de40 tected for all types of violations. We 20 see that exclusiveness constraint viola0 0 20 40 60 80 tions are virtually always independent, Number of Detected Violations whereas occurrence and particularly order constraint violations are often consecutive. Further, Fig. 5 illustrates the Fig. 5. Percentage of filtered violations relative percentage of violations that are filtered to the number of detected violations for single relative to the number of detected viola- cases of the SIMP tions for a certain case. Clearly, for those
Event-Based Monitoring of Process Execution Violations
195
cases that show a large number of violations, the vast majority of violations is reported as consecutive violations. Hence, a process analyst is faced only with a very small number of root causes for each case. Finally, we turn the focus on the SIMP case with the most violations to illustrate our findings. This case showed 77 violations, 74 of them relate to order constraints. Analysis of the case reveals that after an issue has been processed, its close is proposed and then conducted, see also Fig. 4. However, the issue seems to have been reopened, which led to the execution of various activities, e.g., a customer extensions have been conducted. All these executions violate order constraints. Still, the 74 order violations are traced back to three root causes, so that 95.95% of the violations are filtered. This illustrates that our approach is able to identify significant problems during processing and that feedback is given in a structured and aggregated manner.
6 Related Work The research presented in this paper relates to work on business process monitoring (also called business activity monitoring), complex event processing, and transactional workflows. The monitoring of workflows based on a common interface had been early recognized by the workflow management coalition as an important issue [15], though there did not emerge a standard which was accepted by vendors. Several works discuss requirements upon a data format for audit trails [16] and process log data [17], and how these process-related event logs can be analysed a posteriori. Different approaches have been developed for measuring conformance of logs with process models [18] and for identifying root causes for deviation from normative behaviour [14]. Only recently, this research area has progressed towards online auditing [19] and monitoring of choreographies [20]. This paper complements these works with a approach to immediately feed back cases of non-conformance in a compact and filtered way. The approach presented in this paper also relates to the research domain of complex event processing. Several research projects addressed different aspects of event processing technologies [21,22,23,24]. Several works in this area provide solutions for query optimisation, e.g. considering characteristics of wireless sensor networks [24,25], availability of devices [26], resource consumption [27], or size of intermediate result sets [28]. Up until now, research that combines complex event processing and process management is scarce and only focused on query optimisation [29]. In this regard, our contribution is an approach to integrate process monitoring with complex event processing technology. In this paper, we consider a normative process model, which has to be monitored for behavioural conformance of event sequences. This problem is closely related to general properties of transactional workflows, which guarantee the adherence to specified behaviour by moving back to a consistent state when undesirable event sequences occur (see [30,31,32,33]). Transactional concepts defined for workflows have been adopted in web services technologies [34,35] and incorporated in industry standard such as WSAtomicTransaction [36]. In particular, transactional properties have to be considered for service composition [37]. From the perspective of these works, our approach provides immediate feedback on events that need to be rolled back or compensated.
196
M. Weidlich et al.
7 Conclusion In this paper, we have addressed the problem of monitoring conformance of execution sequences with normative process models in a complex event processing environment. Our contribution is a technique to generate queries from a process model based on its behavioural profile. Further, we present query results in a compact way – root causes of a violation are directly visible to the process analyst. Our approach has been implemented and evaluated in a case study in the IT service industry. The results demonstrate that even large sets of violations can be traced back to a small number of root causes. In future research, we aim to enhance our techniques in two directions. Behavioural profiles provide only an abstracted view on constraints within loops. We are currently studying concepts, first, for capturing the fact that certain activities have to be executed before an activity can be repeated. Second, behavioural profiles currently miss a notion of cardinality constraint, when for instance an activity a has to be repeated exactly as many times as b is executed. Beyond these conceptual challenges, we plan to conduct additional case studies with industry partners to further investigate the efficiency of our filtering techniques for different types of processes. Further, the scalability of our approach using a dedicated CEP environment has to be investigated.
References 1. Dumas, M., ter Hofstede, A., van der Aalst, W. (eds.): Process Aware Information Systems: Bridging People and Software Through Process Technology. Wiley Publishing, Chichester (2005) 2. Pesic, M., van der Aalst, W.: A declarative approach for flexible business processes management. In: Eder, J., Dustdar, S. (eds.) BPM Workshops 2006. LNCS, vol. 4103, pp. 169–180. Springer, Heidelberg (2006) 3. Etzion, O., Niblett, P.: Event Processing in Action. Manning, Stammford (2010) 4. Ferreira, D.R., Gillblad, D.: Discovering process models from unlabelled event logs. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 143– 158. Springer, Heidelberg (2009) 5. Musaraj, K., Yoshida, T., Daniel, F., Hacid, M.S., Casati, F., Benatallah, B.: Message correlation and web service protocol mining from inaccurate logs. In: ICWS, pp. 259–266. IEEE Computer Society, Los Alamitos (2010) 6. Jacobsen, H.A., Muthusamy, V., Li, G.: The padres event processing network: Uniform querying of past and future events. It - Information Technology 51(5), 250–261 (2009) 7. Gyllstrom, D., Wu, E., Chae, H.J., Diao, Y., Stahlberg, P., Anderson, G.: SASE Complex event processing over streams. In: Int. Conf. on Innovative Data Systems Research (2007) 8. Madden, S., Franklin, M.J.: Fjording the stream: An architecture for queries over streaming sensor data. In: International Conference on Data Engineering (2002) 9. EsperTech: Esper - Complex Event Processing (March 2011), http://esper.codehaus.org 10. Brenna, L., Gehrke, J., Hong, M., Johansen, D.: Distributed event stream processing with non-deterministic finite automata. In: DEBS, pp. 1–12. ACM, New York (2009) 11. Weidlich, M., Polyvyanyy, A., Mendling, J., Weske, M.: Efficient computation of causal behavioural profiles using structural decomposition. In: Lilius, J., Penczek, W. (eds.) PETRI NETS 2010. LNCS, vol. 6128, pp. 63–83. Springer, Heidelberg (2010)
Event-Based Monitoring of Process Execution Violations
197
12. Weidlich, M., Mendling, J., Weske, M.: Efficient consistency measurement based on behavioral profiles of process models. IEEE Trans. Software Eng. 37(3), 410–429 (2011) 13. Bornh¨ovd, C., Lin, T., Haller, S., Schaper, J.: Integrating automatic data acquisition with business processes experiences with sap’s auto-id infrastructure. In: Int. Conference on Very Large Data Bases, VLDB Endowment, pp. 1182–1188 (2004) 14. Weidlich, M., Polyvyanyy, A., Desai, N., Mendling, J., Weske, M.: Process compliance analysis based on behavioural profiles. Inf. Syst. 36(7), 1009–1025 (2011) 15. Hollingsworth, D.: The Workflow Reference Model. TC00-1003 Issue 1.1, Workflow Management Coalition (November 24, 1994) 16. Muehlen, M.: Workflow-based Process Controlling. In: Foundation, Design, and Implementation of Workflow-driven Process Information Systems, Logos, Berlin (2004) 17. van der Aalst, W., Reijers, H., Weijters, A., van Dongen, B., Alves de Medeiros, A., Song, M., Verbeek, H.: Business process mining: An industrial application. Information Systems 32(5), 713–732 (2007) 18. Rozinat, A., van der Aalst, W.M.P.: Conformance checking of processes based on monitoring real behavior. Inf. Syst. 33(1), 64–95 (2008) 19. van der Aalst, W.M.P., van Hee, K.M., van der Werf, J.M.E.M., Kumar, A., Verdonk, M.: Conceptual model for online auditing. Decision Support Systems 50(3), 636–647 (2011) 20. Wetzstein, B., Karastoyanova, D., Kopp, O., Leymann, F., Zwink, D.: Cross-organizational process monitoring based on service choreographies. In: Shin, S.Y., Ossowski, S., Schumacher, M., Palakal, M.J., Hung, C.C. (eds.) Proceedings of ACM SAC, pp. 2485–2490. ACM, New York (2010) 21. Abadi, D.J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the Borealis stream processing engine. In: Int. Conf. on Innovative Data Systems Research, pp. 277–289 (2005) 22. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: Stream: The stanford data stream management system. Technical report, Department of Computer Science, Stanford University (2004) 23. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., Shah, M.A.: Telegraphcq: Continuous dataflow processing for an uncertain world. In: Int. Conf. on Innovative Data Systems Research (2003) 24. Madden, S.R., Franklin, M.J., Hellerstein, J.M., Hong, W.: TinyDB: An acquisitional query processing system for sensor networks. ACM Transactions on Database Systems (TODS) 30(1), 122–173 (2005) 25. Yao, Y., Gehrke, J.: The cougar approach to in-network query processing in sensor networks. In: Proc. of the Intl. ACM Conf. on Management of Data, SIGMOD (2002) 26. Srivastava, U., Munagala, K., Widom, J.: Operator placement for in-network stream query processing. In: Proc. of PODS. ACM, New York (2005) 27. Schultz-Møller, N.P., Migliavacca, M., Pietzuch, P.: Distributed complex event processing with query rewriting. In: Proceedings of DEBS, pp. 1–12. ACM, New York (2009) 28. Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: SIGMOD, pp. 407–418. ACM, New York (2006) 29. Weidlich, M., Ziekow, H., Mendling, J.: Optimising Complex Event Queries over Business Processes using Behavioural Profiles. In: Muehlen, M.z., Su, J. (eds.) J.1, H.4, D.2. Lecture Notes in Business Information Processing, vol. 66, pp. 743–754. Springer, Heidelberg (2011) 30. Garcia-Molina, H., Salem, K.: Sagas. ACM SIGMOD Record 16(3), 249–259 (1987) 31. Georgakopoulos, D., Hornick, M.F., Sheth, A.P.: An overview of workflow management: From process modeling to workflow automation infrastructure. Distributed and Parallel Databases 3(2), 119–153 (1995)
198
M. Weidlich et al.
32. Alonso, G., Agrawal, D., Abbadi, A.E., Kamath, M., G¨unth¨or, R., Mohan, C.: Advanced transaction models in workflow contexts. In: Su, S.Y.W. (ed.) ICDE, pp. 574–581. IEEE Computer Society, Los Alamitos (1996) 33. Dayal, U., Hsu, M., Ladin, R.: Business Process Coordination: State of the Art, Trends, and Open Issues. In: Int. Conference on Very Large Databases, pp. 3–13. Morgan Kaufmann, San Francisco (2001) 34. Papazoglou, M.P.: Web services and business transactions. WWW 6(1), 49–91 (2003) 35. Alonso, G., Casati, F., Kuno, H.A., Machiraju, V.: Web Services - Concepts, Architectures and Applications. In: Data-Centric Systems and Applications. Springer, Heidelberg (2004) 36. Cabrera, F., et al.: Web services atomic transaction (ws-atomictransaction). Technical report, IBM, Microsoft, BEA (2005) 37. Zhang, L.J.: Editorial: Context-aware application integration and transactional behaviors. IEEE T. Services Computing 3(1), 1 (2010)
Towards Efficient Business Process Clustering and Retrieval: Combining Language Modeling and Structure Matching Mu Qiao1 , Rama Akkiraju2 , and Aubrey J. Rembert2 1 Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
[email protected] 2 IBM T.J. Watson Research Center, Hawthorne, NY 10532, USA {akkiraju,ajrember}@us.ibm.com
Abstract. Large organizations tend to have hundreds of business processes. Discovering and understanding similarities among business processes can be useful to organizations for a number of reasons including better overall process management and maintenance. In this paper we present a novel and efficient approach to cluster and retrieve business processes. A given set of business processes are clustered based on their underlying topic, structure and semantic similarities. In addition, given a query business process, top k most similar processes are retrieved based on clustering results. In this work, we bring together two not wellconnected schools of work: statistical language modeling and structure matching and combine them in a novel way. Our approach takes into account both high-level topic information that can be collected from process description documents and keywords as well as detailed structural features such as process control flows in finding similarities among business processes. This ability to work with processes that may not always have formal control flows is particularly useful in dealing with real-world business processes which are not always described formally. We developed a system to implement our approach and evaluated it on several collections of industry best practice processes and real-world business processes at a large IT service company that are described at varied levels of formalisms. Our experimental results reveal that the combined language modeling and structure matching based retrieval outperforms structure-matching-only techniques in both mean average precision and running time measures.
1
Introduction
Large organizations tend to have hundreds of business processes. Discovering and understanding the similarities among business processes can be useful to organizations for many reasons. First, commonly occurring process activities can be managed more efficiently resulting in business process maintenance efficiencies. Second, common and similar processes and process activities can be reused S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 199–214, 2011. c Springer-Verlag Berlin Heidelberg 2011
200
M. Qiao, R. Akkiraju, and A.J. Rembert
in new or changed process implementations. Third, commonalities among processes and process activities can help identify opportunities for standardization and consolidation of business processes. For example, in the context of a merger or acquisition, discovering similar business processes can aid in decisions about how the merging organizations’ processes can be effectively integrated. The majority of the business process similarity and retrieval research community is concerned with computing similarity primarily by using only the structural component of process models. Some of these approaches rely on full structure comparison, which is computationally expensive, while others rely on local structure comparison, i.e. the successors and predecessors of an activity. More often than not, business processes currently running in companies have been developed and deployed sometime ago, and formal models rarely exist and would be hard or too expensive to re-construct formally. What might be available is informal unstructured or semi-structured documentation. Therefore, a business process clustering and retrieval system that can deal with processes that have different levels of partial information represented in varying degrees of formalism is critical to be useful in real world. We address this problem in our work. In this paper, we present a language modeling and structure matching based business process clustering and retrieval method that takes into account both the process structural and textual information. We formally define the problem of business process clustering and retrieval as follows. Let P be a business process model repository, where each process model Pi ∈ P is the tuple (Si , Ti ) such that Si represents structural information, e.g., control-flow and data-flow, and Ti is textual information, e.g., functionality descriptions. Given P, a business process clustering function assigns Pi ’s into subsets (aka clusters) so that processes in the same subset are close to each other while those in different subsets are as distinct as possible. Also, given a query business process model PQ , a retrieval function returns the top k process models ranked in descending relevance order, where similarity is an approximation of relevance. Our approach not only speeds up the retrieval process, but also provides an intuitive organization of the business process repository. Experiments on several collections of industry best practice processes (e.g., SAP, APQC’s process classification framework processes) and real-world business processes at a large IT service company show that our retrieval method has superior mean average precision and execution times than purely structural approaches.
2
Related Work
Business process clustering and retrieval are defined with respect to process similarity measures. The most common approach to process similarity measures is computing the graph edit distance between control-flow graphs of processes. Dijkman et al. [4] compared four techniques for control-flow graph comparison: the A-star algorithm, a greedy approach, a heuristic approach, and an exhaustive approach. The A-star algorithm performed the best. Madhusudan et al. [15] combined control-flow comparison with similarity flooding. This approach is
Business Process Clustering and Retrieval
201
highly sensitive, and only near exact matches pass the stringent comparison method, which is a requirement of their particular application. Graph comparison approaches give compelling precision results, however, they are not practical. Real-world processes tend to be informally described and most of the information about the process is not encoded in its control-flow graph. Many approaches reduce the reliance on full graph comparisons, and attempt to exploit additional information. The work of Ehrig et al. [8] relies on activity labels and activity context to compute process similarity. Van Dongen et al. [6] use causal footprints to measure similarity, which capture the behavior information about processes and are defined in a vector space. A two-level process comparison approach is proposed by Yan et al. [21]. At the first level, activity labels, and local activity structure are used. The result of the first level is a categorization: relevant, potentially relevant, and irrelevant. Potentially relevant process models are compared using a full graph comparison technique. To the best of our knowledge, no previous work has studied the clustering of real-world business processes. Some initial efforts are made in [11], where two types of vectors, i.e., activity and transition vectors, are proposed to calculate the pairwise structural distance between processes. These distances are then used to create a hierarchical clustering tree. This approach was only evaluated on a small scale synthetic data set and the proposed vectors may not be sufficient to capture the structural features of real-world processes. Business process retrieval, mostly under the name process similarity search, has been explored in [4,21,10] using graph comparison. Recent survey [7] discusses the developments of process similarity search techniques. In general, our business process clustering and retrieval method extends the graph comparison approaches by considering text features, such as topic similarities. The process retrieval is based on the trained clustering model and clustering results. Unlike the work of Dong et al. [5], where clustering is applied to parameter names of web-service operations and then leveraged to measure the web-service similarity, we do clustering on process models themselves. Therefore, clustering not only facilitates the retrieval but also organizes processes in a meaningful way.
3
Business Process Clustering and Retrieval
The flow diagram of our proposed business process clustering and retrieval method is shown in Fig. 1. Throughout the paper we use the terms “similarities” and “commonalities”, “activities” and “steps”, interchangeably. For a selected business process repository, we first extract the textual and structural features from each process model. All the text terms from step names and the documentation will form a text feature vector for each process. Note that in our process collections, the documentation of a process is a paragraph of text description, introducing its purposes and specific functionalities. The process control flow graphs are stored as structural features. We propose a two-level process clustering method. Specifically, a topic language modeling approach is
202
M. Qiao, R. Akkiraju, and A.J. Rembert
first applied to the input processes represented by text feature vectors to obtain the first level clustering based on their topic similarities. Within each cluster, detailed structure similarity match is then performed between all the processes. A clustering method based on pairwise structure similarities is employed to obtain the second level clusters. In Fig. 1, Cij indicates the jth second level cluster within the ith first level cluster. Results from the first level clustering, such as the topic mixture (probabilities of topics appearing in each process), the topic word distribution (probabilities of text terms associated with each topic), and the process clusters, are kept for future retrieval usage. These high-level and granular clusters can reveal the commonalities among business processes. Once the clusters are created, a retrieval function will use the trained topic language model and clusters to retrieve similar business processes for a given query. A ranking scheme ranks the processes in descending order of relevance/similarity to the query. Details of the proposed process clustering and retrieval method are discussed in the following sections. Input query processes
Clusters of processes based on topic similarity
Top k processes with the highest ranking scores
C11
C21
CK1
C12
C22
CK2
...
Business process feature extraction
Topic mixture Topic word distribution
Top k most similar processes
...
Structure features: control flows, etc
First level clustering: a topic language modeling approach (LDA)
...
Text features: title, documentation, category, step names, etc.
...
Business process repository
Second level clustering: a graph partition approach
Re-rank with structure/ semantic similarity
Two-level business process clustering
Language modeling and structure matching based business process retrieval
Fig. 1. The flow diagram of language modeling and structure matching based business processes clustering and retrieval method
3.1
Two-Level Business Process Clustering
Business process clustering helps reveal the underlying characteristics and commonalities among a large collection of processes. The information extracted by clustering can also facilitate subsequent analysis, for instance, to speed up text documents or images indexing and retrieval [13]. Business process models contain compound data. By compound, we mean a group of data types are needed to represent the object. For instance, process models contain both textual features which can be represented by the vectorspace model [17] and structural features usually represented by graphs. In this work, we present a two-level business process clustering method which incorporates these two kinds of information. At the first level, processes are clustered based on their topic similarities. For instance, processes with the same topic as
Business Process Clustering and Retrieval
203
“order” and “management” will be grouped together. Within each cluster, we further take into account the detailed structure information of processes and cluster them based on their structure similarities. For instance, at the second level, processes that are related to “order fulfillment” and “order entry” may be assigned into their own separate clusters. There are several advantages of the two-level method. First, we not only find processes with close structures but also discover processes that describe the same topic but have completely different structures. Second, since the pairwise structure similarity matching is only conducted on processes within each first level cluster, the computation time is significantly reduced as compared to the clustering approach based on all pairwise structure similarities. Additionally, in real practice, some processes may have incomplete structure or no structure at all. We can still do clustering on those processes based on their text features. First level clustering: a topic language modeling approach. We employ topic language modeling, specifically, Latent Dirichlet Allocation (LDA) [2], to cluster processes based on their topic similarities. LDA is used to model the generative process of a text document corpus, where a document is summarized as a mixture of topics. In our work, text terms (aka words) of each business process, e.g., from activity labels and the process documentation, are extracted and combined to form a document represented by a text feature vector. All the processes in the repository are thus converted into a collection of documents in terms of their text features. In LDA, each document is modeled as a mixture over a set of latent topics. We assume that the text document created for each business process is represented by a mixture of topics, for instance, management, purchase, order, etc. The topic mixture is drawn from a Dirichlet prior over the entire collection. Suppose a document d is a word vector of length Nd , denoted by w = (w1 , w2 , ..., wNd ), where wn is the nth word in the document d. LDA is represented as a graphical model in Fig. 2. The generative process for each document in a collection is as follows: 1. For each topic z, choose a multinomial distribution φz from a Dirichlet prior with parameter β; 2. For each document d, choose a multinomial distribution θd over topics from a Dirichlet prior with parameter α; 3. For each of the Nd word wi in document d, choose a topic zi from the multinomial distribution θd ; 4. Choose a word w from the multinomial distribution φz . Exact inference in LDA is generally intractable. Gibbs sampling, a special case of the Markov chain Monte Carlo (MCMC) method, has been used to estimate the parameters of the LDA model, which often yields relatively simple algorithms for approximate inference [9]. Due to space limitation, we refer interested readers to [2] and [9] for the details of LDA and gibbs sampling. In Fig. 2, the parameter θd is the topic distribution (or mixture) in document d and φk is the word distribution in topic k. Intuitively, we can interpret θd as the probabilities of topics appearing in document d and φk as the probabilities
204
M. Qiao, R. Akkiraju, and A.J. Rembert
T
α
z β
I
w K
Nd N
Fig. 2. Graphical model representation of LDA. K is the number of topics; N is the number of documents; Nd is the number of words in document d.
of words associated with topic k. Since each document represents the text features of a process, we finally get the topic mixture for each process. The topic possessing the largest posterior probability is chosen for that process. Processes associated with the same topic form a cluster. Moreover, the topic mixture obtained from LDA can be readily used as a soft clustering, i.e., a process can be assigned to multiple clusters with different probabilities. Cluster number selection. The perplexity score has been widely used in LDA to determine the number of topics (aka clusters in this work), which is a standard measure to evaluate the prediction power of a probabilistic model. The perplexity of a set of words from an unseen document d ∈ Dtest documents, is defined as the exponential of the negative normalized predictive log-likelihood under the training model. The perplexity is monotonically decreasing in the likelihood of the test data [2]. Therefore, a lower perplexity score over a held-out document set indicates a better generalization performance. The value of k, which results in the smallest perplexity score over a randomly selected set of test documents, is selected as the cluster number. Second level clustering: a graph partition approach. After the first level clustering, we look into the detailed structural features of the processes within each cluster. Clustering methods based on pairwise distances, such as graph partition, have enjoyed wide spread applications when a mathematical representation for the object is difficult or intractable to obtain. Since the structure of a business process model cannot be easily represented by a single data type, such as vectors, the graph partition method is applied to obtain the second level clustering results, which is illustrated by Fig. 3. Each business process is represented by a node, denoted by Pi . If the structure similarity between two processes is above a certain threshold, there is an edge between the corresponding nodes. In Fig. 3, nodes with the same color belong to the same first level cluster. All the nodes in the same connected component form a second level cluster. In our work, different structure matching algorithms are selected according to the characteristic of specific business process repository. The detailed structure matching approaches are described in Section 4, after we introduce the business process repositories.
Business Process Clustering and Retrieval
P4
P2
205
P15
C21
C11 P5
P13
P14
P6
P1
P3
P7
P8
C12
C22 P9
P10
…...
P11 P12
Fig. 3. Graph partition to find the second level clusters
3.2
Language Modeling and Structure Matching Based Business Process Retrieval
In our method, language modeling and structure matching are combined for process retrieval. Given a query process, processes are first ranked based on their text features using a language modeling approach. Since the important text features of the query process and each process in the repository are represented as text documents, we essentially retrieve a subset of documents which are most similar to the query document. A LDA-based retrieval method is employed, which has improved documents retrieval results within a language model framework [20]. The basic idea of using language model based retrieval is to model the query generation process [16]. A language model D is built for each document in the collection and the documents are ranked based on their probabilities of generating the query Q, i.e., P (Q|D).Assuming the words in the query Q are all independent, we have P (Q|D) = m i=1 P (qi |D), where qi is the ith word in the query Q and m is the total number of words. P (qi |D) is specified by P (qi |D) = λPml (qi |D) + (1 − λ)P (qi |Coll) ,
(1)
where Pml (qi |D) and P (qi |Coll) are the maximum likelihood estimates of word qi in the document model D and in the entire collection. This model is called document based retrieval. In (1), λ is a general smoothing term, which has different forms for different smoothing methods. A study of different smoothing methods for language models applied to retrieval can be found in [22]. For instance, λ takes the form Nd /(Nd + μ) for Dirichlet smoothing, where μ is a smoothing parameter. A LDA-based model is proposed in [20] for documents retrieval within a language model framework. Experimental results showed that it outperforms the cluster based retrieval method [14] which incorporates the clustering information as Dirichlet smoothing. Specifically, P (qi |D) is now modeled as a linear combination of the original document based retrieval model (1) and the LDA model: P (qi |D) = (1 − γ)(
Nd Nd Pml (qi |D) + (1 − )P (qi |Coll)) + γPlda (qi |D) , (2) Nd + μ Nd + μ
206
M. Qiao, R. Akkiraju, and A.J. Rembert
where Nd /(Nd + μ) is the Dirichlet smoothing term, Plda (qi |D) is the probability that qi is generated by D in the LDA model and γ is a weight term. Plda (qi |D) can be directly obtainedfrom the trained LDA model in the process clustering ˆ ˆ ˆ ˆ step, i.e., Plda (qi |D) = K z=1 P (qi |D, φ)P (z|θ, D), where θ, φ are the posterior estimates of θ and φ in LDA. We then calculate P (Q|D) for the language model D built for each document d, which is used to rank the process corresponding to that document. A larger value indicates that the process has a higher probability to generate the query, i.e., closer to the query in terms of their text similarities. In this work, we adopt the aforementioned LDA-based retrieval method to first rank all the processes based on their text features and obtain a list of top k processes. In the second step, detailed structure similarities between the query and the processes in the list are computed. These processes are then re-ranked by their structure similarities in a descending order. Let ri , ti and si denote the rank, text similarity, and structure similarity of process i in the list, respectively. The ranking scheme can be expressed as follows: si > sj ri ≺ rj if ti > tj , si = sj , which incorporates the perspectives from both textual and structural information. We name this process retrieval method as LDA-retrieval. Since we only need to compute the structure similarities between the query process and a small subset of processes (i.e., k processes), the computational complexity is significantly reduced. In addition, if the query process or some process in the repository does not have formal control flows, we can still retrieve processes which match the user’s query interest based on their text similarities. For comparison, we propose another cluster-based process retrieval method. Specifically, it first determines which topic (aka cluster) the query process falls under, and then does structural comparisons between the query and processes in that cluster. Finally, processes with the top k highest structure similarity scores are retrieved. Clustering therefore serves as a filtering tool. Some early studies have shown that this type of retrieval has advantages on precision-oriented searches for text documents [3]. However, other experimental work also indicated that clustering may not necessarily improve the retrieval performance, depending on specific domains or the collections [19]. Note that the clusters employed in this retrieval method are from the first level clustering results in the process clustering step. The reason that we do not use the second level clusters is that the cluster membership of an unseen query process cannot be directly determined on the second level. Since these clusters are also obtained by LDA, we name this process retrieval method as LDA-cluster-retrieval. Similarly, the clusters can also be obtained by applying the well know kmeans algorithm to processes represented by text feature vectors. As comparisons, we name the retrieval method based on kmeans as kmeans-cluster-retrieval. As a summary, we implemented and compared four business process retrieval systems, i.e., structure-matching-retrieval, LDA-retrieval, LDA-cluster-retrieval, and kmeans-cluster-retrieval. The trained LDA model and clusters from the
Business Process Clustering and Retrieval
207
business process clustering step are reused in LDA-retrieval and LDA-clusterretrieval. Detailed evaluations of these methods are presented in Sect. 4.
4
Experimental Results
In this section, we present the experimental results of the proposed business process clustering and retrieval method on multiple business process repositories. Given a business process repository, we first extract text words from step names and the documentation (if available) of each process and combine them to form a text document. After removing a standard list of stop words, we get a text corpus for the entire process repository. If the associated text document for a process is too short, the process will be excluded. We select processes whose number of text terms is no less than a certain threshold, which is set to 50, unless otherwise noted. Below we explain each data set briefly. SAP best practice processes: we extract around six hundred business processes from SAP’s solution composer tool which is publicly available1 . There are two categories of processes, i.e., industry-specific and cross-industry. 327 processes from industry-specific domain are obtained. The total number of process steps in these processes is 3291, among which 2425 are unique. Similarly, we obtain 276 cross-industry processes which have 2752 steps in total and 2012 unique steps. The structure information of all these processes is not available. SAP reference models: this is a collection of 604 business process models supported by the SAP enterprise system [12]. They are represented in the event driven process chains (EPC) notation, with moderately complicated structures. Since processes do not have documentations, we extract the text terms from step names to form a document for each process. After weeding out processes with fewer terms than the set threshold, we get 242 processes. The total number of steps of these processes is 6302, among which 2982 are unique. APQC process classification framework: we obtain the industry-specific processes from American Product Quality Council’s (APQC) process classification framework (PCF)2 . These processes do not have documentations. The text terms are extracted from step names. Since the average number of terms of these processes are smaller than those in previous data sets, a smaller threshold of 20 is set to remove processes with short text. We finally obtain 180 processes. The total number of steps for these processes is 1312, among which 1269 are unique. The structures of these processes are linear chains. Large company real-world processes: this data set contains a collection of real world processes at a large IT service company. All the processes fall roughly into four categories: finance (FIN), master data management (MDM), order to cash (O2C), and opportunity to order (O2O). After removing process with short 1 2
http://www.sap.com/solutions/businessmaps/composer/index.epx http://www.apqc.org/
208
M. Qiao, R. Akkiraju, and A.J. Rembert
text, we finally get 117 processes. The total number of steps in these processes is 2220, among which 1862 are unique. In our business process clustering and retrieval method, structure similarity matching is used to compute the pairwise similarities between processes within each first level cluster and also the similarities between the query process and a selected small subset of processes for retrieval. Different structure matching algorithms are selected according to the characteristic of specific business process repository. We consider the process step names in structure matching3 . More process attributes, for instance, data flow and textual documentation for specific process steps, can also be taken into account. However, how to combine attributes from different modularities to achieve better structure comparison is not a trivial question. Since it is not the focus of this paper, we will not discuss it in great depth. For the SAP reference model repository, the A-star graph matching algorithm is used since it has good performance in terms of mean average precision and running time shown in previous work [4]. In each step, the algorithm selects the existing partial mapping with the largest graph edit similarity, yielding an optimal mapping in the end. For processes in APQC process classification framework, since they have the chain structures and contain standardized steps, we adopt the “string edit” distance to measure their structure similarities, where each process is modeled as a “string” and the process steps are “characters”. The SAP best practice processes only have process steps without specifying the control flows. A semantic similarity score [1] is therefore defined to measure the similarity between processes, i.e., S = 2 × ncs /(np + nq ), where np and nq are the numbers of steps in process p and q, and ncs is the number of common steps. Also, the structures of our evaluated real-world processes from a large IT service company are very complicated and they rarely share any commonalities. For this repository, we also use the aforementioned semantic similarity scores to compute their similarities. 4.1
Business Process Clustering Evaluation
We present in this section the two-level business process clustering results. The Dirichlet priors α and β of LDA in the first level clustering are set to 0.2 and 0.1, which are common settings in literature. The number of iterations of the Markov chain for gibbs sampling is set to 3000. The larger the number of iterations, the longer time it takes to train the model. For the data sets described above, we note that gibbs sampling usually converges before 3,000 iterations. Before doing the first level clustering with LDA, we first determine the number of topics (aka clusters) using the perplexity score introduced in Sect. 3.1. The LDA model is trained separately on each data set. We calculate the perplexity scores of the trained model over a held-out test data set with different number of topics. Specifically, 10% samples from each data set are randomly selected as the test 3
For SAP reference models represented by EPC, the process steps considered in structure matching consist of both functions and events.
Business Process Clustering and Retrieval
209
set. For each data set, the number of topics is set to the value resulting in the smallest perplexity score. Due to space limitation, we will not show how the perplexity score changes when the number of topics varies. Discovered Topics. We examine the top words associated with each topic to evaluate the quality of discovered topics. Table 1 shows a subset of the discovered topics from two process repositories. For instance, in Table 1(a), the four discovered topics in SAP best practice (industry) repository are “contract”, “management”, “planning”, and “payment”. If we look at their associated top words, most of them are related with that particular topic. Note that the word in “discovered topics” is the key word we select to represent the topic. Business processes associated with the same topic will form a cluster on the first level of the clustering. Table 1. Top words associated with the selected topics (a) SAP best practice (industry) id discovered topics top words 0 contract contract, system, billing, contracts, settlement, data, accounting, rights 1 management level, process, control, management, controls, data, define, assign, perform 3 planning planning, orders, project, order, production, process, planned, mrp, plan 4 payment system, account, payment, business, invoice, credit, check, document, create
(b) SAP reference models id discovered topics top words 0 asset asset, depreciation, posted, investment, master, fixed, changed, simulation 2 maintenance order, maintenance, material, created, capacity, service, purchase, deleted 3 quality inspection, quality, lot, created, defects, certificate, vendor, results 14 registration business, event, exists, order, appraisal, cancelled, resource, attendance
Two-level Clustering Results. After processes are clustered initially based on their topic similarities, we will do a detailed structure similarity matching among all the processes within the same cluster. Note that, for SAP reference models, we use the process heuristic matching algorithm from [4] with default setting, rather than the A-star algorithm, due to the extremely long running time we have observed using A-star for this particular problem. Graph partition method is then employed to generate the second level clusters. In practice, if the structure similarity score between two processes is above 0.0, there will be an edge between the corresponding nodes. Finally, nodes in the same connected component in the graph form a cluster. Table 2 shows selected two-level clustering results on several process repositories. All the processes in the same table belong to the same first level cluster. For instance, in Table 2(a), most processes are about planning, although they may discuss different kinds of planning, e.g., production planning, supply network planning, distribution planning and relocation planning. The last column shows the probabilities of these processes belonging to this specific cluster. Processes having close structure similarities are further grouped together, indicated by the same cluster id in the first column of each table. Cij indicates the jth second level cluster within the ith first level cluster. In Table 2(a), processes in cluster C31 are about production planning in ERP, while those in cluster C32 are about production planning in SCM, though all of them are about production planning. Processes that do not have any common structure with others
210
M. Qiao, R. Akkiraju, and A.J. Rembert
will be stand-alone, indicated by Cin in their cluster id, where i is the first level cluster id. Results in Table 2(b) are based on the SAP reference models. All the processes in this cluster are related with “maintenance and service”, which are from three categories, i.e., “plant maintenance”, “customer service”, and “quality management”. Looking into the detailed steps of these processes, we found that most of them are about maintenance, service, and material order, sharing partial structures, though they are from different categories. Since the process title information of this data set is not available, we put their process labels in the title column instead. Some selected clustering results of real-world business processes from a large IT services company are shown in Table 2(c). Most processes in this cluster are related with “order”, although they may be from different categories. On the second level clustering, processes from the same category tend to be clustered together since there is a better chance that processes in the same category have similar structures. Table 2. Selected Two-level Clustering Results (a) Topic 3 (planning) in SAP best practice (industry) cls id C31 C31 C31 C31 C31 C32 C32 C32 C32 C33 C33 C33 C3n C3n C3n
category process title Mining Production Planning (Process Manufacturing) MTS in ERP Mining Production Planning (Process Manufacturing) MTO in ERP Aero. Def. Manufacturing Production Planning (Discrete Manufacturing) MTO in ERP Aero. Def. Manufacturing Production Planning (Discrete Manufacturing) CTO in ERP Aero. Def. Manufacturing Production Planning (Discrete Manufacturing) MTS in ERP Mining Production Planning (Process Manufacturing) MTO in SCM Mining Production Planning (Process Manufacturing) MTS in SCM Consumer Products-Food Production Planning (Repetitive Manufacturing) CTO in SCM Aero. Def. Manufacturing Production Planning (Discrete Manufacturing) CTO in SCM Mining Supply Network Planning Heuristic Mining Supply Network Optimization Mining Multilevel Demand and Supply Match Pharmaceuticals Distribution Planning Defense Relocation Planning Aero. Def. Manufacturing Pre-Planning of Long Lead-Time Items
prob. 0.94 0.97 0.97 0.96 0.93 0.97 0.93 0.91 0.96 0.95 0.95 0.97 0.73 0.64 0.62
(b) Topic 2 (maintenance and ser- (c) Topic 1 (order) in real-world processes from vice) in SAP reference models a large IT service company cls id category process title C21 Customer Service 1Ku a1xs C21 Customer Service 1Ku 9efq C21 Plant Maintenance 1In at4y C21 Plant Maintenance 1In bip2 C21 Plant Maintenance 1In b8et C21 Plant Maintenance 1In apbf C21 Quality Management 1Qu c117 C21 Quality Management 1Qu c3ie C22 Customer Service 1Ku 903f C22 Customer Service 1Ku 9mgu C22 Plant Maintenance 1In bb6y C22 Plant Maintenance 1In agyu C22 Plant Maintenance 1In b19m C23 Customer Service 1Ku 9rnu C23 Customer Service 1Ku 96oz C23 Plant Maintenance 1In b2z3 C23 Plant Maintenance 1In avon C23 Plant Maintenance 1In ajlf C23 Plant Maintenance 1In bd6i
prob. 0.94 0.94 0.97 0.98 0.97 0.96 0.84 0.86 0.35 0.40 0.50 0.65 0.69 0.93 0.82 0.92 0.91 0.90 0.87
cls id category process title C11 FIN Process Purchase Requisitions C11 FIN Manage Material Master C11 FIN Process Purchase Orders C11 FIN Process Good Receipts C12 O2C Order-Donation Order C12 O2C Order-Front End Order C12 O2C Process Sales Order C12 O2C Order-ESW Order or Bump C12 O2C Order-Credit & Debit Memo Request C12 O2C Order-Trial Order C12 O2C Order-BP Stock Order C12 O2C Order-Multiple Business Models C12 O2C Order-CSC & ROC Order C13 MDM Add a New Location C13 MDM Register New Country Enterprise C14 O2O Reconcile Inventory C14 O2O Manage Sell Out Sell Thru Proc C15 O2C Manage Returns and Disposition C15 O2C Manage Outbound Delivery
prob. 0.79 0.91 0.96 0.80 0.99 0.94 0.98 0.95 0.99 0.99 0.97 0.98 0.99 0.93 0.79 0.62 0.69 0.95 0.95
In summary, we evaluated our two-level business process clustering method qualitatively. The experimental results indicate that our method can effectively discover some underlying information of process repositories and cluster processes with similar characteristics together.
Business Process Clustering and Retrieval
4.2
211
Business Processes Retrieval Evaluation
In this section, we compare four business process retrieval methods introduced in Section 3.2, i.e., LDA-retrieval, LDA-cluster-retrieval, kmeans-cluster-retrieval, and structure-matching-retrieval 4 , which is the baseline retrieval method. Mean average precision (MAP) is used to measure the retrieval performance. n Given a query process, the average precision is calculated as j=1 (precision[j]× rel[j])/R, where R is the number of relevant processes in the repository and j precision[j] = k=1 rel[k]/j. If the jth process is relevant to the query, rel[j] is one, otherwise zero. The mean average precision is the mean of average precisions over a set of queries. To calculate MAP, we first need to identify all the relevant processes given a particular query. This is not a trivial work for a large process repository. A pooling approach used by TREC [18], is employed to get the complete relevance information, where only processes that are returned as possible answers to a query by the participating retrieval systems are evaluated while all other processes in the repository are assumed to be not relevant. Specifically, for each query, we combine the top 20 returned processes from all the retrieval systems and evaluate if these processes are relevant or not. All the other processes that are not retrieved are assumed to be not relevant. This manual evaluation was conducted independently by three authors of this paper adopting a majority voting5 . Parameters. There are two parameters in LDA-retrieval that need to be determined, i.e., the weight γ for the LDA model and the Dirichlet smoothing parameter μ in (2). All the other calculations or parameters can be directly obtained from the LDA model in the business process clustering step. We use the SAP reference repository as training data to select the parameters γ and μ. The other data sets are used to test whether the selected parameters can be applied consistently. In LDA-retrieval, we perform a “grid search” on γ and μ. Various pairs of (γ, μ) are tried and the one yielding the best MAP is selected, i.e., γ = 0.2 and μ = 400. As the LDA model in the clustering step is used in LDA-retrieval and LDA-cluster-retrieval, the topic number of LDA may also have an influence on their retrieval results. We tried these two LDA related retrieval methods on the SAP reference repository with different number of topics. When the topic (aka cluster) number is equal to 20, the same setting of the LDA model in the clustering step, these two methods are able to achieve close-to-best or good mean average precisions. Therefore, the parameters and results of the LDA model from the clustering step can be directly used in the retrieval to save computational cost. However, more careful parameters selection can be performed in order to obtain the best retrieval results. For comparison, the number of clusters in kmeans-cluster-retrieval is set to be the same as that in LDA-cluster-retrieval. 4 5
We abuse structure matching slightly here since some data sets do not have structure information and the semantic similarity score is used. For SAP reference models, the evaluation was conducted by the first author of this paper due to the access constraints for the other authors.
212
M. Qiao, R. Akkiraju, and A.J. Rembert
Retrieval Results. We randomly select a set of query processes from each repository, specifically, 10 processes for SAP reference models and 5 for the other four data sets. Table 3 shows the mean average precisions of the top 20 retrieved processes for these systems over each query set. The highest MAP achieved for each repository is in bold font. As we can see, LDA-retrieval has the best retrieval performance over all the data sets except for SAP best practice (cross industry). Different from structure-matching-retrieval, in LDA-cluster-retrieval, only a subset of processes are retrieved and compared against to the query in terms of their structure similarities. For most repositories, LDA-cluster-retrieval has a higher MAP than that of structure-matching-retrieval. The reason is that some relevant processes may not share any partial structures with the query, i.e., they are relevant in terms of high level text similarities. These processes are very likely to be assigned into the same cluster with the query and therefore get retrieved. On the other hand, some processes sharing partial structures with the query may not be relevant. Similarly, kmeans-cluster-retrieval also outperforms structurematching-retrieval for some data sets, but is inferior to LDA-cluster-retrieval in general. Note that the low MAP by structure-matching-retrieval for APQC is due to the fact that there are not many common steps among processes in this repository. The structure similarities between most processes in APQC and the query are zero and therefore structure-matching-retrieval can only randomly retrieve some processes to put in the top 20 list. Our LDA-retrieval method significantly outperforms structure-matching-retrieval for all the evaluated data sets. Table 3. Mean average precisions of four retrieval systems on different process repositories MAP SAP (industry) SAP (cross industry) SAP reference Structure matching 0.1968 0.4396 0.5771 LDA-retrieval 0.4615 0.4725 0.7940 LDA-cluster-retrieval 0.2311 0.5034 0.5692 Kmeans-cluster-retrieval 0.1976 0.3934 0.5019
APQC Real practice 0.0222 0.3125 0.4076 0.5550 0.0986 0.3636 0.0337 0.3676
Running Time Analysis. Table 4(a) lists the total running time of four retrieval systems over all the queries for each repository. All these systems are evaluated on a laptop with a dual core 2.66 GHz Intel CPU, 4.00 GB RAM, and Java Virtual Machine version 1.6 (with 256MB of heap size). For SAP reference models, structure-matching-retrieval takes up to 833.31 seconds while the other methods use a significantly smaller time. For the other data sets, the running time difference among these methods is quite small due to simple structure or semantic similarity computations. Since these retrieval systems also depend on the first level clustering results, we also show the running time of LDA and kmeans clustering for different process repositories in Table 4(b). The model inference of LDA is much more complicated than kmeans, so LDA clustering takes a longer time. Note that for the SAP reference models, even though we add the running time of LDA clustering to the retrieval for these methods, they still cost a much smaller time than structure-matching-retrieval. In practice, process clustering only needs to be conducted once for each data set. After that, the process clusters and trained LDA model can be readily used for future retrieval.
Business Process Clustering and Retrieval
213
Table 4. Running time of process retrieval and the first level clustering (seconds) (a) Running time of four retrieval systems on different process repositories Running time SAP (industry) SAP (cross industry) SAP reference APQC Real practice Structure matching 0.025 0.018 833.31 0.013 0.014 LDA-retrieval 0.12 0.15 69.02 0.042 0.048 LDA-cluster-retrieval 0.005 0.003 91.72 0.004 0.005 Kmeans-cluster-retrieval 0.008 0.011 108.93 0.016 0.010
(b) Running time of LDA and kmeans clustering on different process repositories Running time SAP (industry) SAP (cross industry) SAP reference APQC Real practice LDA-clustering 36.59 28.66 54.67 5.73 10.51 Kmeans-clustering 2.03 2.71 1.23 0.31 0.23
5
Conclusions and Future Work
In this paper, we present a business process clustering and retrieval method that combines language modeling and structure matching approaches in a novel way. While clustering of business processes helps reveal the underlying characteristics and commonalities among a given set, language modeling based retrieval helps discover the top k most similar processes to a given query more efficiently than the pure structure matching approach. We leverage both the text terms in process step names and the process documentation. In the absence of any formal models, processes can still be matched based on textual/topical similarities. This ability to work with different kinds of documentation about business processes makes our approach more practical since the quality of documentation for realworld business processes can vary widely. Our current experimental evaluation is “homogeneous” in the sense that the clustered or retrieved processes are from the same repository. In the future, we can apply the proposed method to “heterogeneous” clustering and retrieval within a pooled process repository. Processes with different modeling notations can be efficiently matched based on their text features whereas the pure structure matching involves a lot of conversion efforts in the first place. The compound data type of business process models requires novel clustering and retrieval methods which can incorporate process information with different modalities, for instance, control flow, data flow, detailed design documentation, and execution logs. In addition, work from process mining area which derives process model from execution logs can be combined with other forms of available documentation to enhance process similarity matching. We believe that instead of relying on a single type of information about processes, considering multiple sources of information will yield better insights about business process similarities.
References 1. Akkiraju, R., Ivan, A.: Discovering Business Process Similarities: An Empirical Study with SAP Best Practice Business Processes. In: Maglio, P.P., Weske, M., Yang, J., Fantinato, M. (eds.) ICSOC 2010. LNCS, vol. 6470, pp. 515–526. Springer, Heidelberg (2010)
214
M. Qiao, R. Akkiraju, and A.J. Rembert
2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003) 3. Croft, W.B.: A model of cluster searching based on classification. Information Systems 5, 189–195 (1980) 4. Dijkman, R., Dumas, M., Garc´ıa-Ba˜ nuelos, L.: Graph Matching Algorithms for Business Process Model Similarity Search. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 48–63. Springer, Heidelberg (2009) 5. Dong, X., Halevy, A.Y., Madhavan, J., Nemes, E., Zhang, J.: Simlarity Search for Web Services. In: Proc. of VLDB 2004, pp. 372–383 (2004) 6. van Dongen, B.F., Dijkman, R., Mendling, J.: Measuring similarity between business process models. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 450–464. Springer, Heidelberg (2008) 7. Dumas, M., Garc´ıa-Ba˜ nuelos, L., Dijkman, R.M.: Similarity Search of Business Process Models. Bulletin of the Technical Committee on Data Engineering 32(3), 23–28 (2009) 8. Ehrig, M., Koschmider, A., Oberweis, A.: Measuring similarity between semantic business process models. In: Proc. of APCCM 2007, pp. 71–80 (2007) 9. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004) 10. Grigori, D., Corrales, J.C., Bouzeghoub, M., Gater, A.: Ranking BPEL Processes for Service Discovery. IEEE T. Services Computing 3(3), 178–192 (2010) 11. Jung, J., Bae, J., Liu, L.: Hierarchical clustering of business process models. International Journal of Innovative Computing, Information and Control 5(12), 1349– 4198 (2009) 12. Keller, G., Teufel, T.: SAP(R) R/3 Process Oriented Implementation: Iterative Process Prototyping. Addison-Wesley, Reading (1998) 13. Li, J.: Two-scale image retrieval with significant meta-information feedback. In: Proc. of ACM Multimedia Conference 2005, pp. 499–502 (2005) 14. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proc. of ACM SIGIR Conference 2004, pp. 186–193 (2004) 15. Madhusudan, T., Zhao, J.L., Marshall, B.: A case-based reasoning framework for workflow model management. Data Knowl. Eng. 50(1), 87–115 (2004) 16. Ponte, J., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. of ACM SIGIR Conference 1998, pp. 275–281 (1998) 17. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975) 18. Turpin, A., Scholer, F.: User Performance versus Precision Measures for Simple Search Tasks. In: Proc. of ACM SIGIR Conference 2006, pp. 11–18 (2006) 19. Voorhees, E.M.: The cluster hypothesis revisited. In: Proc. of ACM SIGIR Conference 1985, pp. 188–196 (1985) 20. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proc. of ACM SIGIR Conference 2006, pp. 178–185 (2006) 21. Yan, Z., Dijkman, R., Grefen, P.: Fast Business Process Similarity Search with Feature-Based Similarity Estimation. In: Meersman, R., Dillon, T.S., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6426, pp. 60–77. Springer, Heidelberg (2010) 22. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proc. of ACM SIGIR Conference 2001, pp. 334-342 (2001)
Self-learning Predictor Aggregation for the Evolution of People-Driven Ad-Hoc Processes Christoph Dorn1,3 , C´esar A. Mar´ın2 , Nikolay Mehandjiev2 , and Schahram Dustdar3 1
Institute for Software Research, University of California, Irvine, CA 92697-3455
[email protected] 2 Centre for Service Research, University of Manchester, Manchester M15 6PB, UK
[email protected] 3 Distributed Systems Group, Vienna University of Technology, 1040 Vienna, Austria
[email protected] Abstract. Contemporary organisational processes evolve with people’s skills and changing business environments. For instance, process documents vary with respect to their structure and occurrence in the process. Supporting users in such settings requires sophisticated learning mechanisms using a range of inputs overlooked by current dynamic process systems. We argue that analysing a document’s semantics is of uttermost importance to identify the most appropriate activity which should be carried out next. For a system to reliably recommend the next steps suitable for its user, it should consider both the process structure and the involved documents’ semantics. Here we propose a self-learning mechanism which dynamically aggregates a process-based document prediction with a semantic analysis of documents. We present a set of experiments testing the prediction accuracy of the approaches individually then compare them with the aggregated mechanism showing a better accuracy. Keywords: document analysis, process recommendation, people-driven ad-hoc processes, document and process evolution.
1
Introduction
Changing business requirements, dynamic customer needs, and people’s growing skills demand support for flexible business processes. Existing attempts to support such dynamic environments can be roughly categorised into ad-hoc processes [19], semi-structured or case-based processes [15], and well-structured processes [14]. Their differences lie mostly in the flexibility at the process modelling level. Generally they allow the user to deviate from the prescribed course of action, adjusting the process according to the current needs. This flexibility comes at the price of users facing more complex situations while receiving little support on how to carry out their activities. Recent efforts partially address this issue by recommending next steps and relevant documents [18,4]. Yet the increase of information exchanged as documents in disparate forms adds to the cognitive load of the user. S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 215–230, 2011. c Springer-Verlag Berlin Heidelberg 2011
216
C. Dorn et al.
People-driven ad-hoc processes prescribe the existence of a process model, which the user is completely free to deviate from using the presence of certain documents and their states to determine the next action to recommend to its user. Yet the prevailing assumption that process documents are always well structured, conveying a precise meaning and purpose is far from realistic. Achieving such a level of process flexibility — and subsequently providing the necessary user support — demands for an accurate recognition of imprecise documents, allowing the purpose of process activities to remain the same but the required input and output documents to vary. Our approach applies the concept of process stability to determine whether the underlying process model or a document’s semantic structure yield better prediction results. The analysis of prediction results feeds back in turn to automatically adjust the process stability and aggregation. To the best of our knowledge this dynamic predication combination is the first attempt to combine process information and document alignment to determine semantically the type of a given document. We present experimental results obtained through realistic simulations of changing processes and documents with varying semantic structure. Our results support our claim that a dynamic predictor aggregation outperforms individual predictors especially during evolving process phases. In the remaining of this paper, Section 2 presents a reference scenario. Section 3 describes preliminaries and our approach. We outline the individual and aggregated predictors in Section 4. Experimental setup and results are discussed in Section 5. Related Work in Section 6 compares existing approaches to our contribution, followed by a conclusion and outlook on future work in Section 7.
2
Reference Scenario
The example in Figure 1 depicts a flexible people-driven order process. The individual work steps describe a general order of activities to successfully complete a process instance. The outlined flow, however, does not enforce the exact order of activities, which is up to the user, and covers no exceptions or process adaptations that might arise due to a specific customer request, incomplete information, or user specific expertise. Consequently, the listed types of documents that characterise the exchanged messages specify merely an initial set of expected documents. In this scenario, we encounter various forms of documentcentric process evolution:
V
(2) Offer
[D] Credit Check
(4) CreditNote
[F] Prepare Shipment
[H] Regular Dispatch
[E] Send Acceptance
XOR
(5b) Job Description
(6) AuthOf Invoice
(7) Invoice
[J] Priority Dispatch
[C] Replenish
(5a) Agreement
V
[A] Confirm Order
(3) Quote
V
[B] Check Inventory
V
(1) Order
[G] Billing + Invoicing
Fig. 1. Generic order process and associated types of documents
(8) Delivery Note
Self-learning Predictor Aggregation
217
Missing Documents: When the Replenish step (C) is updated to make use of an automatic restocking system, only user confirmation is required and the Quote message (3) no longer occurs. Delayed Documents: A company potentially decides to delay the Agreement (5a) document until the P repareShipment (F) activity signals the completion of the production and packaging sub process. Premature Documents: For premium customers, shipping becomes independent of billing thus the DeliveryN ote (8) potentially occurs before Invoice (7). Shifted Documents: For regular customers, the company might decide to merge the DeliveryN ote (8) into the Invoice (7) thereby having the invoice after the P riorityDispatch (H) and dropping the delivery note altogether.
3
Preliminaries
Our approach relies on following fundamental concepts: (i) there is a set of valid specifications for any document type (Sec 3.1), (ii) the user can deviate from the process model anytime (Sec 3.2), (iii) user feedback corrects predictions (Sec 3.3), (iv) document-task dependencies are malleable and change over time (Sec 4.1), and (v) document structures change over time (Sec 4.2). 3.1
Document Alignment
The alignment of documents consists of determining the best match between a document and a set of internal document representations [7], which include the expected document structure but also knowledge regarding what to do with this document. Therefore, this differs from the related activity of classifying documents. We call a business’ internal representation of a document a document type. It is possible that more than one document type represents the same document, though with different degrees of similarity. However, the document alignment process provides the most appropriate one. In order to align a document to a document type, we need to define how documents are represented. Representing and standardising the structural information of documents have been addressed by a number of efforts, cf.[13,20]. In this paper we simply reuse the one explained in [12], where the hierarchical structure of the semantic concepts is represented by the path from the root concept to each of the leaf concepts. For example, consider the structure of the document in Listing 1.1 (upper part) and the representation of the concepts Name and City (lower part). Notice that the two concepts Name belong to different entities and thus convey a different meaning. Also notice that even though the two concepts City possess the same meaning, they actually represent different concepts because of their position in the structure. We use this document representation for our experiments and leave the conversion of actual documents into XML out of the scope of this paper. We simply assume that others, cf.[10], have done it already.
218 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
C. Dorn et al. < Document > < C l i e n t O r g a n i s a t i o n> < Name > TUVUM S o l u t i o n s < P h y s i c a l A d d r e s s> < Street > Baker St < City > London < P o s t a l A d d r e s s> < POBox > 1234 < City > London < C o n t a c t P e r s o n> < Name > C h r i s t o p h Dorn Document - > C l i e n t O r g a n is at ion - > Name Document - > C l i e n t O r g a n is at ion - > ContactPerson - > Name Document - > C l i e n t O r g a n is at ion - > PhysicalAddress - > City Document - > C l i e n t O r g a n is at ion - > PostalAddress - > City
Listing 1.1. Document representation and semantic concepts
3.2
People-Driven Ad-Hoc Processes
In [4] we introduced a modelling language for describing people-driven ad-hoc processes. Individual process steps are linked to connectors by means of transition arcs thereby creating a directed, a-cyclical graph. The main difference to previous works on flexible workflow support systems (e.g., [14,2,18]) is a supplementing sequence graph that describes the preferred order in which a user progresses through the activities in the various branches. An incoming transition details all document types that are required as inputs to an activity. An outgoing arc lists all document types that are produced by that activity. See Figure 2 for a process model excerpt describing part of the scenario. The process engine observes all messages as well as user actions to determine which process activities have been successfully completed, have been skipped, or are currently processed. Doing so, it keeps track of upcoming activities and expected incoming and outgoing documents. For each incoming document, the process recommendation algorithm then identifies which activity the user should carry out next, and what follow-up activities are suitable. This approach however, assumes that incoming documents are well-formed and thus are always mapped (i.e. aligned) to the correct document type.
(1b) Credit Note
[E] Send Acceptance
AND (3a) Agreement
(3b) Job Description
(3b) Job Description
(3b) Job Description
(6b) AuthOf Invoice
[F] Prepare Shipment [G] Billing + Invoicing
Fig. 2. Process model excerpt detailing document types on transition arcs between activities
Self-learning Predictor Aggregation
3.3
219
A Framework for Self-learning Document Predictors
Our approach to self-learning predictor aggregation is visualised in Figure 3. The Document-based Predictor analyses the structure of every document occurring in the process (1) and provides a ranking of likely types to the Dynamic Predictor Aggregation component (2a). Likewise, the Process-based Predictor provides expected document types (2b). The Dynamic Predictor Aggregation combines both rankings and provides the Process Engine (3) with the most likely document type, triggering (4) the Activity Recommender to suggest the next actions to the user (5). The Prediction Feedback component receives explicit user feedback in the form of document re-alignment, or implicit feedback by recommendation acceptance (6). This updates all three predictors to consider the prediction error and improve their individual performance (7). In the case of the dynamic aggregation, this includes an update of the process stability. Details on the process engine and activity recommender are out of scope of this paper as we focus on improving document type predictions, which consequently increase the precision of recommendations.
1 Documentbased Predictor (Sec. 4.2)
2a
6 2b
Process-based Predictor (Sec. 4.1)
Dynamic Predictor Aggregation (Sec. 4.3) 3
Document Types
Process Engine
5
Prediction Feedback (Sec. 4.3) 7
4
Activity Recommender
7
Fig. 3. Self-learning predictor aggregation through feedback analysis. Components point to their respective sections; remaining components are out of scope of the paper.
4
Document Predictors
A predictor P provides (for a particular process instance and given time) a probability measurement prob(dt) in the interval [0, 1] for each document type dt ∈ DT . The resulting prediction set R is normalised so that max(prob(dt) ∈ R) = 1. Note that multiple document types within a set R can yield the maximum probability of 1. For example, upon completion of activity (E) producing two documents (3a, 3b), the process-based predictor would potentially give them equal probability of 1. 4.1
Predictor Based on Process Progress
Process-based prediction relies on a combination of the previously introduced process model with document type annotations (see Section 3.2) and a probabilistic document state model. A detailed discussion of the state model including the mechanisms for ageing and co-evolution with process changes is provided in [5]. We include a brief description of the state model here for completeness.
220
C. Dorn et al.
Super-State: Waiting
Start 1
Scheduled 0
1
0
Super-State: Arrived 0
0
1
Expected 0
Missing 0
Not Expected
Occurred
Received Early Received On Time
0 0
Received Late
1
0
Repeated
Received Unexpected
0
0
0
0
1 Never Occurred
1
Fig. 4. State transition model for a document type. Each model instance provides the transition probabilities for particular combinations of document type and process model.
The document state model (Figure 4) describes the likelihood for transitions between various document states (such as scheduled, expected, arrived) during the process. We track only documents that are defined in the process model, each in a separate state model instance that is valid only for that particular pair of process type and document type. Based on incoming and outgoing documents as well as completed process steps, the process engine triggers the transition events in one or more message state models. In the scenario, for example, the Agreement and JobDescription documents remain in state Scheduled until a CreditN ote triggers the activation of the SendAcceptance activity and thus sets both documents to Expected. In case the P repareShipment activity is completed without a detected JobDescription, that document is set to M issing. Ultimately, the process-based prediction leverages the transition labels between document states. In any of the W aiting super-states the probability is determined based on the transition to the respective received state. From any of the Arrived super-states the probability is determined by the transition weight to the Repeated state. For prediction, all documents are ranked according to their probability to occur in their current state. The top ranked documents constitute the prediction of newly intercepted documents. The process-based predictor’s main disadvantage is the lack of knowledge about the document’s internal structure. Thus the predictor cannot discriminate between multiple types whenever a process step requires more than one input or output documents, or when the process has arrived at a split where the branches are triggered by different document types. 4.2
Predictor Based on Document Alignment
As mentioned in Section 3.1, documents are represented by the concepts contained within which indicate the semantic structure by representing the paths from the root concept to each of the leaf concepts. Therefore the meaning of the document consists of the set of concepts contained within it. Yet because in reality documents vary, aligning a document to a document type requires an approximation of one to the other. We perform this by calculating the observed “average” document type. We call this calculation the Centre of Mass. We consider DT to be a set of document types already defined and which we want to align a document to. A document type dt ∈ DT is composed by
Self-learning Predictor Aggregation
221
a set of concept paths cpdt ⊆ CP where CP represents the universe of known concept paths. Likewise, a document d is composed by a set of concept paths cpd ⊆ CP . We say a document d is aligned to a document type dt if the condition cpd ∩ cpdt = ∅ is valid. Normally, the greater the concept paths shared between d and dt the more similar the two are. A document is considered aligned to a document type when the two share most of their concept paths. We use the Centre of Mass cm as the approximate representative of a document type dt and is defined as follows cmdt =
∀d→dt
|cp| |{∀cp ∈ }|
(1)
where cm is a vector where each element contains the proportion [3] of each concept path cp with respect to all the concept paths included in the union of all documents d already aligned to a document type dt. Such a proportion can be also seen as the concept path weight to represent a document type. Once the Centre of Mass is calculated, we simply add up the weights of all the concept paths contained in the document we want to align. Obviously, concept paths in the document but not contained in the Centre of Mass do not convey any weight. The result is a value within the range [0, 1], the higher the value the most similar it is to the approximation of that document type. This differs from the Centroid document classifier [6] where the cosine between two vectors (documents) is calculated to determine their similarity. The Centre of Mass learns by aligning a document to a document type. Whether the algorithm is supervised or not, the Centre of Mass re-calculates the approximation of a document type and considers the newly aligned document. In order to accommodate evolving document structures, a learning window is implemented by including only the X most recent documents per document type to compose its Centre of Mass. That is, we limit the number of documents to include only the most recently aligned documents, thus converging to the new “average” document type more quickly. 4.3
Self-adjusting Predictor Aggregation
The self-adjusting aggregation of multiple predictors consists of two phases. Through-out the process lifetime, the individual document-based and processbased document type ranks are aggregated and subsequently applied in user recommendation on which process step to execute next. Upon process termination, the second phase utilizes explicit user feedback (i.e., correction of misclassified documents) and implicit ranking analysis to improve the classification mechanism. Our mechanism dynamically adjusts the weight of process-based rankings (i.e., when messages occur according to the process model) and correspondingly the weight of the structural similarity (i.e., when the documents no longer follow the process) according to the underlying Process Stability (stability = [0, 1]).
222
C. Dorn et al.
Dynamic Aggregation. The first step in the predictor aggregation algorithm (Algorithm 1) is the intersection of process-based (RP ) and document-based (RS ) prediction sets (line 2). The algorithm deals with strongly conflicting predictions in lines 3 to 8, and otherwise focuses on optimal aggregation in lines 9 to 20. Note that prior to aggregation, we remove any document type dt from RP and RS that is not expected, i.e. prob(dt) = 0. An empty intersection (Raggr ) indicates a strong disagreement and hints at a deviation from the process or a very distorted document. Hence, we derive all document types that are still expected to occur — or that are likely to occur again — and intersect them with the document-based prediction result. If this yields an empty set again, we apply the stability value to decide whether classification RP or RS is a more viable predictor. An intersection Raggr with two or more elements needs closer analysis to decide upon the most suitable aggregation preference pref = [0, 1]. Here, process stability is not the sole factor for determining the significance of process-based and document-based predictors. Both predictors potentially encounter situations when they cannot clearly distinguish between multiple document types and thus have to give roughly equal probability to most of them. We apply the widelyknown Shannon’s entropy to determine whether a ranking result yields crisp or indecisive probability values. The normalized entropy h(R) for a document type ranking set is calculated as: |R| h(R) =
i=1
score(typei ) score(type)
i) ∗ log( score(type ) score(type)
log(|R|)
(2)
The normalized entropy yields values close to 1 for rankings of (almost) equal score, and values close to 0 for crisp scores regardless of ranking size. Subsequently we apply the following rules to determine the impact of process-based and document-based rankings: If both are indecisive, we completely rely on the prediction that the process stability currently tends towards. That is, for process stability > 0.5, we apply only process-based rankings, otherwise we apply only document-based rankings. If both are decisive, stability serves directly as the trade-off factor. Otherwise — when only one predictor is decisive — we apply no preference at all (pref = 0.5). An intersection Raggr with a single element represents a good agreement of both predictors and does not need any further processing. Before returning the aggregated set of document type probabilities, the probability values are normalized such that i prob(di ) = 1∀di ∈ R. This allows us to calculate the prediction error if the top ranked document type is incorrect. Prediction Feedback for Stability Adjustment. Upon each successful process termination, the corresponding stability measure is adjusted. The amount and direction of adjustment depends on explicit user feedback (i.e., the overall prediction was incorrect) and implicit feedback (i.e., only one predictor was wrong). Algorithm 2 retrieves the process-based and document-based document type set for each document in the given process (lines 5 and 6) and calculates
Self-learning Predictor Aggregation
223
Algorithm 1. Predictor Aggregation Algorithm A(RP , RS , stability). 1: function aggregate(RP , RS , stability) 2: Raggr ← RP ∩ RS 3: if |Raggr | = 0 then 4: E ← getAllExpectedM sgT ypes(RP ) 5: Raggr ← RC ∪ E retain only expected message types 6: if |RC | = 0 then 7: if stability > 0.5 then Raggr ← RP 8: else Raggr ← RS 9: else 10: if |Raggr | > 1 then 11: RP ← RP ∩ Raggr 12: procEntr = calcN ormalizedEntropy(RP ) 13: RS ← RS ∩ Raggr 14: contentEntr = calcN ormalizedEntropy(RS ) 15: pref ← 0.5 16: if procEntr > indecT hreshold ∧ contentEntr > indecT hreshold then 17: pref ← round(stability) 18: if procEntr < indecT hreshold ∧ contentEntr < indecT hreshold then 19: pref ← stability 20: Raggr ← aggregate(RP , RS , pref ) 21: normalize(Raggr ) return Raggr
prediction error rates for overall and individual predictors (lines 7 to 14). The prediction error function (lines 25 to 31) checks whether the true document type resides on the first position of the sorted document type set (i.e., no error). Otherwise, the function retrieves the position of the right type and calculates the error as the difference between probability and 1. For two or more document types that yield probabilities prob(dt) = 1 (i.e., the predictor was indifferent), the error rate hence will be zero. Note that for the normalized overall prediction result, any misclassification will result in an error rate greater than zero. Next, we determine the process wide stability change (lines 15 to 20). We increase the change for every document-based error and process-based correct classification and vice versa — independent of overall prediction success. There is no impact when both predictors are wrong. Any prediction error, however, weighs in heavier when the user corrects the combined prediction result (i.e., error > 0). Thus, as long as the overall prediction is correct, an individual error’s impact is proportional to the error amount. Thus the effect on stability change remains potentially small. Once we determine the overall stability change we apply the exponentially weighted moving average (EWMA) as the ageing function to update the previous process stability factor (lines 21 to 23). When the stability change is greater than zero, the new stability value will move towards 1. For a stability change value below zero, the new stability is closer to 0. The bootstrapping of the process stability value is straight forward. Empty process models that have no document information (yet) — respectively only a
224
C. Dorn et al.
Algorithm 2. Stability Adjustment Algorithm A(W F, oldStability). 1: function adjust(W F, oldStability) 2: δ←0 Stability change δ 3: for M essage m ∈ W F do 4: Raggr ← getP rediction(W F, m) 5: RP ← getP rocP rediction(W F, m) 6: RS ← getStructP rediction(W F, m) 7: if hasU serCorrection(W F, m) then 8: M essageT ype trueT ype ← getT rueT ype(W F, m) 9: error ← calcP redictionError(Raggr , trueT ype) 10: else 11: M essageT ype trueT ype ← getM essageT ype(Raggr[0]) 12: error ← 0 13: procError ← calcP redictionError(RP , trueT ype) 14: structError ← calcP redictionError(RS , trueT ype) 15: if error > 0 then 16: if procError > 0 ∧ structError = 0 then δ ← δ − 1 17: if procError = 0 ∧ structError > 0 then δ ← δ + 1 18: else 19: if procError > 0 ∧ structError = 0 then δ ← δ − procError 20: if procError = 0 ∧ structError > 0 then δ ← δ + structError 21: if δ < 0 then delta ← max(0, oldStability + δ) 22: if δ > 0 then delta ← min(1, oldStability + δ) 23: newStability ← γ ∗ delta + (1 − γ) ∗ oldStability 24: return newStability 25: function calcPredictionError(R, trueT ype) 26: P redictionResult cr ← R[0] 27: if cr == trueT ype then return 0 28: else 29: for P redictionResult cr ∈ R do 30: if getM essageT ype(cr) == trueT ype then return 1 − getP rob(cr) 31: return 1
list of associated types but no mapping to transitions — start with stability = 0.3. We rely on the document-based predictor in the beginning as the processbased predictor still needs to learn the correct mappings. For new process models with detailed document information, we initiate stability = 0.7 as we assume that a new process is stable and not subject to immediate evolution.
5
Evaluation
We evaluate our approach based on the reference scenario in Section 2. Specifically, we are interested in (i) the prediction error rate and how well the dynamically aggregated predictor performs against the individual predictors as well as a naive fixed aggregation; (ii) how robust the dynamically aggregated predictor is in the present of increasingly noisy (varying) documents; and (iii) how
Self-learning Predictor Aggregation
225
accurately the process stability metric reflects the actual process evolution. Furthermore, we distinguish between two cases: the evolution of a stable process and the evolution of an empty process. 5.1
Experimental Setup
We simulate user behaviour by playing prefabricated log sequences against the prediction system. Each log sequence represents one process execution and is denoted a round. Multiple, sequential rounds make up one experiment run. Within a run, any changes to the process model at the end of a round are made available to the subsequent round — thereby enabling process evolution. Process evolution itself is simulated by gradually switching from process-coherent log sequences to log sequences that follow a different process model. Note that we keep the structure of process steps stable as we focus on evaluating the document prediction aspect only. Consequently, the evolution affects only the type, location, and timing of documents within a process but not the sequence of activities. Two log sequences (L1, L2) are sufficient to cover all independent paths in the scenario process model. Two additional log sequences (L3, L4) describe an evolved model. Here, the Quote (3) document is no longer used, the Agreement (5a) is delayed until completion of the PrepareShipment (F) activity which also triggers the Authorization Of Invoice document (6). The Invoice (7) is produced when executing Priority Dispatch (J). The Delivery Note (8) only applies to Regular Dispatch (H). In all experiments the EWMA coefficient (for process stability aging) α is set to 0.3 which corresponds to a trade-off between rapid uptake of evolving process patterns and robustness against one-time process deviations. The used logs contain following sequences of process steps and documents: L1: L2: L3: L4:
1, 1, 1, 1,
A, A, A, A,
2, 2, 2, 2,
B, C, 3, D, D, 4, B, C, B, C, D, 4, D, 4, B, C,
4, E, 5a, 5b, F, 3, E, 5a, 5b, F, E, 5b, F, 6, 5a, E, 5b, F, 6, 5a,
6, G, 7, J, 8 6, G, 7, H, 8 G, J, 7 G, H, 8
We use a pool of valid documents to generate documents for our experiments. Initially we gathered the documents mentioned in the scenario (Section 2) from companies in our project consortium: 6 Orders, 4 Offers, 5 Quotes, 5 Credit Notes, 5 Agreement, 7 Job Descriptions, 5 Authorisation of Invoicing, 8 Invoices, and 3 Delivery Notes. Even when they represent the same document type, they contain slightly different information from each other. Then we manually extracted the concepts paths using the Core Components standard [20] as a reference for matching concepts. We randomly selected one document from each type to be considered as the document type specification throughout the experimentation. The rest of the documents then formed the pool of valid documents to choose from. Additionally, we developed a document generator which took a document of a specified type from the pool and introduced noise to it in order to take our experiments closer to practical reality. Such noise consists of randomly introducing, deleting, or replacing concept paths.
226
C. Dorn et al.
We conduct each experiment with 5%, 10%, 15%, and 20% noise in each generated document. In addition, every combination of experiment run and document noise is carried out 10x with differently initialised random number generators. In total every experiment run is thus executed 40x. Within these 40 instances we obtain the prediction error for individual predictors, fixed aggregation, and dynamic aggregation, as well as the process stability metric in each round. We evaluate the responsiveness and adaptivity of our aggregation mechanism to process evolution. As success criteria, we measure the absolute prediction error of a predictor each round by simply counting how often it wrongly predicts a document. Thus an average absolute prediction error of 1 signifies that the predictor makes one prediction error per process instance. We compare processbased, document-based, fixed aggregation, and dynamic aggregation predictors. The fixed aggregation predictor simply merges process- and document-based ranks (Rf ixed ← RP ∪ RS ). 5.2
Experiments and Results
In experiment 1, we first run 16 process instances (determined by alternating L1 and L2) which represent normal process behaviour (Phase 1). Then we evolve the process by having an interleaving of L1 to L4 for 12 rounds (Phase 2), and then continue for another 16 rounds of stable evolved behaviour — alternating L3 and L4 (Phase 3). The initial stability value is 0.7 as we assume a stable process. Figure 5a displays the average absolute prediction error for each of the four predictors with document noise level 20%. The three phases are clearly visible — with low(er) error rates during stable periods (round 1 to 16 and 29 to 44). In experiment 2, we evaluate the learning behaviour for an initially empty process (i.e., the process model is unaware of any document types). A run consists of 30x alternating sequences of L3 and L4. The initial stability value is set to 0.3. Here we split the 30 rounds into two phases — learning (1) during round 1 to 6 (Figure 5e) and stabilized (2) during rounds 7 to 30 (Figure 5f). In general, we find the dynamic aggregation predictor outperforming the document-based predictor considerably in all situations (absolute error reduction between 0.57 and 2.19). In experiment 1 (Figure 5c+d), the dynamic aggregation predictors fares as good as the fixed aggregation predictor as well as against the process based predictor for stable process phases (fixed: [-0.10, 0.10] and proc: [-0.08, 0.28]). However, it outperforms both predictors during process evolution (fixed: [0.12, 0.23] and proc: [0.48, 0.52]). In experiment 2 (Figure 5e+f), the dynamic aggregation predictor always outperforms all other predictors — the absolute error difference for fixed: [0.07, 0.33], doc: [0.17, 1.31], and proc: [0.63, 2.0]. Document noise has little effect on the dynamic aggregation in experiment 1. The increase in document-based predictor error is between 0.67 and 1.15 when moving form 5% to 20%. At the same time, the increase in dynamic predictor error is only between 0.06 and 0.26. In experiment 2, the effect of noise is more visible as the dynamic predictor relies more on the document-based predictor during the initial learning phase (doc error diff: 1.37, dynamic error diff: 0.93). Once stabilized, the noise has little impact (doc error diff: 0.87, dynamic error diff: 0.15).
Self-learning Predictor Aggregation
1
4 Dynamic Static Process−based Document−based
0,9 Stability
3.5 Avg Abs Classification Error
227
3 2.5
0,8 Stability 5% Stability 20%
0,7 0,6
2
0,5
1.5
0.5 0
1
−0.5 0.5 0 0
Stability Change 5% Stability Change 20%
−1 4
12
8
16
28 24 20 Experiment Rounds
32
40
36
44
−1.5 0
10
5
(a)
Avg Abs Classification Error
Avg Abs Classification Error
45
1.5
1
Dynamic 20% Fixed 20% Document−based 20% Process−based 20%
2
1.5
1
0.5
0.5
0
3
2
1
3
2
1
(c)
(d) Dynamic Fixed Document−based Process−based
2.5
2
1.5
1
0.5
Dynamic Fixed Document−based Process−based
2.5
Avg Abs Classification Error
Avg Abs Classification Error
40
35
30
(b) Dynamic 5% Fixed 5% Document−based 5% Process−based 5%
2
0
25 20 Experiment Rounds
2.5
2.5
0
15
2
1.5
1
0.5
2
1
(e)
0
1
2
(f)
Fig. 5. Experiment 1: Comparison of document predictors (a), process stability/change (b) across 44 rounds, and error rates for 5% (c) and 20% noise (d) averaged for stable (1), evolving (2), and evolved (3) process phases. Experiment 2: Learning documents in an empty process model: comparison of predictor error rates for 5% (e) and 20% noise (f) averaged for learning (1) and stabilized (2) process phases.
Finally, as shown in Figure 5b, the feedback mechanism successfully determines stability values that accurately reflect the process dynamics. Process stability drops precisely whenever evolution begins (start of phase 2 in exp1 and start of phase 1 in exp2) and stabilizes once evolution terminates. Stability values for experiment 2 and additional analysis is available as supporting online material (SOM)1 .
1
http://www.infosys.tuwien.ac.at/staff/dorn/bpm2011SOM.pdf
228
6
C. Dorn et al.
Discussion and Related Work
Multiple research efforts have focused on supporting ad-hoc workflows (e.g., [19]), semi-structured processes/case-based systems (e.g., [15,2]), and well-structured processes (e.g., [14,21,1]) over the past decade. The increase in flexibility, however, comes at the cost of lacking user guidance. Recently this shortcoming has been addressed by introducing recommender systems [18,4,5]. All these approaches are focused on user activities and assume that the process documents are either rigorously defined (explicitly or implicitly) or that the documents have no impact on adaptation and thus recommendation. This contrasts with existing work in the area of document-driven processes, where, for example, [9] presents an approach for dynamic workflows supported by software agents. Documents trigger events in the system such as document arrival, document updating, or document rejection. A set of rules represent the business logic (workflow process). Agents then exchange documents between each other according to the business logic thus managing the document life cycle. However, even when this approach considers documents within the processes, it does not semantically analyse the contents thus rendering this approach different from ours. Orthogonal to our approach, the framework for document-driven workflows by [22] models document meta-information rather than the internal semantic structure. Our document type state transition model (see Figure 4), however, can be seen as an additional document meta-model. Yet with our approach processes are allowed to evolve due to the probabilities the transition model manages whereas the approach by [22] does not prove how their document meta-model is better supported. Object-aware process management is the focus in [8]. Here document dependencies are explicitly modelled and drive the process execution. The object structure as well as task-object relations however are statically defined at designtime and cannot be changed during runtime. Neither do the authors envision user support through recommendations. Additionally, [11] aims at detecting which document content has an effect on the process outcome, ultimately to provide user recommendations. The process itself however remains rigid and the document structure is pre-defined. An E-mail can be seen as a form of document being exchanged especially between small companies. [16] presents Semanta, an speech act theory approach to analyse E-mails in terms of the verbs contained, mapping between collections of verbs and activities in ad-hoc workflows. In [17] Semanta is presented as an E-mail based system for workflow support, yet the E-mail (document) analysis is exclusively based on the verbs found in contrast to our semantic analysis. The main conceptual differences of our approach are (i) the relaxation of both a rigid document structure and a predetermined document-task dependency and (ii) the automatic learning of evolving documents and document occurrence in the process. This results in a radically different approach to building a process support system. Rather than executing every step automatically, the process engine needs to propose suitable steps to the user and determine from user actions and subsequently observed documents how the process actually continued. Techniques for document alignment [12] and a recommendation-centric process
Self-learning Predictor Aggregation
229
engine [4,5] were developed in the scope of the EU project Commius2 which applies email as the main messaging infrastructure [10].
7
Conclusions
In this paper we presented a self-learning mechanism for determining document types in people-driven ad-hoc processes. The mechanism dynamically decides whether the prediction needs to rely more on process structure information or rather more on document semantics. We introduce a process stability metric which describes whether a process is currently evolving. Implicit and explicit user feedback keep the metric always up-to-date. Simulations derived from real-world documents and processes demonstrate that our mechanism yields the lowest prediction error rates during evolving and stable process phases. In the future we plan to refine the dynamic aggregation mechanism by capturing process stability on a more fine-grained level through detection of activity hot spots. Additional analysis of the prediction success rates for individual document types and the similarity in-between document types seems promising to reduce error rates even further. Acknowledgment. This work has been partially supported by the EU STREP project Commius (FP7-213876) and Austrian Science Fund (FWF) J3068-N23.
References 1. Adams, M., Edmond, D., ter Hofstede, A.H.M.: The application of activity theory to dynamic workflow adaptation issues. In: 7th Pacific Asia Conference on Information Systems. pp. 1836–1852 (2003) 2. Adams, M., Hofstede, A., Edmond, D., van der Aalst, W.: Facilitating flexibility and dynamic exception handling in workflows through worklets. In: Proceedings of the CAiSE 2005 Forum, FEUP. pp. 45–50 (2005) 3. Berry, M.W., Drmac, Z., Jessup, E.R.: Matrices, vector spaces, and information retrieval. Society for Industrial and Applied Mathematics Review 41(2) (1999) 4. Dorn, C., Burkhart, T., Werth, D., Dustdar, S.: Self-adjusting recommendations for people-driven ad-hoc processes. In: Proceedings of International Conference on Business Process Modelling. Springer, Heidelberg (September 2010) 5. Dorn, C., Dustdar, S.: Supporting dynamic, people-driven processes through selflearning of message flows. In: Mouratidis, H., Rolland, C. (eds.) CAiSE 2011. LNCS, vol. 6741, pp. 657–671. Springer, Heidelberg (2011) 6. Han, E.H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D., Komorowski, J., Zytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000) 7. Joseph, D., Mar´ın, C.A.: A study on aligning documents using the circle of interest technique. In: 5th International Conference on Software and Data Technologies, pp. 374–383. SciTe Press (2010) 2
http://www.commius.eu
230
C. Dorn et al.
8. K¨ unzle, V., Reichert, M.: Philharmonicflows: towards a framework for object-aware process management. Journal of Software Maintenance and Evolution: Research and Practice 23(4), 205–244 (2011), http://dx.doi.org/10.1002/smr.524 9. Kuo, J.Y.: A document-driven agent-based approach for business processes management. Information and Software Technology 46(6), 373–382 (2004) ˇ 10. Laclav´ık, M., Dlugolink´ y, S., Seleng, M., Kvassay, M., Hluch´ y, L.: Email analysis and information extraction for enterprise benefit. Computing and Informatics 30(1) (2011) 11. Lakshmanan, G.T., Duan, S., Keyser, P.T., Khalaf, R., Curbera, F.: A heuristic approach for making predictions for semi- structured case oriented business processes. In: First Workshop on Traceability and Compliance of Semi-structured Processes @BPM 2010. Springer, Heidelberg (2010) 12. Mar´ın, C.A., Carpenter, M., Wajid, U., Mehandjiev, N.: Devolved ontology in practice for a seamless semantic alignment within dynamic collaboration networks of smes. Computing and Informatics 30(1) (2011) 13. OASIS: ebXML technical architecture specification. Tech. rep., ebXML (2001) 14. Reichert, M., Rinderle, S., Dadam, P.: Adept workflow management system: flexible support for enterprise-wide business processes. In: van der Aalst, W.M.P., ter Hofstede, A.H.M., Weske, M. (eds.) BPM 2003. LNCS, vol. 2678, pp. 370–379. Springer, Heidelberg (2003) 15. Reijers, H., Rigter, J., Aalst, W.V.D.: The case handling case. International Journal of Cooperative Information Systems 12, 365–391 (2003) 16. Scerri, S., Davis, B., Handschuh, S.: Improving Email Conversation Efficiency through Semantically Enhanced Email. In: Proceedings of the 18th International Conference on Database and Expert Systems Applications, pp. 490–494. IEEE Computer Society, Washington (2007) 17. Scerri, S., Davis, B., Handschuh, S.: Semanta Supporting E-mail Workflows in Business Processes. In: Proceedings of the 2009 IEEE Conference on Commerce and Enterprise Computing, pp. 483–484. IEEE Computer Society, Los Alamitos (2009) 18. Schonenberg, H., Weber, B., Dongen, B., van der Aalst, W.: Supporting flexible processes through recommendations based on history. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 51–66. Springer, Heidelberg (2008) 19. Stoitsev, T., Scheidl, S., Spahn, M.: A framework for light-weight composition and management of ad-hoc business processes. In: Winckler, M., Johnson, H. (eds.) TAMODIA 2007. LNCS, vol. 4849, pp. 213–226. Springer, Heidelberg (2007) 20. UN/CEFACT: Core components technical specification – part 8 of the ebXML framework. Tech. rep., UN/CEFACT (2003) 21. Vanderfeesten, I.T.P., Reijers, H.A., van der Aalst, W.M.P.: Product based workflow support: Dynamic workflow execution. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 571–574. Springer, Heidelberg (2008) 22. Wang, J., Kumar, A.: A framework for document-driven workflow systems. In: van der Aalst, W.M.P., Benatallah, B., Casati, F., Curbera, F. (eds.) BPM 2005. LNCS, vol. 3649, pp. 285–301. Springer, Heidelberg (2005)
Serving Information Needs in Business Process Consulting Monika Gupta, Debdoot Mukherjee, Senthil Mani, Vibha Singhal Sinha, and Saurabh Sinha IBM Research – India {monikgup,debdomuk,sentmani,vibha.sinha,saurabhsinha}@in.ibm.com
Abstract. Business Process Consulting is a knowledge-intensive activity that requires consultants to be aware of all available process variants and best practices to implement the most effective business transformation. Typically, during a business-consulting engagement, a large amount of unstructured documentation is produced that captures not only the processes implemented, but also information such as the requirements addressed, the business benefits realized, and the organizational changes impacted. Consulting organizations archive these documents in internal knowledge repositories to leverage the information in future engagements. Our analysis of one such repository in the IBM Global Services’ SAP practice indicates that although there is tremendous scope for information reuse, the true potential of reuse is not being fully realized using existing search technology. We present a novel search technique to serve the information needs of consultants as they script various pieces of design information. The proposed system auto-formulates fat semantic queries based on the context of the design activity and runs the queries on a semi-structured knowledge repository to return a ranked set of search results. Our empirical studies show that our approach can outperform—often, significantly—a conventional keyword-based search technique in terms of both relevance and precision of search results.
1 Introduction An enterprise-wide Business Process Management (BPM) initiative is a highly knowledge intensive activity. During the course of this activity, consultants document the business needs, the as-is business processes, the desired processes, the benefits expected, and so on. Furthermore, based on the technology to be used, such as a standard ERP system (e.g., SAP or Oracle), the consultants also identify how to configure and customize the pre-packaged process implementations of the system to meet the requirements of the desired processes. The requirements-gathering phase of a BPM engagement requires extensive collaboration between the consultants and the client teams, usually takes months to complete, and is one of the costliest phases of the BPM project lifecycle. The success of such a project depends to a large extent on the knowledge that consultants possess about the best practices and challenges pertinent to the business scenarios of the project. Recognizing the significance of this factor, business-consulting organizations endeavor to capture the information produced in BPM projects as assets that can be leveraged in future projects. Repositories of such assets, which exemplify S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 231–247, 2011. c Springer-Verlag Berlin Heidelberg 2011
232
M. Gupta et al.
the translation of individual knowledge into an organizational capability, offer many benefits: for example, they can serve as a safeguard against employee churn and help bring new or less-skilled consultants up to speed. Moreover, reuse of information from such repositories can greatly improve the productivity of, and the quality of solutions delivered by, the consultants. Efficient and effective reuse of information depends on the structure of the information in the repositories and, more importantly, on the ease and accuracy with which relevant information can be searched and retrieved. During requirements gathering in a BPM project, most of the information is authored as unstructured documents (e.g., Word documents, spreadsheets, or Visio or PowerPoint diagrams). Although consulting practices often use document templates that prescribe the information to be collected and that enforce some rudimentary structure on the documented information, these documents are essentially treated as unstructured text in the asset repository. Consequently, free text search (e.g., using keywords) is the only viable search mechanism, which can be highly imprecise and can return a large amount of irrelevant information. Such enterprise search mechanisms are nowhere close in effectiveness to web search because the link-analysis algorithms [2,9], which are instrumental in retrieving high-quality web pages, do not apply to the repository documents that lack any hyper-linked structure. In fact, the development of effective search techniques over enterprise repositories is recognized as an open research problem. Hawking [8] lists several challenges that impede enterprise search, one of them being the need to seamlessly combine structured (e.g., relational) and unstructured information in a document for search. He also points out that efficient exploitation of search context is an open problem, which when sufficiently solved can potentially change the face of enterprise search. In this paper, we present a new approach for organizing and reusing knowledge assets—specifically, artifacts related to business processes—produced during requirements gathering in a business-transformation project. Our approach stores information in a semi-structured format. The observation underlying our approach is that most of the content contained in process-related workproducts can be represented in a domainspecific information model. Therefore, we propose to store these workproducts in a semi-structured format, such as XML, that follows a domain-specific schema. Semistructured data has many advantages; for example, it makes evident different concepts, such as process steps and business benefits, present in a process document. Thus, identifying keywords in the description of a specific concept would make search more efficient and accurate, especially in cases where the process documents are heterogeneous in terms of concepts. Fully structuring the data, e.g., in a relational database, may not be appropriate. Process consultants are mostly interested in broad-topic queries. In general, they wish to read relevant documents to get new ideas on the design task at hand. Historically, keyword-based information retrieval has responded better to such queries, whereas structured querying has been suitable when specific results are desired. In our setting, both queries and search results can be large chunks of text; therefore, we expect statistical techniques of information retrieval, which are commonly provisioned over semi-structured or unstructured data, to perform well. We also present a novel search strategy that uses the context of the task at hand to retrieve relevant information from the repository. Contextual search has been discussed
Serving Information Needs in Business Process Consulting
233
in the research literature [3,6,20] as an attractive approach to supplement the user’s query with keywords that reflect the user activity. Instead of just gathering a set of keywords as context, we translate our context to fat semantic queries [1], which leverage the semi-structured nature of the data. We implemented our technique and conducted empirical studies using a knowledge repository, which contains process documents created in real business-transformation projects. In the studies, we compared the accuracy of our context-based approach with that of the keyword-based search, and evaluated the effectiveness of context in increasing the accuracy of the retrieved information. Our results illustrate the effectiveness of our approach: for the analyzed repository, our search technique outperformed keyword search in over 85% of the queries, and often quite significantly—for 75% of the queries, our approach retrieved one-third the number of documents retrieved by keyword search. Our results also illustrate that adding more context to a query generally leads to higher-quality search results. The key contributions of our work are – Empirical analysis of an existing knowledge repository that illustrates the potential for reuse and motivates the need for advanced search techniques – The description of a semi-structured organization of information in knowledge repositories and the presentation of a context-based search technique over the semistructured information – The implementation of the technique for a knowledge repository containing processrelated workproducts – The presentation of empirical results, which illustrate that context-based search can significantly outperform keyword-based search in retrieving accurate information. The rest of the paper is organized as follows. Section 2 provides an illustration of the problem. Section 3 presents the empirical study of an existing knowledge repository. Section 4 describes our context-based search approach. The empirical evaluation and results are discussed in Section 5. Section 6 discusses related work, and Section 7 summarizes the paper and lists directions of future work.
2 Background In this section, we define different artifacts related to business-process modeling, and introduce an example to illustrate the problem. The artifacts that we consider in our work are deliverables prescribed by the SAP Practice in IBM. According to this practice, one of the central documents created in a business-transformation project is the process-definition document. A process-definition document (PDD) contains detailed description of a business process and, in particular, contains information about different concepts, some of which we enumerate. (1) Process: the name and a brief description of the process; (2) Process steps: one or more steps that constitute the process; (3) Inputs: one or more inputs to the process steps; (4) Outputs: one or more outputs from the process steps; (5) Business benefits: one or more benefits of the process to the enterprise; (6) Organizational change impact (OCI): impact of the process on the organization; (7) Requirements: one or more functional or
234
M. Gupta et al.
non-functional requirements of the process; (8) Regulations: rules and regulations that the process must conform to; (9) Integration considerations: various technical or functional integration points applicable; (10) Key design decisions: design considerations for the process; (11) Triggers: the trigger points for the process; (12) Dependencies: the dependencies for the process. At the conclusion of a project, these documents are cleansed to remove client-specific information and populated into a knowledge repository to be leveraged in future projects. To illustrate the problem of searching for information from such a repository, consider the example shown in Figure 1, which depicts three PDDs (in the rows) and three concepts (in the columns)—the concepts are process name, process steps, and OCI. All the three process names contain the term Purchase Orders. The process steps of PDD1 and PDD2 are very similar, whereas the steps of PDD3 are quite different from the steps of PDD1 and PDD2 (even though the names of the three processes are similar). Consider the scenario where a consultant is creating the steps for a Purchase Order process, and has to understand and document the organizational change impact of the process. To leverage the information captured in the knowledge repository, the consultant searches the repository for similar PDDs to examine the OCI information in those documents. If the search is performed using the keyword Purchase Orders, all three PDDs are retrieved. Suppose that the consultant mentions Create Purchase as another keyword. However, the search still returns all three documents. Essentially, the search lacks the capability to associate keywords with concepts and, thereby, search for a keyword over a restricted space (i.e., a concept) instead of over the entire PDD. Thus, one of the ideas underlying our approach is to extract concepts from PDDs and perform search over concepts. For this scenario, our approach constructs two queries, to search for: (1) Purchase Orders over the process (name) concept, and (2) Create Purchase over the process-step concept. When these queries are executed, only PDD1
Fig. 1. Samples of process-definition documents, with information about process steps and organizational change impacts
Serving Information Needs in Business Process Consulting
235
and PDD2 are retrieved, which are the PDDs that the consultant would want to examine. In this manner, our approach can improve the accuracy of enterprise search. Next, we formally define some terms that we use to explain our approach. – Schema: An information schema (or simply schema) is a map of concepts and their inter-relationships. It is represented as a 2-tuple, Σ = {E, R}, where E = {Ci } is the set of concepts and R = {(Ci , Cj ) : Ci , Cj ∈ E} is the set of directed relationships between concepts. Unlike an Entity-Relationship(ER) schema, a relationship in our setting is as an ordered pair (Ci , Cj ), and signifies that data of Cj depends upon that of Ci . For the example in Figure 1, (process, OCI) and (step, OCI) are the directed relationships, since the creation of OCI depends upon the definitions for process and steps. – Record: A record is an instance of a concept, containing the data for the concept and links to other records. Formally, a record is a 3-tuple: r = (δ, C, λ), where δ contains the data for the concept C ∈ E, and λ = {rj : (r.C, rj .C) ∈ R} denotes the set of records that are linked to r. Now, if rj ∈ ri .λ, we denote the relation between records by ri → rj . e.g., For PDD3 in Figure 1, (Processing Indirect Purchase Order, Process, OCI-3) is a record, where OCI-3 is reference to the record for OCI in PDD3. Note that data of a record can be in the form of plain text, rich text or binary (for images and attachments). – Similarity in Records: We consider a record, ri , to be similar to another record, rj , i.e., ri ∼ rj , if there is high textual similarity in ri .δ and rj .δ, and ri .C = rj .C. In this paper, we use cosine similarity, sim of term vectors for records to measure textual similarity [12]; ri ∼ rj only if sim(ri .δ, rj .δ) is greater than some predefined threshold, κ.
3 Potential for Reuse : An Empirical Study In this section, we present the results of two empirical studies that we conducted to assess the potential for information reuse in business process consulting. We analyzed the internal SAP knowledge repository at the IBM Global Business Services, which archived over 14000 documents of different types (e.g., PDD, GAP, and Business process procedure) from 37 different SAP engagements done at IBM. The repository is built using the Rational Asset Manager (RAM)1 . RAM provides full-text keyword search of documents and also facet based search on pre-defined categories. The documents in the repository are categorized by the document type and the client industry (e.g., Retail and Healthcare) and span 15 different industries. We considered all 2675 PDDs from the repository for our studies. First, we developed a comprehensive information schema, ΣPDD = (EPDD , RPDD ), that describes all key concepts in a PDD and their inter-relationships, with help from expert SAP consultants. Next, we used an information extractor, AHA [14], to convert content from PDDs into records of concepts defined in EPDD . We consider this set of records as our data set (DS) for all the experiments presented in this paper. Table 1 shows the distribution of records of different concepts in DS. Not all PDDs contain data for each concept; however, each PDD contains at least the process name and description. 1
http://www-01.ibm.com/software/awdtools/ram
236
M. Gupta et al. Table 1. Count of different concepts across the analyzed PDDs Concept Type
Process Triggers (name , description) C1 C2 Number of Instances 2675 614
Process Steps
Requirements Regulations
Output
C3 1039
C4 912
C6 1148
Concept Type
Integration Input Considerations C9 C10 896 1170
OCI
C7 Number of Instances 1434
Key Design Decision C8 476
C5 685
Dependencies Business Benefits C11 C12 546 338
3.1 Study 1: Presence of Similar Records in the Repository We analyzed DS for evidence of similar records to suggest if reuse of previously created information is feasible in process consulting. We observed that reuse of information in records can happen at multiple levels of abstraction. Therefore, we seek to measure the potential for reuse at three different levels: (1) within a single project, (2) across different projects but within the same industry, and (3) across multiple industries. For every concept, Ci ∈ EPDD , we clustered all records of Ci using CLUTO2 . We converted each record into term-vectors with TF*IDF weights represented in the vectorspace format [12]. Next, we used the cosine-similarity measure to cluster the termvectors. For each run of clustering, we gradually increased the number of clusters until the average intra-cluster similarity, Isim , was satisfactory. We found that when Isim is greater than 0.5, the documents clustered together have sufficiently similar semantics to be considered for reuse. Figure 2(a) shows the sizes of the resultant clusters across different concepts for Isim = 0.5. For each concept, we show the number of clusters that have: (a) one document, (b) between 2 and 5 documents, and (c) greater than five documents. For example, for concept type Business Benefit, 45% of the clusters had only one document. This indicates that the content was unique. However, 50% of the clusters had between two and five documents, and 5% of the clusters contained more than five documents. This indicates that 55% of the business benefits were similar to other business benefit records. Hence, in 55% of the cases, there is a potential for reuse. Across all concepts, only 28% to 45% of the clusters contained only one document. This indicates that there is a high potential of reuse across all concepts (varying from 55% for business benefit to 72% for requirements). Figure 2(b) plots the concept-wise spread of clusters (1) within the same project, (2) across different projects in the same industry, and (3) across multiple industries. We noticed that most reuse happens in scope of a single project (across all concepts 36%) or across projects belonging to different industries (61%). Minimal reuse seems to have happened across projects within the same industry (3%). This could be because we did not have data for a significant number of projects per industry (our repository had data for 37 projects across 15 industries).
2
www-users.cs.umn.edu/˜karypis/cluto/
Serving Information Needs in Business Process Consulting
237
Fig. 2. (a) Percentage of clusters of different sizes for each concept; (b) concept-wise distribution for cross-project, cross-industry information reuse
3.2 Study 2: Measuring Reuse in an Actual Project We understand that records can be reused in the context of different business processes defined in a project. For example, the same business requirement may be served by more than one process because of which the record gets cited in multiple PDDs. We evaluated the degree of reuse of records of PDD concepts across 185 PDDs authored using a web-based structured authoring environment, called the Consultant’s Assistant [13], in an actual IBM SAP client engagement. Table 2 shows how records of different concepts were reused across multiple processes. Each column represents a concept; Row 1 indicates the available instances, whereas Row 2 shows the extent of reuse. Maximum reuse was observed for requirements: 84% of requirements were shared across more than one variant. Overall, the data indicate that the reuse of records across PDDs occurs and that the degree of reuse may vary substantially across concepts. Table 2. Reuse metrics for some concepts in the studied project Concept Type
Business Input Benefit Number of Instances 132 381 Reused Instances 36 (27%) 86 (22.5%)
Process Steps 1969 91 (4.6%)
Organizational Output Requirement Change Impact 253 134 31 41 (16%) 65 (48.5%) 26 (84%)
These results provide evidence that there is potential for reuse to drive efficiency in process-consulting engagements, and motivate the need for more powerful search techniques, such as ours, that can realize the potential.
4 Contextual and Semantic Search We propose a system that effectively serves the information needs of a process consultant by utilizing the context developed as part of the engagement. Consultants can leverage our technique, during any design activity, as the system automatically constructs search queries from the information captured in the authoring environment and executes these on a knowledge repository to fetch relevant results.
238
M. Gupta et al.
Problem Definition. We consider a process-consulting engagement as a series of design tasks, where each task is concerned with authoring of a record for a particular concept in the information schema, Σ = {E, R}. We assume a partial order on the set of concepts, Eo = {C1 , C2 , . . . , Cn }, such that Cj ≤ Ci indicates the information authoring for Cj precedes that for Ci . Our aim is to provide relevant search results while the consultant engages in creation of record any concept, Ci , by effectively utilizing the context captured in the records for each concept Cj such that Cj ≤ Ci . 4.1 Context Determination and Biasing An information schema spells out dependency between different pairs of concepts. It may be expected that similarity in two records of a concept can influence similarity in the corresponding records of a dependent concept. For example, if we find a set of highly similar requirements, then the corresponding processes implemented to meet them should be similar too. Formally, ra ∼ rb =⇒ rx ∼ ry , if ra → rx and rb → ry However, different concepts may dictate the contents of a dependent concept to different degrees. For example, the desired business benefit may be a stronger driver for the choice of a process than the requirements. Now, for effective utilization of the available context for search, one must have an idea of how strongly similarity in records of a concept determines similarity in records of another dependent concept. To quantify the same, we compute the following probability, Rel, for pairs of concepts Ci and Cj : Rel(Ci , Cj ) = P (rx ∼ ry |ra ∼ rb , ra → rx , rb → ry )
(1)
where, ra ∈ Ci , rb ∈ Ci , rx ∈ Cj and ry ∈ Cj . We empirically calculated values of Rel, for all related PDD concepts, from the records in DS. Here, we assumed ri ∼ rj only if Cosine-Similarity(ri .δ, rj .δ) > 0.5. Table 3 shows the Rel(X, P rocess) and Rel(Y, OCI) for some concepts on which the definition of process and OCI is dependent. We use these values for Rel for the empirical validation of our approach in Section 5. Use of Rel in Contextual Search: When a consultant authors a record rx of Cj , he can effectively utilize an already authored record ra of Ci as the search context such that ra → rx and Rel(Ci , Cj ) is high. Here is how a contextual search can work in such a scenario. First, we resolve a record-set, Rb = {rb : rb ∼ ra } from the knowledge repository. Then, we trace the record-set, Ry = {ry : rb → ry }. Now, if Rel(Ci , Cj ) is high, it follows from Eqn 1 that any record ry ∈ Ry is likely to be similar to the record rx being authored; and hence ry can be considered to be a relevant search result. Further, if we conduct a similar search with a record, ra of Ck such that (Ck , Cj ) ∈ R, and retrieve a set of search results, Ry . Then for a record, ryc ∈ Ry ∩Ry , we shall obtain two ranks from the two runs of search. Now, the final rank of ryc may be calculated by biasing the two ranks in proportion of Rel(Ci , Cj ) and Rel(Ck , Cj ). Consider the scenario, where in an engagement a business analyst has already listed the client’s requirements(R) and documented the desired business benefits(B),
Serving Information Needs in Business Process Consulting
239
Table 3. Empirically obtained scores for Rel(X, P rocess) and Rel(Y, OCI)
Process
Organizational Change Impact
Requirement Business Benefit 0.7 0.8 Process
Process Steps
0.3
0.4
Fig. 3. Intuitive illustration of contextual search
and the consultant is documenting the process steps, P to meet R and realize B. Contextual search can help the consultant here to search the repository for process steps. First, we look for similar requirements and business benefits in the repository. In the illustration (Figure 3), we find B ∼ B1, B ∼ B2 and R ∼ R1. It is expected that the process steps, P 1 and P 2, which were created for the records R1, B1 and B2, will be more relevant than P 3 because B3 and R3 (related to P 3) are not as similar to the current context described by B and R. Now, the relevance rank of P 2 may be calculated as a weighted average of the cosine similarity scores, Sim(B, B2) and Sim(R, R2), the weights being Rel(Business-Benefit, Process-Step) and Rel(Requirement, Process-Step). 4.2 Architecture Figure 4 shows the architecture of the proposed search system. Functionally, the system is decomposed into two parts–an indexing component and a search component. All our contributions can be built on top of a conventional Full-Text Search (FTS) engine (e.g., Apache Lucene), deployed in a knowledge repository. Indexing Procedure: As mentioned in the background section, a typical workproduct in process consulting engagements captures information on multiple concepts. Instead of adding full work-products as documents in a Full-Text Search engine, we consider building an index with records. For example, we split a PDD workproduct into lists of records of requirements, process steps, business benefits etc., and then index each individual record as a document. The content produced in a structured authoring environment (e.g., Consultant’s Assistant [13]) is already in the form of records, hence it is readily indexable. If there is a need to bootstrap the repository with a corpus of legacy documents produced in formats like MS-Word, we can employ an information extractor (e.g., AHA [14]) to shred documents into records and then load the same. Each document, prepared from a record, ri , lists as fields – the name of the concept Ci , and links to other documents of records, rj such that ri → rj . Documents are parsed using format specific parsers, then their content is stemmed [15], and finally an FTS index is built. Also, as part of the overall indexing procedure, we compute and store values of Rel for each pair of related concept in the schema.
240
M. Gupta et al.
Fig. 4. The architecture of the search process
Search Procedure: Broadly, the process of searching for relevant records for a concept, Ci , by leveraging the context described in records of related concepts works as follows: First, the set of terms in every record of every related concept of Ci , are extracted to constitute a search query. Such a query runs over the FTS index and returns similar records of the same concept from which the query was created. Next, records of Ci that are linked to the search results are resolved. Finally, these records of Ci are suitably ranked and returned as final search results. Algorithm: Figure 5 details the steps in our algorithm. For a query designed to fetch records of concept, Ci , SearchProcess considers the set of available records of related concepts as the search context, βCi , and accepts a set of user-defined keywords, qu , as optional input. formulateQuerySet (lines 17–23) creates a set of queries, Q, algorithm SearchProcess input 1. Concept under design, Ci 2. Context, βCi = {r : (r.C, Ci ) ∈ R, r.C ≤ Ci } 3. (Optional) User keywords, qu output Ranked set of 2-tuples (r, α) of record(r) and relevance-score(α), ψ begin 1. Q = formulateQuerySet(βCi , qu , Ci ) 2. // Q = {(τ, C) : τ is a term-vector, C ∈ E} 3. Initialize target-set, T = φ 4. foreach query qi ∈ Q do 5. Si = search(qi ) 6. foreach s ∈ Si do 7. if s.α > κ then 8. Ts = fetchTargetSet(s.r, Ci ) 9. T = T ∪ Ts 10. Initialize, ψ = φ 11. foreach record ri : ri = t.r, t ∈ T do 12. αi = aggregateRank(ri , T, Ci ) 13. Add (ri , αi ) to ψ 14. ψ = Sort ψ in descending order of αi end function search(Query q) begin 15. S = {(ri , α)} s.t. ri is a record of q.C returned by a FTS on q.τ with a normalized score, α > κ 16. return S end
function formulateQuerySet(Records βCi , Terms qu , Concept Ci ) begin 17. Initialize, Q = φ 18. foreach record ri ∈ βCi do 19. Create a new query, qi ; qi .C = ri .C 20. qi .τ = Terms extracted from ri .δ 21. Add qi to Q 22. Add (qu , Ci ) to Q 23. return Q end function fetchTargetSet(SearchResult s, Concept Ci ) begin //A target is a 3-tuple (r, α, C) containing //record, score and query concept 24. T = {(ri , s.α, r.C) : s.r → ri } 25. return T end function aggregateRank(Record s, Targets T, Concept Ci ) begin 26. Initialize, SumRank = 0, SumW t = 0 27. foreach target t ∈ T such that t.r = ri 28. SumRank = SumRank + t.α × Rel(t.C, Ci ) 29. SumW t = SumW t + Rel(t.C, Ci ) 30. return SumRank/SumW t end
Fig. 5. The search algorithm
Serving Information Needs in Business Process Consulting
241
from βCi . A query is represented as a 2-tuple (τ, C), where τ is a set of terms and C is the concept whose records shall be searched. In a query q, derived from ri ∈ βCi , q.τ contains the terms in ri .δ and q.C is ri .C. For the keywords, qu , specified by the user, we create a query (line 22) with the terms in qu and q.C is taken to be Ci . Next, each query, q ∈ Q is fired on the FTS index (line 5) created for all records in the repository. Si denotes the set of top search results for q; it lists records of concept, q.C along-with their relevance-scores, α, such that α is greater than a threshold, κ (line 15). Now, we introduce the notion of a target. A target, t, contains a record, ri of Ci , which is linked to a record, r reported in a search-result s ∈ Si , i.e., , r → ri . Also, t stores the relevance-score, α of the search-result, s and the concept r.C (line 24). Finally, aggregateRank (lines 26–30) re-computes rank for the target records collected across all search results of all queries, q ∈ Q. The relevance score of a record present in a set of targets, {ti }, gets re-adjusted as the weighted mean of the original relevance scores, ti .α, the weights being values of Rel(ti .C, Ci ). The set of target records are sorted in descending order of their scores to return the final results.
5 Empirical Evaluation We implemented our context-based search approach and conducted two studies. In the first study, we investigated the effectiveness of our approach and compared it with keyword-based search. In the second study, we evaluated the effect of context on the accuracy of the search results. After discussing the experimental setup, we present the results of the two studies. 5.1 Experimental Setup, Method, and Measures We implemented our approach, described in Section 4, and keyword-based search. To perform the search step of both approaches, we leveraged Apache LUCENE3 . LUCENE is a high-performance, full-featured indexing and search engine for text documents. We used the corpus of 2675 PDDs for the studies. To implement keyword search (the baseline technique against which to compare our approach), we indexed the corpus using LUCENE . To implement our approach, first, we split each PDD into multiple “concept” documents, such that each document contained information about one concept only; then, we indexed each concept document, again using LUCENE. To perform the evaluation, we required queries over the PDDs. We randomly selected 50 PDDs from our corpus, and formulated a query over each PDD to simulate the scenario where a consultant is documenting the organizational change impact (OCI) implied by the current process definition. Thus, in this scenario, the search task for the consultant is to find OCI information that is pertinent for the specific process variant being implemented. For the baseline approach, we used the variant name as the search criterion, and performed the search using LUCENE over the indexed PDDs. For our approach, we used the variant name, along with the names of the process steps for the variant—the step names formed the context for the search—to create a fat semantic query. The variant name was used to search over the indexed variant concept documents, whereas the step 3
http://lucene.apache.org/
242
M. Gupta et al.
names were used to search over the indexed step concept documents, as described in Figure 5. For all cases, we consider a retrieved document to be a search result only if it has a relevance score (normalized in the scale of 1–10) greater than 4. Both LUCENE as well as our algorithm return a relevance score against every retrieved document. To measure the quality of the search results and compare the effectiveness of the two approaches, we make use of a quality measure. For every query, we retrieve the text for the concept in question from the original document, which was used to create the query, call it Ro . We denote the same for a keyword search result by Rk . Then, the quality score, qs for a keyword search is taken to be the cosine similarity of Rk and Ro . The quality score qs for a context-based search result is the cosine similarity of the returned record,Rc and Ro . Finally, the average quality score, Q for a ranked set of a search results is calculated as a weighted average of the quality scores of top 5 results, 5
the weights being the relevance score of the search results. Q = the relevance-score for the search result.
αi ×qsi
i=1
αi
, where αi is
5.2 Study 1: Effectiveness of Our Approach The goal of the first study was to evaluate the effectiveness of our search technique in retrieving more accurate results than the keyword approach. We designed the study to investigate the following research questions. – RQ1: What are the quality scores and sizes of the search results for the two search techniques? Does the original document, on which a query is formulated, appear in the search results? – RQ2: Are the documents retrieved by context-based search a subset of the documents retrieved by keyword search? Are the documents excluded by our technique, but retrieved by keyword search, relevant documents that are missed by our approach? To answer these questions, we executed the 50 queries using the context-based and keyword-based approaches. For each query, we calculated the normalized quality score (QC −QK ) QN = Max (QC ,QK ) , where QC is the average quality score of the documents retrieved by context-based search, QK is the average quality score of the documents returned by keyword search, and Max (QC , QK ) is the maximum of QC and QK . A negative value for QN indicates that the quality score of keyword search is better than the score of context-based search, whereas a positive value indicates that context-based search is superior. Table 4 presents the distribution of QN into different ranges for the 50 queries. As the data show, for only 6 (12%) of the 50 queries, keyword search achieved a better quality score; for the remaining 44 queries, context-based searched outperformed keyword search. Moreover, for 28 (64%) of the 44 queries, the QN value is at least 0.75. For three queries, keyword search returns no search results, whereas our approach returns results. Figure 6(a) presents two box plots to illustrate the sizes of the search results computed by the two approaches; 50 data points, corresponding to the 50 queries, are plotted in each box. The vertical axis represents the number of retrieved documents. Each box
Serving Information Needs in Business Process Consulting
243
shows the middle 50% of the queries (in terms of the sizes of the result sets); the remaining 50%, with the exception of some outliers, are contained within the areas between the box and the whiskers. The line inside the box represents the median value; the location of the median line within the box illustrates the skewness in the distribution. From the plots, it is evident that the number of documents returned by our technique is far less than the number of documents retrieved by keyword search: the minimum and maximum values for context search are zero and 18, respectively, whereas, for keyword search, these values are two and 53, respectively; the median value for our approach is five, whereas the median value for keyword approach is 17. Result-set sizes of nine or less account for 75% of the queries for context search; whereas, to account for 75% of the queries for keyword search, result-set sizes upto 26 must be considered. On average, our technique retrieves six documents per query, whereas keyword search retrieves 19—more than three times as many documents as our technique. Figure 6(b) presents the distribution of the average quality scores of the documents retrieved for the 50 queries by both approaches. The vertical axes represent the quality scores—note the difference in the scale of the two axes. Again, it is evident from the plots that the documents returned by our approach have higher quality scores than the documents returned by keyword search. The quality scores for the documents returned by keyword search range from 0 to 0.20, whereas the scores for our technique range from 0 to 7. We manually verified whether the documents used for formulating the queries were returned in the ranked search results. For all 50 queries, both techniques retrieved the documents. However, our technique always ranked the original document the highest, whereas keyword search ranked it lower for seven queries. Overall, these results provide evidence to answer RQ1: for a large proportion of the queries, our technique outperformed keyword search by returning fewer documents that had higher quality scores as well.
Table 4. Distribution of QN for the 50 queries. A positive QN value indicates that our approach outperforms keyword search, whereas a negative QN value indicates that keyword search performs better than our approach Normalized Quality Number of Score Range (QN ) Queries −1 1 −1 < QN < 0 5 0 < QN < .25 3 .25 ≤ QN < .5 5 5 ≤ QN < .75 8 .75 ≤ QN < 1 25 1 3
Fig. 6. (a) Distribution of the number of documents retrieved by the two search techniques; (b) distribution of the average quality scores for the two techniques
244
M. Gupta et al.
Table 5. Percentage of search result overlap Search Results No. of Overlap (SR ) Queries 0 33 0 < SR < 25 10 25 ≤ SR < 50 5 50 2
Fig. 7. Distribution of average quality scores for the 50 queries performed using three versions of context, and by applying (a) keyword search and (b) context-based search
Table 5 presents data about RQ2: it shows the overlap among the documents returned K) by the two approaches. We measure overlap percentage as SR = (DCD∩D ×100, where C DC and DK are the number of documents returned by context search and keyword search, respectively. As the data in the table illustrate, for 66% (33) of the queries, there is no overlap between the retrieved documents—that is, SR = 0. The maximum value of SR is 50%, which occurs for two queries. Thus, because the maximum value of SR is less than 100%, it follows that, for none of the queries, our search results are a subset of the search results of keyword search. Next, we investigated whether the documents omitted by our technique, but retrieved by keyword search, are relevant for the search queries (or serve only to make the search results noisy). To do this, we compared the quality scores of the documents in DC ,uniq = (DC − DK ) and DK ,uniq = (DK − DC ). Specifically, we looked for queries for which Max (DK ,uniq ) > Max (DC ,uniq ), where Max is a function that returns the maximum quality score in a given set of documents. For such a query, DK ,uniq contains documents that are likely highly relevant for the query. In all, we found 10 such cases where DK ,uniq contained at least one document whose quality score was greater than Max (DC ,uniq ). For the remaining 40 (80%) of the queries, our technique fetched higher-quality documents. Moreover, even for the 10 cases, Max (DK ,uniq ) was only marginally greater than Max (DC ,uniq ): the difference in the quality scores ranged from 0.01 to 0.6, with average of 0.14. However, for the 40 queries, the difference between Max (DC ,uniq ) and Max (DK ,uniq ) was much greater: ranging from 0.01 to 9.64, with average 3.22. Thus, overall, the data indicate that the documents returned exclusively by keyword search more often than not contribute to noise, whereas the documents retrieved exclusively by our approach have higher quality scores. 5.3 Study 2: Effect of Context on Accuracy The goal of the second study was to evaluate the effect of context on the accuracy of the computed results. In general, addition of pertinent context to a query can improve the accuracy of the result; however, expanding the context beyond a point may decrease
Serving Information Needs in Business Process Consulting
245
the accuracy. Recall from the discussion in Section 4 that, in our approach, context is provided by concepts related to the concept being queried for. – RQ3: Does addition of context to the search query result in the retrieval of higherquality search results? We considered three versions of each query: the first using only the process variant as the context, the second using only the process steps as the context, and the final considering both variant and steps as the context. Thus, we executed each query three times (using both approaches) and calculated the quality scores. Figure 7 presents the data. As illustrated by the box plot on the left, the addition of context had barely any effect on the quality scores for keyword search—the three box plots are nearly the same, with only very minor differences (e.g., the median values for Step and Full Context are slightly larger than the median value for Variant). The data for our approach, however, are remarkably different, as shown in Figure 7(b). (Note again the difference in the vertical-axis scales of the two box plots.) By considering full context, our technique attains significantly higher quality scores than when it considers either process variant or process steps alone. The maximum quality score for Full Context is 7, wheres the maximum scores for Variant and Step are 5.2 and 6, respectively (the outliers not shown in the box plot). The spread of the third and fourth quartiles also increases significantly. Moreover, the data indicate that, for the queries considered, Step by itself provided better search context (resulting in higher scores) than Variant alone. To summarize our observations on RQ3, we can conclude that, for our queries and corpus of documents, the addition of context resulted in the retrieval of higher-quality results. However, as mentioned earlier, arbitrarily increasing the context of a query can have a negative effect on the quality of the search results. We leave it to future work to investigate this phenomenon empirically.
6 Related Work Reuse of business process information—content, formal model and implementation workfare/artifacts—has been an area of widespread interest. Dijkman et al. [19] present a survey of existing business process repositories. Each system prescribes a format to store business process information and, on top of it, provides query and navigation capabilities. RepoX [17] and Fragmento [16] allow storage of business process models as XML and BPMN, respectively, with free text search and structured (database) search capabilities. ProcessGene [18] allows for storage of process information as a hierarchy of concepts allowing users to find information of interest through filtering content at each level. In the MIT Process Handbook [11], the authors have developed a comprehensive taxonomy to capture organizational business process knowledge. They maintain a rich on-line library4 of data that follows this taxonomy and provide free text and other guided search features. specialized versions of any business process. However, none of these repositories effectively support broad topic search queries; the search criteria is manually provided, with no support for automatic extraction of search context and ranking of search results based on concepts. 4
http://ccs.mit.edu/ph/
246
M. Gupta et al.
Contextual search has been discussed in the domain of web search as an attractive way of reducing the burden on the user to come up with the right set of keywords. Watson [3] and IntelliZap [6] deliver relevant search results, accumulated from different web search engines, to users while they edit documents. Kraft et al. [10] contrast three approaches to contextual search—query rewriting, rank biasing, and iterative filtering meta-search. CodeBroker [20] collects context based on the code written by a developer, and uses the context to search for relevant code snippets that can be reused. However, the context in all these systems is simply a set of keywords; therefore, their querying mechanisms, unlike ours, are neither semantic in nature nor precise. X-Rank [7] and X-Search [4] focus on performing keyword search on semi-structured data (XML). Fagin et al. [5] create multiple possible semantic interpretations of a search query and use these to drive the search. These techniques are applicable when the user is not aware of the exact semantics of the underlying model. We can leverage this work to extend our approach for cases when we are unable to determine with certainty the individual concept types from the extracted context. The DebugAdvisor [1] tool enables search over a bug repository using fat semantic queries for which the context is created automatically as a combination of different structured and unstructured text fields in the bug description. The tool assists the developer in resolving a bug by enabling the developer to search the bug repository for similar bugs and their resolutions. A user study showed that in 35% of the cases, keyword search missed useful results returned by DebugAdvisor.
7 Summary and Future Work We presented a novel context-based search technique for performing accurate searches over enterprise-wide knowledge repositories containing semi-structured data. Our empirical analysis of an existing knowledge repository provides evidence that there is potential of information reuse in real enterprise repositories. We provided a formal specification of context and described an approach for organizing the semi-structured content of process-related documents. Our empirical results suggest that context can increase the efficiency and effectiveness of search by reducing noise and increasing the quality of search results, when compared to existing keyword-based search. In future work, we will investigate enterprise repositories in other domains, and study downstream documents, such as the RICEFW documents, which are transitively dependent on process information. We also intend to evaluate the right size of context (in terms of concepts) that needs to be used for effective search. Acknowledgements. We thank Richard Goodwin, Pietro Mazzoleni, Rakesh Mohan, and Sugata Ghosal for their valuable inputs and many useful discussions on the topic.
References 1. Ashok, B., Joy, J., Liang, H., Rajamani, S., Srinivasa, G., Vangala, V.: Debugadvisor: A Recommender System for Debugging. In: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, pp. 373–382 (2009)
Serving Information Needs in Business Process Consulting
247
2. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998) 3. Budzik, J., Hammond, K.: Watson: Anticipating and Contextualizing Information Needs. In: Proceedings of the Annual Meeting-American Society for Information Science, vol. 36, pp. 727–740 (1999) 4. Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSearch: A Semantic Search Engine for XML. In: Proceedings of the 29th International Conference on Very Large Databases, vol. 29, pp. 45–56 (2003) 5. Fagin, R., Kimelfeld, B., Li, Y., Raghavan, S., Vaithyanathan, S.: Understanding Queries in a Search Database System. In: Proceedings of the Twenty-ninth ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems of Data, pp. 273–284 (2010) 6. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing Search in Context: The Concept Revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414 (2001) 7. Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked Keyword Search Over XML Documents. In: ACM SIGMOD International Conference on Management of Data, pp. 16–27 (2003) 8. Hawking, D.: Challenges in Enterprise Search. In: Proceedings of the 15th Australasian Database Conference, vol. 27, pp. 15–24 (2004) 9. Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM) 46(5), 604–632 (1999) 10. Kraft, R., Chang, C., Maghoul, F., Kumar, R.: Searching with Context. In: Proceedings of the 15th International Conference on World Wide Web, pp. 477–486 (2006) 11. Malone, T., Crowston, K., Herman, G.: Organizing Business Knowledge: The MIT Process Handbook (2003) 12. Manning, C., Raghavan, P., Schutze, H., Corporation, E.: Introduction to Information Retrieval, vol. 1 (2008) 13. Mazzoleni, P., Goh, S., Goodwin, R., Bhandar, M., Chen, S., Lee, J., Sinha, V., Mani, S., Mukherjee, D., Srivastava, B., et al.: Consultant Assistant: A Tool for Collaborative Requirements Gathering and Business Process Documentation. In: Proceeding of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, pp. 807–808 (2009) 14. Mukherjee, D., Mani, S., Sinha, V., Ananthanarayanan, R., Srivastava, B., Dhoolia, P., Chowdhury, P.: AHA: Asset Harvester Assistant. In: IEEE International Conference on Services Computing, pp. 425–432 (2010) 15. Porter, M.: Porter Stemming Algorithm 16. Schumm, D., Karastoyanova, D., Leymann, F., Strauch, S.: Fragmento: Advanced Process Fragment Library. In: Proceedings of the 19th International Conference on Information Systems Development 17. Song, M., Miller, J., Arpinar, I.: RepoX: An XML Repository for Workflow Designs and Specifications. Univeristy of Georgia, USA (2001) 18. Wasser, A., Lincoln, M., Karni, R.: ProcessGene Query–A Tool for Querying the Content Layer of Business Process Models. In: Demo Session of the 4th International Conference on Business Process Management, pp. 1–8 19. Yan, Z., Dijkman, R., Grefen, P.: Business Process Model Repositories-Framework and Survey. Information Sciences (2009) 20. Ye, Y., Fischer, G.: Supporting Reuse by Delivering Task-Relevant and Personalized Information. In: Proceedings of the 24th International Conference on Software Engineering, pp. 513–523 (2002)
Clone Detection in Repositories of Business Process Models Reina Uba1 , Marlon Dumas1 , Luciano Garc´ıa-Ba˜nuelos1 , and Marcello La Rosa2,3 1
University of Tartu, Estonia {reinak,marlon.dumas,luciano.garcia}@ut.ee 2 Queensland University of Technology, Australia
[email protected] 3 NICTA Queensland Lab, Australia
Abstract. Over time, process model repositories tend to accumulate duplicate fragments (also called clones) as new process models are created or extended by copying and merging fragments from other models. This phenomenon calls for methods to detect clones in process models, so that these clones can be refactored as separate subprocesses in order to improve maintainability. This paper presents an indexing structure to support the fast detection of clones in large process model repositories. The proposed index is based on a novel combination of a method for process model decomposition (specifically the Refined Process Structure Tree), with established graph canonization and string matching techniques. Experiments show that the algorithm scales to repositories with hundreds of models. The experimental results also show that a significant number of non-trivial clones can be found in process model repositories taken from industrial practice.
1 Introduction It is nowadays common for organizations engaged in long-term business process management programs to deal with collections of hundreds or even thousands of process models, with sizes ranging from dozens to hundreds of elements per model [15,14]. While highly valuable, such collections of process models raise a maintenance challenge [14]. This challenge is amplified by the fact that process models are generally maintained and used by stakeholders with varying skills, responsibilities and goals, sometimes distributed across independent organizational units. One problem that arises as repositories grow is that of managing overlap across models. In particular, process model repositories tend to accumulate duplicate fragments over time, as new process models are created by copying and merging fragments from other models. Experiments conducted during this study show that a well-known process model repository contains several hundred clones. This situation is akin to that observed in source code repositories, where significant amounts of duplicate code fragments – known as code clones – are accumulated over time [10]. Cloned fragments in process models raise several issues. Firstly, clones make individual process models larger than they need to be, thus affecting their comprehensibility. Secondly, clones are modified independently, sometimes by different stakeholders, leading to unwanted inconsistencies across models. Finally, process model clones hide S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 248–264, 2011. c Springer-Verlag Berlin Heidelberg 2011
Clone Detection in Repositories of Business Process Models
249
potential efficiency gains. Indeed, by factoring out cloned fragments into separate subprocesses, and exposing these subprocesses as shared services, companies may reap the benefits of larger resource pools. In this setting, this paper addresses the problem of retrieving all clones in a process model repository that can be refactored into shared subprocesses. Specifically, the contribution of the paper is an index structure, namely the RPSDAG, that provides operations for inserting and deleting models, as well as an operation for retrieving all clones in a repository that meet the following requirements: – All retrieved clones must be single-entry, single-exit (SESE) fragments, since subprocesses are invoked according to a call-and-return semantics. – All retrieved clones must be exact clones so that every occurrence can be replaced by an invocation to a single (shared) subprocess. While identifying approximate clones could be useful in some scenarios, approximate clones cannot be directly refactored into shared subprocesses, and thus fall outside the scope of this study. – Maximality – once we have identified a clone, every SESE fragment strictly contained in this clone is also a clone, but we do not wish to return all such sub-clones. – Retrieved clones must have at least two nodes (no “trivial” clones). Identifying clones in a process model repository boils down to identifying fragments of a process model that are isomorphic to other fragments in the same or in another model. Hence, we need a method for decomposing a process model into fragments and a method for testing isomorphism between these fragments. Accordingly, the RPSDAG is built on top of two pillars: (i) a method for decomposing a process model into SESE fragments, namely the Refined Process Structure Tree (RPST) decomposition; and (ii) a method for calculating canonical codes for labeled graphs. These canonical codes reduce the problem of testing for graph isomorphism between a pair of graphs, to a string equality check. Section 2 introduces these methods and discusses how they are used to address the problem at hand. Next, Section 3 describes the RPSDAG. Section 4 presents an evaluation of the RPSDAG using several process model repositories. Finally, Section 5 discusses related work while Section 6 draws conclusions.
2 Background This section introduces the two basic ingredients of the proposed technique: the Refined Process Structure Tree (RPST) and the code-based graph indexing. 2.1 RPST The RPST [18] is a parsing technique that takes as input a process model and computes a tree representing a hierarchy of SESE fragments. Each fragment corresponds to the subgraph induced by a set of edges. A SESE fragment in the tree contains all fragments at the lower level, but fragments at the same level are disjoint. As the partition is made in terms of edges, a single vertex may be shared by several fragments. Each SESE fragment in an RPST is of one of four types [13]. A trivial (T) fragment consists of a single edge. A polygon (P) fragment is a sequence of fragments. A bond corresponds to a fragment where all child fragments share a common pair of vertices. Any other fragment is a rigid.
250
R. Uba et al.
Figure 1a presents a sample process model for which the RPST decomposition is shown in the form of dashed boxes. A simplified view of the RPST is shown in Fig. 2. (a)
P14
R13
P12
B11
P9
B8
P6 Determine if the invoice relates to the claim
yes
Determine whether tax invoice is valid
Determine whether invoice received is for proof of ownership Determine if invoice is duplicate
P7 Determine source of invoice
Investigate error
P10 Close the relevant invoice received activity
yes g1
P4
yes
B3 P1 Contact service provider activity for the relevant Insurer team
yes
P5 Determine if invoice has already been paid
Complete customer or third party reimbursement
Determine whether Insurer authorised work
yes g2
Close the relevant invoice received activity
P2 Contact customer activity for the relevant Insurer team
(b)
P22
R21
P20
B19 P18
B17 P16
P7 Determine source of invoice
Determine whether tax invoice is valid
yes
P7 Determine source of invoice
Investigate error
P10
Determine if invoice is duplicate
yes yes
Close the relevant invoice received activity
P4
B3 P1 yes
P5 Determine if invoice has already been paid
Complete customer or third party reimbursement
P15
Determine if the invoice relates to the claim Determine whether invoice received is for proof of ownership
Investigate error
yes
Determine whether Insurer authorised work
Contact service provider activity for the relevant Insurer team
P2 Contact customer activity for the relevant Insurer team
Close the relevant invoice received activity
Fig. 1. Excerpt of two process models of an insurance company
We observe that the root fragment corresponds to polygon P14,
which in turn contains rigid R13 and a set of trivial fragments (sim ple edges) that are not shown in order to simplify the figure. R13 is composed of polygons P12, P5, P4, and so forth. Polygons P4, P5, P7 and P10 have been highlighted to ease the comparison with the
similar subgraphs found in the process model shown in Figure 1b. In particular, we note that polygon P4 occurs in both process mod els. This is an example of a clone that we wish to detect. Fig. 2. RPST of Although the models used as examples are encoded in BPMN, process model in the proposed technique can be applied to other graph-based mod- Fig. 1a eling notations. To achieve this notation-independence, we adopt the following graph-based representation of process models:
Clone Detection in Repositories of Business Process Models
251
Definition 1 (Process Graph). A Process Graph is a labelled connected graph G = (V, E, l), where: – V is the set of vertices. – E ⊆ V × V is the set of directed edges (e.g. representing control-flow relations). – l : V → Σ ∗ is a labeling function that maps each vertex to a string over alphabet Σ. We distinguish the following special labels: l(v) = “start” and l(v) = “end” are reserved for start events and end events respectively; l(v) = “xor-split” is used for vertices representing xor-split (exclusive decision) gateways; similarly l(v) = “xor-join”, l(v) = “and-split” and l(v) = “and-join” represent merge gateways, parallel split gateways and parallel join gateways respectively. For a task node t, l(t) is the label of the task. This definition can be extended to capture other types of BPMN elements by introducing additional types of nodes (e.g. a type of node for inclusive gateways, data objects etc.). Organizational information such as lanes and pools can be captured by attaching dedicated attributes to each node (e.g. each node could have an attribute indicating the pool and lane to which it belongs). In this paper, we do not consider sub-processes, since each sub-process can be indexed separately for the purpose of clone identification. 2.2 Canonical Labeling of Graphs Our approach for graph indexing is an adaptation of the approach proposed in [20]. The adaptations we make relate to two specificities of process models that differentiate them from the class of graphs considered in [20]: (i) process models are directed graphs; (ii) process models can be decomposed into an RPST. Following the method in [20], our starting point is a matrix representation of a process graph encoding the vertex adjacency and the vertex labels, as defined below. Definition 2 (Augmented Adjacency Matrix of a Process Graph). Let G = (V, E, l) be a Process Graph, and v = (v1 , . . . , v|V | ) a total order over the elements of V . The adjacency matrix of G, denoted as A, is a (0, 1)-matrix such that Ai,j = 1 if and only if (vi , vj ) ∈ E, where i, j ∈ {1 . . . |V |}. Moreover, let us consider a function h : Σ ∗ → N \ {0, 1} that maps each vertex label to a unique natural number greater than 1. The Augmented Adjacency Matrix M of G is defined as: M = diag( h(l(v1 )), . . . , h(l(v|V | )) ) + A Given the augmented adjacency matrix of a process graph (or a SESE fragment therein), we can compute a string (hereby called a code) by concatenating the contents of each cell in the matrix from left to right and from top to bottom. For illustration, consider graph G in Figure 3(a), which is an abstract view of fragment B3 of the running example (cf. Figure 1a). For convenience, next to each vertex we show the unique vertex identifier (e.g. v1 ), the corresponding label (e.g. l(v1 ) =“A”), and the numeric value associated with the label (e.g. h(l(v1 )) = 2). Assuming the order v = (v1 , v2 , v3 , v4 ) over the set of vertices, the matrix shown in Figure 3(b) is the adjacency matrix of G. Figure 3(c) is the diagonal matrix built from h(l(v)) whereas Figure 3(d) shows the augmented adjacency matrix M for graph G. It is now clear why 0 and 1 are not part of
252
R. Uba et al.
the codomain of function h, i.e. to avoid clashes with the representation of vertex adjacency. Figure 3(e) shows a possible permutation of M when considering the alternative order v = (v1 , v4 , v2 , v3 ) over the set of vertices.
(a)
(b)
(c)
(d)
(e)
Fig. 3. (a) a sample graph, (b) its adjacency matrix, (c) its diagonal matrix with the vertex label codes, (d) its augmented adjacency matrix, and (e) a permutation of the augmented matrix
Next, we transform the augmented adjacency matrix into a code by scanning from left-to-right and top-to-bottom. For instance, the matrix in Figure 3(d) can be represented as “2.1.1.0.0.3.0.1.0.0.4.1.0.0.0.5”. This code however does not uniquely represent the graph. If we chose a different ordering of the vertices, we would obtain a different code. To obtain a unique code (called a canonical code), we need to pick the code that lexicographically “precedes” all other codes that can be constructed from an augmented adjacency matrix representation of the graph. Conceptually, this means that we have to consider all possible permutations of the set of vertices and compute a code for each permutation, as captured in the following definition. Definition 3 (Graph Canonical Code). Let G be a process graph, M the augmented adjacency matrix of G. The Graph Canonical Code is the smallest lexicographical string representation of any possible permutation of matrix M , that is: code(M ) = str(P T M P )P ∈ Π ∧ ∀Q ∈ Π , P = Q : str(P T M P ) ≺ str(Q T M Q)
where: – Π is the set of all possible permutations of the identity matrix I |M | – str(N ) is a function that maps a matrix N into a string representation. Consider the matrices in Figures 3(d) and 3(e). The code of the matrix in Figure 3(e) is “2.0.1.1.0.5.0.0.0.1.3.0.0.1.0.4’. This code is lexicographically smaller than the code of the matrix in Figure 3(d) (“2.1.1.0.0.3.0.1.0.0.4.1.0.0.0.5”). If we explored all vertex permutations and constructed the corresponding matrices, we would find that “2.0.1.1.0.5.0.0.0.1.3.0.0.1.0.4” is the canonical code of the graph in Figure 3(a). Enumerating all vertex permutations is unscalable (factorial on the number of vertices). Fortunately, optimizations can be applied by leveraging the characteristics of the graphs at hand. Firstly, by exploiting the nature of the fragments in an RPST, we can apply the following optimizations: – The code of a polygon is computed in linear time by concatenating the codes of its contained vertices in the (total) order implied by the control flow. – The code of a bond is computed in linear time by taking the entry gateway as the first vertex, the exit gateway as the last vertex and all vertices in-between are ordered lexicographically based on their labels. Since these vertices in-between do
Clone Detection in Repositories of Business Process Models
253
not have any control-flow dependencies among them, duplicate labels do not pose a problem. In the case of a rigid, we start by partitioning its vertices into two subsets: vertices with distinct labels, and vertices with duplicate labels. Vertices with distinct labels are deterministically ordered in lexicographic order. Hence we do not need to explore any permutations between these vertices. Instead, we can focus on deterministically ordering vertices with duplicate labels. Duplicate labels in process models arise in two cases: (i) there are multiple tasks (or events) with the same label; (ii) there are multiple gateways of the same type (e.g. multiple “xor-splits”) that cannot be distinguished from one another since gateways generally do not have labels. To distinguish between multiple gateways, we pre-process each gateway g by computing the tasks that immediately precede it and the tasks that immediately follow it within the same rigid fragment, and in doing so, we skip other gateways found between gateway g and each of its preceding (or succeeding) tasks. We then concatenate the labels of the preceding tasks (in lexicographic order) and the labels of the succeeding tasks (again in lexicographic order) to derive a new label sg . The sg labels derived in this way are used to order multiple gateways of the same type within the same rigid. Consider for example g1 and g2 in Figure 1b and let s1 = “Determine if invoice has already been paid”, s2=“Determine whether invoice received is for proof of ownership” and s3=“Determine whether Insurer authorized work”. We have that sg1 = “s1.s2” while sg2 = “s1.s3.s2”. Since s3 precedes s2, gateway g2 will always be placed before g1 when constructing an augmented adjacency matrix for R13. In other words, we do not need to explore any permutation where g1 comes before g2 . Even if task “Determine if invoice is duplicate” precedes g1 this is not used to compute sg1 because this task is outside rigid R13. To ensure unicity, vertices located outside a rigid should not be used to compute its canonical code. A similar approach is used to order tasks with duplicate labels within a rigid. This “label propagation” approach allows us to considerably reduce the number of permutations we need to explore. Indeed, we only need to explore permutations between multiple vertices if they have identical labels and they are preceded and followed by vertices that also have the same labels. The worst-case complexity for computing the canonical code is still factorial, but on the size of the largest group of vertices inside a rigid that have identical labels and identical predecessors and successors’ labels.
3 Clone Detection Method 3.1 Index Structure and Clone Detection The RPSDAG is a directed acyclic graph representing the union of the RPSTs of a collection of process models. The key feature of the RPSDAG is that each SESE fragment is represented only once. When a new process model is inserted into an RPSDAG, its RPST is computed and traversed bottom-up. For each fragment, it is checked if this fragment already exists in the RPSDAG. If it does, the fragment is not inserted, but instead, the existing fragment is reused. For instance, consider an RPSDAG constructed from the process model shown in Figure 1a. Imagine that the process model in Figure 1b is inserted into this RPSDAG, meaning that the fragments in its RPST are inserted into
254
R. Uba et al.
the RPSDAG one by one from bottom to top. When the RPST node corresponding to polygon P4 is reached, we detect that this fragment already exists. Rather than inserting the fragment, an edge is created from its parent fragment (R21) to the existing node P4. The RPSDAG is designed to be implemented on top of standard relational databases. Accordingly, the RPSDAG consists of three tables: Codes(Code, Id, Size, Type), Roots(GraphId, RootId) and RPSDAG(ParentId, ChildId). Table Codes contains the canonical code for each RPST fragment of an indexed graph. Column “Id” assigns a unique identifier to each indexed fragment, column Code gives the canonical code of a fragment, column Size is the number of vertices in the fragment, and column Type denotes the type of fragment (“p” for polygon, “r” for rigid and “b” for bond). The “Id” is auto-generated by incrementing a counter every time that a new code is inserted into the Codes table. Strictly speaking, the “Id” is redundant given that the code uniquely identifies each fragment. However, the “id” gives us a shorter way of uniquely identifying a fragment. For optimization reasons, when the canonical code of a new RPST node is constructed, we use the short identifiers of its child fragments as labels (in the diagonal of the augmented adjacency matrix), instead of using the child fragment’s canonical codes, which are longer. For example consider computing the canonical code for fragment P9 in Figure 1a, let s1 = “Determine whether tax invoice is valid”. Instead of extracting the fragment B8 we include the fragment identifyier to the code – resulting the canonical code “s1.B8”. Table Roots contains the Id of each indexed ParentId ChildId graph and the Id of the root fragment for that 3 1 Code Id Size Type 3 2 graph. Table RPSDAG is the main component of 1 1 p 4 3 the index and is a combined representation of the 2 1 p 8 6 3 4 b 8 7 RPSTs of all the graphs in the repository. Since 4 6 p ... ... multiple RPSTs may share fragments (these are 5 1 p 13 4 6 2 p 13 5 precisely the clones we look for), the RPSDAG ... ... ... ... ... ... table represents a Directed Acyclic Graph (DAG) - 13 20 r 14 13 rather than a tree. Each tuple of this table is a pair - 14 21 p 16 7 ... ... ... ... 16 15 consisting of a parent fragment id and a child frag- 22 24 p 17 7 ment id. Figure 4 shows an extract of the tables ... ... ... ... 21 4 21 5 representing the two graphs in Figure 1. Here, the ... ... fragment Ids have been derived from the fragment GraphId RootId 1 14 names in the Figure (e.g. the id for P4 is 4) and the 2 22 codes have been hidden for the sake of readability. Fig. 4. RPST of process model in Looking at the RPSDAG, we can immediately Fig. 1a observe that the maximal clones are those child fragments that have more than one parent. Accordingly, we can retrieve all clones in an RPSDAG by means of the following SQL query. SELECT RPSDAG . C h i l d I d , Codes . S i z e , COUNT(RPSDAG . P a r e n t I d ) FROM RPSDAG, Codes WHERE RPSDAG . C h i l d I d = Codes . I d AND Codes . S i z e >= 2 GROUP BY RPSDAG . C h i l d I d , Codes . S i z e HAVING COUNT(RPSDAG . P a r e n t I d ) >= 2 ;
Clone Detection in Repositories of Business Process Models
255
This query retrieves the id, size and number of occurrences of fragments that have at least two parent fragments. Note that if a fragment appears multiple times in the indexed graphs, but always under the same parent fragment, it is not a maximal clone. For example, fragment P2 in Figure 1 always appears under B3, and B3 always appears under P4. Thus, neither P2 nor B3 are maximal clones, and the above query does not retrieve them. On the other hand, the query identifies P4 as a maximal clone since it has two parents (cf. tuples (13,4) and (21,4) in table RPSDAG in Fig. 4). Also, as per the requirements spelled out in Section 1, the query returns only fragments with at least 2 vertices. For example, even if there are two tuples with ChildId 5 and 10 in the example RPSDAG, the query will not return clones P5 and P10 as they are single nodes. The correctness and completeness of the RPSDAG for clone detection follow from the fact that the RPST decomposition separates the process graph into SESE subgraphs that are either disjoint or have a containment relationship. Thus, if two models share a common SESE fragment S, then fragment S will be indexed exactly once in the RPSDAG and its parents will be the smallest SESE fragments that contain S. 3.2 Insertion and Deletion Algorithm 1 describes the procedure for inserting a new process graph into an indexed repository. Given an input graph, the algorithm first computes its RPST with function ComputeRPST() which returns the RPST’s root node. Next, procedure InsertFragment is invoked on the root node to update tables Codes and RPSDAG. This returns the id of the root fragment. Finally, a tuple is added to table Roots with the id of the inserted graph and that of its root node. Algorithm 1. Insert Graph procedure InsertGraph(Graph m) RPST root ⇐ ComputeRPST(m) rid ⇐ InsertFragment(root) Roots ⇐ Roots ∪ {(NewId(), rid)} // NewId() generates a fresh id
Algorithm 2. Insert Fragment procedure InsertFragment(RPST f ) returns RPSDAGNodeId {RPST, RPSDAGNodeId} C ⇐ ∅ foreach RPST child in GetChildren(f ) do C ⇐ C ∪ {(child, InsertFragment(child))} code ⇐ ComputeCode(f, C) (id, type) = InsertNode(code, f ) foreach (cf, cid) in C do RP SDAG ⇐ RP SDAG ∪ {(id, cid)} if type =“p” then InsertSubPolygons(id, code, f ) return id
256
R. Uba et al.
Procedure InsertFragment (Algorithm 2) inserts an RPST fragment in table RPSDAG . This algorithm performs a depth-first search traversal of the RPST. Nodes are visited in postorder, i.e. a node is visited after all its children have been visited. When a node is visited, its canonical code is computed – function ComputeCode – based on the topology of the RPST fragment and the codes of its children (except of course for leaf nodes whose labels are their canonical codes). Next, procedure InsertNode is invoked to insert the node in table Codes, returning its id and type. This procedure (Algorithm 3) first checks if the node already exists in Codes, via function GetIdSizeType(). If it exists, GetIdSizeType() returns the id, size and type associated with that code, otherwise it returns the tuple (0,0,“”). In this latter case, a fresh id is created for the node at hand, then its size and type are computed via functions ComputeSize and ComputeType, and a new tuple is added in table Codes. Function ComputeSize returns the sum of the sizes of all child nodes or 1 if the current node is a leaf. Once the node id and type have been retrieved, procedure InsertFragment adds a new tuple in table RPSDAG for each children of the visited node. Algorithm 3. Insert Node procedure InsertNode(String code, RPST f ) returns (RPSDAGNodeId, String) (id, size, type) ⇐ GetIdSizeType(code) if (id, size, type) = (0, 0,“”) then id ⇐ NewId() size ⇐ ComputeSize(code) type ⇐ ComputeType(f ) Codes ⇐ Codes ∪ {(code, id, size, type)} return (id,type)
Algorithm 4. Insert SubPolygons procedure InsertSubPolygons(RPSDAGNodeId id, String code, RPST f ) foreach (zcode, zid, zsize, ztype) in Codes such that ztype =“p” and zid = id do {String} LCS ⇐ ComputeLCS(code, zcode) foreach lcode in LCS do (lid, ltype) = InsertNode(lcode, f ) if lid = id then RP SDAG ⇐ RP SDAG ∪ {(id, lid)} if lid = zid then RP SDAG ⇐ RP SDAG ∪ {(zid, lid)} foreach sid in GetChildrenIds(lid) do RP SDAG ⇐ RP SDAG \ {(id, sid), (zid, sid)} ∪ {(lid, sid)} InsertSubPolygons(lid, lcode)
If the node is a polygon, procedure InsertSubPolygons() is invoked in the last step of InsertFragment. This procedure is used to identify common subpolygons and factor them out as separate nodes. Indeed, two polygons may share one or more subpolygons which should also be identified as clones. To illustrate this scenario, let us consider the two polygons P1 and P2 in Fig. 5, where code(P1 ) = B2 .a.B1 .w.z.a.B1 .c and code(P2 ) = a.B1 .c.d.a.B1 .w.z 1 . These two polygons share bond B1 as common 1
For the sake of readability, here we use the node/fragment label as canonical code, instead of the numeric value associated with its label.
Clone Detection in Repositories of Business Process Models
257
child, while bond B2 is a child of P1 only. However, at a closer look, their canonical codes share three Longest Common Substrings (LCS), namely a.B1 .c, a.B1 and w.z 2 . These common substrings represent common subpolygons, and thus clones that may be refactored as separate subprocesses. Assume P1 is already stored in the RPSDAG with children B1 and B2 (first graph in Fig. 5) and we now want to store P2 . We invoke procedure InsertFragment(P2) and add a new node in the RPSDAG, with B1 as a child (second graph in Fig. 5). Since P2 is a polygon, we also need to invoke InsertSubPolygons(P2). This procedure (Algorithm 4) retrieves all polygons from Codes that are different than P2 (in our case there is only P1 ). Then, for each such polygon, it computes all the non-trivial LCSs between its code and the code of the polygon just being inserted. This is performed by function ComputeLCS() which returns an ordered list of LCSs starting from the longest one (there are efficient algorithms to compute the LCSs of two strings. This can be done by using a suffix tree [17] to index the canonical codes of the existing polygons, see e.g. [6] for a survey of LCS algorithms). In the example at hand, the longest substring is a.B1 .c. This substring can be seen as the canonical code of a “shared” subpolygon between P2 and P1 . To capture the fact that this shared subpolygon is a clone, we insert a new node P3 in the RPSDAG with canonical code a.B1 .c (unless such a node already exists, in which case we reuse the existing node) and we insert an edge from P2 to P3 and from P1 to P3 . In order to avoid redundancy in the RPSDAG, the new node P3 then “adopts” all the common children that it shares with P2 and with P1 . These nodes are retrieved by function GetChildrenIds(), which simply returns the ids of all child nodes of P3 : these will either be children of P1 or P2 or both. In our example, there is only one common child, namely B1 , which is adopted by P3 (see third graph in Fig. 5).
#
%
#
%
! !
! "$ &'
! "$
&'
! "$ &'
! "$ &'
Fig. 5. Two subpolygons and the corresponding RPSDAG
2
An LCS of strings S1 and S2 is a substring of both S1 and S2 , which is not contained in any other substring shared by S1 and S2 . LCSs of size one are not considered for obvious reasons.
258
R. Uba et al.
It is possible that the newly inserted node also shares subpolygons with other nodes in the RPSDAG. So, before continuing to process the remaining LCSs between P1 and P2 , we invoke procedure InsertSubPolygons recursively over P3 . In our example, the code of P3 shares the substring a.B1 with both P2 and P1 (after excluding the code of P3 from the codes of P2 and P1 since P3 is already a child of these two nodes). Thus we create a new node P4 with code a.B1 as a child of P3 and P2 , and we make P4 adopt the common child B1 (see fourth graph in Fig. 5). We repeat the same operation between P3 and P1 but since this time P4 already exists in the RPSDAG, we simply remove the edge between P1 and B1 , and add an edge between P1 and P4 (fifth graph in Fig. 5). Then, we resume the execution of InsertSubPolygons(P2) and move to the second LCS between P1 and P2 , i.e. a.B1 . Since this substring has already been inserted into the RPSDAG as node P4 , nothing is done. This process of searching for LCSs is repeated until no more non-trivial common substrings can be identified. In the example at hand, we also add subpolygon w.z between P1 and P2 (last graph in Fig. 5). At this point, we have identified and factored out all maximal subpolygons shared by P1 and P2 , and we can repeat the above process for other polygons in the RPSDAG whose canonical code shares a common substring with that of P1 . Sometimes there may be multiple overlapping LCSs of the same size. For example, given the codes of two polygons a.b.c.d.e.f.a.b.c and a.b.c.k.b.c.d, if we extract one substring (say a.b.c), we can no longer extract the second one (b.c.d). In these cases we locally choose based on the number of occurrences of an LCS within the two strings in question. If they have the same number of occurrences, we randomly choose one. In the example above, a.b.c has the same size of b.c.d but it occurs 3 times, so we pick a.b.c. We do not explicitly discuss the cases where the polygon to be inserted is a subpolygon or superpolygon of an existing polygon. These are just special cases of shared subpolygon detection, and they are covered by procedure InsertSubPolygons. Algorithm 5 shows the procedure for deleting a graph from an indexed repository. This procedure relies on another procedure for deleting a fragment, namely DeleteFragment (Algorithm 6). The DeleteFragment procedure performs a depth-first search traversal of the RPSDAG, visiting the nodes in post-order. Nodes with at most one parent are deleted, because they correspond to fragments that appear only in the deleted graph. Deleting a node entails deleting the corresponding tuple in table Codes and deleting all tuples in the RPSDAG table where the deleted node corresponds to the parent id. If a fragment has two or more parents, the traversal stops along that branch since this node and its descendants must remain in the RPSDAG. At the end of the DeleteGraph procedure, the graph itself is deleted from table Roots through its root id. If the node to be removed is a polygon, it has only two parents and both parents are polygons, it is a shared subpolygon. Before deleting such a node, we need to make sure all its children are adopted by the parent polygon that will remain in the RPSDAG. For space reasons, we omit this step in the deletion algorithm. Also, for space reasons the above algorithms do not take into account the case where a fragment is contained multiple times in the same parent fragment, like for example a polygon that contains two tasks with identical labels. To address this case, we have to introduce an attribute “Weight” in the RPSDAG table, representing the number of times a child node occurs inside a parent node. Under this representation, a node is a clone if the sum of the weights of its incoming edges in the RPSDAG is greater than one.
Clone Detection in Repositories of Business Process Models
259
Complexity. The deletion algorithm performs a depth-first search, which is linear on the size of the graph being deleted. Similarly, the insertion algorithm traverses the inserted graph in linear time. Then, for each fragment to be inserted, it computes its code. Computing the code is linear on the size of the fragment for bonds and polygons, while for rigids it is factorial on the largest number of vertices inside the rigid that share identical labels, as discussed in Section 2.23 . Finally, if the fragment is a polygon, we compute all LCSs between this polygon and all other polygons in the RPSDAG. Using a suffix tree, this operation is linear on the sum of the lengths of all polygons’ canonical codes. Algorithm 5. Delete Graph procedure DeleteGraph(GraphId mid) RPSDAGNodeId rid ⇐ GetRoot(mid) DeleteFragment(rid) Roots ⇐ Roots \ {(mid, rid)}
Algorithm 6. Delete Fragment procedure DeleteFragment(RPSDAGNodeId f id) if |{(pid, cid) ∈ RP SDAG : cid = f id}| ≤ 1 then foreach (pid, cid) in RP SDAG where pid = f id do DeleteFragment(cid) RP SDAG ⇐ RP SDAG \ {(pid, cid)} Codes ⇐ {(code, id, size, type) ∈ Codes : id = f id}
4 Evaluation We evaluated the RPSDAG using four datasets: the collection of SAP R3 reference process models [9], a model repository obtained from an insurance company under condition of anonymity and two collections from the IBM BIT process library [5], namely collections A and B3. In the BIT process library there are 5 collections (A, B1, B2, B3 and C). We excluded collections B1 and B2 because they are earlier versions of B3, and collection C because it is a mix of models from different sources and as such it does not contain any clones. The SAP repository contains 595 models with sizes ranging from 5 to 119 nodes (average 22.28, median 17). The insurance repository contains 363 models ranging from 4 to 461 nodes (average 27.12, median 19). The BIT collection A contains 269 models ranging from 5 to 47 nodes (average 17.01, median 16) while collection B3 contains 247 models with 5 to 42 nodes (average 12.94, median 11). Performance evaluation. We first evaluated the insertion times. Obviously inserting a new model into a nearly-empty RPSDAG is less costly than doing so in an already populated one. To factor out this effect, we randomly split each dataset as follows: One third of the models were used to construct an initial RPSDAG and the other twothirds were used to measure insertion times. In the SAP repository, 200 models were 3
A tighter complexity bound for this problem is given in [1].
260
R. Uba et al.
used to construct an initial RPSDAG. Constructing the initial RPSDAG took 26.1s. In the insurance company repository, the initial RPSDAG contained 121 models and its construction took 7.4s. For the BIT collections A and B3, 90 and 82 models respectively were used for constructing the initial RPSDAG. Constructing the initial RPSDAGs took 4.2s for collection A and 3.5s for collection B3. All tests were conducted on a PC with a dual core Intel processor, 1.8 GHz, 4 GB memory, running Microsoft Windows 7 and Oracle Java Virtual Machine v1.6. The RPSDAG was implemented as a Java console application on top of MySQL 5.1. Each test was run 10 times and the execution times obtained across the ten runs were averaged. Table 1 summarizes the insertion time per model for each collection (min, max, avg., std. dev., and 90th percentile). All values are in milliseconds. These statistics are given for two cases: without subpolygon clone detection and with subpolygon clone detection. We observe that the average insertion time per model is about 3-5 times larger when subpolygon clone detection is performed. This overhead comes from the step where we compare inserted polygon with each already-indexed polygon and compute the longestcommon substring of their canonical codes. Still, the average execution times remain in the order of tens of milliseconds across all model collections even with subpolygon detection. The highest average insertion time (125ms) is observed for the Insurance collection. This collection contains some models with large rigid components in their RPST. In particular, one model contained a rigid component in which 2 task labels appeared 9 times each – i.e. 9 tasks had one of these labels and 9 tasks had the other – and these tasks were all preceded by the same gateway. This label repetition affected the computation of the canonical code (4.4 seconds for this particular rigid). Putting aside this extreme case, all insertion times were under one second and in 90% of the cases (l90 ), the insertion times in this collection were under 50ms without subpolygon detection and 222ms with subpolygon detection. Thus we can conclude that the proposed technique scales up to real-sized model collections. Table 1. Model insertion times (in ms) SAP Insurance BIT A BIT B3 SAP Insurance BIT A BIT B3
min 4 5 3 5 4 26 18 5
max 85 1722 58 467 482 4402 128 150
avg 20 32 12 14 97 126 41 33
std 14 113 9 9 81 291 20 24
l90 40 49 23 28 202 222 59 69
No subpolygon
Subpolygon
As explained in Section 3, once the models are inserted, we can find all clones with a SQL query. The query to find all clones with at least 2 nodes and 2 parents from the SAP reference models takes 75ms on average (90ms for the insurance models, 118 and 116ms for BIT collections A and B3). Refactoring gain. One of the main applications of clone detection is to refactor the identified clones as shared subprocesses in order to reduce the size of the model collection and increase its navigability. In order to assess the benefit of refactoring clones into subprocesses, we define the following measure.
Clone Detection in Repositories of Business Process Models
261
Definition 4. The refactoring gain of a clone is the reduction in number of nodes obtained by encapsulating that clone into a separate subprocess, and replacing every occurrence of the clone with a task that invokes this subprocess. Specifically, let S be the size of a clone, and N the number of occurrences of this clone. Since all occurrences of a clone are replaced by a single occurrence plus N subprocess invocations, the refactoring gain is: S · N − S − N . Given a collection of models, the total refactoring gain is the sum of the refactoring gains of the clones of non-trivial clones (size ≥ 2) in the collection. Table 2 summarizes the total refactoring gain for each model collection. The first two columns correspond to the total number of clones detected and the total refactoring gain without subpolygon refactoring. The third and fourth column show number of clones and refactoring gain with subpolygon refactoring. The table shows that a significant number of clones can be found in all model collections, and that the size of these model collections could be reduced by 4-14% if clones were factored out into shared subprocesses. The table also shows that subpolygon clone detection adds significant value to the clone detection method. For instance, in the case of the insurance models, we obtain about 3 times more clones and 3 times more refactoring gain when subpolygon clone detection is performed. Table 2. Total refactoring gain without and with subpolygon refactoring.
SAP Insurance BIT A BIT B3
Nr. clones 204 107 57 19
No subpolygon refactoring gain 1359 (10.3%) 394 (4%) 195 (4.3%) 208 (6.5%)
Nr. clones 479 279 174 49
With subpolygon refactoring gain 1834 (13.8%) 883 (9%) 384 (8.4%) 259 (6.6%)
Table 3 shows more detailed statistics of the clones found with subpolygon detection. The first three columns give statistics about the sizes of the clones found, the next three columns refer to the frequency (number of occurrences) of the clones, and the last three correspond to refactoring gain. We observe that while the average clone size is relatively small (3-5 nodes), there are some large clones with sizes of 30+ nodes. Table 3. Statistics of detected clones (with subpolygon detection)
SAP Insurance BIT A BIT B3
avg 4.79 3.58 3.02 3.18
Size max 41 32 16 9
std. dev. 4.05 3.18 2.08 1.58
avg 2.33 2.76 2.75 3.82
# occurrences max std. dev. 8 0.77 41 3.18 9 1.29 20 3.70
avg 3.83 3.16 2.21 5.29
Refactoring gain max std. dev. 44 5.68 79 7.67 15 3.12 37 8.87
5 Related Work Clone detection in software repositories has been an active field for several years. According to [3], approaches can be classified into: textual comparison, token comparison, metric comparison, abstract syntax tree (AST) comparison, and program dependence
262
R. Uba et al.
graphs (PDG) comparison. The latter two categories are close to our problem, as they use a graph-based representation. In [2], the authors describe a method for clone detection based on ASTs. The method applies a hash function to subtrees of the AST in order to distribute subtrees across buckets. Subtrees in the same bucket are compared by testing for tree isomorphism. This work differs from ours in that RPSTs are not perfect trees. Instead, RPSTs contain rigid components that are irreducible and need to be treated as subgraphs—thus tree isomorphism is not directly applicable. [11] describes a technique for code clone detection using PDGs. A subgraph isomorphism algorithm is used for clone detection. In contrast, we employ canonical codes instead of pairwise subgraph isomorphism detection. Another difference is that we take advantage of the RPST in order to decompose the process graph into SESE fragments. Work on clone detection has also been undertaken in the field of model-driven engineering. [4] describes a method for detecting clones in large repositories of Simulink/TargetLink models from the automotive industry. Models are partitioned into connected components and compared pairwise using a heuristic subgraph matching algorithm. Again, the main difference with our work is that we use canonical codes instead of subgraph isomorphism detection. In [12], the authors describe two methods for exact and approximate matching of clones for Simulink models. In the first method, they apply an incremental, heuristic subgraph matching algorithm. In the second approach, graphs are represented by a set of vectors built from graph features: e.g. path lengths, vertex in/out degrees, etc. An empirical study shows that this feature-based approximate matching approach improves pre-processing and running times, while keeping a high precision. However, this data structure does not support incremental insertions/deletions. Our work is also related to graph database indexing and is inspired by [20]. Other related indexing techniques for graph databases include GraphGrep [16] – an index designed to retrieve paths in a graph that match a regular expression. This problem however is different from that of clone detection since clones are not paths (except for polygons). Another related index is the closure-tree [7]. Given a graph G, the closuretree can be used to retrieve all indexed graphs in which G occurs as a subgraph. We could use the closure tree to index a collection of process graphs so that when a new graph is inserted we can check if any of its SESE regions appears in an already indexed graph. However, the closure tree does not directly retrieve the exact set of graphs where a given subgraph occurs. Instead, it retrieves a “candidate set” of graphs. An exact subgraph isomorphism test is then performed against each graph in the candidate set. In contrast, by storing the canonical code of each SESE region, the RPSDAG obviates the need for this subgraph isomorphism testing. In [8], we described an index to retrieve process models in a repository that exactly or approximately match a given model fragment. In this approach, paths in the process models are used as index features. Given a collection of models, a B+ tree is used to reduce the search space by discarding those models that do not contain any path of the query model. The remaining models are tested for subgraph isomorphism. In [19], eleven process model refactoring techniques are identified and evaluated. Extracting process fragments as subprocesses is one of the techniques identified. Our work addresses the problem of identifying opportunities for such “fragment extraction”.
Clone Detection in Repositories of Business Process Models
263
6 Conclusion We presented a technique to index process models in order to identify duplicate SESE fragments (clones) that can be refactored into shared subprocesses. The proposed index, namely the RPSDAG, combines a method for decomposing process models into SESE fragments (the RPST decomposition) with a method for generating a unique string from a labeled graph (canonical codes). These canonical codes are used to determine whether a SESE fragment in a model appears elsewhere in the same or in another model. The RPSDAG has been implemented and tested using process model repositories from industrial practice. In addition to demonstrating the scalability of the RPSDAG, the experimental results show that a significant number of non-trivial clones can be found in industrial process model repositories. In one repository, ca. 480 non-trivial clones were found. By refactoring these clones, the overall size of the repository is reduced by close to 14%, which arguably would enhance the repository’s maintainability. A standalone release of the RPSDAG implementation, together with sample models, is available at: http://apromore.org/tools. The current RPSDAG implementation can be further optimized in three ways: (i) by using suffix trees for identifying the longest common substrings between the code of an inserted polygon and those of already-indexed polygons; (ii) by storing the canonical codes of each fragment in a hash index in order to speed up retrieval; (iii) by using the Nauty library for computing canonical codes (http://cs.anu.edu.au/˜bdm/nauty). Nauty implements several optimizations that could complement those described in Section 2.2. Another avenue for future work is to extend the proposed technique in order to identify approximate clones. This has applications in the context of process standardization, when analysts seek to identify similar but non-identical fragments and to replace them with standardized fragments in order to increase the homogeneity of work practices. Acknowledgments. This research is partly funded by the Estonian Science Foundation, the European Social Fund via the Estonian ICT Doctoral School and ERDF via the Estonian Centre of Excellence in Computer Science (Uba, Dumas and Garc´ıa-Ba˜nuelos), and by the NICTA Queensland Lab (La Rosa).
References 1. Babai, L.: Monte carlo algorithms in graph isomorphism testing. Technical Report D.M.S. No. 79-10, Universite de Montreal (1979) 2. Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone Detection Using Abstract Syntax Trees. In: The Int. Conf. on Software Maintenance, pp. 368–377 (1998) 3. Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.: Comparison and Evaluation of Clone Detection Tools. IEEE Trans. on Software Engineering 33(9), 577–591 (2007) 4. Deissenboeck, F., Hummel, B., J¨urgens, E., Sch¨atz, B., Wagner, S., Girard, J.-F., Teuchert, S.: Clone Detection in Automotive Model-based Development. In: ICSE (2008) 5. Fahland, D., Favre, C., Jobstmann, B., Koehler, J., Lohmann, N., V¨olzer, H., Wolf, K.: Instantaneous soundness checking of industrial business process models. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 278–293. Springer, Heidelberg (2009)
264
R. Uba et al.
6. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997) 7. He, H., Singh, A.K.: Closure-Tree: An Index Structure for Graph Queries. In: The 22nd Int. Conf. on Data Engineering. IEEE Computer Society, Los Alamitos (2006) 8. Jin, T., Wang, J., Wu, N., La Rosa, M., ter Hofstede, A.H.M.: Efficient and Accurate Retrieval of Business Process Models through Indexing. In: Meersman, R., Dillon, T.S., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6426, pp. 402–409. Springer, Heidelberg (2010) 9. Keller, G., Teufel, T.: SAP R/3 Process Oriented Implementation: Iterative Process Prototyping. Addison-Wesley, Reading (1998) 10. Koschk, R.: Identifying and Removing Software Clones. In: Mens, T., Demeyer, S. (eds.) Software Evolution. Springer, Heidelberg (2008) 11. Krinke, J.: Identifying Similar Code with Program Dependence Graphs. In: WCRE (2001) 12. Pham, N.H., Nguyen, H.A., Nguyen, T.T., Al-Kofahi, J.M., Nguyen, T.N.: Complete and Accurate Clone Detection in Graph-based Models. In: ICSE, pp. 276–286. IEEE, Los Alamitos (2009) 13. Polyvyanyy, A., Vanhatalo, J., V¨olzer, H.: Simplified Computation and Generalization of the Refined Process Structure Tree. In: WSFM (2010) 14. Reijers, H.A., Mans, R.S., van der Toorn, R.A.: Improved Model Management with Aggregated Business Process Models. Data Knowl. Eng. 68(2), 221–243 (2009) 15. Rosemann, M.: Potential pitfalls of process modeling: Part a. Business Process Management Journal 12(2), 249–254 (2006) 16. Shasha, D., Wang, J.T.-L., Giugno, R.: Algorithmics and Applications of Tree and Graph Searching. In: PODS, pp. 39–52 (2002) 17. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995) 18. Vanhatalo, J., V¨olzer, H., Koehler, J.: The Refined Process Structure Tree. Data Knowl. Eng. 68(9), 793–818 (2009) 19. Weber, B., Reichert, M., Mendling, J., Reijers, H.A.: Refactoring large process model repositories. Computers in Industry 62(5), 467–486 (2011) 20. Williams, D.W., Huan, J., Wang, W.: Graph database indexing using structured graph decomposition. In: ICDE, pp. 976–985 (2007)
Business Artifact-Centric Modeling for Real-Time Performance Monitoring Rong Liu1 , Roman Vaculín1 , Zhe Shan2 , Anil Nigam1 , and Frederick Wu1 1
2
IBM T.J. Watson Research Center Hawthorne, NY 10532 USA {rliu,vaculin,anigam,fywu}@us.ibm.com Department of Supply Chain and Information Systems, Smeal College of Business Penn State University, University Park, PA 16802, USA
[email protected] Abstract. In activity-centric process paradigm, developing effective and efficient performance models is a hard and laborious problem with many challenges mainly because of the fragmented nature of this paradigm. In this paper, we propose a novel approach to performance monitoring based on business artifactcentric process paradigm. Business artifacts provide an appropriate base for explicit modeling of monitoring contexts. We develop a model-driven two-phase methodology for designing real-time monitoring models. This methodology allows domain experts or business users to focus on defining metric and KPI requirements while the detailed technical specification of monitoring models can be automatically generated from the requirements and underlying business artifacts. This approach dramatically simplifies design of monitoring models and also increases the understandability of monitoring results.
1 Introduction Globalization and conglomeration of large enterprises have led to greatly increased business process complexity. These trends have exacerbated the problem of achieving clear visibility over business operations, and demand more effective methods of real-time performance monitoring. The vision of real-time (on-demand) enterprises [1] advocates providing to all stakeholders visibility into running business processes. Similarly, real-time business intelligence [2], which is the practice of delivering real-time business operations information with the goal of enabling corrective actions, adjustments, and optimizations to processes, critically relies on performance monitoring. Despite a significant body of work in this area [3,4,5,6], building a real-time performance model and implementation for any business scenario remains a high-skill, labor-intensive exercise [7]. A major reason for this is the difficulty of fully defining the business monitoring requirements and transforming them into executable real-time monitoring models, without exposing the technical details needed for run-time execution. Most approaches to performance monitoring are based on events (often from multiple sources) that are collected, correlated, and processed based on predefined rules or patterns [8]. To make business decisions based on the monitored events, the events need to be interpreted and analyzed in conjunction with each other and with related information, often known as a monitoring context. Although a monitoring context is frequently S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 265–280, 2011. c Springer-Verlag Berlin Heidelberg 2011
266
R. Liu et al.
not explicitly modeled, we do choose to model the monitoring context explicitly [9,10] as part of the monitoring model. Conceptually, a monitoring context is a construct that aggregates all data elements that are relevant to real-time computation of business metrics. Defining and identifying a monitoring context is difficult for the following reasons: (1) Business processes may be very complex and distributed, crossing multiple organizational boundaries. (2) Applications employed in a business process often do not exchange information and are developed and maintained by independent vendors. (3) Business processes are typically modeled with activity-centric workflow methods [11], which commonly exhibit excessively fine granularity. As a result of these factors, design of monitoring contexts and performance measures usually requires technical experts, rather than business domain experts. This leads to misunderstandings and inaccuracies in the models, and long cycle times for modeling and deployment [7]. The fundamental reason for these difficulties is the separation between design of monitoring models and the design of process models. To address these problems, we propose a novel real-time performance monitoring model based on the business artifact-centric process modeling paradigm [12,13,14]. The business artifact-centric approach models both information and processes in one unified framework (elevating data to a first-class citizen). The main focus is on identifying and defining the key business artifacts (BA) (for example, a purchase order in an ordering system) of a business process and modeling the business process as the interactions of these key business artifacts. Since the BAs capture key business data in their information models, they are a natural basis for defining monitoring contexts, which serve as the container for performance measures and key performance indicators (KPIs). This approach simplifies definition of monitoring models and makes it a business modeling activity, and thus more suitable for business domain experts. There is a corresponding improvement in agility of the monitoring system to be modified in response to business requirements since less technical skills are required. The paper makes the following technical contributions. We formally define a realtime monitoring model employing monitoring contexts based on the business artifactcentric paradigm. We propose a top-down model-driven methodology for definition of monitoring models and monitoring contexts which hides most of the technical details from the designer. Thus, the designer primarily focuses on definition of KPIs and performance metrics by means of using mostly the elements of business artifacts. Next we propose an algorithm that automatically transforms the monitoring context skeletons defined by a model designer into a complete executable monitoring model. The generated model is optimized for execution in distributed environments by ensuring that only necessary events with minimal payload are communicated between execution engine and monitoring infrastructure. The paper is organized as follows. Section 2 introduces business artifacts and an example with which we illustrate our approach. In Sections 3 and 4, we formally define the monitoring model and propose a methodology for definition of monitoring contexts. Section 5 describes implementation and Section 6 presents discussion and evaluation of our approach. Section 7 reviews the related work, and Section 8 is a summary.
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
267
2 Preliminaries: Business Artifacts and the Restaurant Example The business artifact-centric approach [12] considers information as an essential part of business processes, and as such it defines the business processes in terms of interacting business artifacts. Each business artifact (BA) is characterized by an information model (data) and a lifecycle model (the possible evolutions over time, i.e., activities/tasks acting on the data). Similar to other paradigms, it is useful to distinguish the notion of BA types and BA instances. Intuitively, a BA instance of a particular type is a complex data entity which is identified by a unique identifier, and which can evolve over time, i.e., some values of its attributes may be modified. The particular evolution of the artifact instance over time are governed by the life-cycle of the BA. Example 1 (Restaurant example). We will illustrate most of the concepts in the paper using a simple restaurant example (adapted from [12]). The example is intended to be illustrative; it models a visit of a guest to a full-service restaurant. During analysis several business artifact types have been identified. Figure 1 shows information models of some artifact types and a sample artifact instance. A GuestCheck artifact type (or GC for short) represents the basic information about the customer visit. A KitchenOrder (or KO for short) contains details of the items that need to be prepared. One GC instance can lead to creation of several KO instances, as a guest may order many times. There are some other artifact types in the scenario, such as a Chef and a Waiter artifact, which are not discussed in this paper. As will be seen in the following paragraphs, the overall sales process can be described by capturing the life cycle of the introduced business artifacts. The lifecycle of a business artifact is typically represented as a finite state machine. We distinguish the notion of tasks and activities. A task represents a logical unit of work. A task may be a human task, an invocation task (of an activity or an external service), or an assignment task to/from attributes of the information model. An activity is a named access point that can be invoked to perform some work, and that may specify its operations as flows of tasks (represented as a directed graph). Example 2 (Lifecycles of restaurant artifacts). Figure 1 shows a fragment of the restaurant process model with lifecycles of GuestCheck (GC) and KitchenOrder (KO)
Fig. 1. Restaurant example process model represented as Business Artifact types with lifecycles
268
R. Liu et al.
artifact types. States are showed as ovals, activities as rectangles, transitions are associated with triggering activities, and invocations are represented as dashed lines. When a customer arrives, the greeter creates a new instance of the GC artifact by using CreateGuestCheck activity, and records the table number, and the responsible waiter to the GC (activity RequestResource). After the customer has made selections, the waiter records them on the GC, and places an order to the kitchen (OrderItems activity creates a new KO artifact instance by invoking the CreateKitchenOrder activity of KO). The lifecycle of KO continues with assigning the chef to the order and delivering the food. Eventually, when the customer requests the check, the GC is totaled, and paid (the CheckOut activity of the GC).
3 Real-Time Performance Monitoring Model Like many other businesses, restaurants need to monitor time-sensitive metrics to measure how the business delivers value to guests on a minute-by-minute basis [15]. The restaurant in our example is targeted to improve daily sales through monitoring a collection of performance metrics in real time. First, the restaurant tracks active guest checks to anticipate key moments in its service lifecycle (e.g. drink service, food order, or after-dinner service) and then proactively satisfy customer needs. Second, the restaurant closely monitors current sales per each menu group and the strongest item in each group so that it is able to adjust menu items quickly based on current sales and customer preferences daily. Table turn-time, or the average time a guest spends in the restaurant, is an important metric to measure whether the restaurant operates at a speed appropriate to its capacity. This metric usually varies by the time in a day (i.e. breakfast, lunch, dinner) and it is calculated on a hourly basis. Finally, the restaurant analyzes daily sales to evaluate the effectiveness of promotion strategies and detect sales trends. These metrics are displayed in a dashboard shown in Figure 2. For example, the Active Guest Checks table indicates that the first two customers may be about to check out and these two tables will be available soon since desert has been ordered. The Table Turn-Time chart compares current turnover rate (blue line with diamonds) with a target value. Obviously, the table turnover rate around dinner time is higher than the target value, 55 minutes. The restaurant needs to analyze the cause of this variance and detect any potential operational problems. To create such an insightful dashboard, real-time data, usually in the format of events, needs to be collected and analyzed. Our overall monitoring framework is inspired by the approach of Business Activity Monitor (BAM) systems [9], according to which a general performance monitoring architecture contains (1) event absorption layer to collect events and relevant data, (2) event processing and filtering layer to analyze events using event rules or event patterns [8], to calculate metrics and to generate outgoing events, and (3) event delivery and display layer to dispatch outgoing events or represent metrics in a dashboard. These three layers are shown in Figure 2. A monitoring context – originally introduced in Websphere Business Monitor [16] (a BAM tool) – is a data structure for aggregating all information relevant to monitored business metrics, and also a technical mechanism for processing events to retrieve the information. It embraces both event absorption layer and event processing and filtering layer. For instance, to create the dashboard in Figure 2, a monitoring context can be centered around GuestCheck with relevant information such as GuestCheck ID, Food
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
2/25/11 8:50 PM
Performance of the Day
Event Event
Event …...
Monitoring Context
Event Absorption Event Processing and Filtering Guest Check ID Checkout Time Waiting Time Food Sales Coupon Used
Active Guest Checks Customer # of ID Name Guest 6840 Joe White 4 6841 Mary Jill 2 6842 Bill Smith 8 6843 Tim Brown 3 6844 Jack Lee 1 6845 Jerry White 5 6846 Sam Stone 6 6847 Tina Hill 3 6848 Brian Clark 4
CheckIn Duration Sales Time (Min) Food Beverages Dessert Total 7:45 PM 65 60 15 20 95 7:55 PM 55 40 0 15 55 8:02 PM 48 120 50 170 8:05 PM 45 80 40 120 8:06 PM 44 20 5 25 8:15 PM 35 150 50 200 8:22 PM 28 175 60 235 8:30 PM 22 30 30 8:35 PM 15 40 40 Sales By Menu Group
Table Turn-Time
Omelettes 14% Desserts 9%
60 50 Dinners 55%
40 30
Breakfast
Lunch
Sandwiches 22%
Dinner
20 7
11
15 Time 19
Ham Omelette (Omelettes) Toll House Pie (Desserts) BLT Sandwich (Sandwiches) House Steak (Dinners)
Sales Analysis Food
2500
Beverage
800
Dessert
6000
Monitoring Database
Strongest Item in Group
(Min)
70
8000
…...
269
Total
8000 500
1500
4000 2000 2/13
11000
5000
2/20
500 2/13
2/20
200 2/13
2/20
2000 2/13
2/20
Fig. 2. Performance Monitoring Architecture
Sales etc. Here we give definitions of key elements of our monitoring approach, including events, metrics, monitoring contexts, and KPIs. Definition 1 (Event Type). An event type v is a tuple v = header, payload, where header and payload are data schemas. We use V to denote the set of all possible events. Event occurrences (or simply events) must always conform to some event type. The event header is supposed to carry domain independent information (such as identifier of an event, timestamp, event type name, etc.), while the event payload is intended to carry domain/application specific information (e.g., the name of a restaurant customer). Definition 2 (Metric). A metric m is a tuple m = name, value : typem , fm where name is a metric name, value is a value of the metric of type typem (a simple type), and fm is a function fm : t1 × · · · × tn → typem for computing value of m with t1 , . . . , tn being simple types. A metric is a domain specific function (calculation), designed by domain experts, that measures some relevant aspect of performance. Given a vector of simple values (data attributes, values of other metrics, values carried by events) it calculates one value.
270
R. Liu et al.
In our model, metrics are defined in the scope of monitoring contexts. We distinguish monitoring context types and instances. Intuitively, a monitoring context type defines a set of metrics and their computation details, and inbound/outbound events and rules for processing events that update metrics of the context. Definition 3 (Monitoring Context). A monitoring context type mc is a tuple mc = K, Vin , ff ilter , fcr , M, fop , ftrigger , Vout , fout , and a monitoring context instance is an instance of mc with a unique identifier, where – K is a set of keys (key attributes) that together uniquely identify instances of mc; – Vin is a set of inbound event types which are used to update metrics; – ff ilter : Vin → {true, false} is an optional filtering function (condition) that determines whether particular events in Vin should be accepted to impact mc; – fcr : Vin → K is a correlation function that specifies which monitoring context instances of mc should process an accepted inbound event; – M is a set of metrics, ∀m ∈ M its value m.value is computed (function fm ) from inbound events Vin of mc and from the other metrics (of mc and possibly of other monitoring contexts); – fop : Vin → {create, terminate, null} is a function controlling creation and termination of contexts, i.e., if an inbound event is mapped to create, a new instance of mc will be created, if it is mapped to terminate the monitoring context correlated with this event will be terminated; – ftrigger : M → P(Vin ) is a triggering function that determines what inbound events can update a metric (here P denotes a power-set); – Vout is a set of outbound event types which can be generated by mc to signal situations requiring business attention; – fout : M → Vout is a mapping that determines when and what outbound events should be raised given some metric. Notice that the metrics defined in the scope of a monitoring context are called instance based metrics, by which we mean that their values are measured specifically for a monitoring context instance (example of such a measure might be the total amount of a guest check). On top of that often measures evaluated across multiple instances of a monitoring context type are of interest. Such measures are defined as Key Performance Indicators (KPIs). A KPI can be also analyzed in hierarchical dimensions. For instance, the sales of a company may be analyzed by geographic area or line of business. In Figure 2, KPI average table turn-time is displayed with respect to the hour of a day. Definition 4 (Key Performance Indicator). A KPI is a tuple kpi = name, value : typekpi , mc, fkpi , D, where name is a kpi name, value is a value of the kpi of type typekpi , mc is a monitoring context type across which the kpi is evaluated, fkpi : mc → typekpi is an aggregation function for computing value of kpi operating on the metrics of mc, and D is a list of dimensions each of which is defined by a metric of mc. A monitoring model consists of a set of KPIs and monitoring contexts supporting calculations of the KPIs. Definition 5 (Monitoring Model). A monitoring model mm = KP I, M C, where KP I is a set of target KPIs, and M C is the set of monitoring context types used in target KPIs definitions.
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
Correlation (f ) Inbound Event Subscription (V ) Filter (f ) cr
in
Business Process Application
CEI
271
Monitor Model
Map Metric (M)
Create/Terminate (f ) Update (f ) op
trigger
filter
Condition(f ) Outbound Event (V ) out
out
KPI
Map
Fig. 3. Monitoring Model
Fig. 3 shows an architecture describing how the elements of a monitoring model work together to compute target KPIs. In this architecture, a monitoring model subscribes to some event broker, for example, Common Event Infrastructure (CEI) [17], to receive inbound events. By using the sets of inbound events Vin of monitoring context types, a particular inbound event is first directed to a corresponding context type. Next, filtering function ff ilter decides if the event can be accepted. After that the correlation fcr identifies the target monitoring context (instance) for an event, and possibly fop determines whether the event creates or terminates a monitoring context instance. Then ftrigger determines which metric will be refreshed by the event. If fout is satisfied, some outbound events Vout are generated and sent to their destination through an event broker. Finally, if some updated metric contributes to some KPI, the KPI is recalculated.
4 Deriving Monitoring Models from Business Artifacts In this section we show how to define monitoring contexts in the artifact-centric paradigm in such a way that leads to intuitive and easy definitions of monitoring models. First we establish a domain independent mapping between business artifact model and the monitoring model. Subsequently we propose a two-step model-driven methodology for deriving application specific monitoring models from BA process models. 4.1 Monitoring Models in Business Artifact-Centric Paradigm We consider two kinds of goals for a business - strategic goals (aspirations) and operational goals (what actually needs to be done). Strategic goals (e.g. increase sales) necessitate the specification of business level KPIs (e.g. quarterly increase in same store sales). Examination of the KPIs usually provides clues regarding the information that needs to be measured (or metrics). The operational goals provide the connection to the business artifacts (BAs) that need to be produced and processed by the business. The connection between strategic and operational goals is established by the method - the basic refrain is "what do you need to do in order to meet the strategic goals". Establishing operational goals is usually a two-step process: first we capture the high-level operational requirements, and then make these more tangible by insisting "what needs to be produced to meet the operational requirements". Moreover operational and strategic goals can have influence on each other, e.g. an operational insight may lead to a change in strategic goals.
272
R. Liu et al.
Fig. 4. Mapping between business artifacts and monitoring contexts
There are two characteristics of business artifacts that make them worthy candidates for monitoring context. First, a BA provides the basis for establishing and tracking an operational goal (e.g. to improve sales, we need to process more customer orders, i.e. GuestChecks) [12]. In other words, BAs make operational goals more concrete. Second, a basic premise of the business artifact-centric approach is that all information that is important to a business, needs to be recorded on BAs. Consequently BAs provide a natural base for deriving monitoring contexts. In our approach the monitoring context types are derived directly from BA types and the monitoring context instances correspond to BA instances. Figure 4 shows the connection between the process (or business operation) model and the monitoring model through BAs. The BA information model provides basic metrics of the monitoring contexts, and for creating aggregate or complex metrics. The BA lifecycle represents key points in the progress towards operational goals. Thus, events can be derived from it to capture these important checkpoints. We distinguish five major event types: (i) BA event type indicating BA instances created or information changed, (ii) state event type signaling a BA instance enters or leaves a state, (iii) state transition event type, (iv) activity event type, and (v) task event type. Each type has several subtypes. For example, we use vst (s, x) to denote a state event, where s is the state name and x is a subtype (Entered or Exited). For each event, its header contains the name of the BA type, event type and subtype. The payload, by default, contains the entire BA information model, but can be optimized as we will discuss. 4.2 Method for Creating Monitoring Context Skeleton Here we focus on how specific monitoring contexts can be created for BA business processes. The first part of our model-driven methodology is a method for creating monitoring context skeletons from BA definitions and from user inputs. In this method, a business user specifies the targeted KPIs (with the help of appropriate SW tool), and breaks them down hierarchically into metrics that can be mapped or calculated from attributes of BAs or from events. A high level description of the method follows.
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
273
Input: KP I = {kpi1 , kpi2 , ...kpin }, a set of key performance indicator names E = {e 1 , e2 , ...em }, a set of business artifacts defining the process model V = e Ve , where Ve is the set of event types generated by business artifact e, for e ∈ E. Output: M = 1≤i≤m Mei where Mei is a set of metrics that are computed from attributes of ei Att = {att1 , att2, ...}, a set of business artifact attributes needed to calculate M M C = {mc1 , mc2 , ...}, a set of monitoring contexts with skeleton created τ : M → V , a mapping indicating events updating each metric (optional) Steps
1. For each kpi ∈ KP I, let the user decompose it into metrics each of which can be
represented as an expression using (i) the attributes of some business artifact type e, and/or (ii) attributes from event headers of event types generated by e, and/or (iii) value of some other metrics – For each artifact type e ∈ E • E ∗ = E ∗ ∪ {e} • Let m1 , m2 , ..., mk be the metrics defined for instances of e contributing to kpi (i.e., are used in the definition of kpi). Then Me = Me ∪ {m1 , m2 , ..., mk }. • Let atti1 , atti2 , ..., attip be the attributes of ex used in metrics contributing to kpi. Then Att = Att ∪ {atti1 , atti2 , ..., attip }. 2. Create an empty monitoring context mce for each artifact type e ∈ E ∗ . 3. For each m in Me , let user identify events τ (m) = {v1 , v2 , ...vt } that update m. If an identified event is not generated by e (i.e. an external event), the correlation between the event and mce needs to be specified. .
Example 3. To illustrate the method, we show how to create a monitoring context skeleton for the KPIs and metrics in the dashboard of Figure 2. The table of Active Guest Checks can be considered as a list of monitoring context instances with columns showing metrics. Table-Turn Time averages the difference between CheckinT ime and CheckoutT ime of GuestCheck instances grouped by the hour of CheckoutT ime (i.e. dimension). Sales by Menu Group lists the total amounts of ordered items of each menu group in KitchenOrder instances. The Sales Analysis chart plots daily Sales by Menu Group against a time axis. Strongest Item in Group shows the item in each menu group that appears in KitchenOrder instances the most frequently. We illustrate the first three in detail. Obviously, these KPIs are defined for BA types GuestCheck (egc ) and KitchenOrder (eko ) with possibly all events of these types, i.e., V = egc .V ∪ eko .V . Accordingly, we define two monitoring context types, mcgc and mcko from egc and eko respectively. – Hourly Table Turn-Time (kpiturntime ): The KPI averages on T urnT ime, a metric calculated from the attributes of GuestCheck (egc ). Thus, T urnT ime is a metric of mcgc . The metric is updated when a guest checks out, i.e. by the event “GuestCheck enters Complete state” (i.e., vst (Complete, Entered)) (see Fig. 1). = avg(mcgc .T urnT ime) kpiturntime mcgc .T urnT ime = egc .CheckOutT ime − egc .CheckInT ime mcgc .kpiturntime .D = {hour(egc .CheckoutT ime)}(Dimension)
274
R. Liu et al.
Fig. 5. Monitoring Context Skeleton
– Sales by Menu Group: This collection of KPIs can be calculated from attributes of KitchenOrder. We take Dinner Sales (kpidinner ) as an example. It is computed from an aggregated metric DinnerSales as total of dinner sales for a KitchenOrder instance. BeverageSales, OmeletteSales, DessertSales, and SandwichSales are defined in a similar fashion. Accordingly, we define F oodSales as the sum of DinnerSales, OmeletteSales, and SandwichSales. These metrics are defined in mcko and updated once a kitchen order is created, i.e. vst (Created, Entered) (assuming no order cancellation). kpidinner = sum(mcko .DinnerSales) mcko .DinnerSales = (item.Amount) ∀ i∈eko .OrderedItems & i.Group= Dinner
– Active Guest Checks: The table of active guest checks shown in Figure 2 in fact lists active instances of mcgc . The columns of the table are metrics of mcgc . Some of them can be directly mapped to the attributes of egc , such as Guest Check ID and Customer Name. The value of these metrics are populated when a guest check is created (i.e. vst (Created, Entered)). The others are calculated from the metrics of mcko defined above. We take Food Sales (GC_F oodSales) as an example. The total food sales of a guest check should be calculated based on the ordered food items in the associated kitchen orders. Since GC_F oodSales depends on mcko .F oodSales, whenever mcko .F oodSales is updated, GC_F oodSales is recalculated. Therefore, KitchenOrder event, vst (Created, Entered), also updates GC_F oodSales. However, this external event needs to correlate with mcko on the guest check ID (GChID), an attribute in its payload. mcgc .GC_F oodSales = ( mcko .F oodSales ) ∀mcko |mcko .GChID=mcgc .ID
Fig. 5 shows the skeletons for mcgc and mcko based on the discussion above. 4.3 Algorithm for Generating the Complete Monitoring Contexts The second part of our methodology is a fully-automated procedure for deriving executable monitoring models from the monitoring context skeletons. The procedure, which follows, takes as inputs the skeletons as defined in the previous section.
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
275
Input: metrics M , monitoring contexts M C, a mapping τ : M → V from method in Sec. 4.2 Output: M C completely defined Steps: For each mc ∈ M C with corresponding business artifact e, 1. Set the monitoring context key to business artifact id, i.e. mc.K = {ide } 2. Set the metrics as mc.M = Me . 3. Set inbound events using τ if τ is defined; otherwise, inbound events by default are event types generated by e, i.e., – if τ (mc.M ) = ∅ then mc.Vin = m∈mc.M τ (m) – else mc.Vin = Ve 4. Add vst (sinit , entered) and vst (sf inal , entered) to mc.Vin to make sure mc will get properly initialized and terminated. 5. Set trigger mc.ftrigger for every metric in mc.M as follows: ∀m ∈ mc.M if τ (m) is defined, m will be updated by events in τ (m); otherwise, m will be updated by every v ∈ Ve by using the payload of v (i.e. business artifact data). In other words, – if τ (m) = ∅ then mc.ftrigger (m) = τ (m) – else mc.ftrigger (m) = Ve 6. Set the filter of every inbound event to true, i.e. ∀v ∈ mc.Vin , mc.ff ilter (v) = true, since inbound events are limited to those that are generated by business artifact e or specified as τ . 7. Set operation mc.fop . ∀v ∈ mc.Vin , if v indicates a business artifact enters an initial state, a new monitoring context instance is created; if v indicates an artifact reaches a final state, the correlated monitoring context instance is terminated. In other words, – if v = vst (sinit , entered) then mc.fop (v) = create – elseif v = vst (sf inal , entered) then mc.fop (v) = terminate – else mc.fop (v) = null 8. Set mc.fcr to inbound events with mc. Unless specified otherwise, v is correlated with mc whose key equals to the business artifact id in its payload.
Example 4. Figure 4 presents complete monitoring contexts mcgc and mcko generated by the algorithm given the monitoring context skeletons from Example 3. Compared with the skeletons shown in Figure 5, a full-defined monitoring context contains many technical specifications, for instance, triggers and correlation, which often discourage business users to define monitoring models. Our approach automatically generates these technical details to remove technical roadblocks for business users and also significantly reduces the complexity of performance monitoring. Execution engine configuration. When Att (i.e. business artifact attributes needed for calculating targeted KPIs) is defined, a business artifact execution engine can be configured in two perspectives to improve the efficiency of performance monitoring. – Event payload: By default, each event carries an entire business artifact information as its payload. The execution engine can reduce the payload to contain only the attributes in Att. When a business artifact contains a huge amount of data (for example, some blob data which typically is not needed for monitoring purpose), this reduction may be necessary to improve event transmission efficiency. – Event emitting: If τ (i.e. events that bring update to the monitoring context metrics) is not specified, by default, all events created by a business artifact are emitted. The execution engine can reduce the number of emitted events by tracking changes in attribute values. If an event payload contains an attribute with changed value and this attribute is in the list Att, then this event is emitted, otherwise, it is blocked.
276
R. Liu et al.
Fig. 6. Completed Monitoring Contexts
5 Implementation We implemented our methodology in Siena, a business artifact centric process modeling and execution prototype [18]. Fig. 7 shows a few selective screen shots which guide a business user to design KPIs, as described in Section 4. In screen shot (1), the user begins to define a monitoring model with targeted KPIs. The details of a KPI can be defined as shown in screen shot (2). Appropriate Content Assistant is provided to help the user formulate the KPI using metrics. Instead of entering formulas as shown in Example 3, the user can select predefined functions and operators to formulate the KPI. Each metric can be further defined in screen shot (3), where the metric can be stated as an expression of the attributes of a BA with the help of Content Assistant. Moreover, the user can optionally select events that update the attributes and hence refresh the metric. Finally, when a KPI is defined, corresponding monitoring contexts are automatically generated and appended to the monitoring model, as shown in screen shot (4). The monitoring model is deployed to Websphere Business Monitor (WBPM) [16]. WBPM has a monitoring runtime server and also an integrated dashboard. At runtime, business artifact-centric processes are executed by Siena. Events are emitted from Siena and sent to Common Event Infrastructure (CEI) [17] through a web service. WBPM subscribes to CEI to receive inbound events. The received events are processed as shown in Fig. 3. KPIs and metrics are displayed in a dashboard (example is shown in Fig 2).
6 Discussion and Evaluation The real-time performance monitoring methodology using business artifacts as the fitting base of monitoring contexts has several desirable features, compared to Complex Event Processing (CEP) [8], or activity-centric process modeling approaches. (A) Explicitly modeling monitoring context. Monitoring contexts enable top-down performance monitoring design and improve understandability of performance
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
(1)
277
(2)
(3)
(4) Completed Monitor Model Fig. 7. Performance Monitoring User Interface
measures and monitoring models. In comparison, CEP does not specify monitoring contexts explicitly and it is assumed that context information can be retrieved from other parts of the monitoring model and monitored applications. However, in a distributed environment, such as SOA, this information may be distributed in different applications, not centrally controlled, which makes it challenging to retrieve in a timely manner. (B) Reusable and configurable. Business artifacts with appropriate granularity make corresponding monitoring contexts very flexible and configurable for KPIs from different views. Intuitively, each monitoring context represents an object view related to a business artifact. Monitoring contexts can be easily configured and correlated to provide a process view. For example, in Example 3, we combine KitchenOrder and GuestCheck monitoring context to measure performance of the entire catering process. Similarly, resource view can be easily achieved because resources, like Chef and Waiter in Example 1, can also be modeled as a business artifact. In comparison, in activity-centric approach, it is not straightforward to achieve object view since the lifecycle of an object is fragmented in different activities (see below). It is also difficult to achieve resource view because in most cases, only role players of activities are modeled and resource management is considered as external to process management. (C) Simplicity & unifying nature of business artifacts. Key elements of monitoring models in our approach are centered around business artifacts: each monitoring context corresponds to a business artifact; metrics are derived from attributes of the business artifact information model; each event or trigger is either related to state change in BA lifecycle, or caused by activity igniting those state changes; event payload is defined uniformly as business artifact information model. In contrast, defining monitoring model in the activity-centric process paradigm is more complex, and often requires careful manual adjustment. In [19], White presents a semi-automatic approach to generate a monitoring model from an activity-centric process model. Given a process definition (e.g. a process design in BPMN), this approach automatically generates a process monitoring context (PMC) for the entire process itself, and a number of activity monitoring
278
R. Liu et al.
contexts (AMC), one for each activity included. Inbound events of a PMC link to process/activity status changes (e.g. process start, paused, etc. – about 20 states in total), while event payload consists of all application data involved in this process. Inbound events of an AMC are related to activity status changes (around 20 states, with different types of activities having different states), while event payload consists of activity input or output data (different activities have different input and output data). The complexity of this and similar activity centric approaches arises from several aspects. (1) Number of monitoring contexts: One process may contain a large number of activities, with each having its own monitoring context, which is a burden to monitoring solution designers. For comparison, we modeled the restaurant example as an activity-centric model. We needed 6 activities to model the problem. Using the White’s approach, 7 monitoring contexts are generated (1 for the process and 6 for activities), compared to 2 monitoring contexts (GuestCheck and KitchenOrder) in the BA approach. (2) Number of distinct event types: Each process or activity has its own event type sets, and their event payloads differ from each other. This diversity makes it difficult for designers to identify and find the right information that is needed for KPI definition and calculations. In the activity-centric model of the restaurant example, the total number of event types for both the process and the activities are close to 200 with each having a different payload. In contrast, there are around 40 events with uniform payload (either GuestCheck or KitchenOrder information model) in the artifact-centric approach. (3) Correlation of monitoring contexts: The possibly large number of monitoring contexts of the activity centric approach, makes their correlation rather complicated. AMCs need to be correlated with PMC via the process instance ID. If some KPIs are related to business results recorded in activity input/output data, AMCs need to be correlated with each other via data object IDs in event payloads. Note that the restaurant example is a simple process. For large-scale real-life processes the differences between the two approaches will be much more significant.
7 Related Work Business performance management is a set of management and analytic processes that enable the management of an organization’s performance to achieve one or more preselected goals [20]. The practices of business performance management have been strongly influenced by the rise of the balanced scorecard framework [21]. It is common for managers to use this framework to clarify the goals of an organization, to identify how to track them, and to structure the mechanisms by which interventions will be triggered. As a result, balanced scorecard is often used as the basis for business performance management activity with organizations. As companies put significant effort into developing business strategies and goals, however, many of them have no processes or tools in place to validate that the strategies are being implemented as planned. Moreover, the need to continuously refine business goals and strategies requires an IT system that absorbs these changes and help business users optimize business processes to satisfy business objectives. A rich literature on designing performance measures focuses on guidelines or recommendations for choosing performance measures (or KPIs) and for designing performance measurement framework [22]. For example, it is recommended that business measures or KPIs should be
Business Artifact-Centric Modeling for Real-Time Performance Monitoring
279
derived from strategies and reflect business processes [23]. Our business artifact-centric performance monitoring is well aligned with this principle. Our proposition is that KPIs measure the attainment of operational goals, derived from strategies and encapsulated as business artifacts. Moreover, an in-depth case study of the performance measurement practice in [7] highlights several key challenges. Our work addresses the issue of designing useful and understandable performance measures by business users. For performance monitoring, most approaches, such as complex event processing (CEP) [8], are event-driven with events (often from multiple sources) being collected, correlated and analyzed based on predefined event rules or patterns. Generally, monitoring contexts are not defined explicitly in those approaches. Business activity monitoring (BAM), originally coined by Gartner [9], refers to the aggregation, analysis, and presentation of real-time information about activities inside organizations and involving customers and partners. A number of vendors in the business process management area currently offer BAM software. An example is the IBM Cognos Now [24]. Although the concept of monitoring context may not be explicitly used in these commercial software, contextual information and related events needed for calculating target performance measures are still the key to performance monitoring, and in most cases, are manually configured. There are other real-time performance monitoring systems, for example, a semantic approach based on Web Ontology Language [5], iWISE architecture with a business rule engine [3] and policy-driven performance management [4]. These approaches use event rules without explicitly modeling monitoring contexts. Contexts for executing event rules need to be built case by case.
8 Conclusion In this paper, we proposed to formally model monitoring contexts to enable systematic real-time performance monitoring design based on business artifacts. We formalized the concept of monitoring contexts and designed a two-phase top-down methodology for defining monitoring models. This methodology allows domain experts or business users to focus on defining metric and KPI requirements without the burden of detailed technical specification of monitoring contexts. Instead, a complete monitoring context can be automatically generated from these requirements and underlying business artifacts. Our methodology dramatically simplifies the design of real-time performance monitoring, compared with typical approaches in activity-centric paradigm. Moreover, in our approach, performance monitoring can be integrated with business artifact-centric process modeling, since both rely on the fundamental element - business artifacts. This integration leads to increased process visibility and also agile and effective monitoring systems. Our future work is to extend our methodology to business artifacts defined in declarative languages, and we plan to apply our methodology in real engagements.
References 1. Zisman, M.: Ibm’s vision of the on demand enterprise. In: IBM Almaden Conference on Co-Evolution of Business and Technology (2003) 2. White, C.: Now is the right time for real-time bi. Information Management Magazine (2004) 3. Costello, C., Molloy, O.: Building a process performance model for business activity monitoring. Information Systems Development, 237–248 (2009)
280
R. Liu et al.
4. Jeng, J.J., Chang, H., Bhaskaran, K.: Policy driven business performance management. In: Sahai, A., Wu, F. (eds.) DSOM 2004. LNCS, vol. 3278, pp. 52–63. Springer, Heidelberg (2004) 5. Thomas, M., Redmond, R., Yoon, V., Singh, R.: A semantic approach to monitor business process. Commun. ACM 48, 55–59 (2005) 6. Lakshmanan, G.T., Keyser, P., Slominski, A., Curbera, F., Khalaf, R.: A business centric endto-end monitoring approach for service composites. In: 2010 IEEE International Conference on Services Computing (SCC 2010), pp. 409–416. IEEE Computer Society, Los Alamitos (2010) 7. Corea, S., Watters, A.: Challenges in business performance measurement: The case of a corporate it function. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 16–31. Springer, Heidelberg (2007) 8. Luckham, D.C.: The power of events: an introduction to complex event processing in distributed enterprise systems. Addison-Wesley, Boston (2002) 9. Gassman, B.: How the pieces in a BAM architecture work. Technical report, Gartner Research (2004) 10. Nesamoney, D.: Bam: Event-driven business intelligence for the real-time enterprise. Information Management Magazine 14, 38–48 (2004) 11. White, S.A.: Introduction to BPMN. IBM Cooperation (May 2004) 12. Nigam, A., Caswell, N.S.: Business artifacts: An approach to operational specification. IBM Systems Journal 42, 428–445 (2003) 13. Bhattacharya, K., Caswell, N.S., Kumaran, S., Nigam, A., Wu, F.Y.: Artifact-centered operational modeling: Lessons from customer engagements. IBM Systems Journal 46, 703–721 (2007) 14. Chao, T., Cohn, D., Flatgard, A., Hahn, S., Linehan, M.H., Nandi, P., Nigam, A., Pinel, F., Vergo, J., Wu, F.Y.: Artifact-based transformation of IBM global financing. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 261–277. Springer, Heidelberg (2009) 15. Sill, B.: Brand metrics can help restaurant service measure up to customer expectations. Nation’s Restaurant News, 24 (May 27, 2002) 16. IBM: IBM Business Monitor (2010), http://www-01.ibm.com/software/integration/business-monitor/ 17. IBM: Common Event Infrastructure (2004), http://www-01.ibm.com/software/tivoli/features/cei/ 18. Cohn, D., Dhoolia, P., Heath, F., Pinel, F., Vergo, J.: Siena: From power point to web app in 5 minutes. In: Bouguettaya, A., Krueger, I., Margaria, T. (eds.) ICSOC 2008. LNCS, vol. 5364, pp. 722–723. Springer, Heidelberg (2008) 19. White, S.: Best practices for using websphere business modeler and monitor. Technical report, IBM (2006) 20. Ballard, C., White, C., McDonald, S., Myllymaki, J., McDowell, S., Goerlich, O., Neroda, A.: Business performance management... meets business intelligence (2005), http://www.redbooks.ibm.com/abstracts/sg246340.html 21. Kaplan, R.S., Norton, D.P.: The balanced scorecard - measures that drive performance. Harvard Business Review 70, 71–79 (1992) 22. Folan, P., Browne, J.: A review of performance measurement: towards performance management. Computers in Industry 56, 663–680 (2005) 23. Neely, A., Richards, H., Mills, J., Platts, K., Bourne, M.: Designing performance measures: a structured approach. International Journal of Operations & Production Management 17, 1131–1152 (1997) 24. IBM: Ibm cognos now! v4.6 (2008)
A Query Language for Analyzing Business Processes Execution Seyed-Mehdi-Reza Beheshti1 , Boualem Benatallah1 , Hamid Reza Motahari-Nezhad2, and Sherif Sakr1 1
University of New South Wales, Sydney, Australia {sbeheshti,boualem,ssakr}@cse.unsw.edu.au 2 HP Labs Palo Alto, CA, USA
[email protected] Abstract. The execution of a business process (BP) in today’s enterprises may involve a workflow and multiple IT systems and services. Often no complete, up-to-date documentation of the model or correlation information of process events exist. Understanding the execution of a BP in terms of its scope and details is challenging specially as it is subjective: depends on the perspective of the person looking at BP execution. We present a framework, simple abstractions and a language for the explorative querying and understanding of BP execution from various user perspectives. We propose a query language for analyzing event logs of process-related systems based on the two concepts of folders and paths, which enable an analyst to group related events in the logs or find paths among events. Folders and paths can be stored to be used in follow-on analysis. We have implemented the proposed techniques and the language, FPSPARQL, by extending SPARQL graph query language. We present the evaluation results on the performance and the quality of the results using a number of process event logs. Keywords: Business Processes, Query Processing, Event Correlation.
1
Introduction
A business process (BP) consists of a set of coordinated tasks and activities employed to achieve a business objective or goal. In modern enterprises, BPs are realized over a mix of workflows, IT systems, Web services and direct collaborations of people. The understanding of business processes and analyzing BP execution data (e.g., logs containing events, interaction messages and other process artifacts) is difficult due to lack of documentation and especially as the process scope and how process events across these systems are correlated into process instances are subjective: depend on the perspective of the process analyst. As an example, one may want to understand the delays to the ordering process (the end-to-end from ordering to the delivery) for a specific customer, while another analyst is only considered with the packaging process for any orders in the shipping department. Certainly, one process model would not serve S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 281–297, 2011. Springer-Verlag Berlin Heidelberg 2011
282
S.-M.-R. Beheshti et al.
the analysis purpose for both situations. Rather there is a need for a processaware querying approach that enables analysts to analyze the process events from their perspectives, for the specific goal that they have in mind, and in an explorative manner. In this paper, we focus on addressing this problem. The first step of process analysis is gathering and integration of process execution data in a process event log from various, potentially heterogeneous, systems and services. We assume that execution data are collected from the source systems and transformed into an event log using existing data integration approaches [29], and we can access the event metadata and the payload content of events in the integrated process log. The next step is providing techniques to enable users define the relationships between process events. The various ways in which process events may be correlated are characterized in earlier work ([8] and our prior work [25]). In particular, in [25], we introduced the notion of a correlation condition as a binary predicate defined on the attributes of event payload that allows to identify whether two or more events are potentially related to the same execution instance of a process. We use the concept of correlation condition to formulate the relationships between any pairs of events in the log. Standard (database) querying and exploration techniques over individual system logs are not process aware, and therefore entail writing complex queries as at this stage the correlation among process events is not often known. We rather need process-centric abstractions and methods that enable the querying and exploration of process event relationships, and the discovery of potential process models. In this paper, we introduce a data model for process events and their relationships, and a query language to query and explore events, their correlations and possible aggregation of events into process-centric abstractions. We introduce two concepts of folders and paths, which help in partitioning events in logs into groups and paths, in order to simplify the discovery of process-centric relationships (e.g., process instances) and abstractions (e.g., process models). We define a folder node as a placeholder for a group of inter-related events. We use a path node to represent the set of events that are related to each other through transitive relationships. These paths may lead into discovering process instances. In summary, we present a novel framework for manipulating, querying, and analyzing event logs. The unique contributions of the paper are as follows: – We propose a graph data model that supports typed and untyped events, and introduce folder and path nodes as first class abstractions. A folder node contains a collection of related events, and a path node represent the results of a query that consists of one or more paths in the events relationship graph based on a given correlation condition. – We present a process event query language and graph-based querying processing engine called FPSPARQL, which is a Folder-Path enabled extension of SPARQL [28]. We use FPSPARQL to query and analyze events, folder and path nodes in order to analyze business process execution data. – We describe the implementation of FPSPARQL, the results of the evaluation of the performance of the engine, and the quality of results over large event
A Query Language for Analyzing Business Processes Execution
(a) Example of SCM service interaction log
283
(b) A simplified business process in SCM log for retailer service [24]
Fig. 1. Event log analysis case study
logs. The evaluation shows that the engine is reasonably fast and the quality of the query results is high in terms of precision/recall. – We provide a front-end tool for the exploration and visualization of results in order to enable users to examine the event relationships and the potential for discovering process instances and process models. The remainder of this paper is organized as follows: We present a case study on process event logs in section 2. In section 3 we give an overview of the query language and data model. We present the FPSPARQL query language in section 4. In section 5 we show how we use the query language for analyzing the case study process log. In section 6 we describe the query engine implementation and evaluation experiments. Finally, we discuss related work in section 7, before concluding the paper with a prospect on future work in Section 8.
2
Event Log Analysis: Example Scenario
Let us assume a set of web services that are interacting to realize a number of business processes. In this context, two or more services exchange messages to fulfil a certain functionality, e.g. to order goods and deliver them. The events related to messages exchanged during service conversations may be logged using various infrastructures [25]. A generic log model L represented by set of messages L = {m1 , m2 , ..., mn } where each message m is represented by a tuple mi ∈ A1 × A2 × ... × Ak [23]. Attributes A1 , ..., Ak represent the union of all the attributes contained in all messages. Each single message typically contains only a subset of these attributes and mx .Ai denotes the value of attribute Ai in message mx . Each message mx has a mandatory attribute τ that denotes the timestamp at which the event (related to the exchange of mx ) has been recorded. In particular, we use the interaction log of a set of services in a supply scenario provided by WS-I (the Web Service Interoperability organization), referred in
284
S.-M.-R. Beheshti et al.
Fig. 2. Event log analysis scenario
the following as SCM. Figure 1 illustrates a simplified business process in SCM (Supply Chain Management) and an example of SCM log. The log of the SCM business service contains 4,050 messages, 14 service operations (e.g. CO, PO, and Inv), and 28 attributes (e.g. sequenceid, custid, and shipid). Figure 1(b) illustrates a simplified business process in SCM log for retailer service. We will use this log in the paper to demonstrate how various users use the querying framework introduced in this paper for exploring and understanding process event logs. For example, we will show how an analyst can: (i) use correlation conditions to partition SCM log (e.g. examples 1 and 2 in section 5); (ii) explore the existence of transitive relationships between messages in constructed partitions to identify process instances (e.g. examples 3 and 4 in section 5); and (iii) analyze discovered process instances by discovering a process model to understand the result of the query in terms of process execution visually (e.g. example 5 in section 5).
3 3.1
A Query Language for Analyzing Process Logs Overview
We introduce a graph-based data model for modeling the process entities (events, artifacts and people) in process logs and their relationships, in which the relationship among entities could be expressed using regular expressions. In order to enable the explorative querying of the process logs represented in this model, we propose the design and development of an interactive query language that operates on this graph-based data model. The query language enables the users to find entities of their interests and their relationships. The data model include abstractions which act as higher level entities of related entities to browse the results as well as store the result for follow-on queries. The process events in these higher level entities could be used for further process-specific analysis purposes. For instance, inspecting entities for finding process instances, as well as applying process mining algorithms on containers having process instances for discovering process models. Figure 2 shows an overview of the steps in analyzing process event logs (i.e. preprocessing, partitioning, and analysis) in our framework, which is described in the following sections.
A Query Language for Analyzing Business Processes Execution
3.2
285
Data Model and Abstractions
We propose to model a process log as a graph of typed nodes and edges. We define a graph data model for organizing a set of entities as graph nodes and entity relationships as edges of the graph. This data model supports: (i) entities, which is represented as a data object that exists separately and has a unique identity; (ii) folder nodes, which contain entity collections. A folder node represents the results of a query that returns a collection of related entities; and (iii) path nodes, which refer to one or more paths in the graph, which are the result of a query, too. A path is the the transitive relationship between two entities. Entities and relationships are represented as a directed graph G = (V, E) where V is a set of nodes representing entities, folder or path nodes, and E is a set of directed edges representing relationships between nodes. Entities. Entities could be structured or unstructured. Structured entities are instances of entity types. An entity type consists of a set of attributes. Unstructured entities, are also described by a set of attributes but may not conform to an entity type. This entity model offers flexibility when types are unknown and take advantage of structure when types are known. We assume that all unstructured entities are instances of a generic type called ITEM. ITEM is similar to generic table in [26]. We store entities in the entity store. Relationships. A relationship is a directed link between a pair of entities, which is associated with a predicate (e.g. a regular expression) defined on the attributes of entities that characterizes the relationship. A relationship can be explicit, such as ’was triggered by’ in ’event1 wasTriggeredBy event2 ’ in a BPs execution log. Also a relationship can be implicit, such as a relationship between an entity and a larger (composite) entity that can be inferred from the nodes. Folder Nodes. A folder node contains a set of entities that are related to each other, i.e. the set of entities in a folder node is the result of a given query that require grouping graph entities in a certain way. The folder concept is akin to that of a database view defined on a graph. However, a folder node is part of the graph and creates a higher level node that other queries could be executed on it. Folders can be nested, i.e., a folder can be a member of another folder node, to allow creating and querying folders with relationships at higher levels of abstraction. A folder may have a set of attributes that describes it. A folder node is added to the graph and can be stored in the database to enable reuse of the query results for frequent or recurrent queries. Path Nodes. A path is a transitive relationship between two entities showing a sequence of edges from the start entity to the end. This relationship can be codified using regular expressions [5,11] in which alphabets are the nodes and edges from the graph. We define a path node for each query that results in a set of paths. We use existing reachability approaches to verify whether an entity is reachable from another entity in the graph. Some reachability approaches (e.g. all-pairs shortest path [11]) report all possible paths between two entities. We define a path node as a triple of (Vstart , Vend , RE) in which Vstart is the start
286
S.-M.-R. Beheshti et al.
node, Vend is the end node and a regular expression RE. We store all paths of a path node in the path store.
4
Querying Process Logs
As mentioned earlier, we model process logs as a graph. In order to query this graph a graph query language is needed. Among languages for querying graphs, SPARQL [28] is an official W3C standard and based on a powerful graph matching mechanism. However, SPARQL does not support the construction and retrieval of subgraphs. Also paths are not first class objects in SPARQL [28,19]. In order to analyze BPs event logs (see section 3), we propose a graph processing engine, i.e. FPSPARQL [12] (a Folder-Path enabled extension of the SPARQL), to manipulate and query entities, and folder and path nodes. We support two levels of queries: (i) Entity-level Queries: at this level we use SPARQL to query entities in the process logs; and (ii) Aggregation-level Queries: at this level we use FPSPARQL to construct and query folder and path nodes. 4.1
Entity-Level Queries
At this level, we support the use of SPARQL to query entities and their attributes in the process logs. SPARQL is a declarative and extendable query language which contains capabilities for querying required and optional graph patterns. Each pattern consists of subject, predicate and object, and each of these can be either a variable or a literal. We use the ’@’ symbol for representing attribute edges and distinguishing them from the relationship edges between graph nodes. As an example, we may be interested in retrieving a list of messages in SCM log (section 2) that have the same value on ’requestsize’ and ’responsesize’ attributes and the values for their timestamps falls between τ1 and τ2 . Following is the SPARQL query for this example: select ?m where {?m @type message. ?m @requestsize ?x. ?m @responsesize ?y. ?m @timestamp ?t. F ILT ER(?x =?y && ?t > τ1 && ?t < τ2 ).}
In this query, variable ?m represents messages in the SCM log. Variables ?x, ?y, and ?t represent the value of the attributes ’requestsize’, ’responsesize’, and timestamp respectively. Finally the filter statement restrict the result to those messages for which the filter expression evaluates to true. 4.2
Aggregation-Level Queries
Standard SPARQL querying mechanism is not enough to support querying needs for analyzing BP execution data based on the introduced data model in Section 3.2. In particular, SPARQL does not support folder and path nodes and querying them natively. In addition, querying the result of a previous query becomes complex and cumbersome, at best. Also path nodes are not first class objects in SPARQL [11,19]. We extend SPARQL to support aggregation-level
A Query Language for Analyzing Business Processes Execution
287
queries to satisfy specific querying needs of proposed data model. Aggregationlevel queries in FPSPARQL include two special constructs: (a) construct queries: used for constructing folder and path nodes, and (b) apply queries: used to simplify applying queries on folder and path nodes. Folder Node Construction. To construct a folder node (i.e. a partition in event log), we introduce the f construct statement. This statement is used to group a set of related entities or folders. A basic folder node construction query looks like this: f construct < F older N odeN ame > [ select ?var1 ?var2 ... | (N ode1 ID, N ode2 ID, ...) ] where { pattern1. pattern2. ... }
A query can be used to define a new folder node by listing folder node name and entity definitions in the fconstruct and select statements, respectively. Also a folder node can be defined to group a set of entities, folder nodes, and path nodes (see example 2 in section 5). A set of user defined attributes for this folder can be defined in the where statement. We instrument folder construction query with the correlate statement in order to apply a correlation condition on entity nodes (e.g. messages in service event logs) and correlated the entity nodes for which the condition evaluates to true. A correlation condition ψ is a predicate over the attributes of events for attesting whether two events belong to the same instance. For example, considering SCM log, one possible correlation condition is ψ(mx , my ) : mx .custid = my .custid, where ψ(mx , my ) is a binary predicate defined over the attribute custid of two messages mx and my in the log. This predicate is true when mx and my have the same value and false otherwise. A basic correlation condition query looks like this: correlate { (entity1 , entity2 , edge1 , condition) pattern1 . pattern2 . ... }
As a result, entity1 will be correlated to entity2 through a directed edge edge1 if the condition evaluates to true. Patterns (e.g. pattern1 ) can be used for specifying the edge attributes. Example 1 in section 5 illustrates such a query. Path Node Construction. We introduce the pconstruct statement to construct a path node. This statement can be used to: (i) discover transitive relationships between two entities (e.g. by using an existing graph reachability algorithm); or (ii) discover frequent pattern(s) between set of entities (e.g. by using an existing process mining algorithm). In both cases the result will be a set of paths which can be stored under a path node name. In general a basic path node construction query looks like this: pconstruct < P athN ode N ame > (Start N ode, End N ode, Regular Expression) where { pattern1. pattern2. ... }
A regular expressions can be used to define a transitive relationship between two entities, i.e. starting node and ending node, or set of frequent patterns to be discovered. Attributes of starting node, ending node, and regular expressions
288
S.-M.-R. Beheshti et al.
alphabets (i.e. graph nodes and edges) can be defined in the where statement. The query applied on a folder in example 3 section 5, illustrates such a query. Folder Node Queries. We introduce the apply statement to retrieve information, i.e. by applying queries, from the underlying folder nodes. These queries can apply on one folder node or the composition of several folder nodes. Our model supports the standard set operations (union, intersect, and minus) to apply queries on the composition of several folder nodes. In general, a basic folder node query looks like this: [F older N ode | (Composition of F older N odes)] AP P LY ( [ | | ])
An entity/aggregation-level query can be applied on folder nodes by listing folder node or composition of folder nodes before apply statement, and placing the query in parenthesis after apply statement. We also developed an interface to support applying existing process mining algorithms on folder and path nodes. Examples 3 and 4 in section 5 illustrate such queries. Path Analysis Queries. This type of query is used to retrieve information, i.e. by applying entity-level queries, from the underlying path node by using apply statement. Domain experts may use such queries in order to examine the discovered process instances. In general, a basic path node query looks like this: Path Node Name APPLY ( Entity-level Query )
An entity-level query can be applied on a path node by listing path node name before apply statement, and placing the query in parenthesis after apply statement. Example 5 in section 5 illustrates such a query.
5
Using FPSPARQL on SCM Scenario
In this section we show how we use FPSPARQL query language to analyze process logs. We focus on the case study presented in section 2. Preprocessing. The aim of preprocessing of the log is to generate a graph by considering the set of messages in the log as nodes of the graph, and correlation between messages as edges between nodes. In order to preprocess the SCM log we performed the following two steps: (i) generating graph nodes: we extracted messages and their attributes from the log and formed a graph node for each message with no relations between nodes; and (ii) Generating candidate correlations: We used the correlation condition discovery technique introduced in [25] to generate a set of candidate correlation conditions that could be used for examining the relationship between process events. Partitioning. We use the candidate conditions, in preprocessing phase, to partition the log. Identifying the interestingness of a certain way of partitioning the logs or grouping the process events is “subjective”, i.e., depends on the user perspective and the particular querying goal. To cater for interestingness, we
A Query Language for Analyzing Business Processes Execution
289
enable the user to choose the candidate conditions she is interested in to explore as a basis of relationships among events. Example 1. Adam, an analyst, is interested in partitioning the SCM log into a set of related messages having the same customer ID. To achieve this he can construct a folder node (i.e. ’custID’) and apply the correlation condition ψ(mx , my ) : mx .custid = my .custid (where ψ(mx , my ) is a binary predicate defined over the attribute custid of two messages mx and my in the log) on the folder. As a result, related messages will be discovered and stored in the folder node ’custID’. Figure 3(a) in section 6 illustrates how our tool enables users choosing the correlation condition(s) and generating FPSPARQL queries automatically. Following is the FPSPARQL query for this example. fconstruct custID as ?fn select ?m_id, ?n_id where { ?fn @description ’custid=custid’. ?m @isA entityNode. ?m @type message. ?m @id ?m_id. ?m @custid ?x. ?n @isA entityNode. ?n @type message. ?n @id ?n_id. ?n @custid ?y. correlate{(?m,?n,?edge,FILTER(?x=?y && ?n_id>?m_id)) ?edge @isA edge. ?edge @label "custid"}}
In this query, the variable ?fn represent the folder node to be constructed, i.e. ’custID’. Variables ?m and ?n represent the messages in SCM log and ?m id and ?n id represent IDs of these messages respectively. Variables ?x and ?y represent the values for m.custid and n.custid attributes. The condition ?x =?y applied on the log to group messages having same value for custid attribute. The condition ?nid >?mid makes sure that only the correlation between each message and the following messages in the log are considered. The correlate statement will connect messages, for which the condition (?x =?y && ?nid > ?mid ) evaluates to true, with a directed labeled edge ?edge. The result will be stored in folder custID and can be used for further queries. Example 2. Consider two folder nodes ’custID’ and ’payID’ each representing correlated messages based on correlation conditions ψ(mx , my ) : mx .custid = my .custid and ψ(mx , my ) : mx .payid = my .payid respectively. It is possible to construct a new folder (e.g. ’custID payID’) on top of these two folders in order to group them. The ’custID payID’ folder contains events related to customer orders that have been paid. Queries applied on ’custID payID’ folder will be applied on all its subfolders. Example query is defined as follows. fconstruct custID_payID as ?fn
(custID,payID) where { ?fn @description ’set of ...’. }
In this example the variable ?fn represent the folder node to be constructed, i.e. custID payID. This folder node contains two folder nodes and has a user defined attribute description. These folder nodes are hierarchically organized by part-of (i.e. an implicit relationship) relationships. Mining. In this phase a query can be applied on previously constructed partitions to discover process models. As mentioned earlier, a folder node, as a result of a correlation condition, partitions a subset of the events in the log into instances of a process. The process model, which the instances inside a folder represent, can be discovered using one of the many existing algorithms for process mining [3,2], including our prior work [24]. It is possible that some folders contain a set of related process events (e.g., the set of orders for a given customer), but not process instances. It is possible
290
S.-M.-R. Beheshti et al.
for the analyst to apply a regular expression based query on the events in a folder. The regular expression may define a relationship that is not captured by any candidate correlation condition. Applying such queries on a folder node may result in forming a set of paths which can be then stored in a path node. The constructed path node can be examined by the analysts and may considered as a set of process instances. Example queries is defined as follows. Example 3. Adam is interested in discovering patterns, in ’custID’ partition (see example 1), between specific orders (i.e. messages with IDs ’3958’ and ’4042’) which contains product confirmation. He can codify his knowledge into a regular expressions that describe paths through sequence of messages in the partition. The path query can be applied on the partition and the result can be stored in a path node (i.e. ’orderDiscovery’). As a result, one path (i.e. a process model) discovered. Figure 3(b) in section 6 illustrates the visualized result of this example generated by the front-end tool. Following is the FPSPARQL query for this example. (custID) apply( pconstruct OrderDiscovery (?startNode,?endNode,(?e ?n)* e ?msg e (?n ?e)*) where { ?startNode @isA entityNode. ?startNode @type message. ?startNode @id ’3958’. ?endNode @isA entityNode. ?endNode @type message. ?endNode @id ’4042’. ?n @isA entityNode. ?e @isA edge. ?msg @isA entityNode. ?msg @type message. ?msg @operation ’OrderFulfil’. } )
In this example a path construction query, i.e. pconstruct query, applied on the folder ’custID’. Variables ?startN ode and ?endN ode denote messages mid=3958 and mid=4042 respectively. Variables ?e and ?n denote any edge or node in the transitive relationship between mid=3958 and mid=4042 . Finally, ?msg denotes a message having the value ’confirmProduction’ for the attribute operation. In the regular expression, parentheses are used to define the scope and precedence of the operators and the asterisk indicates there are zero or more of the preceding element. Example 4. Adam is interested in discovering frequent patterns between correlated messages in ’custID’ partition (see example 1), to analyze the process of producing a product. He can codify his knowledge into regular expressions that describe paths having the pattern ”start with a message having the value produce for the attribute operation, which followed by a message having the value ’confirmProduction’ for the attribute operation, and end with a message having the value pay for the attribute operation”. The path query can be applied on the partition and the result can be stored in a path node (i.e. ’productDiscovery’). As the result set of this query, 12 paths discovered. Unlike example 3, these paths have different starting and ending nodes. Following is the FPSPARQL query for this example. (custID) apply( pconstruct ProductDiscovery (?startNode, ?endNode, e ?msg e) where { ?startNode @isA entityNode. ?startNode @type message. ?startNode @operation ’Produce’. ?endNode @isA entityNode. ?endNode @type message. ?endNode @operation ’Pay’. ?e @isA edge. ?msg @isA entityNode. ?msg @type message. ?msg @operation ’ConfirmProduction’.})
In this example a pconstruct query is applied on the folder ’custID’, in which ?startN ode and ?endN ode denote set of starting and ending nodes. Variable ?e denotes any edges in the regular expression pattern and ?msg denotes a message having the value ’confirmProduction’ for the attribute operation. A frequent sequence mining algorithm (developed based on a process mining method) is used to generate frequent pattern(s) based on the specified regular expression, i.e. (?startN ode → e →?msg → e →?endN ode).
A Query Language for Analyzing Business Processes Execution
291
Example 5. Consider the path node ’orderDiscovery’ constructed in example 3. We are interested to find messages in this path node, that have the keyword Retailer in their binding attributes. Following is the FPSPARQL query for this example. (OrderDiscovery) apply ( select ?m_id where { ?m @isA entityNode. ?m @type message. ?m @binding ?b. Filter regex(?b,"Retailer"). })
In this example ?m id denotes message IDs that fall inside ’orderDiscovery’ path node. The query ’retrieve the messages that have the keyword Retailer in their binding attributes” will apply on this path node.
6
Implementation and Experiments
Implementation. The query engine is implemented in Java (J2EE) and uses a relational database system (we utilized IBM DB2 as a back-end database for the experiments). As FPSPARQL core, we implemented a SPARQL-to-SQL translation algorithm based on the proposed relational algebra for SPARQL [15] and semantics preserving SPARQL-to-SQL query translation [14]. This algorithm supports aggregate and keyword search queries. We implemented the proposed techniques on top of this SPARQL engine. We developed four optimization techniques proposed in [13,30,14] to increase the performance of the query engine. Implementation details, include a graphical representation of the query engine, can be found in [12]. A front-end tool prepared to assist users in four steps: Step1: Preprocessing. We have developed a workload-independent algorithm for: (i) processing and loading a log file into an RDBMS for manipulating and querying entities, folders, and paths; (ii) generating powerful indexing mechanisms (see [12] for details). We also provided inverted indexes [33] on folder store in order to increase the performance of queries applied on folders. Step2: Partitioning. We provide users with a list of interesting correlation conditions based on the algorithm for discovering interesting conditions in [25]. Users may choose these correlation conditions to partition a log. In order to generate folder construction queries, we provide users with an interface (i.e. FPSPARQL GUI) to choose the correlation condition(s) and generate FPSPARQL queries automatically (see Figure 3(a)). Step3: Mining. We provide users with templates to generate regular expressions and use them in path queries. We have developed an interface to support applying existing process mining algorithms on folder nodes. We developed a regular expression processor which supports optional elements (?), loops (+,*), alternation (—), and grouping ((...)) [11]. We provide the ability to call existing process mining algorithms for path node queries (see [12] for details). Step4: Visualizing. We provided users with a graph visualization tool for the exploration of results (see Figure 3(b), i.e. the discovered process model for the query result in example 3). Users are able to view folders, paths, and the result of queries in a list and visualized format. This way, event relationships and candidate process instances can be examined by the analyst. Datasets. We carried out experiments on three datasets: (i) SCM: This dataset introduced in the case study in section 2; (ii) Robostrike: This is the interaction
292
S.-M.-R. Beheshti et al.
Fig. 3. Screenshots of FPSPARQL GUI: (a) The query generation interface in FPSPARQL, and (b) The discovered process model for the query result in example 3
log of a multi-player on-line game service. The log contains 40,000 messages, 32 service operations, and 19 attributes; and (iii) PurchaseNode: This process log was produced by a workflow management system supporting a purchase order management service. The log contains 34,803 messages, 26 service operations, and 26 attributes. Details about these dataset can be found in [25]. Evaluation. We evaluated the performance and the query results quality using SCM, Robostrike, and PurchaseNode process logs. Performance. The performance of FPSPARQL queries assessed using query execution time metric. The preprocessing step took 3.8 minutes for the SCM log, 11.2 minutes for the Robostrike log, and 9.7 minutes for the PurchaseNode log. For the partitioning step, we constructed 10 folders for each process log (i.e. SCM, Robostrike, and PurchaseNode). Constructed folders, i.e. partitions, selected according to provided list of interesting correlation conditions by the tool. Figure 4(a,b, and c) shows the average execution time for constructing selected folders for each log. For the mining step, we applied path node construction queries on each constructed folder. These path queries generated by domain experts who were familiar with the process models of proposed process logs. For each folder we applied one path query. As the result, the set of paths for each query were discovered and stored in path nodes. Figure 4 (d,e, and f) shows the average execution time for applying constructed path queries on the folders for each log. We ran these experiments for different sizes of process logs. Quality. The quality of the results is assessed using classical precision metric which defined as the percentage of discovered results that are actually interesting. For evaluating the interestingness of the result, we ask domain experts who
A Query Language for Analyzing Business Processes Execution
293
Fig. 4. The performance evaluation results of the approach on three datasets, illustrating: (i) the average execution time for partitioning: (a) SCM log, (b) Robostrike log, and (c) PurchaseNode log; and (ii) the average execution time for mining: (d) SCM log, (e) Robostrike log, and (f) PurchaseNode log;
have the most accurate knowledge about the dataset and the related process to: (i) codify their knowledge into regular expressions that describe paths through the nodes and edges in the folders; and (ii) analyze discovered paths and identify what they consider relevant and interesting from a business perspective. The quality evaluation applied on SCM log. Five folders constructed and three path queries applied on on each folder. As a result 31 paths discovered, examined by domain experts, and 29 paths (precision=93%) considered relevant. Discussion. We evaluated our approach using different types of process event logs, i.e. PurchaseNode (a single-process log), SCM (a multi-service interaction log), and Robostrike (a complex logic of a real-world). The evaluation shows that the approach is performing well (also in [12] we evaluated the performance of the FPSPARQL query engine compared to one of the well-known graph databases, the HyperGraphDB, which shows the great performance of the FPSPARQL query engine). As illustrated in Figure 4 we divide each log into regular number of messages and ran the experiment for different sizes of process logs. The evaluation shows a polynomial (nearly linear) increase in the execution time of the queries in respect with the dataset size. Based on the lesson learned, we believe the quality of discovered paths is highly related to the regular expressions generated to find patterns in the log, i.e. generating regular expressions by domain experts will guarantee the quality of discovered patterns.
7
Related Work
We discuss related work in two categories: (i) business processes and (ii) graph query languages. Business Processes. In recent years, querying techniques for BPs received high interest in the research community. Some of existing approaches for querying
294
S.-M.-R. Beheshti et al.
BPs [7,9,17,18,31] focused on querying the definitions of BP models. They assume the existence of enterprise repository of business process models. They provide business analysts with a visual interface to search for certain patterns and analyze and reuse BPs that might have been developed by others. These query languages are based on graph matching techniques. BP-QL [9] and [17] are designed to query business processes expressed in BPEL. BPMN-Q [7,31] and VQL [18] are oriented to query generic process modeling concepts. Process mining [1,3] techniques and tools (e.g., ProM [2]) offers a wide range of algorithms for discovering a process model from a set of process instances. These techniques assume that the correlation among process events is known and the process instances are given (we also use our previous work on process discovery [24] in the framework introduced here). Our process querying approach enables querying and exploring the relationships and correlations among process events, and various ways to form process instances, and therefore it is complementary to process mining tools. Querying running instances of business processes, represents another flavor of querying BPs [10,21,27]. These approaches can be used to monitor the status of running processes and trace the progress of execution. BP-MON [10] and PQL [21] can be used to discover many problems such as: detecting the occurrence of deadlock situations or recognizing unbalanced load on resources. Pistore et. al. [27] proposed an approach, i.e. based on BPEL specification, to monitor the composition and execution of Web services. These approaches assume that the business process models are predefined and available. Moreover, they presume that the execution of the business processes is achieved through a business process management system (e.g BPEL). In our approach, understanding the processes, i.e. in the enterprise, and their execution through exploration and querying event logs is a major goal. The focus is also on scenarios, which is often the case in today’s environments, where processes are implemented over IT systems, and there is: (i) no up-to-date documentation of process definition and it’s execution; and/or (ii) no data on the correlation rules for process events into process instances. Proposed query language,i.e. FPSPARQL, provides an explorative approach which enables users to correlated entities and understand which process instances are interesting. Moreover, we support for executing external process mining algorithms on the graph and partitions (i.e. folders) to facilitate the analysis of BPs execution. Graph Query Languages. Recently, several research efforts show a growing interest in graph models and languages. A recent book [4] and survey [6] discuss a number of graph data models and query languages. Some of existing approaches for querying and modeling graphs [16,6,20] focused on defining constraints on nodes and edges simultaneously on the entire object of interest. These approaches do not support querying nodes at higher levels of abstraction. Some query languages for graph [16,20] are focused on the uniform treatment of nodes and edges and support queries that return subgraphs. BiQL [16] supports a closure property on the result of its queries meaning that the output of every query can be used for further querying. HyperGraphDB[20] supports
A Query Language for Analyzing Business Processes Execution
295
grouping related nodes in a higher level node through performing special purpose APIs. Compared to BiQL and HyperGraphDB, in our work folders and paths are first class abstractions (graph nodes) and can be defined in a hierarchical manner, over which queries are supported. SPARQL [28] is a declarative query language, an W3C standard, for querying and extracting information from directed labeled RDF graphs. SPARQL supports queries consisting of triple patterns, conjunctions, disjunctions, and other optional patterns. However, there is no support for querying grouped entities. Paths are not first class objects in SPARQL [28,19]. In FPSPARQL, we support folder and path nodes as first class entities that can be defined at several levels of abstractions and queried. In addition, we provide an efficient implementation of a query engine that support their querying.
8
Conclusion
In this paper, we presented a data model and query language for querying and analyzing business processes execution data. The data model supports structured and unstructured entities, and introduces folder and path nodes as first class abstractions. Folders allow breaking down an event log into smaller clearer groups. Mining folder nodes may result in discovering process-centric relationships (e.g., process instances) and abstractions (e.g., process models) which can be stored in path nodes for further analysis. The query language, i.e. FPSPARQL, defined as an extension of SPARQL to manipulate and query entities, and folder and path nodes. We have developed an efficient and scalable implementation of FPSPARQL. To evaluate the viability and efficiency of FPSPARQL, we have conducted experiments over large event logs. As future work, we plan to design a visual query interface to support users in expressing their queries over the conceptual representation of the event log graph in an easy way. Moreover, we plan to make use of interactive graph exploration and visualization techniques (e.g. storytelling systems [32]) which can help users to quickly identify the interesting parts of event log graphs. We are also interested in the temporal aspects of process graph analysis, as in some cases (e.g. provenance [22]) the structure of the graph may change rapidly over time.
References 1. Aalst, W.M.P.V.D.: Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer, Heidelberg (2011) 2. Aalst, W.M.P.V.D., Dongen, B.F.V., G¨ unther, C.W., Rozinat, A., Verbeek, E., Weijters, T.: Prom: The process mining toolkit. In: BPM (2009) 3. Aalst, W.M.P.V.D., Dongen, B.F.V., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: a survey of issues and approaches. Data Knowl. Eng. 47, 237–267 (2003) 4. Aggarwal, C.C., Wang, H.: Managing and Mining Graph Data. Springer Publishing Company, Incorporated, Heidelberg (2010)
296
S.-M.-R. Beheshti et al.
5. Alkhateeb, F., Baget, J., Euzenat, J.: Extending sparql with regular expression patterns. J. Web Sem. 7(2), 57–73 (2009) 6. Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40, 1:1–1:39 (2008) 7. Awad, A.: BPMN-Q: A Language to Query Business Processes. In: EMISA (2007) 8. Barros, A.P., Decker, G., Dumas, M., Weber, F.: Correlation patterns in serviceoriented architectures. In: Dwyer, M.B., Lopes, A. (eds.) FASE 2007. LNCS, vol. 4422, pp. 245–259. Springer, Heidelberg (2007) 9. Beeri, C., Eyal, A., Kamenkovich, S., Milo, T.: Querying business processes with BP-QL. Inf. Syst. 33(6) (2008) 10. Beeri, C., Eyal, A., Milo, T., Pilberg, A.: Monitoring Business Processes with Queries. In: VLDB (2007) 11. Begel, A., Khoo, Y.P., Zimmermann, T.: Codebook: discovering and exploiting relationships in software repositories. In: ICSE 2010, pp. 125–134 (2010) 12. Beheshti, S.M.R., Sakr, S., Benatallah, B., Motahari-Nezhad, H.R.: Extending SPARQL to support entity grouping and path queries. Technical report, unswcse-tr-1019, University of New South Wales (2010) 13. Chebotko, A., Lu, S., Fei, X., Fotouhi, F.: Rdfprov: A relational rdf store for querying and managing scientific workflow provenance. Data Knowl. Eng. 69(8), 836–865 (2010) 14. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving sparql-to-sql translation. Data Knowl. Eng. 68(10), 973–1000 (2009) 15. Cyganiak, R.: A relational algebra for SPARQL. Hpl-2005-170, HP-Labs (2005) 16. Dries, A., Nijssen, S., De Raedt, L.: A query language for analyzing networks. In: CIKM 2009, pp. 485–494. ACM, NY (2009) 17. Eshuis, R., Grefen, P.: Structural Matching of BPEL. In: ECOWS (2007) 18. Francescomarino, C., Tonella, P.: Crosscutting Concern Documentation by Visual Query of Business Processes. In: BPM Workshops (2008) 19. Holl, D.A., Braun, U., Maclean, D., Muniswamy-reddy, K.: Choosing a data model and query language for provenance. In: IPAW 2008 (2008) 20. Iordanov, B.: HyperGraphDB: A generalized graph database. In: Shen, H.T., Pei, ¨ J., Ozsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 25–36. Springer, Heidelberg (2010) 21. Momotko, M., Subieta, K.: Process Query Language: A Way to Make Workflow Processes More Flexible. In: Bencz´ ur, A.A., Demetrovics, J., Gottlob, G. (eds.) ADBIS 2004. LNCS, vol. 3255, pp. 306–321. Springer, Heidelberg (2004) 22. Moreau, L., Freire, J., Futrelle, J., Mcgrath, R.E., Myers, J., Paulson, P.: The open provenance model: An overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008) 23. Motahari-Nezhad, H.R., Benatallah, B., Saint-Paul, R., Casati, F., Andritsos, P.: Process spaceship: discovering and exploring process views from event logs in data spaces. PVLDB 1(2), 1412–1415 (2008) 24. Motahari-Nezhad, H.R., Saint-Paul, R., Benatallah, B., Casati, F.: Deriving protocol models from imperfect service conversation logs. IEEE Trans. on Knowl. and Data Eng. 20, 1683–1698 (2008) 25. Motahari-Nezhad, H.R., Saint-Paul, R., Casati, F., Benatallah, B.: Event correlation for process discovery from web service interaction logs. VLDB J 20(3), 417–444 (2011) 26. Ooi, B.C., Yu, B., Li, G.: One table stores all: Enabling painless free-and-easy data publishing and sharing. In: CIDR 2007, pp. 142–153 (2007)
A Query Language for Analyzing Business Processes Execution
297
27. Pistore, M., Barbon, F., Bertoli, P., Shaparau, D., Traverso, P.: Planning and Monitoring Web Service Composition. In: Bussler, C.J., Fensel, D. (eds.) AIMSA 2004. LNCS (LNAI), vol. 3192, pp. 106–115. Springer, Heidelberg (2004) 28. Prud’hommeaux, E., Seaborne, A.: Sparql query language for rdf (working draft). Technical report, W3C (March 2007) 29. Rahm, E., Hai Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000) 30. Sakr, S., Al-Naymat, G.: Relational processing of rdf queries: a survey. SIGMOD Rec 38(4), 23–28 (2009) 31. Sakr, S., Awad, A.: A Framework for Querying Graph-Based Business Process Models. In: WWW (2010) 32. Satish, A., Jain, R., Gupta, A.: Tolkien: an event based storytelling system. In: Proc. VLDB Endow., vol. 2, pp. 1630–1633 (August 2009) 33. Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: WWW 2008, USA, pp. 387–396 (2008)
Discovering Characteristics of Stochastic Collections of Process Models Kees van Hee1 , Marcello La Rosa2,3 , Zheng Liu1 , and Natalia Sidorova1 1
Eindhoven University of Technology, The Netherlands {k.m.v.hee,z.liu3,n.sidorova}@tue.nl 2 Queensland University of Technology, Australia
[email protected] 3 NICTA Queensland Lab, Australia
Abstract. Process models in organizational collections are typically created by the same team and using the same conventions. As such, these models share many characteristic features like size range, type and frequency of errors. In most cases merely small samples of these collections are available due to e.g. the sensitive information they contain. Because of their sizes, these samples may not provide an accurate representation of the characteristics of the originating collection. This paper deals with the problem of constructing collections of process models from small samples of a collection, with the purpose to estimate the characteristics of this collection. Given a small sample of process models drawn from a reallife collection, we mine a set of generation parameters that we use to generate arbitrarily-large collections that feature the same characteristics of the original collection. In this way we can estimate the characteristics of the original collection on the generated collections. We extensively evaluate the quality of our technique on various sample datasets drawn from both research and industry.
1 Introduction Today there are millions of business process models around [10]. They are used for various reasons such as documentation of business procedures, performance analysis, and process execution. Organizations have their own process model collections to describe their business procedures. For example, Suncorp, one the largest Australian insurers, maintains a repository of over 6,000 process models [9]. Also consultancy firms and software companies like SAP provide collections of “reference models” to develop customer-specific process models. With the proliferation of process models, it comes a variety of tools to manipulate such models, e.g. process editors, simulation tools and conformance checkers. These tools, developed by both software vendors and academics, contain often sophisticated algorithms that use process models as input. In order to evaluate these algorithms, and in particular to determine their amortized performance, we use benchmarks. In this context, a benchmark is a fixed set of process models that is used to compare different algorithms. These benchmarks either come from practice or are manually constructed to have some extreme properties. For instance, benchmarks can be selected from various existing public process repositories. One of the first process repositories for research S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 298–312, 2011. c Springer-Verlag Berlin Heidelberg 2011
Discovering Characteristics of Stochastic Collections of Process Models
299
purpose is Petriweb [4], and recently AProMoRe [10] was developed. Although benchmarking is a popular technique, there is a drawback: one can only compare algorithms with respect to the same set of process models and we have no idea how they behave on other models. This is one of the motivations for our research: we would like to be able to derive statistically sound statements over algorithms. In order to do so we have to define distributions over the set of all possible process models. We want such a distribution to reflect the modeling style of a company for which the performance estimation is made. In this paper, we deal with the problem of generating stochastic process model collections out of small samples drawn from an existing (real-life) collection. The generated collections should be arbitrarily large and accurately reproducing the characteristics of the existing collection. We choose workflow nets as the language to model processes, but our method is not restricted to this particular model. For model generation, we use construction (model refinement) rules, applied iteratively, starting with the most primitive workflow net possible, consisting of one place only. The probability distribution for the selection of the next construction rule is one of our generation parameters. Another generation parameter is the number of refinement steps, being a random variable, independent of the rules themselves. We use a Poisson distribution there, but any other discrete probability on the natural numbers will do as well. In this way we can make an arbitrarily large collection from a small dataset. To estimate the generation parameters from a small collection of models, we iteratively apply nine construction rules used in the generation process in the reverse direction—as reduction rules. Assuming that the sample was generated via these rules, we estimate the probability with which each rule was used. These parameters are then used to guide the generation of new process models from the sample. The generated collection can be used for benchmarking, and moreover, it can be used for determining characteristics of the original collection (or the collection the company will obtain if they continue to use their current modeling practices) with probabilistic accuracy. For example, we can determine the distribution of the longest or shortest path, or the deadlock probability. As a result, even a very small dataset, on which a direct accurate estimation of characteristic values is impossible, provides enough information for generating enough process models, allowing one to give precise characteristic values for the underlying stochastic collection. This sounds like magic but is very close to bootstrap estimation in classical statistics [13]. While a collection of, for example, five process models would give us just five numbers showing the length of the shortest path, making it difficult to estimate the distribution of the length of the shortest path over the whole collection, at the same time, these five models would provide enough information to estimate the generation parameters with a high precision (assuming that these five models are of a sufficiently large size). Then we can generate an arbitrarily large collection from the same distribution as the sample’s one. To make sure the generation parameters and the characteristics of the generated collections do correspond to those of the originating process model collection, we tested our approach on very large samples generated both by ourselves and extracted from real-life datasets. In the first set of experiments we generated a large collection with some parameters and we took small samples from this collection in order to estimate
300
K. van Hee et al.
the generation parameters. Since we knew the real parameters, we could determine the quality of the estimations, and they turned out to be very good. We also did this for the collections coming from practice. Here we also took small samples from reasonably large collections and we compared the characteristics estimated using our approach with the characteristics of the original collection as a whole. Here our approach also performed quite well. The rest of the paper is organized as follows. Section 2 introduces some basic concepts. Section 3 describes how workflow nets can be generated using our rules. Section 4 shows how the generation parameters can be estimated. Sections 5 and 6 present our empirical results. Section 7 concludes the paper.
2 Preliminaries Workflow nets are a subclass of Petri nets extensively applied to the formal specification and verification of business processes [1]. In addition, mappings exist between process modeling languages used in industry (e.g. UML ADs, EPC, BPMN, BPEL) and workflow nets. These mappings provide a basis for transferring the results outlined in this paper to concrete process modeling notations. In the following, we formally define Petri nets and workflow nets. For a set S, B(S) denotes the set of all bags over S, i.e. B(S) : S → N where N is the set of natural numbers. With [a3 , b2 , c] we denote a bag with 3 occurrences of a, 2 occurrences of b, and 1 occurrence of c. A sequence σ of length l ∈ N over S is a function σ : {1, . . . , l} → S, which we denote as σ = σ(1), σ(2), . . . , σ(l). The set of all finite sequences over S is denoted as S ∗ . Definition 1 (Petri net). A Petri net is a 3-tuple N = (P, T, F ), where P and T are two disjoint sets of places and transitions respectively, F ⊆ (P × T ) ∪ (T × P ) is a flow relation. Elements of P ∪ T are the nodes of N , elements of F are the arcs. Given a node n ∈ (P ∪ T ), we define its preset • n = {n | (n , n) ∈ F }, and its postset n• = {n | (n, n ) ∈ F }. Definition 2 (Workflow net). A workflow net is a 5-tuple N = (P, T, F, i, f ) where (P, T, F ) is a Petri net, i ∈ P is the initial place, such that • i = ∅, f ∈ P is the final place, such that f • = ∅, and each node n ∈ P ∪ T is on a directed path from i to f . Soundness is an important property of workflow nets [1]. Intuitively, this property guarantees that a process always has an option to terminate and that there are no dead transitions, i.e. transitions that can never be executed.
3 Generation of Petri Nets In order to generate workflow nets, we employ a stepwise refinement approach with a number of construction rules. We first define the construction rules, and then introduce the approach of generation of workflow nets. Note that it has been proved in [7] that the construction rules enable us to generate any arbitrary Petri net.
Discovering Characteristics of Stochastic Collections of Process Models
301
Fig. 1. Construction rules
3.1 Construction Rules We consider a set of nine construction rules provided in Fig. 1. Rules R1 , . . . , R5 were studied by Berthelot in [2] and Murata in [12] as reduction rules that preserve liveness and boundedness properties of Petri nets. The rules are often called Murata rules (In fact Murata considered one more rule, a loop addition with a (marked) place, similar to R3 . We do not use this rule since it would destroy the soundness property). These rules are used in [5] to generate the so-called Jackson nets, which is a class of well-formed workflow nets. Besides Murata rules, we also propose rules R6 , . . . , R9 to generate Petri nets. Let N be the original net and R be one of the construction rules. If we apply R to N then we get a generated net N . If we use R in the inverse direction, then we can get N again by reducing N . Consequently, each rule R defines a binary relation ϕR on the set of nets. Let N be the set of all Petri nets, ϕRi ⊆ N × N . In our case we have (N, N ) ∈ ϕRi . In a similar way we define the construction rules as follows.
Definition 3 (Place refinement rule R1 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR1 if and only if there exist places s, r ∈ P , s = r and a transition t ∈ T such that: •t = {s}, t• = {r}, s• = {t}, •s = ∅, •s ⊆ •r. The net N satisfies: P = P \ {s}, T = T \ {t}, F = (F ∩ ((P × T ) ∪ (T × P ))) ∪ (•s × t•).
302
K. van Hee et al.
Definition 4 (Transition refinement rule R2 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR2 if and only if there exist a place s ∈ P and transitions t, u ∈ T , t = u such that: •s = {u}, s• = {t}, •t = {s}, t• = ∅, u• ⊆ t•. The net N satisfies: P = P \ {s}, T = T \ {u}, F = (F ∩ ((P × T ) ∪ (T × P ))) ∪ (•u × s•).
Definition 5 (Arc refinement rule R3 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR3 if and only if there exist two nodes m, n ∈ P ∪T , such that: |•m| = 1, m• = {n}, |n•| = 1, •n = {m}, (•m × n•) ∩ F = ∅. The net N satisfies: P ∪ T = (P ∪ T ) \ {m, n}, F = (F ∩ ((P × T ) ∪ (T × P ))) ∪ (•m × n•).
Definition 6 (Place duplication rule R4 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR4 if and only if there exist two places s, r ∈ P , s = r such that: •s = •r, s• = r•. The net N satisfies: P = P \ {s}, T = T , F = F ∩ ((P × T ) ∪ (T × P )).
Definition 7 (Transition duplication rule R5 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR5 if and only if there exist two transitions t, u ∈ T , t = u such that: •t = •u, t• = u•. The net P = P , T = T \ {u}, F = F ∩ ((P × T ) ∪ (T × P )). Definition 8 (Arc refinement rule R6 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR6 if and only if there exist two nodes m, n ∈ P ∪T , such that: |•m| = 1, m• = {n}, |n•| = 1, •n = {m}, (•m × n•) ∩ F = ∅. The net N satisfies: P ∪ T = (P ∪ T ) \ {m, n}, F = (F ∩ ((P × T ) ∪ (T × P ))) ∪ (•m × n•). Definition 9 (Place bridge rule R7 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR7 if and only if there exist one place s ∈ P and two transitions u, t ∈ T such that: •s = {u}, s• = {t}. The net N satisfies: P = P \ {s}, T = T , F = F ∩ ((P × T ) ∪ (T × P )). Definition 10 (Transition bridge rule R8 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR8 if and only if there exist one transition t ∈ T and two places s, r ∈ P such that: •t = {s}, t• = {r}. The net N satisfies: P = P , T = T \{t}, F = F ∩((P ×T )∪(T ×P )). Definition 11 (Arc bridge rule R9 ). Let N, N ∈ N . We say (N, N ) ∈ ϕR9 if and only if there exist two nodes s, r ∈ P ∪ T , such that (s, r) ∈ F . The net N satisfies: P = P , T = T , F = F \ {(s, r)}. In Figure 1 we only listed examples of the rules. For instance, in Rule R7 , we can also use a place to bridge transitions A and C from C to A, instead of from A to C as shown in the Fig. 1. We observe that different rules can generate the same structure. It is clear that the place refinement rule and the transition refinement rule can both generate sequential structures. Another example is shown in Fig. 2. In order to generate N , from N we can apply either the loop addition rule on the place s, or the transition duplication rule on the transition C. It has been proved in [5] that the rules R1 . . . R5 preserve the soundness property of workflow nets (with respect to a given marking). Therefore, all Jackson nets are sound. Moreover, in [8] the authors proved that Jackson nets are generalized sound, which is a more powerful property and important for refinements. It is easy to show by an example that the rules R7 , R8 , R9 destroy the soundness property. For instance, in Fig. 3 after applying a place bridge rule from transition B to transition A to a sound net N , a deadlock is introduced in the refined net N and N is not sound anymore.
Discovering Characteristics of Stochastic Collections of Process Models
303
Fig. 2. Transition duplication rule and loop addition rule result in the same structure
Fig. 3. Place bridge rule breaks soundness
3.2 Generation of Workflow Nets In order to generate a workflow net, we adopt a stepwise refinement approach. Given two workflow nets N and N , N generates N if and only if N can be obtained from N by applying zero or more times of a construction rule, without applying any rules to the initial and the final places of N . Let N be the set of all workflow nets, ϕ∗ ⊆ N ×N . Let R be the set of all construction rules. We define (N, N ) ∈ ϕ∗ ⇔ ∃R ∈ R : (N, N ) ∈ ϕR ∨ ∃N ∈ N , ∃R ∈ R : (N, N ) ∈ ϕR ∧ (N , N ) ∈ ϕ∗ . Our approach for generating workflow nets has three determinants, or generation parameters: construction rules, probabilities of applying the rules, and the total number of times the construction rules are used per net, which is the stopping criterion and is called size (of the net) in this paper. We assume a Poisson distribution for the size, and take a random number as the size of a net accordingly. Generation starts with one single place. Initially, the only applicable rule is the place refinement rule that results in the generation of the workflow net consisting of the initial place and the final place connected by one transition. In every subsequent step, we select a rule from the construction rules whose conditions hold, i.e. enabled rules. In order to select a rule from all the enabled rules to extend the net, we take a uniform distribution over all the enabled construction rules, e.g. all the enabled rules have the same probability to be chosen. The generation stops when the size has been reached. Thus, such a mechanism makes net generation a random process, and every particular net has certain probability to be produced. A formal definition of the probability space for a stochastic collection of process models can be found in the technical report [6].
4 Discovery of Generation Parameters Workflow nets are generated by using a set of generation parameters. Once the generation parameters have been specified, we are able to generate an arbitrarily large set of
304
K. van Hee et al.
Fig. 4. An example of workflow net reduction
workflow nets (i.e. a “collection”) where each net has certain probability to occur. We consider a set of workflow nets that we encounter as a sample of its collection. Given a sample of workflow nets, we would like to determine the collection to which the sample belongs by estimating the generation parameters of the collection from the sample nets. We call such an estimation procedure generation parameter mining. To derive the generation parameters from a sample of nets, in our estimation procedure we employ a reduction process according to the construction rules R1, . . . , R9. During reduction, in each step we apply a construction rule (in the inverse direction) as a reduction rule to reduce a net until we reach the initial net, which is a single place (see Section 3). We require that after each reduction the reduced net remains a workflow net. During reduction we apply the rules in the following order: Step 1. Keep applying the refinement rules (R1, R2) to reduce a net until the net cannot be reduced any further by the refinement rules, then go to Step 2. Step 2. Apply the duplication rules (R4, R5) to reduce the net. After one successful reduction either by R4 or by R5 go back to Step 1. If both rules cannot reduce the net, go to Step 3. Step 3. Apply the loop addition rule (R3) to reduce the net. If the loop addition rule successfully reduces the net, then go back to Step 1. Otherwise go to Step 4. Step 4. Apply the bridge rules (R7, R8, R9) to reduce the net. After one successful reduction by any bridge rule go back to Step 1. If none of the bridge rules can reduce the net, terminate. Figure 4 displays an example of such a reduction process. During reduction we record the number of times we use each rule. We can see which rules were used at least once and for each applied rule we compute the overall probability of its application. In this way we estimate the generation parameters.
Discovering Characteristics of Stochastic Collections of Process Models
305
Fig. 5. Procedure of estimating characteristics of an unknown collection
Estimation of characteristics. One of the reasons we consider collections of process models is because we want to determine characteristic properties of a collection based on its sample. Of course, it can be done on the sample directly, using available statistical methods. Typically the larger the sample size, the more accurate the collection characteristics can be estimated. If the sample size is not big enough, the characteristic estimations may not accurately reflect those of the whole collection. One way to boost accuracy is to increase the sample size. In our case, for any unknown collection, we estimate the generation parameters from a sample using the approach introduced in the early part of this section. Once we obtain these parameters, we can generate arbitrarily large samples. Consequently, based on the law of large numbers and the central limit theorem, we can estimate the characteristics of the original collection with any precision. This approach is illustrated in Fig. 5. We tested our approach focusing on the following characteristics of a workflow net: 1) number of nodes, 2) average fanin and fanout, 3) length of the longest path, 4) length of the shortest path, 5) soundness probability, 6) deadlock probability, and 7) average number of strongly connected components. We compare the estimations with respect to their 1) mean, 2) standard deviation, and 3) confidence interval of the mean over the collection. Confidence interval of the mean is calculated as [Y − t( α2 ,N −1) √sN , Y + t( α2 ,N −1) √sN ], where Y is the sample mean, s is the sample standard deviation, N is the sample size, α is the desired significance level, t( α2 ,N −1) is the upper critical value of the t-distribution with N − 1 degrees of freedom.
5 Evaluation of the Estimation Quality We tested our approach performing a series of experiments on collections of workflow nets generated as described in Section 3 using our plug-in to the ProM toolset [3]. To measure the quality of the generated collections, we considered various metrics. For the sake of space, we report the results on two sample metrics: the length of the longest path and the soundness probability. Quality of the estimations of generation parameters. We generated a collection of 100 workflow nets with a set of chosen generation parameters. We treat this collection as our original collection. Note that this original collection itself is a sample of the entire collection determined by the original generation parameters. We divided the original collection into 20 samples, 5 nets per sample. On each sample, we estimated the generation parameters and thus obtained 20 sets of the generation parameters. Note
306
K. van Hee et al.
that in practice we only have one sample, from which we mine the generation parameters. In this test we used 20 samples only for the purposes of testing. Fig. 6 lists the results of the original generation parameters (used to generate the original collection) and the estimated generation parameters (we consider the transition refinement rule and the place refinement rule together as they both generate sequential structures). According to Fig. 6, Probability of Probability of Probability of Size the original construction R1 and R2 R5 R4 mean rules and their probabiloriginal parameters 0.60 0.20 0.20 200 ities are 60% for both parameters’ estimations 1 0.58 0.21 0.22 201 the transition refinement parameters’ estimations 2 0.63 0.18 0.19 194 rule and the place refineparameters’ estimations 3 0.61 0.21 0.19 209 parameters’ estimations 4 0.61 0.20 0.20 203 ment rule, and 20% for parameters’ estimations 5 0.59 0.20 0.21 206 the transition duplication . . . . . . . . . . rule, the place duplica. . . . . tion rule and the mean parameters’ estimations 20 0.60 0.20 0.20 207 value of size. The stopAvg. 0.60 0.20 0.20 201 ping criterion of net genStd.Dev. 0.0162 0.0168 0.0105 8.6601 eration is 200. From each Fig. 6. Original generation parameters and their estimations on sample, we can always samples mine exactly the same construction rules as the original rules. On average, the probabilities of the construction rules estimated on all 20 samples are: 60% for both the transition refinement rule and the place refinement rule, 20% for the transition duplication rule, 20% for the place duplication rule. Compared against the probabilities of the original construction rules, we have got exactly the same probabilities as estimations. The average of the estimated mean value of size is 201, and this result is very close to the original value 200. Consequently, we succeeded in mining the generation parameters by using our generation parameter mining algorithm. Distribution of the length of the longest path. In this test we used the same collection of 100 workflow nets generated in Section 5, and treated it as the original collection. Using each set of the generation parameters’ estimations (see Section 5), we generated a new collection of 100 workflow nets using our generation approach. Thus we generated 20 new collections. We used the generated collections to represent the original collection. In order to measure the quality of the generated collections, we considered the length of the longest path (LLP), which is a structural property of workflow nets, as our metric in this experiment. In a workflow net the longest path is a path between the initial place and the final place such that the total number of transitions on this path is maximized discarding loops. The LLP is the number of transitions on the path. Figure 7 displays the histograms of the LLP in the original collection, one of the samples and a collection generated with the generation parameters computed on this sample. The figure also gives the scatter plot that shows that the correlation between the number of nodes and the LLP in the original collection is very weak. According to our experiment, it is difficult to find a strong estimator for the LLP, so we cannot compute the LLP based on another characteristic. We take the number of nodes as an example here. From the correlation plot we can observe that there is no strong
Discovering Characteristics of Stochastic Collections of Process Models
(a) Histogram of the LLP distribution in the original collection
(c) Histogram of the LLP distribution in the generated collection
307
(b) Scatter plot relating the number of nodes and LLP in the nets of the original collection
(d) Histogram of the LLP distribution in the sample
Fig. 7. Evaluation results for the LLP metric
correlation between the number of nodes and the LLP. Hence we cannot build a good model to compute the LLP from the number of nodes. In Fig. 7 we can observe that the histogram of the generated collection produces a distribution of LLP similar to the distribution that can be derived from the histogram of the original collection, while the histogram of the sample indicates a completely different distribution. We further computed the 80th and the 90th percentiles (which are the values below which 80%, resp., 90% of the observations are found) for the LLP in the original collection and in all the collections generated. The results are listed in Fig. 8. The 80th and the 90th percentiles of the original collection are 102.00, resp., 107.90. On average, the 80th and the 90th percentiles of all the generated collections are 101.69, resp., 108.28. The Shapiro-Wilk test (see e.g. [11]) confirmed that we can assume that both the sample of the twenty 80th percentiles and the sample of the twenty 90th percentiles of the generated collections come from a normally distributed population, and taking into account the estimations of the standard deviations of these two distributions (6.88 and 6.51 respectively), we conclude that in 95% of cases the approximation error on the generated collection will not exceed 13.72% for the 80th percentile and 13.02% for the 90th percentile.
308
K. van Hee et al.
Soundness probability. In this experiment we 80% percentile 90% percentile considered soundness. Using a set of origioriginal collection 102.00 107.90 nal generation parameters (different from the generated collection 1 95.00 101.00 ones used in Section 5), we generated a colgenerated collection 2 103.80 112.00 lection of 50 workflow nets. We included the generated collection 3 108.80 112.00 generated collection 4 104.80 111.90 place bridge rule in the original generation pagenerated collection 5 103.40 107.00 rameters to introduce unsoundness (recall that . . . . . . bridge rules breaks soundness as illustrated in . . . Section 3.1). Otherwise the nets would all be generated collection 20 100.80 108.00 sound by construction. We regarded this colAvg. 101.69 108.28 lection as the original collection, and divided it Std.Dev. 6.88 6.51 into 10 samples, 5 nets per sample. From each th th sample we estimated the generation parameters Fig. 8. 80 and 90 percentiles of LLP in the original collection and in the from which we generated a new collection of generated collections 50 workflow nets. In order to measure the quality of the collections generated, we tested the probability of soundness, which is a behavioral property of workflow nets. Figure 9 presents the results of soundness Soundness Probability probability of the original collections, all samoriginal collection 46% ples and all generated collections. Accordingly, sample 1 20% in the original collection 46% of nets are sound. generated collection 1 50% On average, soundness probability of samples sample 2 60% is 46%, and soundness probability of the genergenerated collection 2 44% ated collections is 45.8%. Again, the Shapirosample 3 20% Wilk test confirmed that we can assume that generated collection 3 42% the sample of the soundness probability esti. . . . mated on the generated collections comes from . . a normally distributed population with the mean sample 10 60% value of 45.8% and the standard deviation of generated collection 10 46% 4.57%, implying that in 95% of cases we can Avg. of samples 46% expect that the estimation of soundness probaStd.Dev. of samples 18.97 bility on a generated collection will be at most Avg. of generated collections 45.8% 9.14%. Note that the standard deviation of the Std.Dev. of generated collections 4.57 samples is 18.97%, implying a 4 times greater Fig. 9. Soundness probability error factor than in the generated collections. We repeated the same experiments along other metrics (i.e. average fanin and fanout, length of the shortest path, deadlock probability, and average number of strongly connected components) obtaining similar results. Hence we conclude that the generated model collections closely resemble the characteristics of the original collection, of which only a small, casually extracted sample was available.
6 Evaluation with Industry Models In this series of tests, we tested our approach with process models from practice. We randomly extracted 50 process models from a collection of 200 models of a
Discovering Characteristics of Stochastic Collections of Process Models
309
Table 1. Estimations of generation parameters from 10 samples of process models from practice Probability of Probability of Probability of Probability of Probability of Size R1 and R2 R5 R4 R3 R8 mean 1 2 3 4 5 6 7 8 9 10
0.69 0.73 0.84 0.85 0.85 0.72 0.72 0.74 0.76 0.73
0.22 0.20 0.10 0.08 0.12 0.04 0.11 0.18 0.02 0.12
0.01 0.01 0.05 0.06 0 0.06 0 0 0 0.07
0.03 0.06 0.02 0.02 0.03 0.09 0.07 0.02 0.08 0.03
0.06 0 0 0 0 0.09 0.1 0.06 0.15 0.05
32 14 13 13 15 14 12 17 12 23
Avg. Std.Dev.
0.76 0.06
0.12 0.07
0.03 0.03
0.05 0.03
0.05 0.05
17 6.35
manufacturing company, which we acquired from the AProMoRe process model repository [10]. We assumed that these models could be generated by the generation mechanism in Section 3. We treated this collection as our original collection, and divided it into ten samples, five nets per sample. For each sample, we derived an estimation of the generation parameters and generated a collection of 200 nets based on these parameters. Thus we had ten generated collections, each having 200 nets. In order to test the quality of the generated collections, we considered various characteristics of workflow nets as metrics. Below, we report the results on the estimations of the generation parameters and two metrics: length of the shortest path (structural metric) and soundness probability (behavioral metric). Estimation of generation parameters. Table 1 lists the generation parameters estimated on each sample. Here we consider the transition refinement rule and the place refinement rule together as they both generate sequential structures. In this experiment, models were obtained from practice, so we did not know the original generation parameters (the generation parameters used to generate the original collection). In Table 1, the standard deviation of each generation parameter (bottom row in Table 1) is actually large. This is because models in different samples vary with respect to size (e.g. number of nodes) and structure. For instance, the means of size of sample 1 and sample 2 are 32 and 14, respectively, which means models in sample 1 are generally larger (e.g. in terms of number of nodes) than models in sample 2. There is no parallel structure in models in (e.g.) sample 5, as the place duplication rule was not used in generation (the probability of the place duplication rule is estimated as 0).
LSP original collection
7.14
generated collection 1 generated collection 2 generated collection 3 . . . generated collection 10
7.34 6.60 6.31 . . . 7.98
Avg. Std.Dev.
7.29 1.46
Fig. 10. Average LSP
Distribution of the length of the shortest path. In this experiment we considered the length of the shortest path (LSP). In a workflow net the shortest path is a path between
310
K. van Hee et al.
the initial place and the final place such that the total number of transitions on this path is minimized. LSP is the number of transitions on the path. Fig. 10 lists the results of the LSP in the original collection and in all the generated collections. According to Fig. 10, the average LSP in the original collection is 7.14. On average, the LSP in all the generated collections is 7.29. Shapiro-Wilk test was performed to show that we can assume a normal distribution with the mean value of 7.29 and the standard deviation of 1.46 for the average LSP for the generated collections, implying that with a high probability the generated collections will approximate the original collection well with respect to the average LSP value. Let us zoom in one sample to measure the quality of our approach. Figure 11 displays the mean and the 95% confidence interval for the mean of LSP for the original collection, sample 1 and the generated collection 1. The mean of sample 1 is 6. It falls outside the 95% confidence interval for the mean of the original collection, which ranges from 6.42 to 7.86. This confidence interval means that we are 95% sure that the real mean of the original collection falls between 6.42 and 7.86. Thus it is clear that we can poorly estimate this characteristic of the original collection from this sample. On the other hand, the mean of the generated collection 1 is 7.34. This value falls inside the 95% Fig. 11. Mean and 95% confidence inconfidence interval for the mean of the original terval of the LSP for original collection, collection, and is also very close to the mean of sample 1 and generated collection 1 the original collection, which is 7.14. The 95% confidence intervals for the mean of the original collection and that of the generated collection 1 almost fully overlap. Consequently, the generated collection 1 very well represents the original collection along LSP. Soundness probability. In this experiment we computed the probability of soundness for models in the 10 collections that we generated. Figure 12 lists the results of the soundness probability in the original collection and in all the generated collections. The results show that all nets in the original collection are sound as the soundness probability of the original collection was 100% (this implies that all samples from this collection are 100% sound too). The soundness probabilities of the generated collections 1, 6, and 10 are 96%, 89%, and 85%, respectively (note that our net generation technique do preserve soundness in general), and the rest of the generated collections are all 100% sound. On average the soundness probability of all generated collections is 97%. The Shapiro-Wilk test showed that we cannot assume a normal distribution for the data in this case, however due to the low value of standard deviation in the distribution of soundness probability estimated on the generated collections, we conclude that the generated collections will allow to estimate the soundness probability well enough. As shown in Table 1, we estimated the generation parameters (probability distribution for Murata’s rules) on samples 2, 3, 4, 5. The generated collections 2, 3, 4, 5 consist of Jackson nets only, and these nets are (generalized) sound (see Section 3). To generate samples 7, 8, 9, both the transition bridge rule and Murata’s rules without place
Discovering Characteristics of Stochastic Collections of Process Models
311
duplication rule are needed. These rules allow to generate state machine workflow nets only (i.e. workflow nets that do not allow for parallelism), as none of the rules can introduce a parallel structure. Because state machine workflow nets are always sound, the generated collections 7, 8, 9 are 100% sound. To reduce nets from samples 1, 6, 10, both collection probability of soundness transition bridge rules and Murata’s rules are original collection 1 necessary. Since the soundness property can be generated collection 1 0.96 destroyed by the bridge rules (see Section 3), generated collection 2 1 soundness is not guaranteed for the generated generated collection 3 1 collections 1, 6, 10. Remarkably, in spite of the generated collection 4 1 generated collection 5 1 source of unsoundness in the form of the bridge generated collection 6 0.89 rules, the soundness probability on the genergenerated collection 7 1 ated collections 1, 6 and 10 is still very high. generated collection 8 1 The results conducted with real-life process generated collection 9 1 models confirm the results we obtained with generated collection 10 0.85 the dataset that we generated artificially. From Avg. 0.97 these results we can already state with some Std.Dev. 0.05 confidence that our generated model collections Fig. 12. Soundness probability of original can closely reproduce the characteristics of a collection (from practice) and generated given collection, when only a small sample is collections available. Thus, the generated models can be used as benchmarks to test algorithms that operate on process models, and produce statistically valid results.
7 Conclusion We proposed an approach called generation parameter mining to discover a collection of process models based on a sample from that collection. We assume that nets can be generated by certain generation parameters by stepwise refinement. Such generation parameters consist of a set of defined construction rules, probabilities of applying the rules, and net size. Thus, we mine these generation parameters by reducing the sample nets using the construction rules in the inverse direction. Once we obtain the generation parameters, we can use them to determine the whole collection of the given sample. To the best of our knowledge, there is no similar research to our work in this field yet. There are a number of good reasons for discovering whole collections from samples. First, it is very difficult to estimate the characteristics of a collection accurately with a small sample size. Hence if we can identify the entire population, we are able to estimate the characteristics of that population at any level of accuracy by generating as many models as necessary from the population. Second, given that we can build arbitrarily large collections which faithfully reproduce the features of the original collection, our generated models can be used for benchmarking purposes. We extensively tested our approach using both process models generated by our software tool and process models from practice. The results indicate that the generation parameters of an original collection can be successfully estimated on a small sample of the collection. In order to test the quality of the generated collections, we considered
312
K. van Hee et al.
various process model characteristics. The results of these latter tests show that the generated collections accurately represent the original collection, whereas the original collection is poorly represented by the samples. There are many interesting questions to explore in the future. We will further test our approach on more model collections from practice, and we will consider other characteristics e.g. deadlock probability, number of strongly connected components, maximal number of concurrent transitions. As we can get the same net by reducing a net using different rules, this may cause incorrect generation parameters from estimation. This issue becomes more problematic if bridge rules are involved. For instance, bridge rules break soundness: if bridge rules and their probabilities are estimated wrongly, the soundness probability of the population will be incorrectly estimated. Therefore, we intend to improve our approach by refining the rule mining phase. Acknowledgments. This research is partly funded by the NICTA Queensland Lab.
References 1. van der Aalst, W.M.P.: The Application of Petri Nets to Workflow Management. The Journal of Circuits, Systems and Computers 8(1), 21–66 (2001) 2. Berthelot, G.: Transformations and Decompositions of Nets. In: Advances in Petri Nets, vol. 254, pp. 360–376. Springer, Heidelberg (1987) 3. van Dongen, B.F., Alves der Medeiros, A.K., Verbeek, H.M.W., Weijters, A.J.M.M., van der Aalst, W.M.P.: The ProM Framework: A New Era in Process Mining Tool Support. In: Ciardo, G., Darondeau, P. (eds.) ICATPN 2005. LNCS, vol. 3536, pp. 444–454. Springer, Heidelberg (2005) 4. Goud, R., van Hee, K.M., Post, R.D.J., van der Werf, J.M.E.M.: Petriweb: A Repository for Petri Nets. In: Donatelli, S., Thiagarajan, P.S. (eds.) ICATPN 2006. LNCS, vol. 4024, pp. 411–420. Springer, Heidelberg (2006) 5. van Hee, K.M., Hidders, J., Houben, G., Paredaens, J., Thiran, P.: On the Relationship between Workflow Models and Document Types. Information Systems 34(1), 178–208 (2008) 6. van Hee, K.M., La Rosa, M., Liu, Z., Sidorova, N.: Discovering Characteristics of Stochastic Collections of Process Models. BPM Center Report BPM-11-06, BPMcenter.org (2011) 7. van Hee, K.M., Liu, Z.: Generating Benchmarks by Random Stepwise Refinement of Petri Nets. In: Proc. of APNOC, pp. 30–44 (2010) 8. van Hee, K.M., Sidorova, N., Voorhoeve, M.: Generalised Soundness of Workflow Nets Is Decidable. In: Cortadella, J., Reisig, W. (eds.) ICATPN 2004. LNCS, vol. 3099, pp. 197–215. Springer, Heidelberg (2004) 9. La Rosa, M., Dumas, M., Uba, R., Dijkman, R.: Merging business process models. In: Proc. of CoopIS, pp. 96–113. Springer, Heidelberg (2010) 10. La Rosa, M., Reijers, H.A., van der Aalst, W.M.P., Dijkman, R.M., Mendling, J., Dumas, M., Garcia-Banuelos, L.: AProMoRe: An Advanced Process Model Repository. Expert Systems with Applications 38(6) (2011) 11. Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers, 5th edn. Wiley & Sons, Chichester (2011) 12. Suzuki, I., Murata, T.: A Method for Stepwise Refinement and Abstraction of Petri Nets. Journal of Computer and System Sciences 27(1), 51–76 (1983) 13. Wu, C.F.J.: Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis. Annals of Statistics 14(4), 1261–1295 (1986)
Wiki-Based Maturing of Process Descriptions Frank Dengler and Denny Vrandeˇci´c Karlsruhe Institute of Technology (KIT), Institute AIFB, 76128 Karlsruhe, Germany {frank.dengler,denny.vrandecic}@kit.edu
Abstract. Traditional process elicitation methods are expensive and time consuming. Recently, a trend toward collaborative, user-centric, online business process modeling can be observed. Current social software approaches, satisfying such a collaborative modeling, mostly focus on the graphical development of processes and do not consider existing textual process description like HowTos or guidelines. We address this issue by combining graphical process modeling techniques with a wiki-based light-weight knowledge capturing approach and a background semantic knowledge base. Our approach enables the collaborative maturing of process descriptions with a graphical representation, formal semantic annotations, and natural language. Existing textual process descriptions can be translated into graphical descriptions and formal semantic annotations. Thus, the textual and graphical process descriptions are made explicit and can be further processed. As a result, we provide a holistic approach for collaborative process development that is designed to foster knowledge reuse and maturing within the system.
1
Introduction
Enterprises are trying to describe their business processes in order to better understand, share, and optimize them. But still most of the process knowledge remains either in people’s heads, or as textual and graphical descriptions in the Intranet as how-tos, guidelines, or methodology descriptions. The cost of an upfront, complete formalization of all business processes is prohibitive, and the benefits often seem elusive, especially under the stress of the daily work. Traditionally, interviews of the domain experts performed by process modelers have been used to develop the process descriptions. An alternative to interviews is the group storytelling approach. It transforms stories told by individual process performers into process descriptions [23]. In this paper we present a process and a wiki-based platform that allows capturing process descriptions through several channels, with different speeds and goals. It supports the capturing of stories and natural language process descriptions, rendering and editing of graphical representations, and formal models, which can be exported with a well-defined semantics and used for the further processing and validation. The platform covers a wide array of externalization approaches, and provides a unifying system that can interconnect gradually the explicated knowledge, enabling the enterprise to follow a continuous knowledge S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 313–328, 2011. c Springer-Verlag Berlin Heidelberg 2011
314
F. Dengler and D. Vrandeˇci´c
maturing effort [25,24] instead of requiring a steep and complete formalization step. We have implemented a fully functional prototype1 where users can start to model their processes by formulating a first idea with natural language or with an incomplete model. These models can be further refined and consolidated by others. Existing processes within the wiki platform either formatted as a textual how-to or expressed via predefined semantic properties can be reused by transforming them into a newly developed graphical representation. We have evaluated our approach by conducting functionality tests and a usability study to check if users can graphically render and edit processes in an intuitive manner. The paper is structured as follows. Section 2 describes the requirements for collaborative process development derived from previous literature and an analysis of existing processes in industry. As a result of our requirement analysis, we present our approach to support collaborative process development (Section 3) and its implementation in Section 4. After describing our holistic approach for collaborative proposal development, we elaborate the transformation of existing how-tos into graphical process representations in Section 5. We evaluate our approach by modeling and transforming existing processes in Section 6. Section 7 relates the elaborations of this paper to other research. Eventually, Section 8 concludes the paper by mapping the requirements on the features of the implementation, and gives an outlook on future research.
2
Requirement Analysis
In order to collect the requirements for enabling both novice users of our system as well as process modeling experts to gather process descriptions and cooperate together, we analyzed previous literature and further requirements derived from our work with case study partners. We present and summarize these requirements in this section. 2.1
Requirements from Literature
In order to reuse, share, and collaboratively improve processes, they have to be externalized. Often, process mining techniques and tools [1] are used to capture and analyze activities performed by knowledge workers. But many tasks of knowledge workers are outside the grasp of mining algorithms: more than 14% of the work time of knowledge workers is spent in phone calls or meetings (as shown by a study of task switching and interruptions of knowledge workers [5]), which today is not sufficiently monitored to be a viable source for process mining. Knowledge required for process activities is often exchanged informally between knowledge workers, e.g. during a discussion at the water cooler or over lunch, and thus also out of the grasp of automatic approaches. 1
The prototype can be accessed via http://bpmexample.wikiing.de (User name: ProcessTester – Password: active!).
Wiki-Based Maturing of Process Descriptions
315
Complementary to mining the activities of the knowledge worker, process modeling tools can be used to explicitly capture the processes. Knowledge workers can model their own processes, or simply tell their own stories. The aggregated process descriptions and stories can be used to arrive at a high-quality description of the actually performed processes. Thus, collaboration support is required. People with different levels of expertise are describing processes. Therefore it is equally important to provide means for the individual knowledge worker to externalize their process knowledge, as well as for process modeling experts to efficiently work on the abstraction level they are used. The integration of stakeholders into the creation, adaptation, and revision of process models has been shown beneficial especially for model accuracy and verification [15]. This requires the tool to enable to bridge the gap between the modeling expertise levels of the knowledge workers and the modeling experts. Recker et al. [21] have investigated how novices model business processes in the absence of tool support. They have found that design representation forms chosen to conceptualize business processes range from predominantly textual over hybrid to predominantly graphical types. The hybrid process descriptions, combining graphical and textual types, achieve a higher quality. A survey analyzing the used modeling constructs of the Business Process Modeling Notation2 (BPMN), shows that in most BPMN diagrams less than 20% of the BPMN vocabulary constructs are regularly used and the most occurring subset is the combination of tasks and sequence flows [19]. A simpler subset of the language can be chosen based on this results. 2.2
Requirements from Use Cases
For our analysis of processes used in industry we looked at twenty different processes from a large consultancy company. The processes are methodologies and reusable assets describing procedures to guide consultants, i.e. knowledge workers, in their daily work. The processes are defined by experts, but junior consultants with only little experience in process modeling have to work with them. The older process descriptions are described in MS Word documents, accessible from the company Intranet. The newer ones are stored as Flash and HTML files on the Intranet. A process can have subprocesses which are stored in separate files and interrelated to each other via links. Each process document contains a short description, inputs and outputs, a flow diagram, and extensive textual descriptions of each process step. The process flow is expressed by a process picture created with a graphics software (i.e. the semantics of the picture is not formally captured and accessible to the machine for further processing). The textual descriptions are composed of detailed action instructions, links to other resources, and the roles responsible for each step. All (sub)processes contain less than 10 steps. Only a few modeling constructs are used within the flow models, namely activities, sequence flows, conditions, and pools. Instead of using BPMN to capture the more intricate details of the processes, the expressivity of the 2
http://www.bpmn.org/
316
F. Dengler and D. Vrandeˇci´c
flow model is complemented by textual descriptions, e.g. exception handling is described in detail within the action instructions. Due to that setup, the process search is limited to a simple textual search over the descriptions as the process knowledge is not machine-processable and can not be exploited for the search. Browsing is also confined to the few links provided explicitly by the authors of the process descriptions. To increase the search and navigation functionality, support for structured process documentation is required. 2.3
Summary of Requirements
Based on the previous two sections, we derive the following requirements for collaborative process development: Natural Language Support for Novice Users. Novices in process modeling need to be able to use natural language instead of formal modeling constructs, so that they can create and extend the processes without the assistance of an expert. If the user does not know the graphical representation of a process element, natural language can be used to describe it. Intuitive Graphical Rendering and Editing of Processes. A rich user interface can provide the user with means for interacting with processes in a highly intuitive manner. But with too many elements, the usability of the tool may suffer. This leads to a trade-off between the expressivity offered to develop the formal process models and the simple accessibility of the tool. Collaboration Support. Users must be able to discuss process models asynchronously. Changes of the process model have to be tracked and users should be enabled to access the version history and to revert to previous versions. In addition, design rationales should be documented. Structured Process Documentation Support. The process models must be stored in a machine-processable documentation format, including additional properties linking to external resources. Users must be able to interlink between process descriptions and external resources to enable more sophisticated retrieval, browsing, and navigation.
3
Approach
To address these requirements for collaborative process development, we have combined wiki-based light-weight knowledge capturing with graphical process modeling functionality and a semantic knowledge base. Hence, users can develop process descriptions by using graphical representations, natural language, and formal semantic annotations. Our approach enables domain experts without modeling expertise to actively participate in the process modeling task. It provides users with means to intuitively model processes graphically with basic (but widely used) process elements, namely task, sequence flow, parallel gateway, and data-based exclusive gateway. This allows to integrate stakeholders into the creation, adaption, and revision of the process descriptions. A single person typically does not have a complete
Wiki-Based Maturing of Process Descriptions
317
overview of a process. The required organizational and semantic integration are also addressed with our approach; all users can contribute to process modeling and the definition of terms is done in a multitude of small steps, giving each contributor the ability to adapt easily to a common usage of terms [4]. Standard wiki features, like the discussion functionality, also support the definition of terms in a collaborative manner. Thus, users can discuss and evolve the meaning of terms. By using the watch list functionality, users can be notified about changes in the process descriptions or new contributions to the discussion. The process models are stored in a machine-processable format. These are created on the one hand by automatically translating the graphical process representation into formal semantic annotations, but also by using the annotations from the semantic wiki. During the translation of the graphical representation a wiki page is created for each process element with corresponding semantic properties. A predefined property is introduced for the sequence flow. In addition, the user can enter annotated links to external resources that are required for a given step or task (e.g assigned roles and actors). The formal process descriptions are the combined results from the graphics and the annotated text. Using the formal descriptions, a semantic search system is offered that can improve the navigation and retrieval of process activities. In addition, semantic queries can be used to detect errors or constraint violations in process model descriptions, e.g. a search for the executors of all process activities using a specific confidential document. An application of our approach can be a collaborative proposal development scenario. Since many people with different expertise and roles are involved, everybody can contribute in acquiring the proposal development processes. Team members can start modeling from scratch or by translating existing textual process descriptions. Depending on the expertise and the available time, proposal team members can thereby alter the process to take advantage of it (e.g. contributing only to parts of the proposal document for which they have the necessary knowledge). Additionally, previous proposal documents can be linked with our approach to specific process activities using semantic properties. Thus relevant proposals can be filtered out by querying for related, previously created proposals. Also less experienced proposal team members can profit from the process wiki, because they can look up and follow the developed processes. The formalized processes can also be used as a basis for the input to a process execution engine, e.g. by accessing the RDF export interface of our approach via the process execution tool.
4
Implementation
We extended our implementation presented in [8] and integrated a graphical process editor with Semantic MediaWiki (SMW). SMW extends the MediaWiki [2] software that runs, e.g. the popular Wikipedia site. SMW combines Semantic Web technology with the collaborative aspects of wikis [17] to enable large-scale and inter-departmental collaboration on knowledge structures. Knowledge can be expressed by using natural language in combination with formal annotations
318
F. Dengler and D. Vrandeˇci´c
allowing machines to process this knowledge. An extension to the MediaWiki syntax is provided by SMW to define class and property hierarchies, and semantic properties related to wiki pages. For instance a category Process can be added to a wiki page by adding [[Category:Process]] on the wiki page. For a property has Successor, expressing a successor relation between two wiki pages, the following syntax is used: [[has Successor::]]. To access the formalized knowledge within wiki pages, SMW offers an inline query language (ASK syntax). The syntax for a query asking for all instances belonging to the category Process and their property Short Description is {{#ask: [[Category:Process]] |?Short Description}}. Without stating a specific output format, the query result will be displayed by default as a table on the corresponding wiki page. To make the formalized knowledge also available for other applications, SMW provides several export functionalities, e.g. in RDF [3].
Fig. 1. SMW Process Editor Screen Shot
For our implementation we selected the Oryx Process Editor [6], an open source process editor, as the graphical process editor component. It currently supports various modeling languages such as BPMN, Event-driven Process Chain (EPC), Petri Nets, Workflow Nets, as well as Unified Modeling Language (UML), and can easily be extended to handle further process modeling languages. SMW was extended to be compatible with the Oryx graphical editor, so that data can be exchanged between both. In addition, the Oryx graphical editor was extended to display and edit wiki pages seamlessly from within its interface; as a consequence, users can directly access the corresponding wiki page within the process editor. The entered wiki text is rendered by using the parse method provided by SMW. Thus the whole SMW syntax can be used including categories and properties. SMW ASK queries are executed and the results are displayed as well. Both the originally entered text as well as the parsed wiki text are temporarily stored within the data model of the process editor as additional hidden properties.
Wiki-Based Maturing of Process Descriptions
319
Fig. 2. Example Process in SMW (process summary page with fact box)
As can be seen in Figure 1, the process editor interface consists of different regions. The corresponding wiki page is displayed in the bottom of the editor. For our approach we only use a small subset of BPMN constructs. The available process elements are presented in the left region of the editor, namely task, sequence flow, parallel gateway, and data-based exclusive gateway. Users can easily add process elements to the process by dragging a process element from the left region and dropping it on the process diagram in the middle. Once the process is saved (by clicking on the save button in the process editor), the process data and wiki pages belonging to the process are created or updated in SMW. The process elements are saved as subpages to the process summary page within the wiki (see Figure 3 for an example). The process element wiki pages contain the textual descriptions and a fact box with all the stored properties. On the process summary page, the process diagram in SVG and a fact box are displayed (see Figure 2 for an example). We support most of the Basic Control-Flow Patterns introduced in [22]. Every single process step (activity) is represented as a wiki page belonging to the category Process Element and linked via the properties has Type to the corresponding type (task ) and Belongs to Process to the corresponding process, represented as wiki pages themselves (process summary pages). An activity is the basic element of our process. Depending on the granularity level of the process this can vary from atomic activities, such as open a web page, to activities describing a whole subprocess. To express the control flow of the process, we use edges in the diagram and special predefined process elements (gateways). If an element has a successor, we draw an edge from the activity to the successor activity in the diagram and store this with the additional property has Successor on the corresponding wiki page in SMW. For more successors executed in parallel (parallel-split pattern), a parallel gateway is used in between the activities. An activity can have several successors, but only one has to be selected and executed (multi-choice pattern). Therefore we use the data-based exclusive gateway
320
F. Dengler and D. Vrandeˇci´c
Fig. 3. Example Task in SMW (element wiki page with fact box)
without conditions. The data-based exclusive gateway with conditions is used to split based on a condition (exclusive-choice pattern). A condition is stored as a many-valued property3 . The distinction between the synchronization pattern and the simple-merge pattern is realized by using the parallel gateway and the data-based exclusive gateway the other way round to merge different branches of a process. All properties of the process elements are also available in SMW. They are stored as SMW properties with their corresponding value. Thus all the process properties can be accessed within SMW, queried, and exported. The basic vocabulary used in the wiki is illustrated in Figure 4. For example, these properties can be displayed in a fact box on the corresponding wiki page as shown in Figure 2. Links to the corresponding wiki pages are automatically added to the SVG graphic, which enable the user to navigate through the process in the wiki. A new tab edit with editor has been added to the process wiki for editing an existing process. The tab automatically appears on pages belonging to the categories Process and Process Element. The tab contains the graphical process editor with the process model. Even though technically the process editor runs on a separate Tomcat server, the SMW authentication is also used to control the access on the process model within the process editor, providing a seamless experience between the two different components.
5
Transformation of Existing Textual Processes
In traditional (semantic) wikis, processes are stored either as a textual description on a single wiki page or they can also be expressed by connecting various 3
Many-valued properties in SMW are implemented as records, see http://semantic-mediawiki.org/wiki/Type:Record
Wiki-Based Maturing of Process Descriptions
321
Fig. 4. Basic Wiki Process Vocabulary
wiki pages. The single page process descriptions are usually simple HowTos, structured as a numbered list. Examples for such how-tos can be found at WikiHow4 , a wiki platform where users can collaboratively create, edit, and share how-tos. The interlinking of multiple wiki pages can define a single process (by describing single tasks or subprocesses on their own pages, and then interlinking these). A way to specify such a process, by using predefined semantic properties to express flow relations, is described in [8]. With our wiki-based approach, we support both ways of transforming wiki pages into graphically editable process descriptions – single and multi wiki page transformations. As shown in Figure 5, a Create Process tab is displayed on each wiki page that does not belong already to a graphical process description. By clicking on this tab, the graphical editor interface is displayed and the described process is transformed as described in the following subsections. 5.1
Single Wiki Page Transforming
Simple how-tos, like in WikiHow, are usually formatted as numbered lists. Text within MediaWiki can be formatted as enumerations, or ordered lists. MediaWiki syntax allows building ordered lists by using the hash character (#) at the beginning of subsequent lines5 . For the transformation of such a single page howto, our extension parses the text for the hash character at the beginning of a line, similar to our previous approach in [26]. If a text line matches this pattern, we create a task element and display the text either as the name of the process step or on the corresponding wiki page, if it is longer than 100 characters. Bullet lists and definition items, included in the numbered list, are directly displayed as part of the corresponding wiki page. 4 5
http://www.wikihow.com http://www.mediawiki.org/wiki/Help:Formatting
322
F. Dengler and D. Vrandeˇci´c
Fig. 5. Example How-to in SMW
A how-to can look like the numbered list shown in Figure 5. The graphical transformation of this textual process description is illustrated in Figure 1. Upon saving the page, the enumeration will be replaced with the process summary page as shown in Figure 2. 5.2
Multi Wiki Page Transforming
For transforming multiple wiki pages describing a process, where the pages are connected with flow relations, we used the predefined properties specified in [8] to express the process flow. A task element is created in the process diagram for each single wiki page, retrieved with the appropriate SMW query6 . For each property has Successor an edge (sequence flow element) is created between the subject and object of the property. If a task has multiple successor properties, a parallel gateway element is included, from where the edges go to the multiple successor tasks. The data-based exclusive gateway is used for each has OrSuccessor property and their condition properties. The content of the wiki page defining the process is also included in the corresponding process steps, thus providing an overview. When the user saves the transformed graphical representation of the process description, the process in the new structure is stored in the wiki. A redirect link is inserted on each previous element of the process description to the corresponding new process step wiki page. With these redirects in place we can assure minimal negative impact due to the transformation and allow all existing links to continue working.
6
Evaluation
We evaluated our holistic approach for collaborative maturing of process descriptions in three steps built upon each other. 6
[[Belongs to Process::]].
Wiki-Based Maturing of Process Descriptions
323
1. We applied SMW with the collaborative process development functionality presented in [8] to industry use cases within the ACTIVE project7 . This allowed for the creation of process descriptions using merely the textual, wiki-based syntax. No specialized or graphical user interface was available for editing the processes. After the trialists started to use the process visualization functionality (as query results, i.e. an uneditable graph displaying the process), they suggested that the process-editing functionality should be easier to use [9], e.g. similar to Microsoft Visio. 2. Therefore, our new approach presented in this paper supports the graphical editing of processes. The usability of the wiki-based graphical process editor was evaluated with ten students, modeling textual process descriptions. We used four textual process descriptions specifying internal HowTos of the university institute AIFB and eight service process descriptions (GR01, GR02, GR03, GR04, GR05, IT01, IT02, and IT04) from the COCKPIT Project [16]. The processes were modeled by the students with different experience levels in using Semantic MediaWiki and in modeling processes. Only half of the students have ever used a process modeling tool. After a brief introduction of the basic functionality of our tool, each student was asked to model three assigned processes with it without any hint. The results can be found in our evaluation wiki8 . The students modeled the processes in different ways. While all students used task and sequence flows, only four students used the gateway elements. Conditions were sometimes expressed in the textual description, in the graphical representation, or in both representations. Additional semantic properties were introduced by half of the students. At the end, each student had to fill out a Computer System Usability Questionnaire (CSUQ) [18], by rating 19 statements from 1 to 7 with respect to our tool, where 7 is strongly agree and 1 is strongly disagree. In addition, the questionnaire was extended with questions about previous process modeling experiences and free text questions about most positive and negative aspects. The results of the CSUQ can be found in Table 1. The overall assessment of the students about the usability of the tool was very positive. Only the quality of the error messages was ranked negative in average. As a consequence, we will improve them in future releases. The students criticized that all information entered as wiki text was deleted when they clicked on another task while the wiki text editor was not closed using the End Edit Mode button. In addition they recommended that the system should provide duplication of pages within the modeler, in order to enable the faster modeling of similar processes. Some students needed more time in the beginning to get familiar with the wiki syntax. As positive aspects most of the students explicitly mentioned the intuitive usability. It was stated that it was easy to handle, because the user interface was 7 8
http://www.active-project.eu The evaluation wiki can be accessed via http://oryx.f-dengler.de (Username: ProcessTester – Password: active!).
324
F. Dengler and D. Vrandeˇci´c Table 1. Evaluation Results Question
N AVG DEV
Overall, I am satisfied with how easy it is to use this system It was simple to use this system I can effectively complete my work using this system I am able to complete my work quickly using this system I am able to efficiently complete my work using this system I feel comfortable using this system It was easy to learn to use this system I believe I became productive quickly using this system The system gives error messages that clearly tell me how to fix problems Whenever I make a mistake using the system, I recover easily and quickly The information (such as online help, on-screen messages, and other documentation) provided with this system is clear It is easy to find the information I needed The information provided for the system is easy to understand The information is effective in helping me complete the tasks and scenarios The organization of information on the system screens is clear The interface of this system is pleasant I like using the interface of this system This system has all the functions and capabilities I expect it to have Overall, I am satisfied with this system
10 10 10 9 10 10 10 9 6 10 8
5,30 5,50 5,60 4,67 5,50 5,20 5,40 5,56 3,00 5,50 4,63
1,35 1,43 1,28 1,49 0,81 1,47 1,74 1,34 1,91 1,36 1,87
9 9 9
5,11 5,44 5,33
1,66 1,95 1,63
10 10 10 6 10
5,20 5,90 5,90 5,50 5,40
1,60 1,58 0,94 1,38 1,02
perceived as simple and clear. The user interface was positively mentioned for not being overloaded with unnecessary features. The results are favorable, and show that the tool can be used intuitively without any previous experience in process modeling. The tool provided the functions needed to complete the tasks. The integration of natural language and graphical elements was widely used. A small number of students used additional features provided by the graphical interface, like coloring graphical elements. The use of such features was not expected by the evaluators, but shows that users can creatively extend the system to cover their needs. 3. To complement our approach, we also implemented the support for the automatic transformation of textual process knowledge articulated in SMW into a graphical process representation displayed in the Oryx component (described in Section 5). In order to evaluate this step, we performed a functionality test by transforming twenty existing textual process descriptions into graphical process representation. We took ten how-to descriptions from WikiHow9 as examples for processes described on single wiki pages. For processes expressed via predefined semantic properties (i.e. the multi wiki page approach), we used existing processes developed within the ACTIVE project. The transformation of multiple wiki pages to a single process worked as expected. During the transformation of simple how-tos represented as single wiki pages, we detected multiple numbered lists on some how-to pages. As a result, we got two separate process sequences in the graphical representation. We have to check in a further evaluation if we can automatically decide 9
http://www.wikihow.org
Wiki-Based Maturing of Process Descriptions
325
if these graphical representations display separated processes or not. The complete results and input data can be found in our demo wiki10 .
7
Related Work
The work presented in this paper is related to the following streams of research: (1) social software for process management, and (2) collaborative process modeling. There exist other approaches to manage processes with social software [10,20], but none of them supports process knowledge capturing with graphical representations, formal semantic annotations, and natural language. To support workflows, Dello et al. [7] have extended the Makna Semantic Wiki by integrating the workflow engine jBPM. It enables the coordination of interactions within a wiki system, but does not support the collaborative creation of the workflow. The MoKi enterprise modeling wiki [14], by contrast, allows the collaborative development of processes. However, the tool does not translate the collaboratively created graphical process flow descriptions into machine readable formal semantics, but stores merely a graphical representation. Therefore, SMW queries concerning process semantics, such as ”show me all process activities resulting in an approval activity” are not possible. A further extension of the Moki enterprise modeling wiki, named BP-MoKi [12], supplements the graphical flow descriptions with formal semantics. However, existing textual process description, previously stored within the wiki, cannot be reused and automatically transformed into graphical process descriptions. Also the semantics of the processes and the other entities in the wiki are in separated stores, and thus cannot be connected natively. Furthermore, the graphical interface of both tools does not allow the user to read and edit the wiki pages within the graphical edit mode. This requires users to switch between graphical and textual mode. Semantic Wikis can also be used for the management of process model relations. The knowledge about the model relations is captured in a semantic wiki and thus can be searched and reused across different projects and stakeholders [11]. Another approach of using groupware tools for process elicitation is described in [13]. The tool supports collaborative modeling, but does not provide the collaboration functionality of a wiki engine. In addition, the process models have no formal semantics and thus they are not machine processable. Other BPM tools and Web communities allow collaborative business process modeling, like Activiti11 and BPMN-Community,12 both based on Oryx [6], or processWave13 based on Google Wave. There are also commercial tools, like IBM BPM Blueprint14 and ARISalign15 . However, they only focus on the 10 11 12 13 14 15
The demo wiki can be accessed via http://bpmexample.wikiing.de (User name: ProcessTester – Password: active!). http://www.activiti.org/ http://www.bpmn-community.org/ http://www.processwave.org/ http://www.lombardisoftware.com/ http://www.arisalign.com/
326
F. Dengler and D. Vrandeˇci´c
collaborative development of the process model and require modeling expertise. Novice users cannot model unknown constructs with natural language. Only predefined process properties can be used to further formalize processes. If a property is not included in the process modeling language, it cannot be used. Since the flow structure is only stored in the process diagram and not as semantic descriptions, the search is rather limited compared to our approach.
8
Conclusion
We presented a multi-faceted approach towards the gradual elicitation of knowledge about processes, following the idea of knowledge maturing. We provide a wiki-based implementation of our approach, which combines natural language descriptions, graphical representations, and formal, machine-readable semantics. The mature wiki platform we build upon provides us with numerous basic functionalities which we carefully exploit. This includes user identity management, complete history of the wiki pages, and a well-tested discussion facility. In addition, interlinking between process descriptions and external resources with annotations is supported. In combination with the provided query language, it enables more sophisticated retrieval, browsing and navigation. By storing processes in a machine-processable format (Requirement: Structured Process Documentation Support), using a standard exchange formats (RDF), it can easily be integrated into existing semi-automatic process acquisition approaches and enhance their functionality. The tool is available on the Web and can be tested, including the results of the functional evaluation. We evaluated our tool compared to the requirements that were extracted from current literature and the enterprise case studies that we examined. The functionality was evaluated against a number of real processes taken from case study partners. A usability evaluation further has shown that the tool is superior to alternative approaches for knowledge elicitation from novice users. Natural language or incomplete models can be used to formulate a first idea (Requirement: Natural Language Support for Novice Users), which can be further refined and consolidated by other novice users or experts (Requirement: Collaboration Support). The rich user interface provides users with means to intuitively model processes graphically with basic (but widely used) process elements, namely tasks, sequence flows, parallel and exclusive split gateways. We only used this small sub set of process elements, because with too many elements, the usability of the tool may suffer (Requirement: Intuitive Graphical Rendering and Editing of Processes). In summary we can reasonably expect that the tool implements our process knowledge maturing approach and that this will enable a more holistic approach towards the effective, and efficient, elicitation of reusable process descriptions in the enterprise. For future work, we will further validate our approach by involving various users collaboratively constructing processes. We will also investigate the appropriate expressivity of the process language for different modeling skills. But
Wiki-Based Maturing of Process Descriptions
327
due to the open, wiki-based approach, users can already today extend the system with their own elements, creating further expressivity which is exported in RDF, and can then be further processed. Our user evaluation has shown that a small number of power users are able to creatively extend the system to cover their needs, and thus provide a path for the wider usage base to most efficiently explicate the collective process knowledge of a given community. Acknowledgements. This work was supported by the European Commission through the IST project ACTIVE (ICT-FP7-215040) and by the German Federal Ministry of Education and Research (InterLogGrid, Grant BMBF 01IG09010F).
References 1. van der Aalst, W.M.P., Weijters, A.J.M.M.: Process mining: a research agenda. Computers in Industry 53(3), 231–244 (2004) 2. Barrett, D.J.: MediaWiki, 1st edn. O’Reilly Media, Inc., Sebastopol (2008) 3. Beckett, D.: RDF/XML syntax specification (Revised). Tech. rep., W3C (2004), http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/ 4. Bruno, G., Dengler, F., Jennings, B., Khalaf, R., Nurcan, S., Prilla, M., Sarini, M., Schmidt, R., Silva, R.: Key challenges for enabling agile BPM with social software. J. of Software Maintenance and Evolution: Research and Practice 23(4), 297–326 (2011) 5. Czerwinski, M., Horvitz, E., Wilhite, S.: A diary study of task switching and interruptions. In: CHI 2004: Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 175–182. ACM, New York (2004) 6. Decker, G., Overdick, H., Weske, M.: Oryx – An Open Modeling Platform for the BPM Community. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 382–385. Springer, Heidelberg (2008) 7. Dello, K., Nixon, L.J.B., Tolksdorf, R.: Extending the Makna semantic wiki to support workflows. In: SemWiki, CEUR Workshop Proc., vol. 360. CEUR-WS.org (2008) 8. Dengler, F., Lamparter, S., Hefke, M., Abecker, A.: Collaborative Process Development using Semantic MediaWiki. In: Proc. of the 5th Conf. on Professional Knowledge Management, pp. 97–107 (2009) 9. Djordjevic, D., Ghani, R., Fullarton, D.: Process-centric enterprise workspace based on semantic wiki. In: Proc. of the Int. Conf. on Knowledge Management and Information Sharing (KMIS/IC3K) (2010) 10. Erol, S., Granitzer, M., Happ, S., Jantunen, S., Jennings, B., Johannesson, P., Koschmider, A., Nurcan, S., Rossi, D., Schmidt, R.: Combining BPM and social software: contradiction or chance? J. of Software Maintenance and Evolution: Research and Practice 22, 449–476 (2010) 11. Fellmann, M., Thomas, O., Dollmann, T.: Management of model relations using semantic wikis. In: Proc. of the 2010 43rd Hawaii Int. Conf. on System Sciences, HICSS 2010, pp. 1–10. IEEE Computer Society, Washington, DC, USA (2010) 12. Francescomarino, C.D., Ghidini, C., Rospocher, M., Serafini, L., Tonella, P.: A framework for the collaborative specification of semantically annotated business processes. J. of Software Maintenance and Evolution: Research and Practice 23(4), 261–295 (2011)
328
F. Dengler and D. Vrandeˇci´c
13. de Freitas, R.M., Borges, M.R., Santoro, F.M., Pino, J.A.: Groupware support for cooperative process elicitation. In: Favela, J., Decouchant, D. (eds.) CRIWG 2003. LNCS, vol. 2806, pp. 232–246. Springer, Heidelberg (2003) 14. Ghidini, C., Rospocher, M., Serafini, L.: Moki: a wiki-based conceptual modeling tool. In: ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, vol. 658, pp. 77–80 (2010) 15. den Hengst, M.: Collaborative modeling of processes: What facilitation support does a group need? In: Proc. of AMCIS 2005, Paper 15 (2005) 16. Koutras, C., Koussouris, S., Kokkinakos, P., Panopoulos, D.: Cockpit public services scenarios. Tech. Rep. D1.1, COCKPIT Project (June 2010) 17. Leuf, B., Cunningham, W.: The Wiki Way: Collaboration and Sharing on the Internet. Addison-Wesley, Reading (2001) 18. Lewis, J.R.: IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use. Int. Journal of Human-Computer Interaction 7, 57–78 (1995) 19. Muehlen, M.z., Recker, J.: How Much Language Is Enough? Theoretical and Practical Use of the Business Process Modeling Notation. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 465–479. Springer, Heidelberg (2008) 20. Nurcan, S., Schmidt, R.: The Third Workshop on Business Process Management and Social Software (BPMS 2010) as part of BPM 2010 (2010) 21. Recker, J., Safrudin, N., Rosemann, M.: How Novices Model Business Processes. In: Hull, R., Mendling, J., Tai, S. (eds.) BPM 2010. LNCS, vol. 6336, pp. 29–44. Springer, Heidelberg (2010) 22. Russell, N., Hofstede, A.H.M.T., van der Aalst, W., Mulyar, N.: Workflow ControlFlow Patterns: A Revised View. Tech. Rep. BPM Center Report BPM-06-22, BPMcenter.org (2006) 23. Santoro, F., Borges, M., Pino, J.: Tell us your process: A group storytelling approach to cooperative process modeling. In: 12th Int. Conf. on Computer Supported Cooperative Work in Design, CSCWD 2008, pp. 29–34. IEEE, Los Alamitos (2008) 24. Schmidt, A., Hinkelmann, K., Ley, T., Lindstaedt, S., Maier, R., Riss, U.: Conceptual Foundations for a Service-oriented Knowledge and Learning Architecture: Supporting Content, Process and Ontology Maturing. In: Pellegrini, T., Auer, S., Tochtermann, K., Schaffert, S. (eds.) Networked Knowledge - Networked Media. SCI, vol. 221, pp. 79–94. Springer, Heidelberg (2009) 25. Schmidt, A., Hinkelmann, K., Lindstaedt, S., Ley, T., Maier, R., Riss, U.: Conceptual foundations for a service-oriented knowledge and learning architecture: Supporting content, process, and ontology maturing. In: 8th Int. Conf. on Knowledge Management (I-KNOW 2008) (2008) 26. Vrandeˇci´c, D., Gil, Y., Ratnakar, V.: Want world domination? Win at Risk!: matching to-do items with how-tos from the Web. In: Proc. of the 2011 Int. Conf. on Intelligent User Interfaces, Palo Alto, CA, USA, February 13-16, pp. 323–326 (2011)
A-Posteriori Detection of Sensor Infrastructure Errors in Correlated Sensor Data and Business Workflows Andreas Wombacher Database Group, University of Twente, The Netherlands
[email protected] Abstract. Some physical objects are influenced by business workflows and are observed by sensors. Since both sensor infrastructures and business workflows must deal with imprecise information, the correlation of sensor data and business workflow data related to physical objects might be used a-posteriori to determine the source of the imprecision. In this paper, an information theory based approach is presented to distinguish sensor infrastructure errors from inhomogeneous business workflows. This approach can be applied on detecting imprecisions in the sensor infrastructure, like e.g. sensor errors or changes of the sensor infrastructure deployment.
1
Introduction
Many processes in everyday life describe the handling of physical objects. One example are logistics processes facilitating fleet management and package tracking [1]. Another example are processes influenced by the presence of physical objects, like e.g. context aware workflows supporting the execution of tasks in a physical environment [2]. Integrating physical and digital world requires digitizing the physical world, i.e., resolving the media break. Digitizing physical objects is done using sensors observing certain aspects of a physical object. For example a scale measures the weight of a physical object without further identifying the object. While a Universal Product Code (UPC) reader can identify a physical object as an instance of a class of goods. However, it remains unclear whether the weight associated with a product code in an information system corresponds to the actual weight of the physical object. The combination of several sensors, the knowledge of their dependencies and capabilities allows to establish a digital model of a physical object. A digital model of a physical object is always an abstraction of the physical object, which by definition is focusing on specific aspects of the physical object and neglecting other aspects. Further, the digital model of a physical object may contain imprecise sensor information, since sensors are potentially imprecise [3]. For example, a physical object passing by an RFID reader too fast, such that the reading can not be completed, with the consequence that the moving object is not recorded. S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 329–344, 2011. c Springer-Verlag Berlin Heidelberg 2011
330
A. Wombacher
Besides of imprecision resulting from the usage of sensors, imprecision can also be observed in workflows dealing with physical objects. In [4] the authors argue that information in a workflow system can be imprecise. With regard to physical objects this means the location or properties of a physical object in the physical world may deviate from the expected location and properties as contained in the digital model of the object in the workflow system. These discrepancies can be either exceptionally, i.e., ad hoc changes in the physical world on request which are not reflected in the workflow system, or structurally, i.e., the deviation/evolution of the implemented workflow in the physical world and the workflow in the workflow system [5]. As discussed above, sensor data and workflow systems describe properties and location of physical objects at specific points in time with potential imprecision. Thus, business decisions made based on imprecise data are potentially incorrect with potentially severe consequences for the business. The idea proposed in this paper is to combine sensor data and workflow data to identify types of imprecisions as well as their origin. Based on the identified imprecisions, recommendations can be made on – how to improve the digital models of physical objects, – which sensors to upgrade to improve data quality, – how to extend the sensor infrastructure to verify more properties of the digital model of physical objects, – how to adjust the current workflow in the workflow system. Based on these recommendations, imprecision can be reduced which supports managers in making business decisions. In this paper, the focus is on correlating sensor data and workflow data to identify errors in the sensor infrastructure. The analysis is performed a-posteriori, i.e., after the workflow execution has been completed and all sensor and workflow execution data is available. The contribution in this paper is the representation of the correlation of sensor and workflow data as a coding problem in information theory. Further, sensor infrastructure error types and workflow execution errors are distinguished and criteria for identifying these errors are presented based on the introduced information theory measures.
2
Scenario
As a running example the following procurement workflow is used (see Fig 2). The workflow is denoted as a finite state automaton, where states are represented as circles and activities or labeled state transitions are represented as arcs. The example workflow starts with centrally ordering a set of products (activity a0) electronically, which are later on received in activity a1. Next, the quality of the received products is checked (activity a2). Depending on the requester of a product products are either forwarded to another branch of the company (activity a3) or they remain at the receiving branch, where the products have to be stored in the local warehouse (activity a4). This procurement workflow
A-Posteriori Detection of Sensor Infrastructure Errors
331
involves moving physical objects in activities a1 to a4. While physical objects in activities a1 to a3 move along the same path, the storage of physical objects into the warehouse (also called sorting) distributes the different physical objects potentially over the complete warehouse, thus the objects move along different paths. The upper part of Fig 1 depicts three instances of the workflow in Fig 2. Activity a0 for all instances is not depicted since it does not have an effect on physical objects. Instance in1 contains six physical objects o1 to o6, which are forwarded to another branch, while instance in2 and in3 contain three and two physical objects respectively, all sorted into the local warehouse. In Fig 1 each instance is represented along a time line. Further, each activity is depicted as a rectangle where the left and right hand side of the rectangle indicates the start and completion time of the activity respectively. For instance, activity a1 of instance in1 starts at time one and stops at time three. Therefore, all observed sensor readings in this time interval related to physical objects o1 to o6 can be correlated to activity a1 of instance in1. The lower part of Fig 1 depicts six sensors and the observed physical objects per time step. For example sensor x1 does not have any observation at time one, but observes objects o3, o4, o1, and o8 at time two. To illustrate the capabilities of the proposed approach, examples of imprecision on the sensor infrastructure and the workflow system have been included: – an error in sensor x1 at the beginning of the processing: objects o1 and o2 of activity a1 of instance in1 are not observed due to a sensor malfunction. – equal object identifiers (o1 and o2) are used by sensor x1 belonging to different instances and activities (for instance in1 and in2 to activities a1 and a2) – equal object identifiers (o1 and o2) observed by sensor x2 belonging to different instances and activities (for instance in1 and in2 to activity a2) – objects (o1,o2, and o7) of instance in2 are sorted in activity a4: objects are observed by sensors x4,x5,and x6. – object o9 in instance in3 is treated outside the workflow and therefore the object o9 is not observed by any sensor
3
Correlation of Sensor Data and Workflow Systems
Executing a workflow means executing activities, which may result in physical objects changing their location or some of their properties. These changes can potentially be detected by an available sensor infrastructure (see Fig 3). Therefore, sensor data and workflow systems can be correlated if there are sensors observing a physical object which is manipulated by executing a workflow activity. Observing many executions of the same workflow allows to determine probabilities of expected sensor readings if an activity is executed. Thus, if an object is effected by a specific activity, it is expected to be sensed by a subset of sensors. Deviation of the expected sensor readings for a specific activity may be because
332
A. Wombacher a2 o1,o2,o3 o4,o5,o6
Instance in1
a1 o1,o2,o3, o4,o5,o6
a3 o1,o2,o3,o4,o5,o6
a2 o1,o2,o7 a1
a4 o1, 02, 07
o1,o2,o7
Instance in2
a2 o8,o9
a1 o8, o9
Instance in3
Time Sensor x1
Sensor x2
1
a4 o8, 09
2
3
4
5
o3 o4 o1 o8
o5 o6 o2 o7
o1 o2 o3 o1
6
7
8
9
10 11 12 13 14 15 16
o4 o7 o5 o8 o6 o2
o1 o4 o7 o2 o5 o8 o3 o6 o1 o2
Sensor x3 o1 o2 o3 o4 o5 o6 Sensor x4 Sensor x5 Sensor x6
o1 o8
o7 o2
Fig. 1. Scenario
– objects are moved different than described in a workflow, or – objects are not properly observed by the sensor infrastructure due to issues in the sensor infrastructure. In this paper the focus is on detecting issues in the sensor infrastructure. Changes of physical objects not initiated by a workflow instance are excluded, like e.g. maintenance operations in the warehouse. 3.1
Scope
The correlation of sensor data and workflow systems is based on physical objects. As a simplification, in this paper no dependencies between physical objects are considered. For example, sensing a pallet object will not imply sensing several product objects currently standing on this pallet, if the product objects are not observed themselves. Further, observed physical objects must be distinguishable, i.e., identifiable. However, the physical object IDs do not necessarily be universally unique like
A-Posteriori Detection of Sensor Infrastructure Errors
a0
a1
a2
333
a3
`
a4 a0: order products a1: receive products a2: check quality, measures etc. a3: forward to other location a4: store in local warehouse
Fig. 2. Example Procurement Workflow
sensor data
physical object • no change • change location • change property
workflow
Fig. 3. Sensor Data and Workflow Correlation
e.g. RFID tags, but several physical objects may have the same ID like e.g. bar codes representing a universal product code. Please be aware that this paper is not addressing syntactic and semantic integration problems [6], although we acknowledge the problem. Further, although sensor data fusion is an important topic, the focus in this paper is on relating sensor data and workflow systems. Therefore, the difficulties of sensor data fusion are not addressed. To further explain the approach a workflow model and a sensor data model is introduced next. A detailed formal definition of the various probabilities is available as a technical report [7], but not included in this paper due to a lack of space. 3.2
Workflow Model
The workflow schema depicted in Fig 2 is specified as a Finite State Automaton [8]. States are represented as circles and transitions are represented as labeled arrows, where the label represents the activity to be executed in this transition. States with thick lines are called final states, i.e., an execution is successful if it ends in a final state. Each execution starts in the initial state represented by the state with the little arrow pointing at it. An execution of a workflow is called a workflow instance or process, which adheres to the workflow schema. A process execution is a finite sequence of activities connecting the initial state with a final state also known as a trace. With regard to the example, there are only two traces for the workflow schema in Fig 2, i.e., a0, a1, a2, a3 and a0, a1, a2, a4. To correlate sensor data and processes, several information about the process state is required. In particular, ψ provides the trace of a workflow instance/process ink . Further, the start and the completion time of an activity within a trace is denote as τs (ai ) and τc (ai ) for the start and completion time of activity ai in workflow instance/process ink , i.e., ai ∈ ψ(ink ). The IDs of physical objects effected by an activity are denoted as θ(ai ) providing a multiset1 of object IDs. With regard to the example, there are no objects effected by activity a0 for all instances ink , thus θ(a0) = ∅. Therefore, activity a0 is also not depicted in the upper part of Fig 1. The trace of instance in1 is ψ(in1) = a0, a1, a2, a3. 1
A multiset is a set supporting multiple instances of the same element.
334
A. Wombacher
For activity a1 of instance in1 the start time of the activity is τs (a1) = 1, the completion time of the activity is τc (a1) = 3, and the set of effected objects is θ(a1) = {o1, o2, o3, o4, o5, o6}. 3.3
Sensor Data Model
A sensor data stream is a discrete-time signal, which can be modeled as a sequence x of sensor readings typically encoded as numbers, where x[n] is the n-th element in the sequence x = {x[n]} with −∞ < n < ∞ [9]. In this paper, the observations of physical objects made by a sensor at time t results in x[t] being the multiset1 of observed object IDs. If no objects are observed, the set is empty. Let X be the set of all available sensor data streams. With regard to the example, there are six sensors available, thus X = {x1, x2, x3, x4, x5, x6}. The sensor stream x1 contains at time t = 1 no observations ,i.e. x1[1] = ∅, while at time t = 4 there are four observations, i.e. x[4] = {o1, o2, o3, o1}. 3.4
Approach
The approach is based on physical objects effected by the execution of an activity and observed by sensors. In particular, the homogeneity of the correlation of activities and sensor observation based on the effected objects indicates whether the proper operation of the sensor infrastructure. This problem specification is similar to noisy channel coding theory on a discrete memory less channel (DMC) as discussed in information theory [10]. The more homogeneous the encoding and decoding of messages are the less noisy the communication channel is. In information theory the information entropy is a measure of uncertainty of a stochastic variable. In the context of this paper, the stochastic variables are A representing activities and instances, and S representing sets of sensors. Thus, the activity entropy H(A) is based on activities and instances, and the sensor cluster entropy H(S) is based on sensor observations. The mutual information I(A; S) (or transinformation) represents the dependency of the stochastic variables for activities and sensor data. Mutual information I(A; S) is based on the notion of entropy (here H(A) and H(S)) describing the uncertainty associated with activities and sensor data respectively. The conditional entropy H(A|S) and H(S|A) gives an indication on the uncertainty of the activities based on the sensor data or the other way around. In Fig 4 the dependencies between the different entropies and the mutual information are visualized. In the following subsections the entropy and mutual information are defined based on the workflow and sensor data model discussed in Sect 3.2 and 3.3. Probability of Activity. The probability function p(a) for an activity a of an instance ink is the ratio of the number of object IDs associated to activity a, i.e., |θ(a)|, and the number of object IDs of all activities involved in trace ψ(ink ) of all instances in1 , . . . inm . The activities in different instances are treated uniquely,
A-Posteriori Detection of Sensor Infrastructure Errors Option 1
Time
H(S|A)
I(A;S)
H(A|S)
Sensor x1
H(S) H(A)
Option 2
Option 3
335
Option 4
2
3
2
3
2
3
2
3
o3 o4 o1 o8
o5 o6 o2 o7
o3 o4 o1 o8
o5 o6 o2 o7
o3 o4 o1 o8
o5 o6 o2 o7
o3 o4 o1 o8
o5 o6 o2 o7
instance 1
instance 2
instance 3
Fig. 4. Channel Entropy Fig. 5. Possible partitions for sensor x1 for activity a1 for (taken from [10]) instance in1 and in2, while instance in3 does not have a degree of freedom
although they may be named the same, like e.g. a1 in the example is used in the traces of all three instances in1, in2 and in3. The sum of all activity probabilities is one. The corresponding activity entropy for all instances in1 , . . . , inm with A := {a ∈ ψ(ink )|1 ≤ k ≤ m} is then defined as2 H(A) = −p(ai ) ∗ log(p(ai )) (1) ai ∈A
Sensor Partition. The objects effected by processes are observed by sensors. Thus, several activities of several instances may contribute to the objects observed by a single sensor. For example, sensor x1 in the example observes objects from activities a1 and a2 of instances in1, in2, and in3. To correlate sensor observation and activities of instances, the observations of a sensor must be partitioned. I.e. each observation of a sensor is associated with an activity and instance. The union of all partitions is the set of sensor readings. A necessary condition is that observations can only be considered for a partition of an activity if the observation has been made while the activity has been executed. The aim is to find partitions for all sensors which minimizes the errors in the correlation (see Sect 4). With regard to the example, four possible options for partitioning the observations of sensor x1 for activity a1 of instances in1 and in2 at time 2 and 3 are depicted in Fig 5. The option represent all permutations of associating objects o1 and o2 to instance in1 or in2. For sensor x2 and activity a2 for instance in1 and in2 result also in four options, which are symmetric, since they all contain the same set of observations. Based on these examples, it is easy to see that the partitioning may create a huge amount of possible options. Sensor Clustering and Sensor Errors. An object effected by an activity may be observed by several sensors. Thus, the observations made by a set of sensor data streams xj must be clustered to one observation made by a set of sensors. 2
According to entropy definition in [10].
336
A. Wombacher
Therefore, the set of sensor clusters S := 2X is defined as the powerset of all available sensor data streams X. Based on the partitions of sensors a cluster s for an activity a of instance in is observing a set of objects consisting of objects contained in all partitions of sensors contained in the cluster. With regard to the example, sensor x1 and x2 are clustered for activity a2 for all three instances, thus, the partition for the sensor cluster {x1, x2} for option 1 in Fig 5 is {x1, x2}a2,in1 = {o3, o4, o5, o6}, {x1, x2}a2,in2 = {o1, o2, o7} and {x1, x2}a2,in3 = {o8}. As can be seen for activity a2 of instance in3, only object o8 is observed, while object o9 is not observed. Sensor observation errors are available as an error set ε(a) ⊆ E, which contains the set of object IDs for an activity of an instance, which are not observed by any sensor. The set E contains the set of all possible errors, i.e., the set of all object IDs of all activities and instances. With regard to the example, the set E of possible error object IDs is E = {o1, . . . , o9}. Further, the set of not observed object IDs ε(a1 , in3) for activity a1 of instance in3 is object o9, since there is no sensor which observed the object during the execution time of the activity. For activity a1 of instance in1 and option 1 in Fig 5 the set of not observed object IDs ε(a1 , in1) is ε(a1 , in1) = {o1, o2}. Probability of Sensor Clusters and Errors. The probability for a sensor cluster sj ∈ S is the ratio of the number of objects observed by all sensors in the cluster sj and the number of all potentially observable objects. Since the sensor infrastructure is considered to be erroneous, the number of observable objects is taken from the workflow system as it has been used for activity probabilities. The corresponding sensor cluster entropy for all sensor clusters sj ∈ S is defined as3 H(S) = −p(sj ) ∗ log(p(sj )) (2) sj ∈S
Due to the errors in sensor readings, the sum of all sensor cluster probabilities may be less than 1. The error probability is defined as the ratio of the sum of not observed object IDs in all activities and the number of all potentially observable objects. Due to the construction of the error probability, the sum of error probabilities over all objects indicates the error of the sensor readings such that, 1− p(sj ) = p(o) sj ∈S
o∈E
To make the sensor cluster entropies comparable, the error probabilities should be included, therefore a sensor cluster entropy with error H e (S) is defined based on the sensor cluster entropy (see Eq 2) as follows4 H e (S) = H(S) + −p(o) ∗ log(p(o)) (3) o∈E 3 4
According to entropy definition in [10] According to entropy and mutual information definition in [10]
A-Posteriori Detection of Sensor Infrastructure Errors
337
Mutual Information. Based on the above definitions, conditional probabilities for sensor readings dependent on an activity can be defined by combining sensor cluster probability and activity probability. I.e., instead of calculating the probability of an observation in a cluster for all possible observations, the conditional probability specifies the probability for an observation in a cluster for objects of a specific activity. The sum of all conditional probabilities equals the sensor cluster probability. p(sj ) = p(sj |ai )p(ai ) ai ∈A
Similar to the sensor cluster probability, the error probability can be expressed as a conditional probability dependent on the activity probability. Again the sum of the conditional probabilities equals the error probability. p(o) = p(o|ai )p(ai ) ai ∈A
Finally the mutual information can be defined. The mutual information represents the dependency of activities and sensor data. The definition of the mutual information is defined as4 p(sj |ai ) I(A; S) = p(sj |ai )p(ai )log( ) (4) p(sj ) sj ∈S ai ∈A
Similar to the sensor cluster entropy also the mutual information can be extended by considering errors to provide comparability of the mutual information. The mutual information with errors I e (A; S) is based on the mutual information as follows4 p(o|ai ) I e (A; S) = I(A; S) + p(o|ai )p(ai )log( ) (5) p(o) o∈E ai ∈A
4
Partition Optimization
In Sect 3.4 formally a partition of sensor observations has been introduced based on partitions, i.e., a mapping of sensor observations to activities. In the scenario used in this paper (see Sect 2), the assignment of objects o1 and o2 in sensor x1 and x2 determines the number of possible partitions as discussed in Sect 3.4. The observations on o1 ∈ x1[2] and o2 ∈ x1[3] can be associated with activity a1 in instance in1 or instance in2 resulting in four possible options as depicted in Fig 5. Thus, there are four non symmetric partitions varying the assignment of o1 and o2 in sensor x1 to activity a1 in instance in1 and in2. The idea is to select a partition with a minimal conditional entropy H(S|A) indicating the uncertainty of the sensor observations based on the activities. To ensure comparability of results, the sensor cluster entropy and the mutual information with errors is used. Since H e (S|A) = H e (S) − I e (A; S) based on Eq 3 and 5 the optimal partition is given by a partition P with argminP
H e (S) − I e (A; S)
338
A. Wombacher
Table 1. Non-symmetric partitions of objects o1 and o2 mapped to instance 1 and 2 Mapping option 1 (o1, o2 → in2) option 2 (o1 → in1, o2 → in2) option 3 (o1 → in2, o2 → in1) option 4 (o1, o2 → in1)
H e (S) 1.8270 1.8270 1.8270 1.8270
I e (A; S) H e (S) − I e (A; S) 1.4434 0.3836 1.4613 0.3657 1.4613 0.3657 1.5012 0.3258
For the scenario, the different measurements are listed in Tab 1. The optimal partition is the last one in the table, where the objects o1 and o2 observed at sensor x1 are mapped to activity a1 of instance in1. In future research, it will be investigated how to determine the optimal partition without enumerating and inspecting all possible partitions, since the enumeration may result in a combinatorial explosion.
5
Sensor Infrastructure Errors
Based on the introduced entropy and mutual information measures (see Sect 3.4), criteria for assessing a correlation of workflow and sensor data are provided. The sensor infrastructure is working without error, if the activity entropy and the mutual information have the same value. Therefore, if there is an object of any activity, which is not observed by any sensor, then the mutual information is smaller than the activity entropy. In case of a correct working sensor infrastructure the sorting and picking can be distinguished from a homogeneous execution. The execution is homogeneous if the sensor cluster entropy equals the mutual information. Intuitively this means that all objects of an activity are observed by the same sensor cluster, thus, all objects follow the same path. In case of sorting and picking at an activity, objects are observed by varying sensor clusters. In case of a not correct working sensor infrastructure, the occurrence of ignored objects can be distinguished from sensing errors. Ignored objects, means that there are objects belonging to activities, which are not observed by any sensor cluster. Therefore the mutual information with and without error is the same, i.e., I e (A; S) = I(A; S). However, if objects belonging go activities are sometimes observed and sometimes not than this a sensing error is detected and the mutual information with error is greater than the mutual information, i.e., I e (A; S) > I(A; S). The proofs for the statements made in this section are elaborated in the following subsections. A decision tree of the above statements is depicted in Fig 6. 5.1
Complete Observation of Objects
All objects associated to activities are observed in the sensor data if the activity entropy equals the mutual information.
A-Posteriori Detection of Sensor Infrastructure Errors H(A)=I(A;S)
tru
e
I (A;S)=I(A;S)
ignored object
se
sorting
ls e
fal
homogeneous
se fa l
e
H(S)=I(A;S)
fa
tru e
e tru
339
sensing error
Fig. 6. Decision tree for sensor infrastructure errors
Lemma 1. Iff all objects of all activities ai with p(ai ) > 0 are observed by a sensor sj , than H(A) = I(A; S). Proof. Based on the lemma −H(A) + I(A; S) = 0. With Eq 1 and the re-written Eq 4 the difference can be written as
p(ai ) ∗ log(p(ai )) +
ai ∈A
sj ∈S ai ∈A
p(ai |sj )p(sj )log(
p(ai |sj ) )=0 p(ai )
which can be re-written into )) + sj ∈S p(ai |sj )p(sj )log(p(ai |sj )− ai ∈A p(ai ) ∗ log(p(a i log(p(ai )) ∗ sj ∈S p(ai |sj )p(sj ) =0 by combining the sums over the activities and separating the division in the logarithm into two terms. Since sj ∈S p(ai |sj )p(sj ) = p(ai ) based on Bayes, the first and third term compensate each other, thus p(ai |sj )p(sj )log(p(ai |sj )) = 0 ai ∈A sj ∈S
If p(sj ) = 0 then the sensor is not recording any objects and therefore can be neglected. If p(sj ) > 0 then an object is recorded, which can be associated with an activity ai . Based on a condition of the lemma p(ai ) > 0, also the conditional probability is greater than zero, i.e., p(ai |sj ) > 0. Thus, the above expression can only be zero if p(ai |sj ) = 1. The conditional probability is one, i.e., p(ai |sj ) = 1 iff all objects observed by sensor sj belong to activity ai . This proofs the lemma. 5.2
Homogeneity of Observations
All objects associated to activities are observed in the sensor data if the sensor cluster entropy equals the mutual information. Lemma 2. Iff all objects of an activity ai with p(ai ) > 0 are observed by the same sensor cluster sj , than H(S) = I(A; S).
340
A. Wombacher
The proof goes along the lines of the proof in the previous section. Proof. Based on the lemma −H(A) + I(A; S) = 0. With Eq 2 and Eq 4 the difference can be written as p(sj |ai ) p(sj ) ∗ log(p(sj )) + p(sj |ai )p(ai )log( )=0 p(sj ) sj ∈S
sj ∈S ai ∈A
which can be re-written into )) + ai ∈A p(sj |ai )p(ai )log(p(sj |ai ))− sj ∈S p(sj ) ∗ log(p(s j log(p(sj )) ∗ ai ∈A p(sj |ai )p(ai ) =0 by combining the sums over the activities and separating the division in the logarithm into two terms. Since ai ∈A p(sj |ai )p(ai ) = p(sj ) based on Bayes, the first and third term compensate each other, thus p(sj |ai )p(ai )log(p(sj |ai )) = 0 sj ∈S ai ∈A
If p(ai ) = 0 then the activity does not contain any recorded objects and therefore can be neglected. If p(ai ) > 0 then an activity does contain a recorded object, which can be associated with sensor sj . Thus, the above expression can only be zero if p(sj |ai ) = 1. The conditional probability is one, i.e., p(sj |ai ) = 1 iff all objects of activity ai are observed by sensor sj . This proofs the lemma. 5.3
Sensing Error
All objects associated to activities are observed in the sensor data if the sensor cluster entropy equals the mutual information. Lemma 3. Iff there are no erroneous observations or a set of objects is not observed for all activities ai with p(ai ) > 0, than I e (A; S) = I(A; S). Proof. Based on the lemma, I e (A; S) − I(A; S) = 0. With Eq 5 and Eq 4 the equation can be re-written as
p(o|ai )p(ai )log(
o∈E ai ∈A
p(o|ai ) ) = 0. p(o)
This equation is fulfilled if either the conditional probability of an erroneous reading is zero, i.e., p(o|ai ) = 0 or the conditional probability of an erroneous reading equals the probability of an erroneous reading, i.e., p(o|ai ) = p(o). The case of p(o|ai ) = 0 means that all objects are observed per activity, thus, does not have any erroneous objects. For the second part, it has to be shown that p(o|a) = p(o) for all a ∈ A where p(o|a) > 0. Since p(o) = p(o|ai )p(ai ), ai ∈A
A-Posteriori Detection of Sensor Infrastructure Errors
the condition can be rephrased as p(o|a) =
341
p(o|ai )p(ai ).
ai ∈A
Since p(o|a) > 0 the equation is dived by the term resulting in
ai ∈A\{a}
p(o|ai ) p(ai ) + p(a) = 1. p(o|a)
With ai ∈A = 1 and since the equation must hold for all a the equation can only be fulfilled if p(o|ai ) =1 p(o|a) for all ai . This means that all probabilities for erroneous readings is equal for all activities. As a consequence of the second part, it can be seen that all p(o|ai ) must have the same value which can be either null in case of no errors or a specific probability in case of errors stemming from ignored objects.
6
Granularity
The scenario contains several issues as discussed in Sect 2. Based on the current values of entropy and mutual information, the various sensing errors overlap. This makes it impossible to decide whether an object has been ignored, whether there is sorting or picking, or whether there is indeed a sensor infrastructure error. As a consequence, the information of workflow system and sensor infrastructure must be investigated in smaller portions to separate different issues. Obvious ways to partition the available information is by investigating instances separately, or by partitions of sensors. By reducing the granularity of the data set, on the one hand side issues potentially get separated and therefore identifiable. On the other hand side, if the granularity is getting to fine grained, like e.g. single sensors or a single activity of an instance, then there is not sufficient mutual information to conclude anything. With regard to the running scenario, investigating the data on instance level separates the issues of the scenario. The formulas for deriving the individual measures on instance level are the same as presented in Sect 3.4 except that there is no aggregation over instances anymore. Due to different granularity and the various effects on the entropy and mutual information, the partition optimization and the sensor infrastructure error identification have to be performed on the instance based data. 6.1
Partition Optimization
Partition optimization as introduced in Sect 4 aims at minimizing the uncertainty of sensor observations based on activities. It can be calculated that the sum of
342
A. Wombacher
differences per instance between sensor cluster entropy and mutual information is minimal for option 1 in Fig 5, thus, the assignment of objects o1 and o2 to activity a1 of instance in2. This is a change to the initial assessment in Sect 4. In the following this partition is used. 6.2
Sensor Infrastructure Errors
The entropy and mutual information measurements per instance are documented in Tab 2. The criteria introduced in Sect 5 are used for identifying cases. The first three rows in Tab 2 are the measures for the three instances, there is no instance working homogeneously. Instance in2 fulfills the conditions for sorting and picking, i.e., H(A) = I(A; S) and H(S) > I(A; S). This is indeed the case as described in Sect 2. In the scenario the objects o1,o2, and o7 effected by activity a4 of instance in2 are sensed by sensors x4[7], x6[9] and x5[12] respectively (see Tab 2 second row). Instance in3 fulfills the condition for ignored objects, i.e., H(A) > I(A; S) and I e (A; S) = I(A; S). Checking the scenario shows that object o9 is never sensed, thus although it occurs in all activities of instance in3. Therefore, it seems the object is treated outside the workflow (see Tab 2 third row). If we exclude object o9 from the process information of instance in3, then the entropy and mutual information values depicted in the last row of Tab 2 can be calculated. This row fulfills the criteria for a homogeneous working sensor infrastructure for this instance, i.e., H(A) = I(A; S) and H(S) = I(A; S). Finally, instance in1 indicates a sensing error, i.e., H(A) = I(A; S) and I e (A; S) > I(A; S). Due to the used mapping of assigning objects o1 and o2 to instance in2 rather than instance in1 the objects o1 and o2 used in activity a1 of instance in1 do not have matching sensor observations, thus, it is an issue with the sensing infrastructure (see Tab 2 first row). Table 2. Entropy and mutual information per instance
7
Instance
H(A) I(A; S) H(S) I e (A; S) H e (S)
in1 in2 in3 in3 w/o o9
1.0986 1.0986 1.0986 1.0986
0.9765 1.0986 0.5493 1.0986
1.0666 1.4648 0.8959 1.0986
1.0986 1.0986 0.5493 1.0986
1.3878 1.4648 1.2425 1.0986
Related Work
Up to my knowledge, the idea of correlating sensor data and workflow data as a means to classify deviations has not been discussed in literature before. There is however, quite some literature on correlating different models based on their overlap describing a system. Examples are correlating business process models and coordination models at design time [11] or at run time [12], or the
A-Posteriori Detection of Sensor Infrastructure Errors
343
monitoring of Web Service compositions using logging information [13]. However, sensor data and workflow systems are indirectly correlated via the state of the physical world, while the afore mentioned approaches do the correlation directly. The mapping function between sensor data and workflow state has some similarities with data integration approaches like e.g. [6] for homogeneous models. In particular, the approach of probabilistic data integration could be a good source of inspiration [14]. Please be aware that although dealing with sensor data scientific workflows are very different from business workflows. Scientific workflows are focusing on processing data often coordinated by data [15,16] while business workflows have many instances of the same workflow schema. In [17] the authors propose a model of physical objects to identify cardinality, frequency and duration properties of physical objects related to a business workflow. Such a model could be beneficial in the context of this work. However, in the current paper, this additional explication of the physical objects in terms of a model is not required. Mutual information and information entropy have been utilized in several application domains. An application close to the one in this paper is the assessment or selection of alternative models. This is e.g. done in [18] by optimizing mutual information related to a Bayesian learning algorithms. Model selection based on changes in entropy are proposed e.g. in [19]. Both cases deviate from the work presented here since the selections of model alternatives are made, while in this paper, models of different types are correlated, namely sensor data and workflow data.
8
Conclusion
The correlation of sensor data and business workflow data based on mutual information and information entropy seems to be a good approach to classify deviations in the correlation. This classification can be used to identify changes in the sensor infrastructure or errors in observing physical objects. They actually provide a metric for the health of the sensor infrastructure and the business workflow, which can be used to improve either the sensor infrastructure of the business workflow. Future work will address the handling of abstraction problems and incomplete sensor infrastructure. Further, an approach will be investigated in selecting the appropriate granularity level. Further, the current results will be applied in a case study.
References 1. Jedermann, R., Behrens, C., Westphal, D., Lang, W.: Applying autonomous sensor systems in logistics–combining sensor networks, rfids and software agents. Sensors and Actuators A: Physical, 370–375 (2006)
344
A. Wombacher
2. Wieland, M., Kopp, O., Nicklas, D., Leymann, F.: Towards context-aware workflows. In: CAiSE 2007 WS Proc., vol. 2 (2007) 3. Wieland, M., K¨ appeler, U.P., Levi, P., Leymann, F., Nicklas, D.: Towards integration of uncertain sensor data into context-aware workflows. In: GI Jahrestagung, pp. 2029–2040 (2009) 4. Soffer, P.: Mirror, mirror on the wall, can i count on you at all? exploring data inaccuracy in business processes. In: Bider, I., Halpin, T., Krogstie, J., Nurcan, S., Proper, E., Schmidt, R., Ukor, R. (eds.) BPMDS 2010 and EMMSAD 2010. Lecture Notes in Business Information Processing, vol. 50, pp. 14–25. Springer, Heidelberg (2010) 5. Lenz, R., Reichert, M.: It support for healthcare processes - premises, challenges, perspectives. Data Knowl. Eng. 61, 39–58 (2007) 6. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J 10, 334–350 (2001) 7. Wombacher, A.: A-posteriori detection of sensor infrastructure errors in correlated sensor data and business workflows. Technical report (2011) 8. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison Wesley, Reading (2001) 9. Oppenheim, A.V., Schafer, R.W.: Discrete-Time Signal Processing. Prentice Hall Signal Processing Series. Prentice Hall, Englewood Cliffs (1989) 10. Tzschach, H., Haßlinger, G.: Codes f¨ ur den st¨ orungssicheren Datentransfer. Oldenburg (1993) 11. Zlatev, Z., Wombacher, A.: Consistency between e3 -value models and activity diagrams in a multi-perspective development method. In: Chung, S. (ed.) OTM 2005. LNCS, vol. 3760, pp. 520–538. Springer, Heidelberg (2005) 12. Bodenstaff, L., Wombacher, A., Reichert, M.U., Wieringa, R.: Monitoring Collaboration from a Value Perspective. In: Intl. Conf. on Digital Ecosystems and Technologies (2007) 13. Bodenstaff, L., Wombacher, A., Reichert, M., Jaeger, M.C.: Monitoring dependencies for SLAs: The mode4SLA approach. In: IEEE SCC, pp. 21–29 (2008) 14. van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: ICDE, pp. 459–470. IEEE Computer Society, Los Alamitos (2005) 15. Tan, W., Missier, P., Madduri, R.K., Foster, I.T.: Building scientific workflow with taverna and BPEL: A comparative study in cagrid. In: Feuerlicht, G., Lamersdorf, W. (eds.) ICSOC 2008. LNCS, vol. 5472, pp. 118–129. Springer, Heidelberg (2009) 16. Lud¨ ascher, B., Weske, M., McPhillips, T.M., Bowers, S.: Scientific workflows: Business as usual? In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 31–47. Springer, Heidelberg (2009) 17. Wieringa, R., Pijpers, V., Bodenstaff, L., Gordijn, J.: Value-driven coordination process design using physical delivery models. In: Li, Q., Spaccapietra, S., Yu, E.S.K., Oliv´e, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 216–231. Springer, Heidelberg (2008) 18. Zeng, Y., Doshi, P.: Model identification in interactive influence diagrams using mutual information. Web Intelli. and Agent Sys. 8, 313–327 (2010) 19. Vatcheva, I., de Jong, H., Bernard, O., Mars, N.J.I.: Experiment selection for the discrimination of semi-quantitative models of dynamical systems. Artificial Intelligence 170, 472–506 (2006)
Conformance Checking of Interacting Processes with Overlapping Instances Dirk Fahland, Massimiliano de Leoni, Boudewijn F. van Dongen, and Wil M.P. van der Aalst Eindhoven University of Technology, The Netherlands {d.fahland,m.d.leoni,b.f.v.dongen,w.m.p.v.d.aalst}@tue.nl
Abstract. The usefulness of process models (e.g., for analysis, improvement, or execution) strongly depends on their ability to describe reality. Conformance checking is a technique to validate how good a given process model describes recorded executions of the actual process. Recently, artifacts have been proposed as a paradigm to capture dynamic, and inter-organizational processes in a more natural way. Artifact-centric processes drop several restrictions and assumptions of classical processes. In particular, process instances cannot be considered in isolation as instances in artifact-centric processes may overlap and interact with each other. This significantly complicates conformance checking; the entanglement of different instances complicates the quantification and diagnosis of misalignments. This paper is the first paper to address this problem. We show how conformance checking of artifact-centric processes can be decomposed into a set of smaller problems that can be analyzed using conventional techniques. Keywords: artifacts, process models, conformance, overlapping process instances.
1 Introduction Business process models have become an integral part of modern information systems where they are used to document, execute, monitor, and optimize business processes. However, many studies show that models often deviate from reality (see. [1]). To avoid building on quicksand it is vital to know in advance to what extent a model conforms to reality. Conformance checking is the problem of determining how good a given process model M describes process executions that can be observed in a running system S in reality. Several conformance metrics and techniques are available [2,3,4,5,6,7]. The most basic metric is fitness telling whether M can replay every observed execution of S. In case M cannot replay some (or all) of these executions, the model M needs to be changed to match the reality recorded by S (or the system and/or its underlying processes are changed to align both). Existing conformance checking techniques assume rather simple models where process instances can be considered in isolation. However, when looking at the data models of ERP products such as SAP Business Suite, Microsoft Dynamics AX, Oracle EBusiness Suite, Exact Globe, Infor ERP, and Oracle JD Edwards EnterpriseOne, one can easily see that this assumption is not valid for real-life processes. There are one-to-many S. Rinderle-Ma, F. Toumani, and K. Wolf (Eds.): BPM 2011, LNCS 6896, pp. 345–361, 2011. c Springer-Verlag Berlin Heidelberg 2011
346
D. Fahland et al.
and many-to-many relationships between data objects, such as customers, orderlines, orders, deliveries, payments, etc. For example, an online shop may split its customers’ quotes into several orders, one per supplier of the quoted items, s.t. each order contains items for several customers. Consequently, several customer cases synchronize on the same order at a supplier, and several supplier cases synchronize on the same quote of a customer. In consequence, we will not be able to identify a unique notion of a process instance by which we can trace and isolate executions of such a process, and classical modeling languages are no longer applicable [8, 9, 10]. The fabric of real-life processes cannot be straightjacketed into monolithic processes. Therefore, we need to address two problems: (1) Find a modeling language L to express process executions where several cases of different objects overlap and synchronize. (2) Determine whether a process model M expressed in L adequately describes actual executions of a process in reality — despite the absence of process instances. The first problem is well-known [8, 9, 10] and several modeling languages have been proposed to solve it culminating in the stream of artifact-centric process modeling that emerged in recent years [8, 9, 10, 11, 12]. In short, an artifact instance is an object that participates in the process. It is equipped with a life-cycle that describes the states and possible transitions of the object. An artifact describes a class of similar objects, e.g., all orders. A process model then describes how artifacts interact with each other, e.g., by exchanging messages [11, 12]. Note that several instances of one artifact may interact with several instances of another artifact, e.g., when placing two orders consisting of multiple items with an electronic bookstore items from both orders may end up in the same delivery while items in the same order may be split over multiple deliveries. In this paper we use proclets [8] as a modeling language for artifacts to study and solve the second problem. A proclet describes one artifact, i.e., a class of objects with their own life cycle, together with an interface to other proclets. A proclet system connects the interfaces of its proclets via unidirectional channels, allowing the life-cycles of instances of the connected proclets to interact with each other by exchanging messages; one instance may send messages to multiple other instances, or an instance may receive messages from multiple instances. After selecting proclets as a representation, we can focus on the second problem; determine whether a given proclet system P allows for the behavior recorded by the actual information system S, and if not, to which degree P deviates from S and where. The problem is difficult because S does not structure its executions into isolated process instances. For this reason we develop the notion of an instance-aware log. The system S records executed life-cycle cases of its objects in separate logs L1 , . . . , Ln — one log per class of objects. Each log consists of several cases, and each event in a case is associated to a specific object. For each event, it is recorded with which other objects (having a case in another log) the event interacted by sending or receiving messages. The artifact conformance problem then reads as follows: given a proclet system P and instance-aware logs L1 , . . . , Ln , can the proclets of P be instantiated s.t. the life-cycles of all proclets and their interactions “replay” L1 , . . . , Ln ? Depending on how objects in S interact and overlap, a single execution of S can be long, possibly spanning the entire lifetime of S which results in having to replay all
Conformance Checking of Interacting Processes with Overlapping Instances
347
cases of all logs at once. Depending on the number of objects and cases, this may turn out infeasible for conformance checking with existing techniques. Proclets may also be intertwined in various ways. This makes conformance checking a computationally challenging problem. Analysis becomes intractable when actual instance identifiers are taken into account. Existing techniques simply abstract from the identities of instances and their interactions. Therefore, we have developed an approach to decompose the problem into a set of smaller problems: we minimally enrich each case in each log to an interaction case, describing how one object evolves through the process and synchronizes with other objects, according to other cases in other logs. We then show how to abstract a given proclet system P to an abstract proclet system P|P for each proclet P s.t. P can replay L1 , . . . , Ln iff the abstract proclet system P|P can replay each interaction case of P , for each proclet P of P. As an interaction case focuses on a single instance of a single proclet at a time (while taking its interactions into account), existing conformance checkers [6, 7] can be used to check conformance. This paper is structured as follows. Section 2 recalls how artifacts describe processes where cases of different objects overlap and interact. There, we also introduce the notion of an instance-aware event log that contains just enough information to reconstruct executions of such processes. Further, proclets are introduced as a formal language to describe such processes. Section 3 then formally states the conformance checking problem in this setting, and Section 4 presents our technique of decomposing proclet systems and logs for conformance checking. The entire approach is implemented in the process mining toolkit ProM; Section 5 presents the tool’s functionality and shows how it can discover deviations in artifact-centric processes. Section 6 concludes the paper.
2 Artifacts This section recalls how the artifact-centric approach allows to describe processes where cases of different objects overlap and interact. 2.1 Artifacts and Proclets: An Example To motivate all relevant concepts and to establish our terminology, we consider a backend process of a CD online shop that serves as a running example in this paper. The CD online shop offers a large collection of CDs from different suppliers to its customers. Its backend process is triggered by a customer’s request for CDs and the shop returns a quote regarding the offered CDs. If the customer accepts, the shop splits the quote into several orders, one for each CD supplier. One order handles all quoted CDs by the same supplier. The order then is executed and the suppliers ship the CDs to the shop which distributes CDs from different orders according to the original quotes. Some CDs may be unavailable at the supplier; in this case notifications are sent to the CD shop which forwards the information to the customer. The order closes when all CDs are shipped and all notifications are sent. The quote closes after the customer rejected the quote, or after notifications, CDs, and invoice have been sent. In the recent years, the artifact-centric approach emerged as a paradigm to describe processes like in our
348
D. Fahland et al.
quote
order
create from request
1,+
add CD add CD
send quote
order at supplier
accept
+,1
processed notified unavailability reject
deliver
close
+,?
+,?
+,?
+,?
ship available notify unavailable
generate invoice
close
Fig. 1. A proclet system describing the back-end process of a CD online shop. A customer’s quote is split into several orders according to the suppliers of the CDs; an order at a supplier handles several quotes from different customers
CD shop example where several cases of quotes object interact with several cases of orders. Quotes and orders are the artifacts of this process. Figure 1 models the above process in terms of a proclet system [8] consisting of two proclets: one describing the life-cycle of quotes and the other describing the life-cycle of orders. Note that Fig. 1 abstracts from interactions between the CD shop and the customers. Instead the focus is on interactions between quotes in the CD shop and orders handled by suppliers. The distinctive quality in the interactions between quotes and orders is their cardinality: each quote may interact with several orders, and each order may interact with several quotes. That is, we observe many-to-many relations between quotes and orders. For example, consider a process execution involving two quote instances: one over CDa (q1 ) and the other over CDa , CDb , and CDc (q2 ). CDb and CDc have the same supplier, CDa has a different supplier. Hence, the quotes are split into two order instances (o1 and o2 ). In the execution, CDa and CDb turn out to be available whereas CDc is not. Consequently, CDa is shipped to the first quote, and CDa and CDb are delivered to the second quote. The second quote is also notified regarding the unavailability of CDc . This execution gives rise to the following cases of quote and order which interact as illustrated in Fig. 2 (note that a CD is not an artifact as it does not follow a life-cycle in this process). q1 q2 o1 o2
: : : :
create, send, accept, processed, deliver, generate, close create, send, accept, notified, processed, deliver, generate, close add CD, add CD, order, ship, close add CD, add CD, order, notify, ship, close
Conformance Checking of Interacting Processes with Overlapping Instances q1:
create
send
accept
processed
deliver
CDa o1:
generate
add CD
send
accept
CDb o2:
add CD
close
CDa add CD
order
ship
CDa q 2: create
349
close
CDa notified
CDc
processed
CDc
add CD
deliver
generate
close
CDb order
notify
ship
close
Fig. 2. An execution of the CD shop process involving two quote cases and two order cases that interact with each other in a many-to-many fashion
2.2 Instance-Aware Logs The task of checking whether a given process model accurately describes the processes executed in a running system S requires that S records the relevant events in a log. Classically, each process execution in S corresponds to a case running in isolation. Such a case can be represented by the sequence of events that occurred. In an artifactcentric process like Fig. 1, one cannot abstract from interactions and many-to-many relations between different cases; quotes and orders are interacting in a way that cannot be abstracted away. Relating events of different cases to each other is known as event correlation; see [13] for a survey of correlation patterns. A set of events (of different cases) is said to be correlated by some property P if each event has this property P . The set of all correlated events defines a conversation. For instance in Fig. 2, events accept and processed of q1 , and events add CD and ship of o1 form a conversation. Various correlation mechanisms to define and set the correlation property of an event are possible [13]. In this paper, we do not focus on the actual correlation mechanism. We simply assume that such correlations have been derived; these are the connections between the different instances illustrated in Fig. 2. To abstract from a specific correlation mechanism we introduce the notion of an instance-aware log. In the following we assume asynchronous interaction between different instances. Let e be an event. Event correlation defined the instances from which e received a message and the instances to which e sent a message. As e could send/receive several messages to/from the same instance, correlation data are stored as multisets of instance ids. For sake of simplicity, in this paper we assume that correlation data on receiving and sending messages was defined by a single correlation property each. Hence, e is associated to one multiset of instances from which e received messages and to one multiset of instances to which e sent messages. A multiset m ∈ I over a set I of instance ids is technically a mapping m : I → defining how often each id ∈ I occurs in m; [] denotes the empty multiset.
Definition 1 (Instance-aware events). Let Σ = {a1 , a2 , . . . , an } be a finite set of event types, and let I = {id 1 , id 2 , . . .} be a set of instance identifiers. An instance-aware event
350
D. Fahland et al.
e is a 4-tuple e = (a, id , SID, RID) where a ∈ Σ is the event type, id is the instance in which e occurred, SID = [sid1 , . . . , sidk ] ∈ I is the multiset of instances from which e consumed messages, and RID = [rid1 , . . . , ridl ] ∈ I is the multiset of instances for which e produced messages.Let E(Σ, I) denote the set of all instance-aware events over Σ and I.
Consider for example the third event of q2 in Fig. 2. This instance aware event is denoted as (accept, q2 , [], [o1 , o2 , o2 ]). The fifth event of q2 is denoted as (processed, q2 , [o1 , o2 ], []). Instance-aware events capture the essence of event correlation and abstraction from the underlying correlation property, e.g., CDa . All events of one instance of an artifact A define a case; all cases of A define the log of A. An execution of the entire process records the cases of the involved artifact instances in different logs that together constitute an instance-aware log. Definition 2 (Instance-aware cases and logs). An instance-aware case σ = e1 , . . . , er ∈ E(Σ, I)∗ is a finite sequence of instance-aware events occurring all in the same instance id ∈ I. Let L1 , . . . , Ln be sets of finitely many instance-aware cases s.t. no two cases use the same instance id. Further, let < be a total order on all events in all cases s.t. e < e whenever e occurs before e in the same case1 . Then L = ({L1 , . . . , Ln},