COMMUNICATING PROCESS ARCHITECTURES 2009
Concurrent Systems Engineering Series Series Editors: M.R. Jane, J. Hulskamp, P.H. Welch, D. Stiles and T.L. Kunii
Volume 67 Previously published in this series: Volume 66, Communicating Process Architectures 2008 (WoTUG-31), P.H. Welch, S. Stepney, F.A.C. Polack, F.R.M. Barnes, A.A. McEwan, G.S. Stiles, J.F. Broenink and A.T. Sampson Volume 65, Communicating Process Architectures 2007 (WoTUG-30), A.A. McEwan, S. Schneider, W. Ifill and P.H. Welch Volume 64, Communicating Process Architectures 2006 (WoTUG-29), P.H. Welch, J. Kerridge and F.R.M. Barnes Volume 63, Communicating Process Architectures 2005 (WoTUG-28), J.F. Broenink, H.W. Roebbers, J.P.E. Sunter, P.H. Welch and D.C. Wood Volume 62, Communicating Process Architectures 2004 (WoTUG-27), I.R. East, J. Martin, P.H. Welch, D. Duce and M. Green Volume 61, Communicating Process Architectures 2003 (WoTUG-26), J.F. Broenink and G.H. Hilderink Volume 60, Communicating Process Architectures 2002 (WoTUG-25), J.S. Pascoe, P.H. Welch, R.J. Loader and V.S. Sunderam Volume 59, Communicating Process Architectures 2001 (WoTUG-24), A. Chalmers, M. Mirmehdi and H. Muller Volume 58, Communicating Process Architectures 2000 (WoTUG-23), P.H. Welch and A.W.P. Bakkers Volume 57, Architectures, Languages and Techniques for Concurrent Systems (WoTUG-22), B.M. Cook Volumes 54–56, Computational Intelligence for Modelling, Control & Automation, M. Mohammadian Volume 53, Advances in Computer and Information Sciences ’98, U. Güdükbay, T. Dayar, A. Gürsoy and E. Gelenbe Volume 52, Architectures, Languages and Patterns for Parallel and Distributed Applications (WoTUG-21), P.H. Welch and A.W.P. Bakkers Volume 51, The Network Designer’s Handbook, A.M. Jones, N.J. Davies, M.A. Firth and C.J. Wright Volume 50, Parallel Programming and JAVA (WoTUG-20), A. Bakkers Transputer and OCCAM Engineering Series Volume 45, Parallel Programming and Applications, P. Fritzson and L. Finmo Volume 44, Transputer and Occam Developments (WoTUG-18), P. Nixon Volume 43, Parallel Computing: Technology and Practice (PCAT-94), J.P. Gray and F. Naghdy Volume 42, Transputer Research and Applications 7 (NATUG-7), H. Arabnia ISSN 1383-7575
Communicating Process Architectures 2009 WoTUG-32 Edited by
Peter H. Welch University of Kent, UK
Herman W. Roebbers TASS, Eindhoven, the Netherlands
Jan F. Broenink University of Twente, the Netherlands
Frederick R.M. Barnes University of Kent, UK
Carl G. Ritson University of Kent, UK
Adam T. Sampson University of Kent, UK
Gardiner S. Stiles Utah State University, USA
and
Brian Vinter University of Copenhagen, Denmark
Proceedings of the 32nd WoTUG Technical Meeting, 1–4 November 2009, TU Eindhoven, Eindhoven, the Netherlands
Amsterdam • Berlin • Tokyo • Washington, DC
© 2009 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-60750-065-0 Library of Congress Control Number: 2009937770 Publisher IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail:
[email protected] Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail:
[email protected] LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved.
v
Preface This thirty-second Communicating Process Architectures conference, CPA 2009, takes place as part of Formal Methods Week, 1-6 November, 2009. Under the auspices of WoTUG, CPA 2009 has been organised by TASS (formerly Philips TASS) in co-operation with the Technische Universiteit Eindhoven (TU/e). This is for the second time – we had a very successful conference here in 2005 and are very pleased to have been invited back. We see growing awareness of the ideas characterized by “Communicating Processes Architecture” and their growing adoption. The complexity of modern computing systems has become so great that no one person – maybe not even a small team – can understand all aspects and all interactions. The only hope of making such systems work is to ensure that all components are correct by design and that the components can be combined to achieve scalability and predictable function. A crucial property is that the cost of making a change to a system depends linearly on the size of that change – not on the size of the system being changed. This must be true whether that change is a matter of maintenance (e.g. to take advantage of increasing multicore capability) or to add new functionality. One key is that system composition (and disassembly) introduces no surprises. A component must behave consistently, no matter the context in which it is used – which means that component interfaces must be explicit, published and free from hidden side-effect. Our view is that concurrency, underpinned by the formal process algebras of Hoare’s Communicating Sequential Processes and Milner’s π-Calculus, provides the strongest basis for the development of technology that can make this happen. Many current systems cannot be maintained if they do not have concurrency working for them and not against – certainly not if multicores are involved! We have again received an interesting set of papers covering many different grounds: system design and implementation (for both hardware and software), tools (concurrent programming languages, libraries and run-time kernels), formal methods and applications. They have all been strongly refereed and are of high quality. As these papers are presented in a single stream you won’t have to miss out on anything. As always, we will have plenty of space for informal contact and do not forget the evening Fringe Programme – where we will not have to worry about the bar closing at half past ten! We are very pleased this year to have Professor Michael Goldsmith of the e-Security Group (WMG Digital Laboratory, University of Warwick) as our keynote speaker. He is one of the creators of the CSP model checking program FDR, which is used by many in industry and academia to check concurrent systems for deadlock, livelock and verify correct patterns of behaviour through refinement checks – all before any code is written. We thank the authors for their submissions and the Programme Committee for their hard work in reviewing the papers. We also thank Tijn Borghuis and Erik de Vink for inviting CPA 2009 to join FMweek 2009 and for making the arrangements with the TU/e. Peter Welch (University of Kent), Herman Roebbers (TASS, Eindhoven), Jan Broenink (University of Twente), Frederick Barnes (University of Kent), Carl Ritson (University of Kent), Adam Sampson (University of Kent), Dyke Stiles (Utah State University), Brian Vinter (University of Copenhagen).
vi
Editorial Board Dr. Frederick R.M. Barnes, University of Kent, UK Dr. Jan F. Broenink, University of Twente, The Netherlands Mr. Carl G. Ritson, University of Kent, UK Mr. Herman Roebbers, TASS, the Netherlands Mr. Adam T. Sampson, University of Kent, UK Prof. Gardiner (Dyke) Stiles, Utah State University, USA Prof. Brian Vinter, University of Copenhagen, Denmark Prof. Peter H. Welch, University of Kent, UK (Chair)
vii
Reviewing Committee Dr. Alastair R. Allen, Aberdeen University, UK Dr. Paul S. Andrews, University of York, UK Dr. Bahareh Badban, University of Konstanz, Germany Dr. Iain Bate, University of York, UK Dr. John Markus Bjørndalen, University of Tromsø¸, Norway Dr. Jim Bown, University of Abertay Dundee, UK Dr. Phil Brooke, University of Teesside, UK Mr. Neil C.C. Brown, University of Kent, UK Dr. Kevin Chalmers, Edinburgh Napier University, UK Dr. Barry Cook, 4Links Ltd., UK Dr. Ian East, Oxford Brookes University, UK Dr. Oliver Faust, Altreonic, Belgium Prof. Wan Fokkink, Vrije Universiteit Amsterdam, The Netherlands Dr. Leo Freitas, University of York, UK Dr. Bill Gardner, University of Guelph, Canada Mr. Marcel Groothuis, University of Twente, The Netherlands Dr. Kohei Honda, Queen Mary & Westfield College, UK Mr. Jason Hurt, University of Nevada, USA Ms. Ruth Ivimey-Cook, Creative Business Systems Ltd., UK Dr. Jeremy Jacob, University of York, UK Mr. Humaira Kamal, University of British Columbia, Canada Dr. Adrian E. Lawrence, University of Loughborough, UK Dr. Gavin Lowe, University of Oxford, UK Dr. Jeremy M.R. Martin, GlaxoSmithKline, UK Dr. Alistair McEwan, University of Leicester, UK Dr. MohammadReza Mousavi, Eindhoven University of Technology, The Netherlands Dr. Jan B. Pedersen, University of Nevada, USA Mr. Brad Penoff, University of British Columbia, Canada Dr. Fiona A.C. Polack, University of York, UK Mr. Jon Simpson, University of Kent, UK Dr. Marc L. Smith, Vassar College, USA Mr. Bernhard H.C. Sputh, Altreonic, Belgium Prof. Susan Stepney, University of York, UK Prof. Gardiner (Dyke) Stiles, Utah State University, USA Mr. Bernard Sufrin, University of Oxford, UK Dr.ir. Johan P.E. Sunter, TASS, The Netherlands Dr.ir. Jeroen P.M. Voeten, Eindhoven University of Technology, The Netherlands Prof. Alan Wagner, University of British Columbia, Canada Mr. Doug N. Warren, University of Kent, UK Prof. George C. Wells, Rhodes University, South Africa
This page intentionally left blank
ix
Contents Preface Peter H. Welch, Herman Roebbers, Jan F. Broenink, Frederick R.M. Barnes, Carl G. Ritson, Adam T. Sampson, Gardiner (Dyke) Stiles and Brian Vinter
v
Editorial Board
vi
Reviewing Committee
vii
Beyond Mobility: What Next After CSP/π? Michael Goldsmith
1
The SCOOP Concurrency Model in Java-Like Languages Faraz Torshizi, Jonathan S. Ostroff, Richard F. Paige and Marsha Chechik
7
Combining Partial Order Reduction with Bounded Model Checking José Vander Meulen and Charles Pecheur On Congruence Property of Scope Equivalence for Concurrent Programs with Higher-Order Communication Masaki Murakami
29
49
Analysing gCSP Models Using Runtime and Model Analysis Algorithms Maarten M. Bezemer, Marcel A. Groothuis and Jan F. Broenink
67
Relating and Visualising CSP, VCR and Structural Traces Neil C.C. Brown and Marc L. Smith
89
Designing a Mathematically Verified I2C Device Driver Using ASD Arjen Klomp, Herman Roebbers, Ruud Derwig and Leon Bouwmeester
105
Mobile Escape Analysis for occam-pi Frederick R.M. Barnes
117
New ALT for Application Timers and Synchronisation Point Scheduling (Two Excerpts from a Small Channel Based Scheduler) Øyvind Teig and Per Johan Vannebo
135
Translating ETC to LLVM Assembly Carl G. Ritson
145
Resumable Java Bytecode – Process Mobility for the JVM Jan Bækgaard Pedersen and Brian Kauke
159
OpenComRTOS: A Runtime Environment for Interacting Entities Bernhard H.C. Sputh, Oliver Faust, Eric Verhulst and Vitaliy Mezhuyev
173
Economics of Cloud Computing: A Statistical Genetics Case Study Jeremy M.R. Martin, Steven J. Barrett, Simon J. Thornber, Silviu-Alin Bacanu, Dale Dunlap and Steve Weston
185
x
An Application of CoSMoS Design Methods to Pedestrian Simulation Sarah Clayton, Neil Urquhart and Jon Kerridge An Investigation into Distributed Channel Mobility Support for Communicating Process Architectures Kevin Chalmers and Jon Kerridge
197
205
Auto-Mobiles: Optimised Message-Passing Neil C.C. Brown
225
A Denotational Study of Mobility Joël-Alexis Bialkiewicz and Frédéric Peschanski
239
PyCSP Revisited Brian Vinter, John Markus Bjørndalen and Rune Møllegaard Friborg
263
Three Unique Implementations of Processes for PyCSP Rune Møllegaard Friborg, John Markus Bjørndalen and Brian Vinter
277
CSP as a Domain-Specific Language Embedded in Python and Jython Sarah Mount, Mohammad Hammoudeh, Sam Wilson and Robert Newman
293
Hydra: A Python Framework for Parallel Computing Waide B. Tristram and Karen L. Bradshaw
311
Extending CSP with Tests for Availability Gavin Lowe
325
Design Patterns for Communicating Systems with Deadline Propagation Martin Korsgaard and Sverre Hendseth
349
JCSP Agents-Based Service Discovery for Pervasive Computing Anna Kosek, Jon Kerridge, Aly Syed and Alistair Armitage
363
Toward Process Architectures for Behavioural Robotics Jonathan Simpson and Carl G. Ritson
375
HW/SW Design Space Exploration on the Production Cell Setup Marcel A. Groothuis and Jan F. Broenink
387
Engineering Emergence: An occam-π Adventure Peter H. Welch, Kurt Wallnau and Mark Klein
403
Subject Index
405
Author Index
407
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-1
1
Beyond Mobility: What Next After CSP/π? Michael GOLDSMITH WMG Digital Laboratory, University of Warwick, Coventry CV4 7AL, UK
[email protected] Abstract. Process algebras like CSP and CCS inspired the original occam model of communication and process encapsulation. Later the π-calculus and various treatments handling mobility in CSP added support for mobility, as realised in practical programming systems such as occam-π, JCSP, CHP and Sufrin’s CSO, which allow a rather abstract notion of motion of processes and channel ends between parents or owners. Milner’s Space and Motion of Communicating Agents on the other hand describes the bigraph framework, which makes location more of a first-class citizen of the calculus and evolves through reaction rules which rewrite both place and link graphs of matching sections of a system state, allowing more dramatic dynamic reconfigurations of a system than simple process spawning or migration. I consider the tractability of the notation, and to what extent the additional flexibility reflects or elicits desirable programming paradigms. Keywords. bigraphs, formal modelling, refinement, verification
Introduction The Communicating Process Architecture approach to programming can be traced back to the occam of the 1980s [1], which in turn was the result of grafting together a rather simple (and hence amenable to analysis) imperative language with the interaction model of the two competing process algebras, CCS [2] and CSP [3,4]; occam lies squarely within the sphere where the two can agree, although both allow more dynamic process evolution (such as recursion through parallel operators in the case of CSP), while occam deliberately enforces a rather static process structure, to simplify memory allocation and usage checking. The πcalculus [5,6] extends CCS with the ability dynamically to create and communicate (event/channel) names, which can then be used for further communication by their recipients. These notions have been realised in practical imperative programming systems such as Communicating Sequential Processes for Java (JCSP) [7], occam-π [8] and Communicating Scala Objects [9], while Communicating Haskell Processes (CHP) [10] embeds them into a functional-programming setting. We have grown accustomed to thinking of the ability to transfer channel ends and (running) processes as “mobility”, but often this is little more than a metaphor: movement is between logical owners and perhaps processors, but (at least within these languages) there is no notion of geographical location or physical proximity, such as would be needed to treat adequately of pervasive systems or location-aware services. For considerations of this kind, something like the Ambient Calculus [11] might be more appropriate, but this concentrates rather exclusively on entry to and exit from places, rather than communication and calculation per se. Milner’s bigraphs [12] provide a framework within which all these formalisms can be represented and perhaps fruitfully cohabit. In this paper, I give a brief overview of the notions and some musings on the potential and challenges they bring with them.
2
M. Goldsmith / Beyond Mobility
1. Bigraphs A bigraph is a graph with two distinct sets of edges, or, if you prefer, a pair of graphs with a (largely) common set of vertices. One is the place graph, a forest of vertices whose leaves (if any – there may equally be nodes with out-degree 0) are sites, placeholders into which another place graph might be grafted; a signature is a set of control types (with which the vertices of the place graph are annotated) and the arity (or number of ports) associated with each. The link graph on the other hand is a hypergraph relating zero or more ports belonging to nodes of the place graph together with (possibly) names constituting the inner and outer faces of the bigraph. Names on the inner face (conventionally drawn below the diagrammatic representation) provide a connection to the link graph of a bigraph plugging into the sites of the place graph; those on the outer face link into any context into which the bigraph itself may be plugged. Thus a given bigraph can be viewed as an adaptor mediating the fit of a bigraph with as many regions (roots of its place-graph forest) as this one has sites, and with the set of names on its outer face equal to those on the inner face of the one in question, into a context with as many sites as this one has regions and inner name-set equal to our outer. For those so inclined, it can be regarded as an arrow from its inner interface1 to its outer interface2 in a suitable category, and composition of such arrows constitutes such a plug-in operation. Both places and links may be required to obey a sorting discipline, restricting the control types which may be immediate children of one another, or for instance restricting each link to joining at most one “driving” port; such well-formedness rules will restrict the number of compositions that are possible, in comparison with amongst unsorted bigraphs. A bigraph provides only a (potentially graphical) representation of a state of a system, like a term in a traditional process algebra; it does not in itself define any notion of evolution such as to give rise to an operational semantics. This is presented in the form of a number of reaction rules, where a bigraphical redex is associated with a reactum into which any subgraph matching the redex may be transformed. The precise meaning of “match” is defined in categorical terms, in that there must exist contexts for both the current bigraph and the redex, such that the two in their respective contexts are equal. The result of the reaction is the reactum placed in the context required for the redex. For any well-behaved reaction system this last composition will be well defined, and one might hope that the context of the source bigraph could be stripped off the result to reflect a self-contained rewrite, but this is not part of the definition. Place graphs have to play a dual role: they may indeed represent physical location, and so juxtaposition (and parallel execution) of their children; but they are also the only structuring mechanism available, to represent possession of data or the continuation of a process beyond some guarding action. Thus it may be necessary to distinguish active and inactive control types, and to restrict matching of a redex to contexts where all its parents are active, lest the second term of a sequential composition, for example, should be reduced (and so effectively allowed to execute) before the first has successfully terminated. Similarly links might be expected to be used to model events or channels, and indeed one can formulate process algebras within this framework in such a way that they do, but they can also be used to represent more abstract relations between nodes, such as joint possession of some “secret” data value which it would be inconvenient to model as a distinguished type of control which two agents would have to “possess”. Note that links can relate multiple nodes, like events in CSP, and are not restricted to point-to-point connections. Thus they can also represent “barriers” in the occam-π sense. 1 2
The pair of the number of sites and the inner name-set. The pair of the number of regions and the outer name-set.
3
M. Goldsmith / Beyond Mobility
1.1 Simple Examples Figure 1 shows perhaps the simplest useful ambient-like reaction rule on place graphs: an atomic agent a who is adjacent to a room r might invoke this rule to move inside the room. Even here there is more than may meet the eye: we are in fact defining a parametric family of reaction rules, parameterised not only by the identities a and r, but also by the existing contents of the room (represented graphically by the pair of dots). There is also some abuse of notation involved, since strictly controls ought to have a fixed number of children (their rank), and here we have changed the number of nodes both inside and immediately outside the room; but there is some punning which lets us get away with such flexibility; note that the context that matches the redex against (a context of) the source bigraph must also be flexible in this way, since the children of the node where it plugs in reduce in number.
R
R
A
enter.a.r
A
a
a r
r
Figure 1
Figure 2 shows a similar transformation, but here a secure room x will only admit a user who carries a badge with the correct authorisation. In this version (and so at this level of abstraction) we are probably treating the badge as a passive token, and the U control should probably be defined to be inactive; but in a slightly richer model it might be that it had an expiry date and an appropriate timeout reaction would unlink it from the unlocks.x attribute (though not necessarily from other unlocks permissions), and in this case U should be active to allow this to proceed autonomously.
unlocks.x
unlocks.x
X
X
U
swipe.u.x u B
x
U x
Figure 2
u B
4
M. Goldsmith / Beyond Mobility
So far all the reactions have preserved the number and kind of nodes in the graph, and merely restructured the place graph. That certainly need not be the case. unlocks.x
unlocks.x
X
X
U
spend.u.x u T
x
U x
u ┴
Figure 3
For example, in Figure 3, rather than a long-term badge, u holds a single-use token – at least until he uses it in order to enter x, at which point it simply vanishes, leaving him with an empty pocket. As a final example, consider communication in a language like occam: two processes, possibly in widely separated parts of a graph, can interact over a channel to resolve an alternative construct where the input end is guarding one of the branches (Figure 4). c
c
ALT !v
x:=v
?x c.v
c Figure 4
Note that the other alternatives are discarded from the ALT, that there remains an obligation to complete the data transfer on the right-hand side, and that the channel c remains available for connection of either or both of the components with each other or the other’s (sequential) context.
M. Goldsmith / Beyond Mobility
5
2. Challenges 2.1 Semantics The bigraph framework is extremely flexible and powerfully expressive; but with great power comes great responsibility! The category-theoretical semantics is quite rebarbative, at least to a CSP-ist, and much of the theory seems aimed at finding restrictions on a given bigraphical reactive system to ensure that the resulting transition system can be trimmed to make it tractable, while preserving its discrimination power (up to bisimulation). The labelled transition system naturally derived from a set of reaction rules over some set of ground bigraphs has, as its labels, the context into which the source bigraph must be placed in order to be unified with a (unspecified) redex itself in some (unspecified) context. This results in a system where both the labels and the results of the transition are infinite in number and arbitrarily complex in structure; this is clearly undesirable for any practical purpose. One set of conditions is designed to ensure that the essentials of this transition system are captured when attention is restricted to the minimal contexts enabling the rule to fire; another allows attention to be restricted further to prime engaged transitions (see [12]). But even here I find something unsatisfactory: the label describes only what needs to be done to the source bigraph to allow it to engage in some reaction. There is no immediate way to determine by which reaction rule, nor where in that composed process, the rewrite occurs. For instance, any reductions that can take place internally to the current state give rise to a transition labelled with the null (identity) context. In some ways this is quite natural, by analogy with the internal τ-actions of the operational semantics of CCS or CSP, but my instinct is that it will often be of crucial interest which reactions gave rise to those essentially invisible events. In many cases it can be contrived (by subtle use of link graphs) that the minimal or prime engaged systems do agree with a more intuitive labelling scheme (such as might arise from the reaction names decorating the arrows in the above examples). But it remains far from clear to me whether there is any systematic way of attaching an intuitive scheme to (some subclass of) reactive systems or of contriving a reactive system whose natural semantics are equivalent to one conceived from a given intuitive labelling. Of course, given my background, it is not really bisimilarity that interests me – I want to be able to encode interesting specifications and check for refinement using FDR [13]. At least the failures pre-order is known to be monotonic over well-behaved bigraphs (and failures equivalence is a congruence), so there is no fundamental problem here; but there remains a certain amount of research needed to attain fluency and to map onto FDR. 2.2 Applications There are some areas where a tractable bigraphical treatment offers immediate benefit over the sometimes convoluted modelling needed in CSP, say, to achieve the same effect: the interface between physical and electronic security (of which the examples above are simplistic illustrations), pervasive adaptive systems, location-based and context-aware services, vehicle telematics, and so on. The security application has already received welcome attention in the form of Blackwell’s Spygraphs [14,15], which adds cryptographic primitives in the style of the Spi Calculus [16]. What others are there? Do they exhibit particular challenges or opportunities? 2.3 Language Features The encryption and decryption schemas for symmetric and asymmetric cryptography lend themselves quite naturally to a bigraphical description with nodes representing data items –
6
M. Goldsmith / Beyond Mobility
see Figure 5. (Note however that this reaction rule is not nice, in the technical sense, because it duplicates the contents of the site in the redex into the reactum, and so is not affine.) k K
k ENC
decrypt.k
k
K
ENC
k Figure 5
But are there any other data-processing features which could usefully be described in this way, possibly without the adjacency requirement? Perhaps external correlation attacks against putatively anonymised databases? More generally, suppose I describe some pervasive-adaptive system, say, and establish that the design has desirable properties; some of the agents in the system will describe people and other parts of the environment, but others will be components which need to be implemented. What is the language most appropriate for programming such components, so that we can attain a reasonable assurance that they will behave as their models do in all important aspects? Is it something like occam-π, or do we need new language features or whole new paradigms to reflect the potentially quite drastic reconfigurations that a reaction can produce? A challenge for the language designers out there! References [1] [2] [3] [4] [5] [6] [7] [8]
Inmos Ltd, occam Programming Manual, Prentice Hall, 1984. R. Milner, A Calculus of Communicating Systems, Springer Lecture Notes in Computer Science 92, 1980. C.A.R. Hoare, Communicating Sequential Processes, Prentice Hall, 1985. A.W. Roscoe, The Theory and Practice of Concurrency, Prentice-Hall, 1998. R. Milner, Communicating and Mobile Systems: the Pi-Calculus, Cambridge University Press, 1999. D. Sangiorgi and D. Walker, The π-calculus: A Theory of Mobile Processes, Cambridge University Press, 2001. JCSP home page. http://www.cs.kent.ac.uk/projects/ofa/jcsp/ P.H. Welch, An occam-pi Quick Reference, 1996-2008, https://www.cs.kent.ac.uk/research/groups/sys/wiki/OccamPiReference
[9] [10] [11] [12] [13] [14] [15] [16]
B. Sufrin, Communicating Scala Objects, Communicating Process Algebras 2008, pp. 35-54, IOS Press, ISBN 978-1-58603-907-3, 2008. N.C.C. Brown, Communicating Haskell Processes: Composable Explicit Concurrency using Monads, Communicating Process Architectures 2008, pp. 67-83, IOS Press, ISBN 978-1-58603-907-3, 2008. L. Cardelli and A.D. Gordon, Mobile Ambients, in Foundations of System Specification and Computational Structures, Springer Lecture Notes in Computer Science 1378, pp.140-155, 2000. R. Milner, The Space and Motion of Communicating Agents, Cambridge University Press, 2009. Formal Systems (Europe) Ltd: Failures-Divergence Refinement: the FDR2 manual, http://www.fsel.com/fdr2_manual.html, 1997-2007. C. Blackwell: Spygraphs: A Calculus For Security Modelling, in Proceedings of the British Colloquium for Theoretical Computer Science – BCTCS, 2008. C. Blackwell: A Security Architecture to Model Destructive Insider Attacks, in Proceedings of the 8th European Conference on Information Warfare and Security, 2009. M. Abadi and A.D. Gordon: A calculus for cryptographic protocols: The Spi Calculus, in Proceedings of the Fourth ACM Conference on Computer and Communications Security, pp.36-47, 1997.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-7
7
The SCOOP Concurrency Model in Java-like Languages Faraz TORSHIZI a,1 , Jonathan S. OSTROFF b , Richard F. PAIGE c and Marsha CHECHIK a a
b
Department of Computer Science, University of Toronto, Canada Department of Computer Science and Engineering, York University, Canada c Department of Computer Science, University of York, UK Abstract. SCOOP is a minimal extension to the sequential object-oriented programming model for concurrency. The extension consists of one keyword (separate) that avoids explicit thread declarations, synchronized blocks, explicit waits, and eliminates data races and atomicity violations by construction, through a set of compiler rules. SCOOP was originally described for the Eiffel programming language. This paper makes two contributions. Firstly, it presents a design pattern for SCOOP, which makes it feasible to transfer SCOOP’s concepts to different object-oriented programming languages. Secondly, it demonstrates the generality of the SCOOP model by presenting an implementation of the SCOOP design pattern for Java. Additionally, we describe tools that support the SCOOP design pattern, and give a concrete example of its use in Java. Keywords. Object oriented, concurrency, SCOOP, Java
Introduction Concurrent programming is challenging, particularly for less experienced programmers who must reconcile their understanding of programming logic with the low-level mechanisms (like threads, monitors and locks) needed to ensure safe, secure and correct code. Brian Goetz (a primary member of the Java Community Process for concurrency) writes: [...] multicore processors are just now becoming inexpensive enough for midrange desktop systems. Not coincidentally, many development teams are noticing more and more threading-related bug reports in their projects. In a recent post on the NetBeans developer site, one of the core maintainers observed that a single class had been patched over 14 times to fix threading related problems. Dion Almaer, former editor of TheServerSide, recently blogged (after a painful debugging session that ultimately revealed a threading bug) that most Java programs are so rife with concurrency bugs that they work only “by accident”. ... One of the challenges of developing concurrent programs in Java is the mismatch between the concurrency features offered by the platform and how developers need to think about concurrency in their programs. The language provides low-level mechanisms such as synchronization and condition waits, but these mechanisms must be used consistently to implement application-level protocols or policies. [1, p.xvii]
Java 6.0 adds richer support for concurrency but the gap between low-level language mechanisms and programmer cognition persists. The SCOOP (Simple Concurrent Object Oriented Programming) model of concurrency [2,3] contributes towards bridging this gap in the object-oriented context. Instead of providing typical concurrency constructs (e.g., threads, semaphores and monitors), SCOOP operates at a different level of abstraction, providing a single new keyword separate that 1 Corresponding Author: Faraz Torshizi, 10 King’s College Road, Toronto, ON, Canada, M5S 3G4. E-mail:
[email protected].
8
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
is used to denote an object running on a different processor. Processor is an abstract notion used to define behavior. Processors can be mapped to virtual machine threads, OS threads, or even physical CPUs (in case of multi-core systems). Calls within a single processor are synchronous (as in sequential programming), whereas calls to objects on other processors are dispatched asynchronously to those processors for execution while the current execution continues. This is all managed by the runtime, invisible to the developer. The SCOOP compiler eliminates race conditions and atomicity violations by construction thus automatically eliminating a large number of concurrency errors via a static compiler check. However, it is still prone to user-introduced deadlocks. SCOOP also extends the notion of Design by Contract (DbC) to the concurrent setting. Thus, one can specify and reason about concurrent programs with much of the simplicity of reasoning about sequential programs while preserving concurrent execution [4].
Contributions
The SCOOP model is currently implemented in the Eiffel programming language, making use of Eiffel’s built-in DbC support. In this paper, we make two contributions: 1. We provide a design pattern for SCOOP that makes it feasible to apply the SCOOP concurrency model to other object-oriented programming languages such as Java and C# that do not have Eiffel constructs for DbC. The design has a front-end and a backend. The front-end is aimed at software developers and allows them to write SCOOP code in their favorite language. The back-end (invisible to developers) translates their SCOOP programs into multi-threaded code. The design pattern adds separate and await keywords using the language’s meta-data facility (e.g., Java annotations or C# attributes). The advantage of this approach is that developers can use their favorite tools and compilers for type checking. The translation rules enforced by the design pattern happen in the back-end. 2. We illustrate the design pattern with a prototype for Java, based on an Eclipse plug-in called JSCOOP. Besides applying the SCOOP design pattern to Java, and allowing SCOOP programs to be written in Java, the JSCOOP plug-in allows programmers to create projects and provides syntax highlighting and compile-time consistency checking for “traitors” that break the SCOOP concurrency model (Section 2.1). The plugin uses a core library and translation rules for the generic design pattern to translate JSCOOP programs to multi-threaded Java. Compile time errors are reported at the JSCOOP level using the Eclipse IDE. Currently the Eclipse plug-in supports preconditions (await) but not postconditions. This is sufficient to capture the major effects of SCOOP, but future work will need to address the implementation of full DbC. In Section 1 we describe the SCOOP model in the Eiffel context. In Section 2 we describe the front-end design decisions for transferring the model to Java and C#. In Section 3, we describe the back-end design pattern via a core library (see the UML diagram in Fig. 1 on page 8). This is followed by Section 4 which goes over a number of generic translation rules used by the pre-processor to generate multi-threaded code using the core library facilities. In Section 5 we describe the Eclipse plug-in which provides front-end and back-end tool support for JSCOOP (the Java implementation of SCOOP). This shows that the design pattern is workable and allows us to write SCOOP code in Java. In Section 6, we compare our work to other work in the literature.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
9
1. The SCOOP Model of Concurrency The SCOOP model was described in detail by Meyer [2], and refined (with a prototype implementation in Eiffel) by Nienaltowski [3]. The SCOOP model is based on the basic concept of OO computation which is a routine call t.r(a), where r is a routine called on a target t with argument a. SCOOP adds the notion of a processor (handler). A processor is an abstract notion used to define behavior — routine r is executed on object t by a processor p. The notion of processor applies equally to sequential or concurrent programs. In a sequential setting, there is only one processor in the system and behaviour is synchronous. In a concurrent setting there is more than one processor. As a result, a routine call may continue without waiting for previous calls to finish (i.e., behaviour is asynchronous). By default, new objects are created on the same processor assigned to handle the current object. To allow the developer to denote that an object is handled by a different processor than the current one, SCOOP introduces the separate keyword. If the separate keyword is used in the declaration of an object t (e.g., an attribute), a new processor p will be created as soon as an instance of t is created. From that point on, all actions on t will be handled by processor p. Assignment of objects to processors does not change over time. 1 class PHILOSOPHER create 2 make 3 feature 4 left, right: separate FORK 5 make (l, r: separate FORK) 6 do 7 left := l; right := r 8 end 9 10 act 11 do 12 from until False loop 13 eat (left, right) -- eating 14 -- thinking 15 end 16 end 17 18 eat (l, r: separate FORK) 19 require not (l.inuse or r.inuse) 20 do 21 l.pickup; r.pickup 22 if l.inuse and r.inuse then 23 l.putdown; r.putdown 24 end 25 end 26 end Listing 1. PHILOSOPHER class in Eiffel.
The familiar example of the dining philosophers provides a simple illustration of some of the benefits of SCOOP. A SCOOP version of this example is shown in listings 1 and 2. Attributes left and right of type FORK are declared separate, meaning that each fork object is handled by its own processor (thread of control) separate from the current processor (the thread handling the philosopher object). Method calls from a philosopher to a fork object (e.g., pickup) are handled by the fork’s dedicated processor in the order that the calls are received. The system deadlocks when forks are picked up one at a time by competing philosophers. Application programmers may avoid deadlock by recognizing that two forks must
10
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
1 2 3 4 5 6 7 8 9
class FORK feature inuse: BOOLEAN pickup is do inuse := True end putdown is do inuse := False end end Listing 2. FORK class in Eiffel.
be obtained at the same time for a philosopher to safely eat to completion. We encode this information in an eat method that takes a left and right fork as separate arguments. The eat routine invoked at line 13 waits until both fork processors are allocated to the philosopher and the precondition holds. The precondition (require clause) is treated as a guard that waits for the condition to become true rather than a correctness condition that generates an exception if the condition fails to hold. In our case, the precondition asserts that both forks must not be in use by other philosophers. A scheduling algorithm in the back-end ensures that resources are fairly allocated. Thus, philosophers wait for resources to become available, but are guaranteed that eventually they will become available provided that all reservation methods like eat terminate. Such methods terminate provided they have no infinite loops and have reserved all relevant resources. One of the difficulties of developing multi-threaded applications is the limited ability to re-use sequential libraries. A naive re-use of library classes that have not been designed for concurrency often leads to data races and atomicity violations. The mutual exclusion guarantees offered by SCOOP make it possible to assume a correct synchronization of client calls and focus on solving the problem without worrying about the exact context in which a class will be used. The separate keyword does not occur in class FORK. This shows how classes (such as FORK) written in the sequential context can be re-used without change for concurrent access. Fork methods such as pickup and putdown are invoked atomically from the eat routine (which holds the locks on these forks). Once we enter the body of the eat routine (with a guarantee of locks on the left and right forks and the precondition True), no other processors can send messages to these forks. They are under the control of this routine. When a philosopher (the client) calls fork procedures (such as l.pickup) these procedures execute asynchronously (e.g., the pickup routine call is dispatched to the processor handling the fork, and the philosopher continues executing the next instruction). In routine queries such as l.inuse and r.inuse at line 22, however, the client must wait for a result and thus must also wait for all previous separate calls on the fork to terminate. This is called wait-by-necessity. 2. The Front-end for Java-like Languages Fair versions of the dining philosophers in pure Java (e.g., see [5, p137]) are quite complex. The low level constructs for synchronization add to the complexity of the code. In addition, the software developer must explicitly implement a fair scheduler to manage multiple resources in the system. The scheduling algorithm for a fair Java implementation in [5] requires an extra 50 lines of dense code in comparison to the SCOOP code of listings 1 and 2. Is there a way to obtain the benefits of SCOOP programs for languages such as Java and C#? Our design goal in this endeavor is to allow developers to write SCOOP code using their favorite editors and compilers for type checking.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
11
1 public class Philosopher { 2 private @separate Fork rightFork; 3 private @separate Fork leftFork; 4 5 public Philosopher (@separate Fork l, @separate Fork r) { 6 leftFork = l; rightFork = r; 7 } 8 9 public void act(){ 10 while(true) { 11 eat(leftFork, rightFork); //non-separate call 12 } 13 } 14 15 @await(pre="!l.isInUse()&&!r.isInUse()") 16 public void eat(@separate Fork l, @separate Fork r){ 17 l.pickUp(); r.pickUp(); // separate calls 18 if(l.isInUse() && r.isInUse()) { 19 l.putDown(); r.putDown(); 20 } 21 } 22 } Listing 3. Example of a SCOOP program written in Java.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
public class Philosopher { [separate] private Fork rightFork; [separate] private Fork leftFork; public Philosopher ([separate] Fork l, [separate] Fork r) { leftFork = l; rightFork = r; } public void act(){ while(true) { eat(leftFork, rightFork); //non-separate call } } [await("!l.isInUse()&&!r.isInUse()")] public void eat([separate] Fork l, [separate] Fork r){ l.pickUp(); r.pickUp(); if(l.isInUse() && r.isInUse()) { l.putDown(); r.putDown(); } } Listing 4. Example of a SCOOP program written in C#.
At the front-end, we can satisfy our design goal by using the meta-data facility of modern languages that support concurrency as shown in listing 3 and in listing 4. The meta-data facility (e.g., annotations in Java and attributes for C#) provide data about the program that is not part of the program itself. They have no direct effect on the operation of the code they annotate. The listings show SCOOP for Java (which we call JSCOOP) and SCOOP for C# (which we call CSCOOP) via the annotation facility of these languages. In Eiffel, the keyword require is standard and in SCOOP is re-used for the guard. Thus only one new keyword (separate) is needed. Java and C# languages do not support contracts.
12
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
So we need to add two keywords via annotations: separate and await (for the guard). The front-end uses the Java or C# compiler to do type checking. In addition, the front-end must do number of consistency checks to assure the correct usage of the separate construct (see below). 2.1. Syntax Checking In SCOOP, method calls are divided into two categories: • A non-separate call where the target object of the call is not declared as separate. This type of call will be handled by the current processor. Waiting happens only if the parameters of this method contain references to separate objects (i.e., objects handled by different processors) or there exists an unsatisfied await condition. Before executing a non-separate call, two conditions have to be true: (a) locks should be acquired automatically on all processors handling separate parameters and (b) await condition should be true. Locks are passed down further (if needed) and are released at the end of the call. Therefore, atomicity is guaranteed at the level of methods. • A separate call. In the body of a non-separate method one can safely invoke methods on the separate objects (i.e., objects declared as separate in the parameters). This type of call, where the target of the call is handled by a different processor, is referred to as a separate call. Execution of a separate call is the responsibility of the processor that handles the target object (not the processor invoking the call). Therefore, the processor invoking the call can send a message (call request) to the processor handling the target object and move on to the next instruction without having to wait for the separate call to return. The invoking processor waits for the results only at the point that it is necessary (e.g., the results are needed in an assignment statement) this is referred to as wait-by-necessity. Due to the differences between the semantics of separate and non-separate calls, it is necessary to check that a separate object is never assigned to a variable that is declared as nonseparate. Entities declared as non-separate but pointing to separate objects are called traitors. It is important to detect traitors at compile-time so that we have a guarantee that remote objects cannot be accessed except through the official locking mechanism that guarantees freedom from atomicity and race violations. The checks for traitors must be performed by the front-end. We use the Eclipse plugin (see Section 5) for JSCOOP to illustrate the front-end which must check the following separateness consistency (SC) rules [3]: • SC1: If the source of an attachment (assignment instruction or parameter passing) is separate, its target entity must be separate too. This rule makes sure that the information regarding the processor of a source entity is preserved in the assignment. As an example, line 9 in Listing 5 is an invalid assignment because its source x1 is separate but its target is not. Similarly, the call in line 11 is invalid because the actual argument is separate while the corresponding formal argument is not. There is no rule prohibiting attachments in the opposite direction — from non-separate to separate entities (e.g., line 10 is valid). • SC2: If an actual argument of a separate call is of a reference type, the corresponding formal argument must be declared as separate. This rule ensures that a non-separate reference passed as actual argument of a separate call be seen as separate outside the processor boundary. Let’s assume that method f of class X (line 22 in Listing 5) takes a non-separate argument and method g takes a separate argument (line 23). The client is not allowed to use its non-separate attribute a as an actual argument of x.f because from the point of view of x, a is separate. On the other hand, the call x.g(a) is valid.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
13
• SC3: If the source of an attachment is the result of a separate call to a function returning a reference type, the target must be declared as separate. If query q of class X returns a reference, that result should be considered as separate with respect to the client object. Therefore, the assignment in line 34 is valid while the assignment in line 35 is invalid. 1 public class Client{ 2 public static @separate X x1; 3 public static X x2; 4 public static A a; 5 6 public void sc1(){ 7 @separate X x1 = new X(); 8 X x2 = new X(); 9 x2 = x1; // invalid: traitor 10 x1 = x2; // valid 11 r1 (x1); // invalid 12 } 13 14 public void r1 (X x){} 15 16 public void sc2(){ 17 @separate X x1 = new X(); 18 r2 (x1); 19 } 20 21 public void r2 (@separate X x){ 22 x.f(a); // invalid 23 x.g(a); // valid 24 } 25 26 public void sc3() { 27 @separate X x1 = new X(); 28 s (x1); 29 } 30 31 public void s (@separate X x){ 32 @separate A res1; 33 A res2; 34 res1 = x.q; // valid 35 res2 = x.q; // invalid 36 } 37 } Listing 5. Syntax checking.
A snapshot of the tool illustrating the above compile errors is shown in Fig. 6 on page 16. 3. The Core Library for the Back-end In this section we focus on the design and implementation of the core library classes shown in Fig. 1; these form the foundation of the SCOOP implementation. The core library provides support for essential mechanisms such as processors, separate and non-separate calls, atomic locking of multiple resources, wait semantics, wait-by-necessity, and fair scheduling. Method calls on an object are executed by its processor. Processors are instances of the Processor class. Every processor has a local call stack and a remote call queue. The local stack is used for storing non-separate calls and the remote call queue is used for storing calls
14
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
&RUH/LEUDU\ 6FRRS7KUHDG
6FKHGXOHU
3URFHVVRU
/RFN5HTXHVW
-DYD!! -6&223
&DOO
LPSRUW!!
LPSRUW!!
&!! &6&223
LQWHUIDFH!!
6\VWHP7KUHDGLQJ7KUHDG
-6FRRS7KUHDG
&6FRRS7KUHDG
5XQQDEOH
Figure 1. Core library classes associated with the SCOOP design pattern (see Fig. 9 in the appendix for more details).
made by other processors. A processor can add calls to the remote call queue of another processor only when it has a lock on the receiving processor. All calls need to be parsed to extract method names, return type, parameters, and parameter types. This information is stored in the Call objects (acting as a wrapper for method calls). The queue and the stack are implemented as lists of Call elements. The ScoopThread interface is implemented by classes whose instances are intended to be executed by a thread, e.g., Processor (whose instances run on a single thread acting as the “processor” executing operations) or the global scheduler Scheduler. This interface allows the rest of the core library to rely on certain methods being present in the translated code. Depending on the language SCOOP is implemented, ScoopThread can be redefined to obey the thread creation rules of the supporting language. For example, a Java based implementation of this interface (JScoopThread) extends Java’s Runnable interface. The Scheduler should be instantiated once for every SCOOP application. This instance acts as the global resource scheduler, and is responsible for checking await conditions and acquiring locks on supplier processors on behalf of client processors. The scheduler manages a global lock request queue where locking requests are stored. The execution of a method by a processor may result in creation of a call request (an instance of Call) and its addition to the global request queue. This is the job of SCOOP scheduler to atomically lock all the argu-
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
15
ments (i.e., processors associated with arguments) of a routine. For example, the execution of eat (listing 3) blocks until both processors handling left and right forks have been locked. The maximum number of locks acquired atomically is the same as the maximum number of formal arguments allowed in the language. Every class that implements the ScoopThread interface must define a run() method. Starting the thread causes the object’s run() method to be called in that thread. In the run() method, each Processor performs repeatedly the following actions: 1. If there is an item on the call stack which has a wait condition or a separate argument, the processor sends a lock request (LockRequest) to the global scheduler Scheduler and then blocks until the scheduler sends back the “go-ahead” signal. A lock request maintains a list of processors (corresponding to separate arguments of the method) that need to be locked as well as a Semaphore object which allows this processor to block until it is signaled by the scheduler. 2. If the remote call queue is not empty, the processor dequeues an item from the remote call queue and pushes it onto the local call stack. 3. If both the stack and the queue are empty, the processor waits for new requests to be enqueued by other processors. 3.1. Scheduling When a processor p is about to execute a non-separate routine r that has parameters, it creates a call request and adds this request to the global request queue (handled by the scheduler processor). A call request contains (a) the identity of the requesting processor p, (b) list of resources (processors) requested to be locked, (c) the await condition, and (d) routine r itself. When the request is added to the global queue, p waits (blocks) until the request is granted by the SCOOP scheduler. Requests are processed by the SCOOP scheduler in FIFO order. The scheduler first tries to acquire locks on the requested processors on behalf of the requesting processor p. If all locks are acquired, then the scheduler checks the await condition. According to the original description of the algorithm in [3] if either of (a) acquiring locks on processors or (b) wait condition fail, the scheduler moves to the next request, leaving the current one intact. If both locking and wait condition checking are successful, the request is granted giving the client processor the permission to execute the body of the associated routine. The client processor releases the locks on resources (processors) as soon as it terminates executing the routine. 1 public class ClassA 2 { 3 private @separate ClassB b; 4 private @separate ClassC c; 5 ... 6 this.m (b, c); 7 ... 8 @await(pre="arg-B.check1()&&arg-c.check2()") 9 public void m(@separate ClassB arg-B, @separate ClassC arg-C){ 10 arg-B.f(arg-C); // separate call (no waiting) 11 arg-C.g(arg-B); // separate call (no waiting) 12 ... 13 } 14 ... 15 } Listing 6. Example of a non-separate method m and separate methods f and g.
16
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
! ! " # " #
Figure 2. Sequence diagram showing a non-separate call followed by two separate calls involving three processors A, B and C (processor is abbreviated as proc and semaphore as sem).
3.2. Dynamic Behaviour In order to demonstrate what happens during a separate or non-separate call, let’s consider the JSCOOP code in Listing 6. We assume that processor A (handling this of type ClassA) is about to execute line 6. Since method m involves two separate arguments (arg-B and arg-C), A needs to acquire the locks on both processors handling objects attached to these arguments (i.e., processors B and C). Fig. 2 is a sequence diagram illustrating actions that happen when executing m. First, A creates a request asking for a locks (lock-semaphore-B and lock-semaphore-C) on both processors handling objects attached to arguments and sends this request to the global scheduler. A then blocks on another semaphore (call-semaphore-A) until it receives the “go-ahead” signal from the scheduler. The scheduler acquires both lock-semaphore-B and lock-semaphore-C in a fair manner and safely evaluates the await condition. If the await condition is satisfied, then the scheduler releases call-semaphore-A signaling processor A to continue to the body of m. A can then safely add separate calls f and g (lines 10 and 11) to the remote call queue of B and C respectively and continue to the next instructions without waiting for f and g to return. The call semaphore described above is also used for query calls, where wait-by-necessity is needed. After submitting a remote call to the appropriate processor, the calling class blocks on the call semaphore. The call semaphore is released by the executing object after the method has terminated and its return value has been stored in the call variable. Once the calling object has been signaled, it can retrieve the return value from the Call object. What happens if the body of method f has a separate call on object attached to arg-C (i.e., processor B invoking a call on C)? Since A already has a lock on C, B cannot get that lock and has to wait until the end of method m. In order to avoid cases like this, the caller processor “passes” the needed locks to the receiver processor and let the receiver processor release those locks. Therefore, A passes the lock that it holds on C to B allowing B to continue safely. Locks are passed down further (if needed) and are released at the end of the call.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
17
4. Translation Rules In this section we describe some of the SCOOP translation rules for Java-like languages. These rules take the syntax-checked annotated code (with separate and await) as their input and generate pure multi-threaded code (without the annotation) as their output. The output uses the core library facilities to create the dynamic behavior required by the SCOOP model. All method calls in SCOOP objects must be evaluated upon translation to determine whether a call is non-separate, separate, requires locks, or has an await condition. If the call is non-separate and requires no locks (i.e., does not have any separate arguments) and has no await condition, it can be executed as-is. However, if the call is separate, requires locks, or has an await, it is flagged for translation. All classes that use separate variables or have await conditions, or are used in a separate context are marked for translation. In addition, the main class (entry point of the application) is marked for translation. For each marked class A, a corresponding SCOOP class SCOOP A is created which implements the ScoopThread interface. 4.1. Non-separate Method Fig. 3 shows the rule for translating non-separate methods taking separate argument(s). To make the rule more generic, we use variables (entities starting with %). The method call (%Method) is invoked on an entity (%VAR2) which is not declared as separate. This call is therefore a non-separate call and should be executed by the current processor. The call takes one or more separate arguments (indicated as %SepArg0 to %SepArgN) and zero or more non-separate arguments (indicated as %Arg0 to %ArgN). The method may have the return value of type (%ReturnType). The SCOOP model requires that all formal separate arguments of a method have their processors locked by the scheduler before the method can be executed. In order to send the request to the scheduler, this processor needs to wrap the call and the required resources in a Call object. The creation of a Call wrapper scoop call is done in line 20. In order to create such a wrapper we need to (1) create a LockRequest object that contains the list of locks required by this call, i.e., processors handling objects attached to the arguments (lines 3–8), (2) capture the arguments and their types (lines 12–18), and (3) capture the return type of this call. We can assume that the classes corresponding to separate arguments are marked for translation to %SCOOP Type0 to %SCOOP TypeN. This allows us to access the remote call queues of the processors associated with these arguments. The new Call object is added to the stack of this processor (line 23) to be processed by the run() method (not shown here). Finally, the current processor blocks on its call semaphore (line 24) until it receives the go-ahead signal (release()) from the scheduler. The thread can safely enter the body of the method (line 26) after the scheduler issues the signal (the body of the method is translated by the separate-call rule). At the end of the method, the lock semaphores on all locked processors are released (lines 28–32). 4.2. Await Condition Fig. 4 shows the translation rule that deals with await conditions. The await conditions for all methods of a class are encapsulated in a single boolean method checkPreconditions located in the generated SCOOP %ClassName. This method is executed only by the scheduler thread (right after the lock semaphores on all requested processors are acquired). The scheduler sets the attribute call of this class and then runs the checkPreconditions. In this method, each of the await conditions is translated and suited to the SCOOP %ClassName context i.e., separate variables in await clause should be casted to the corresponding SCOOP types. As an example, if we have the await condition a.m1() == b.m2() where a is a sep-
18
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Annotated code // Method call on a non - separate object , including method // calls on ’ this ’. Class % ClassName { ... % ReturnType % VAR1 ; % TargetType % VAR2 ; ... [% VAR1 =] % VAR2 .% Method (% Type0 % SepArg0 ,[... ,% TypeN % SepArgN , % Type00 % Arg0 ,... ,% TypeNN % ArgN ]); ... }
Line
Translation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Class SCOOP_ % ClassName { ... List < Processor > locks ; // loop locks . add (% SepArg0 . getProcessor ()); ... // do this for all separate args locks . add (% SepArgN . getProcessor ()); // end loop lock_request := new LockRequest ( % VAR2 , locks , this . getProcessor (). getLockSemaphore ()); List < Object > args_types , args ; // loop args_types . add ( SCOOP_ % Type0 ); args . add (% SepArg0 ); ... // do this for all args args_types . add (% TypeNN ); args . add (% ArgN );... // end loop Semaphore call_semaphore = new Semaphore (0); scoop_call := new Call ( " % Method " , args_types , args , % ReturnType , lock_request , % VAR2 , call_semaphore ); getProcessor (). addLocalCall ( scoop_call ); call_semaphore . acquire (); // call the translated version of % Method ( see Rule 3) [% VAR1 = ] % VAR2 . translated_ % Method (% SepArg0 ,[... ,% SepArgN , % Arg0 ,... ,% ArgN ]); // loop locks [0]. unlockProcessor (); ... // do this for all locks processors locks [ N ]. unlockProcessor (); // end loop ... }
Figure 3. Translation rule for invoking non-separate method taking separate arguments.
arate variable of type A and b is a non-separate variable of type B, the corresponding await condition translation looks like ((SCOOP A) a).m1() == b.m2(). checkPreconditions returns true iff the await condition associated with the call evaluates to true. 4.3. Separate Call This rule describes the translation of a separate method (%SepMethod) invoked on a separate target object %SepArg0 in the body of a non-separate call %Method. %Method has at least one argument that is declared to be separate (i.e., %SepArg0), we can assume that the corresponding classes are marked for translation (i.e., %SCOOP Type0). This allows us to access the remote call queues of the processors associated with those arguments. As in the first rule, we first create a list of processors that needs to be locked for (%SepMethod) (lines 8–12), and create a lock request from that information (line 13). We collect the arguments and their
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
19
Annotated code Class % ClassName { ... // method signature % AwaitCondition0 ... % Method0 (% Args0 ...% ArgsA ) { ... } ... % AwaitConditionN ... % MethodN (% Args0 ...% ArgsB ) { ... } }
Line
Translation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Class SCOOP_ % ClassName { { ... public boolean checkPreconditions () { Call this_call = call ; // set by the scheduler String method_name = this_call . getMethodName (); int num_args = this_call . getNumArgs (); // " Reflection " method to check wait condition if ( method_name . equals ( " % Method0 " )) { if ( num_args == A ) { if ( Translated_ % AwaitCondition0 ) return true ; else return false ; } } ... // this is done for all methods } else if ( method_name . equals ( " % MethodN " )) { if ( num_args == B ) { if ( Translated_ % AwaitConditionN ) return true ; else return false ; } } ... }
Figure 4. Translation rule for methods with await condition.
types in lines 15–22. The Call object is finally created in line 23. In order to schedule the call for execution, we add the call object to the end of the remote call queue of the processor responsible for %SepArg0 (line 25). After the request has been sent, the current thread can safely continue to the next instruction without waiting. This rule only deals with void separate calls. If the call returns a value needed by the current processor, the current processor has to wait for the result (wait-by-necessity). This is achieved by waiting on the call semaphore. The supplier processor releases the call semaphore as soon as it is done with the calls. In this section we only showed three of the main translation rules. The rest of the translation rules are available online1 . A sample translation of a JSCOOP code to Java can be found in the appendix. 5. Tool Support JSCOOP is supported by an Eclipse plug-in2 that provides well-formedness checking, as well as syntax highlighting and other typical features of a development environment. Overall, the 1 2
http://www.cs.toronto.edu/~faraz/jscoop/rules.pdf http://www.cs.toronto.edu/~faraz/jscoop/jscoop.zip
20
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Annotated code Class % ClassName { ... // method body containi ng a separate call % ReturnType % Method (% Type0 % SepArg ,...); { ... % SepArg0 .% SepMethod ([% SepArg0 ,... ,% SepArgN ,...]); ... } ... } Class % Type { ... // supplier side void % SepMethod (% Type00 % Arg0 ,...% TypeNN % ArgN ) { ... } ... }
Line
Translation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Class SCOOP_ % ClassName { ... // translated method % ReturnType % Method ( SCOOP_ % Type0 % SepArg ,...); { ... Call call ; List < Processor > locks ; // loop locks . add (% SepArg0 . getProcessor ()); ... // do this for all separate arguments of % SepMethod locks . add (% SepArgN . getProcessor ()); // end loop lock_request = new LockRequest (% SepArg , locks , % SepArg . getProcessor (). getLockSemaphore ()); List < Object > args_types , args ; // loop args_types . add (% Type00 ); // use the SCOOP_ for separate types args . add (% Arg0 ); ... // do this for all arguments of % SepMethod args_types . add (% TypeNN ); args . add (% ArgN );... // end loop call = new Call ( " % SepMethod " , arg_types , args , void , lock_request , % SepArg0 , void ); % SepArg0 . getProcessor (). addRemoteCall ( call ); ... // move on to the next operation without waiting } ... }
Figure 5. Translation rule for separate methods.
plug-in consists of two packages: 1. edu.jscoop: This package is responsible for well-formedness and syntax checking and GUI support for JSCOOP Project creation. A single visitor class is used to parse JSCOOP files for problem determination. 2. edu.jscoop.translator: This package is responsible for the automated translation of JSCOOP code to Java code. Basic framework for automated translation has been designed with the anticipation of future development to fully implement this feature. The plug-in reports errors to the editor on incorrect placement of annotations and checks for violation of SCOOP consistency rules. We use the Eclipse AST Parser to isolate methods decorated by an await annotation. Await conditions passed to await as strings are translated into a corresponding assert statement which is used to detect problems. As a result,
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
21
all valid assert conditions compile successfully when passed to await in the form of a string, while all invalid assert conditions are marked with a problem. For example, on the @await (pre="x=10") input, our tool reports “Type mismatch: cannot convert from Integer to boolean.” All await statements are translated into assert statements, and rewritten to a new abstract syntax tree, which is in turn written to a new file. This file is then traversed for any compilation problems associated with the assert statements, which are then reflected back to the original SCOOP source file for display to the developer. The Eclipse AST Parser is also able to isolate separate annotations. We are able to use the Eclipse AST functionalities to type check the parameters passed in the method call by comparing them to the method declaration. Separateness consistency rules SC1 and SC2 are checked when the JSCOOP code is compiled using this plug-in. Fig. 6 is an illustration of these compile time checks. In order to process the JSCOOP annotations, we have created a hook into the Eclipse’s JDT Compilation Participant. The JSCOOP CompilationParticipant acts on all JSCOOP Projects, using JSCOOP Visitor to visit and process all occurrences of separate and await in the source code. 6. Related work The Java with Annotated Concurrency (JAC) system [6] is similar in intent and principle to JSCOOP. JAC provides concurrency annotations — specifically, controlled and compatible — that are applicable to sequential program text. JAC is based on an active object model [7]. Unlike JSCOOP, JAC does not provide a wait/precondition construct, arguing instead that both waiting and exceptional behaviour are important for preconditions. Also, via the compatible annotation, JAC provides means for identifying methods whose execution can safely overlap (without race conditions), i.e., it provides mutual exclusion mechanisms. JAC also provides annotations for methods, so as to indicate whether calls are synchronous or asynchronous, or autonomous (in the sense of active objects). The former two annotations are subsumed by SCOOP (and JSCOOP)’s separate annotation. Overall, JAC provides additional annotations to SCOOP and JSCOOP, thus allowing a greater degree of customisability, while requiring more from the programmer in terms of annotation, and guaranteeing less in terms of safety (i.e., through the type safety rules of [3]). Similar annotation-based mechanisms have been proposed for JML [8,9]; the annotation mechanisms for JML are at a different level of abstraction than JSCOOP, focusing on methodlevel locking (e.g., via a new locks annotation). Of note with the JML extension of [8] is support for model checking via the Bogor toolset. The RCC toolset is an annotation-based extension to Java designed specifically to support race condition detection [10]. Additionally, an extension of Spec# to multi-threaded programs has been developed [11]; the annotation mechanisms in this extension are very strong, in the sense that they provide exclusive access to objects (making it local to a thread), which may reduce concurrency. The JCSP approach [12] supports a different model of concurrency for Java, based on the process algebra CSP. JCSP also defines a Java API and set of library classes for CSP primitives, and does not make use of annotations, like JSCOOP and JAC. Polyphonic C# is an annotated version of C# that supports synchronous and asynchronous methods [13]. Polyphonic C# makes use of a set of private messages to underpin its communication mechanisms. Polyphonic C# is based on a sound theory (the join calculus), and is now integrated in the Cω toolset from Microsoft Research. Morales [14] presents the design of a prototype of SCOOP’s separate annotation for Java; however, preconditions and general design-by-contract was not considered, and the support for type safety and well-formedness was not considered. More generally, a modern abstract programming framework for concurrent or parallel
22
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Figure 6. Snapshot of the Eclipse plug-in.
programming is Cilk [15]; Cilk works by requiring the programmer to specify the parts of the program that can be executed safely and concurrently; the scheduler then decides how to allocate work to (physical) processors. Cilk is also based, in principle, on the idea of annotation, this time of the C programming language. There are a number of basic annotations, including mechanisms to annotate a procedure call so that it can (but doesn’t have to) operate in parallel with other executing code. Cilk is not yet object-oriented, nor does it provide design-by-contract mechanisms (though recent work has examined extending Cilk to C++). It has a powerful execution scheduler and run-time, and recent work is focusing on minimising and eliminating data race problems. Recent refinements to SCOOP, its semantics, and its supporting tool have been reported.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
23
Ostroff et al [16] describe how to use contracts for both concurrent programming and rich formal verification in the context of SCOOP for Eiffel, via a virtual machine, thus making it feasible to use model checking for property verification. Nienaltowski [17] presents a refined access control policy for SCOOP. Nienaltowski also presents [4] a proof technique for concurrent SCOOP programs, derived from proof techniques for sequential programs. An interesting semantics for SCOOP using the process algebra CSP is provided in [18]. Brooke [19] presents an alternative concurrency model with similarities to SCOOP that theoretically increases parallelism. Some of the open research questions with regards to SCOOP are addressed in [20]. Our approach allows us to take advantage of the recent SCOOP developments and re-use them in Java and C# settings. 7. Conclusion and Future Work In this paper, we described a design pattern for SCOOP, which enables us to transfer concepts and semantics of SCOOP from its original instantiation in Eiffel to other object-oriented programming languages such as Java and C#. We have instantiated this pattern in an Eclipse plug-in, called JSCOOP, for handling Java. Our initial experience with JSCOOP has been very positive, allowing us to achieve clean and efficient handling of concurrency independently from the rest of the program. The work reported in this paper should be extended in a variety of directions. First of all, we have not shown correctness of our translation. While the complete proof would indicate that the translation is correct for every program, we can get partial validation of correctness by checking the translation on some programs. For example, Chapter 10 of [3] includes many examples of working SCOOP code. We intend to rewrite these programs in JSCOOP and check that they have the same behaviour (i.e., are bi-similar) as programs written directly in Java. These experiments will also enable us to check efficiency (e.g., time to translate) and effectiveness (lines of the resulting Java code and performance of this code) of our implementation of the Eclipse plug-in. We also need to extend the design pattern and add features to our implementation. Full design-by-contract includes postconditions and class invariants, and we intend to extend our design pattern to allow for these. Further, we need to implement support for inheritance and the full lock passing mechanism of SCOOP [17]. And creating support for handling C# remains future work as well. SCOOP is free of atomicity violations and race conditions by construction but it is still prone to user-introduced deadlocks (e.g., an await condition that is never satisfied). Since JSCOOP programs are annotated Java programs, we can re-use existing model-checkers and theorem-provers to reason about them. For example, we intend to verify the executable code produced by the Eclipse plug-in for the presence of deadlocks using Java Pathfinder [21]. Acknowledgements We would like to thank Kevin J. Doyle, Jenna Lau and Cameron Gorrie for their contributions to the JSCOOP project. This work was conducted under a grant from Natural Sciences and Engineering Research Council of Canada. References [1] Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. Java Concurrency in Practice. Addison-Wesley, May 2006. ISBN-10: 0321349601 ISBN-13: 978-0321349606.
24
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
[2] Bertrand Meyer. Object-Oriented Software Construction. Prentice Hall, 1997. [3] Piotr Nienaltowski. Practical framework for contract-based concurrent object-oriented programming, PhD thesis 17031. PhD thesis, Department of Computer Science, ETH Zurich, 2007. [4] Piotr Nienaltowski, Bertrand Meyer, and Jonathan Ostroff. Contracts for concurrency. Formal Aspects of Computing, 21(4):305–318, 2009. [5] Stephen J. Hartley. Concurrent programming: the Java programming language. Oxford University Press, Inc., New York, NY, USA, 1998. [6] Klaus-Peter Lohr and Max Haustein. The JAC system: Minimizing the differences between concurrent and sequential Java code. Journal of Object Technology, 5(7), 2006. [7] Oscar Nierstrasz. Regular types for active objects. In OOPSLA, pages 1–15, 1993. [8] Edwin Rodr´ıguez, Matthew B. Dwyer, Cormac Flanagan, John Hatcliff, Gary T. Leavens, and Robby. Extending JML for modular specification and verification of multi-threaded programs. In ECOOP, pages 551–576, 2005. [9] Wladimir Araujo, Lionel Briand, and Yvan Labiche. Concurrent contracts for Java in JML. In ISSRE ’08: Proceedings of the 2008 19th International Symposium on Software Reliability Engineering, pages 37–46, Washington, DC, USA, 2008. IEEE Computer Society. [10] Cormac Flanagan and Stephen Freund. Detecting race conditions in large programs. In PASTE ’01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 90–96, New York, NY, USA, 2001. ACM. [11] Bart Jacobs, Rustan Leino, and Wolfram Schulte. Verification of multithreaded object-oriented programs with invariants. In Proc. Workshop on Specification and Verification of Component Based Systems. ACM, 2004. [12] Peter H. Welch, Neil Brown, James Moores, Kevin Chalmers, and Bernhard H. C. Sputh. Integrating and extending JCSP. In CPA, pages 349–370, 2007. [13] Nick Benton, Luca Cardelli, and C´edric Fournet. Modern concurrency abstractions for C#. ACM Trans. Program. Lang. Syst., 26(5):769–804, 2004. [14] Francisco Morales. Eiffel-like separate classes. Java Developer Journal, 2000. [15] Robert Blumofe, Christopher Joerg, Bradley Kuszmaul, Charles Leiserson, Keith Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proc. Principles and Practice of Programming. ACM Press, 1995. [16] Jonathan S. Ostroff, Faraz Ahmadi Torshizi, Hai Feng Huang, and Bernd Schoeller. Beyond contracts for concurrency. Formal Aspects of Computing, 21(4):319–346, 2009. [17] Piotr Nienaltowski. Flexible access control policy for SCOOP. Formal Aspects of Computing, 21(4):347– 362, 2009. [18] Phillip J. Brooke, Richard F. Paige, and Jeremy L. Jacob. A CSP model of Eiffel’s SCOOP. Formal Aspects of Computing, 19(4):487–512, 2007. [19] Phillip J. Brooke and Richard F. Paige. Cameo: an alternative concurrency model for eiffel. Formal Aspects of Computing, 21(4):363–391, 2009. [20] Richard F. Paige and Phillip J. Brooke. A critique of SCOOP. In First International Symposium on Concurrency, Real-Time, and Distribution in Eiffel-like Languages (CORDIE), 2006. [21] Klaus Havelund and Tom Pressburger. Model checking Java programs using JavaPathfinder. Software Tools for Technology Transfer (STTT), 2(4):72–84, 2000.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
25
Appendix
Line
JSCOOP
1 2 3 4 5 6
... public void act (){ ... eat ( leftFork , rightFork ); ... }
Line
Java
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
... public void act () { Semaphore j_call_semaphore = new Semaphore (0); JSCOOP_Call j_call ; JSCOOP_Call [] j_wait_calls ; ... // eat ( leftFork , rightFork ); j_req_locks = new LinkedList < JSCOOP_Processor >(); j_req_locks . add ( lefFork . getProcessor ()); j_req_locks . add ( rightFork . getProcessor ()); j_lock_request = new JSCOOP_LockRequest ( this , j_req_locks , this . getProcessor (). getLockSemaphore ()); j_arg_types = new Class [2]; j_arg_types [0] = leftFork . getClass (); j_arg_types [1] = rightFork . getClass (); j_args = new Object [2]; j_args [0] = leftFork ; j_args [1] = rightFork ; j_call = new JSCOOP_Call ( " eat " , j_arg_types , j_args , null , j_lock_request , this , j_call_semaphore ); getProcessor (). addLocalCall ( j_call ); j_call_semaphore . acquire (); eat ( leftFork , rightFork ); scheduler . releaseLocks ( j_call ); ... }
Figure 7. Mapping from Philosopher to JSCOOP Philosopher: calling eat method.
26
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Line
JSCOOP
1 2 3 4 5 6 7 8
@await ( pre = " ! l . isInUse ()&&! r . isInUse () " ) public void eat ( @separ ate Fork l , @separate Fork r ){ l . pickUp (); r . pickUp (); if ( l . isInUse () && r . isInUse ()) { status = 2; }... }...
Line
Java
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
public boolean checkPreconditions () { ... if ( method_name . equals ( " eat " )) { ... if (((( JSCOOP_Fork ) args [0]). isInUse ()== false ) && ((( JSCOOP_Fork ) args [1]). isInUse ()== false )) return true ; else return false ; }} else if ( method_name . equals (...)) { ... } public void eat ( J SCOOP_Fork l , JSCOOP_Fork r ) { JSCOOP_Call [] j_wait_calls ; Semaphore j_call_semaphore = new Semaphore (0); JSCOOP_Call j_call ; // l . pickUp (); j_req_locks = new LinkedList < JSCOOP_Processor >(); j_lock_request = new J SC OO P_Lo ck Re qu es t (l , j_req_locks , l . getProcessor (). getLockSemaphore ()); j_arg_types = null ; j_args = null ; j_call = new JSCOOP_Call ( " pickUp " , j_arg_types , j_args , null , j_lock_request , l , null ); l . getProcessor (). addRemoteCall ( j_call ); // r . pickUp (); ... // if ( l . isInUse () && r . isInUse ()) j_wait_calls = new JSCOOP_Call [2]; j_req_locks = new LinkedList < JSCOOP_Processor >(); j_lock_request = new JSC OO P_ Lo ckRe qu es t (l , j_req_locks , l . getProcessor (). getLockSemaphore ()); j_arg_types = null ; j_args = null ; j_wait_calls [0] = new JSCOOP_Call ( " isInUse " , j_arg_types , j_args , Boolean . class , j_lock_request , l , j_call_semaphore ); j_req_locks = new LinkedList < JSCOOP_Processor >(); j_lock_request = new JSC OO P_ Lo ckRe qu es t (r , j_req_locks , r . getProcessor (). getLockSemaphore ()); j_arg_types = null ; j_args = null ; j_wait_calls [1] = new JSCOOP_Call ( " isInUse " , j_arg_types , j_args , Boolean . class , j_lock_request , r , j_call_semaphore ); l . getProcessor (). addRemoteCall ( j_wait_calls [0]); r . getProcessor (). addRemoteCall ( j_wait_calls [1]); // ( wait - by - necessity ) j_call_semaphore . acquire (2); // Execute the if statement with returned values if (( Boolean ) j_wait_calls [0]. getReturnValue () && ( Boolean ) j_wait_calls [0]. getReturnValue ()) status = 2; ... }
Figure 8. Mapping from Philosopher to JSCOOP Philosopher.
27
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
&RUH/LEUDU\ LQWHUIDFH!!
6FRRS7KUHDG
3URFHVVRU ORFNHGBE\3URFHVVRU ORFDOBFDOOBVWDFN UHPRWHBFDOOBTXHXH VOHHSBVHPDSKRUH6HPDSKRUH ORFNBVHPDSKRUH6HPDSKRUH VFKHGXOHU
UXQ FUHDWH3URFHVVRU UHPRYH3URFHVVRU UHOHDVH/RFNV DGG5HTXHVW
UXQ ORFN3URFHVVRU XQORFN3URFHVVRU DGG/RFDO&DOO DGG5HPRWH&DOO LQYRNH&DOO
&DOO
/RFN5HTXHVW VHPDSKRUH6HPDSKRUH UHTXHVWHU ORFNV
PHWKRGBQDPH DUJBW\SHV UHWXUQBYDOXH VFRRSBSURFHVVRU VFRRSBREMHFW FDOOBVHPDSKRUH6HPDSKRUH ORFNBUHTXHVW JHW0HWKRG1DPH JHW2EMHFW JHW2EMHFW3URFHVVRU JHW&DOO6HPDSKRUH JHW/RFN5HTXHVW
JHW/RFNV JHW6HPDSKRUH JHW5HTXHVWHU
LPSRUW!!
&6&223
LPSRUW!!
-6&223
6FKHGXOHU ORFNBUHTXHVWV DOOBSURFHVVRUV ORFNHGBSURFHVVRUV VOHHSBVHPDSKRUH6HPDSKRUH
VHW3URFHVVRU JHW3URFHVVRU FKHFN3UHFRQGLWLRQV VHW&DOO JHW&DOO
6\VWHP7KUHDGLQJ
LQWHUIDFH!! 5XQQDEOH
UXQ LQWHUIDFH!!
&VFRRS7KUHDG
LQWHUIDFH!!
-VFRRS7KUHDG -6&223B3URFHVVRU ORFNHGBE\ -6&223B3URFHVVRU ORFDOBFDOOBVWDFN UHPRWHBFDOOBTXHXH VOHHSBVHPDSKRUH6HPDSKRUH ORFNBVHPDSKRUH6HPDSKRUH VFKHGXOHU
UXQ FUHDWH3URFHVVRU UHPRYH3URFHVVRU UHOHDVH/RFNV DGG5HTXHVW
UXQ ORFN3URFHVVRU XQORFN3URFHVVRU DGG/RFDO&DOO DGG5HPRWH&DOO LQYRNH&DOO
-6&223B&DOO
PHWKRGBQDPH
JHW0HWKRG1DPH JHW2EMHFW JHW2EMHFW3URFHVVRU JHW&DOO6HPDSKRUH JHW/RFN5HTXHVW
JHW/RFNV JHW6HPDSKRUH JHW5HTXHVWHU
&6B3URFHVVRU ORFNHGBE\&6B3URFHVVRU ORFDOBFDOOBVWDFN UHPRWHBFDOOBTXHXH VOHHSBVHPDSKRUH6HPDSKRUH ORFNBVHPDSKRUH6HPDSKRUH VFKHGXOHU
&6B6FKHGXOHU ORFNBUHTXHVWV DOOBSURFHVVRUV ORFNHGBSURFHVVRUV VOHHSBVHPDSKRUH6HPDSKRUH
VHW3URFHVVRU JHW3URFHVVRU FKHFN3UHFRQGLWLRQV VHW&DOO JHW&DOO
UXQ FUHDWH3URFHVVRU UHPRYH3URFHVVRU UHOHDVH/RFNV DGG5HTXHVW
UXQ ORFN3URFHVVRU XQORFN3URFHVVRU DGG/RFDO&DOO DGG5HPRWH&DOO LQYRNH&DOO
&6B&DOO
DUJBW\SHV UHWXUQBYDOXH MVFRRSBSURFHVVRU MVFRRSBREMHFW FDOOBVHPDSKRUH6HPDSKRUH ORFNBUHTXHVW
-6&223B/RFN5HTXHVW VHPDSKRUH6HPDSKRUH UHTXHVWHU ORFNV
-6&223B6FKHGXOHU ORFNBUHTXHVWV DOOBSURFHVVRUV ORFNHGBSURFHVVRUV VOHHSBVHPDSKRUH6HPDSKRUH
VHW3URFHVVRU JHW3URFHVVRU FKHFN3UHFRQGLWLRQV VHW&DOO JHW&DOO
&6B/RFN5HTXHVW VHPDSKRUH6HPDSKRUH UHTXHVWHU ORFNV
JHW/RFNV JHW6HPDSKRUH JHW5HTXHVWHU
PHWKRGBQDPH DUJBW\SHV UHWXUQBYDOXH FVBSURFHVVRU FVBREMHFW FDOOBVHPDSKRUH6HPDSKRUH ORFNBUHTXHVW JHW0HWKRG1DPH JHW2EMHFW JHW2EMHFW3URFHVVRU JHW&DOO6HPDSKRUH JHW/RFN5HTXHVW
-6&223B3KLORVRSKHU &6B3KLORVRSKHU
SURFHVVRU VFKHGXOHU FDOO MVFRRSBORFNBUHTXHVW MVFRRSBUHTXHVWLQJBORFNV OHIWBIRUNULJKWBIRUN -6&223B)RUN
SURFHVVRU VFKHGXOHU FDOO FVBORFNBUHTXHVW FVBUHTXHVWLQJBORFNV OHIWBIRUNULJKWBIRUN -6&223B)RUN
HDW UXQ WKLQN OLYH
HDW UXQ WKLQN OLYH
Figure 9. Core library classes.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-29
29
Combining Partial Order Reduction with Bounded Model Checking Jos´e VANDER MEULEN and Charles PECHEUR Universit´e catholique de Louvain, Louvain-la-Neuve, Belgium {jose.vandermeulen , charles.pecheur} @uclouvain.be Abstract. Model checking is an efficient technique for verifying properties on reactive systems. Partial-order reduction (POR) and symbolic model checking are two common approaches to deal with the state space explosion problem in model checking. Traditionally, symbolic model checking uses BDDs which can suffer from space blowup. More recently bounded model checking (BMC) using SAT-based procedures has been used as a very successful alternative to BDDs. However, this approach gives poor results when it is applied to models with a lot of asynchronism. This paper presents an algorithm which combines partial order reduction methods and bounded model checking techniques in an original way that allows efficient verification of temporal logic properties (LT LX ) on models featuring asynchronous processes. The encoding to a SAT problem strongly reduces the complexity and non-determinism of each transition step, allowing efficient analysis even with longer execution traces. The starting-point of our work is the Two-Phase algorithm (Namalesu and Gopalakrishnan) which performs partial-order reduction on process-based models. At first, we adapt this algorithm to the bounded model checking method. Then, we describe our approach formally and demonstrate its validity. Finally, we present a prototypal implementation and report encouraging experimental results on a small example.
Introduction Model checking is a technique used to verify concurrent systems such as distributed applications and communication protocols. It has a number of advantages. In particular, model checking is automatic and usually quite fast. Also, if the design contains an error, model checking will produce a counterexample that can be used to locate the source of the error [1]. In the 1980s, several researchers introduced very efficient temporal logic model checking algorithms. McMillan achieved a breakthrough with the use of symbolic representations based on the use of Ordered Binary Decision Diagrams (BDD) [2]. By using symbolic model checking algorithms, it is possible to verify systems with a very large number of states [3]. Nevertheless, the size of the BDD structures themselves can become unmanageable for large systems. Bounded Model Checking (BMC) uses SAT-solvers instead of BDDs to search errors on bounded execution path [4]. BMC offers the advantage of polynomial space complexity and has proven to provide competitive execution times in practice. A common approach to verify a concurrent system is to compute the product finite-space description of the processes involved. Unfortunately, the size of this product is frequently prohibitive due, among other causes, to the modelling of concurrency by interleaving. The aim of partial order reduction (POR) techniques is to reduce the number of interleaving sequences that must be considered. When a specification cannot distinguish between two interleaving sequences that differ only by the order in which concurrently executed events are taken, it is sufficient to analyse one of them [5].
30 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
This paper presents a technique which combines together the BMC method and the POR method for verifying linear temporal logic properties. We start from the Two-Phase algorithm (TP) of Namalesu and Gopalakrishnan [6]. We merge a variant of TP with the BMC procedure. This allows the verification of models featuring asynchronous processes. Intuitively, from a model and a property, the BMC method constructs a propositional formula which represents a finite unfolding of the transition relation and the property. Our method proceeds in the same way, but instead of using the entire transition relation during the unfolding of the model, we only use a safe subset based on POR considerations. This produces a propositional formula which is well suited for most modern SAT solvers. In contrast, our previous work introduced an algorithm to verify branching temporal logic properties which merges POR and BDD-based model checking [7]. To assess the validity of our approach, we start by introducing two methods which can be combined together for transforming computation trees. The POR method captures partialorder reduction criteria [8,5,1]. The idle-extension shows how a finite number of transitions can be added while also preserving temporal logic properties. Then, the Stuttering Bounded Two-Phase (SBTP) reduction is introduced, as a particular instance of a combination of these two methods inspired from TP. Finally, we present how a finite unfolding of SBTP is encoded as a propositional formula suitable for BMC. The remainder of the paper is structured as follows. Section 1 recalls some background concepts, definitions and notations that are used throughout the paper: bounded model checking, bisimulations and POR. In Section 2, two transformations of computation trees which preserves CT L∗X properties are presented, as well as the SBTP algorithm and its transformation to a BMC problem. Section 3 presents the extension of our prototype implementing the BMC of SBTP method. In Section 4, we present the results obtained by applying our method on a case study. Section 5 reviews related works. Finally, Section 6 gives conclusions as well as directions for future work. 1. Background 1.1. Transitions Systems A transition system which is a particular class of state machine represents the behavior of a system. A state of the transition systems is a snapshot of the system at a particular time, formally each state is labelled with atomic propositions. The actions performed by the system are modeled by means of transitions between states. Formally each transition carries a label which represents the performed action [1] 1 . In the rest of this paper, we assume a set AP of atomic propositions and a set A of transitions. Without loss of generality, the set AP can be restricted to the propositions that appear in the property to be verified on the system. Definition 1 (Transition System). Given a set of transitions A and a set of atomic propositions AP , a transition system (over A and AP ) is a structure M = (S, T, s0 , L) where S is a finite set of states, s0 ∈ S is an initial state 2 , T ⊆ S × A × S is a transition relation and L : S → 2AP is an interpretation function over states. α
We write s −→ s for (s, α, s ) ∈ T . A transition α is enabled in a state s iff there is a α state s such that s −→ s . We write enabled(s, T ) for the set of enabled transitions of T in s. When the context is clear, we write enabled(s) instead of enabled(s, T ). We assume that T 1 Our treatment differs slightly from [1] which views T as a set of (unlabelled) transition relations α ⊆ S × S. Using labelled transitions amounts to the same structures and is mathematically cleaner to define. 2 For simplicity, s0 is a single state. All the arguments of this paper can be easily generalized to many initial states (i.e. S0 ⊆ S).
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 31
is total (i.e. enable(s) = ∅ for all s ∈ S). A transition α is deterministic in a state s iff there α is at most one s such that s −→ s . α A transition α ∈ T is invisible if for each pair of states s, s ∈ S such that s −→ s , L(s) = L(s ). A transition is visible if it is not invisible. An execution path of M is an infinite a0 a1 a2 −→ s1 − −→ s2 − −→ ···. sequence of consecutive transitions steps s0 − A computation tree can be built from a transition system M . s0 ∈ S is the root of a tree that unwinds all the possible executions from that initial state [1]. The computation tree of M (CT (M )) is itself a transition system and is essentially equivalent to M , in a sense that will be made precise below. 1.2. Model Checking This section briefly introduces model checking. For more details, we refer the reader to [1]. Model checking is an automatic technique to verify that a concurrent systems such a distributed application and a communication protocol, satisfies a given property. Intuitively, the system is modeled as a finite transition system, and model checking performs an exhaustive exploration of the resulting state graph to fulfill the verification. If the system violates the property, model checking will generate a counterexample which will help to locate the source of the error. A common approach to verify a concurrent system is to compute the combined finitespace description of the processes involved. Unfortunately, the size of this combination can grow exponentially, due to all the different interleavings among the executions of all the processes. Partial Order Reduction (POR) techniques reduce the number of interleaving sequences that must be be considered. When a specification cannot distinguish between two interleaving sequences that differ only by the order in which concurrently executed events are taken, it is sufficient to analyse one of them [5]. Temporal logic is used to express properties to be verified. In addition to the elements of propositional logic, this logic provides temporal operators for reasoning over different steps of the execution. There are several types of temporal logics such as linear temporal logic (LT L), computation tree logic (CT L), or CT L∗ which subsumes both LT L and CT L. For instance, LT L formulæ are interpreted over each execution path of the model. In LT L, Gϕ (globally ϕ) says that ϕ will hold in all future states, Fϕ (finally ϕ) says that ϕ will hold in some future states, ϕ U ψ (ϕ until ψ) says that ψ will hold in some future states and at every preceding states ϕ holds, and Xϕ (next ϕ) says that ϕ is true in the next state. In this paper we will consider LT LX , the fragment of LT L without the X operator. Similarly, CT L∗X (resp. CT LX ) is the fragment of CT L∗ (resp. CT L) without the X operator. By using temporal logic model checking algorithms, we can check automatically whether a given system, modeled as a transition system, satisfies a given temporal logic property. In the 1980’s, very efficient temporal logic model checking algorithms were introduced for these logics [9,10,11,12]. For instance, to check if a system M satisfies a LT L property ϕ, the algorithm presented in [10] constructs an automaton B over infinite words named B¨uchi automaton from the negation of ϕ [13]. Then it searches for violations of ϕ by checking the executions of the state graph which result from the combination of M and B. 1.3. Bounded Model Checking In 1992, a step forward was reached by McMillan by using a symbolic approach, based on Binary Decision Diagrams (BDD), to reason on set of states rather than individual states. This technique made it possible to verify systems with a very large number of states [14]. However for large models, the size of the BDD structures themselves can become intractable. In contrast, the Bounded Model Checking (BMC) uses SAT solver instead of BDDs as the underlying computational device [4]. The idea of BMC is to characterize an error
32 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
execution path of length k as a propositional formula, and search for solutions to that formula with a SAT solver. This formula is obtained by combining a finite unfolding of the system’s transition relation and an unfolding of the negation of the property being verified. The latter is obtained on the basis of expansion equivalences such as p U q ≡ q ∨ (p ∧ X(p U q)) which allow us to propagate across successive states the constraints corresponding to the violation of the LT L property. If no counterexample is found, k is incremented and a new execution path is searched. We continue this process until a counterexample is found or any limit is reached. BMC allow to check LT L properties on a system. Since BMC works on finite paths, an approximate bounded semantics of LT L is defined. Intuitively, the bounded semantics treats differently paths with a back-loop (c.f. Figure 1(a)) and paths without such a back-loop (c.f. Figure 1(b)). The former can be seen as an infinite path formed by a finite number of states. In this case the classical semantic of LT L can be applied. In contrast, the latter is a finite prefix of an infinite path. In some cases, such a prefix π is sufficient to show that a path violates a property f . For instance, let f be the property Gp. If π contains a state which does not satisfy p then all paths which start with the prefix π violate Gp.
(b)
(a) sl
si
si
sk
sk
Figure 1. The two cases for a bounded path [4].
The propositional formula [[M, ¬f ]]k which is submitted to the SAT solver is constructed as follows, where f is the LT L property to be verified. Definition 2 (BMC encoding). Given a transition system M = (S, T, s0 , L), a LTL formula f , and a bound k ∈ N: k (l Lk ∧ l [[¬f ]]) where [[M, ¬f ]]k = [[M ]]k ∧ (¬Lk ∧ [[¬f ]]) ∨ l=0
• [[M ]]k is a propositional formula which represents the unfolding of k steps of the transition relation, • l Lk is propositional formula which is true iff there is a transition from sk to sl , • Lk is propositional formula which is true iff there exists a l such that l Lk , • [[¬f ]] is a propositional formula which is the translation of ¬f when [[M ]]k does not contain any back loop, and • l [[¬f ]] is a propositional formula which is the translation of ¬f when [[M ]]k does contain a back loop to state sl . It is shown in [4] that if M |= f then there is a k ≥ 0 such that [[M, ¬f ]]k is satisfiable. Conversely, if [[M, ¬f ]]k has no solutions for any k then M |= f 3 . Given a propositional formula p produced by the BMC encoding, a SAT solver decides if p is satisfiable or not. If it is, a satisfying assignment is given that describes the path violating the property. Most of the SAT solvers apply a variant of the Davis-Putnam-LogemannLoveland (DPLL) algorithm [15]. Intuitively, DPLL performs alternatively two phases. The first one chooses a value for some variable. The second one propagates the implications of 3 Actually, [4] shows that it is sufficient to look for bounded solutions of [[M, ¬f ]]k up to bound k ≤ K which depends on f and M .
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 33
this decision that are easy to infer. This method is known as unit propagation. The algorithm backtracks when a conflict is reached 1.4. Bisimulation Relations A bisimulation is a binary relation between two transition systems M and M . Intuitively, a bisimulation can be constructed between two systems if one can simulate the other and viceversa. For instance, bisimulation techniques are used in model checking to reduce the number of states of M while preserving some kind of properties (e.g. LT LX , CT LX , . . . ). The literature proposes a large number of variants of bisimulation relations [16,8]. This section describes two kinds of bisimulation relations used in the sequel. 1.4.1. Bisimulation Equivalence Bisimulation equivalence is the classical notion [16], here adapted to transition systems by requiring identical state labellings, which ensures that CT L∗ properties are preserved [1]. Intuitively, bisimulation equivalence groups states that are impossible to distinguish, in the sense that both have the same labelling and offer the same transitions leading to equivalent states. Definition 3 (Bisimulation Equivalence). Let M = (S, T, s0 , L) and M = (S , T , s0 , L ) be two structures with the same set of atomic propositions AP . A relation B ⊆ S × S is a bisimulation relation between M and M if and only if for all s ∈ S and s ∈ S, if B(s, s ) then the following conditions hold: • L(s) = L(s ). α α • For every state s1 ∈ S such that s −→ s1 there is a s1 ∈ S such that s −→ s1 and B(s1 , s1 ). α α • For every state s1 ∈ S such that s −→ s1 there is a s1 ∈ S such that s −→ s1 and B(s1 , s1 ). M and M are bisimulation-equivalent iff there exists a bisimulation relation B such that B(s0 , s0 ). In [1] it is shown that unwinding a structure results in a bisimulation-equivalent structure. So, we conclude that a computation tree which is generated from a model M is bisimulation-equivalent to M . Furthermore, bisimulation equivalence preserves CT L∗ properties, as shown in [1]. Figure 2 (a) and Figure 2 (b) are bisimulation-equivalent. For each dashed oval, we can group together every state of Figure 2 (b) to state of Figure 2 (a) (e.g. B(1, 3)). On the other hand, Figure 2 (a) and Figure 2 (c) are not bisimulation-equivalent because the node 7 in Figure 2 (c) does not correspond to any states in Figure 2 (a). 1.4.2. The Visible Bisimulation Visible bisimulation is a weaker equivalence that only preserves CT L∗X properties, and thus also CT L and LT L properties. Our POR methods preserve visible bisimilarity and therefore those logics. Intuitively, the visible bisimulation associates two states s and t that are impossible to distinguish, in the sense that if from s a visible action a is attainable in the future, a also belongs to t’s future. Definition 4 (Visible Bisimulation [8]). A relation B ⊆ S × S is a visible simulation between two structures M = (S, T, s0 , L) and M = (S , T , s0 , L ) iff B(s0 , s0 ) and for every s ∈ S, s ∈ S such that B(s, s ), the following conditions hold: 1. L(s) = L (s )
34 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
p (a)
p
a
1
2
p
b a
3 (b)
(c)
q b
p 5
a
6
p
4
q
b
p
7
8
b
a
a
Figure 2. Bisimilar and nonbisimilar structures. a
2. Let s −→ t. There are two cases: • a is invisible and B(t, s ), or c
cn−1
c
a
0 1 −→ s1 − −→ · · · −−−→ sn − −→ t in M , such that B(s, si ) • there exists a path s − for 0 < i ≤ n and ci is invisible for 0 ≤ i < n. Furthermore, if a is visible, then a = a. Otherwise, a is invisible.
a
a
0 1 −→ s1 − −→ · · · in M , where all ai are invisible 3. If there is an infinite path s = s0 − c and B(si , s ) for i ≥ 0, then there exists a transition s −→ t such that c is invisible, and for some j > 0, B(sj , t )
B is a visible bisimulation iff both B and B −1 are visible simulations. M and M are visibly-bisimilar iff there is a visible bisimulation B. Figure 3 (a) and 3 (b) are visibly-bisimilar. To see this, we construct the relation which put together states of Figure 3 (a) and states of Figure 3 (b) that are linked by a dashed line together. The action a and b can be executed in any order leading to the same result, from the standpoint of verification. Figure 3 (a) and Figure 3 (c) are not visibly-bisimilar, the node 12 in Figure 3 (c) does not correspond to any states in Figure 3 (a). p 1 a p 2
c 5 q
p
8 p
c
c 6 r
b
13
b
a
(a)
a
a 3 p
4 p
12
7 b
b
p
p
14
a
b 9 p
15
c
c
10
11
16
q
r
q
(b)
q
p
c 17
(c)
r
Figure 3. Visibly-bisimilar and not visibly-bisimilar structures.
There also exists a weaker bisimulation, called stuttering bisimulation. In general, the POR literature is based on the notion of stuttering bisimulation to reason about POR. In [8], it is shown that a visible bisimulation is also a stuttering bisimulation, and also preserves CT L∗X properties. In general, it is easier to reason about visible bisimulation than about stuttering bisimulation because the former implies an argument about states and the latter
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 35
implies an argument about infinite paths. Actually, the POR method which is applied in the sequel produces a reduced graph from a model M which is visible-bisimilar to M . 1.5. Partial-Order Reduction The goal of partial-order reduction methods (POR) is to reduce the number of states explored by model-checking, by avoiding to explore different equivalent interleavings of concurrent events [8,5,1]. Naturally, these methods are best suited for strongly asynchronous programs. Interleavings which are required to be preserved may depend on the property to be checked. Partial-order reduction is based on the notions of independence between transitions and invisibility of a transition. Two transitions are independent if they do not disable one another and executing them in either order results in the same state. Intuitively, if two independent transitions α and β are invisible w.r.t. the property f that one wants to verify, then it does not matter whether α is executed before or after β, because they lead to the same state and do not affect the truth of f . Partial-order reduction consists in identifying such situations and restricting the exploration to either of these two alternatives. In effect, POR amounts to exploring a reduced model M = (S , T , s0 , L) with S ⊆ S and T ⊆ T . In practice, classical POR algorithms [5,1] execute a modified depthfirst search (DFS). At each state s, an adequate subset ample(s) of the transitions enabled in s are explored. To ensure that this reduction is adequate, that is, that verification results on the reduced model hold for the full model, ample(s) has to respect a set of conditions, based on the independence and invisibility notions previously defined. In some cases, all enabled transitions have to be explored. The following conditions are set forth in [1,8]: C0 ample(s) = ∅ if and only if enable(s) = ∅. C1 Along every path in the full state graph that starts at s, the following condition holds: a transition that is dependent on a transition in ample(s) cannot be executed without a transition in ample(s) occurring first. C2 If ample(s) = enabled(s), then all transitions in ample(s) are invisible. C3 A cycle is not allowed if it contains a state in which some transition α is enabled, but is never included in ample(s) on the cycle. On finite models, conditions C0, C1, C2 and C3 are sufficient to guarantee that the reduced model preserves properties expressed in LT LX . On infinite models (such as computation trees) condition C3 must be rephrased as the following condition, which intuitively states that all the transitions in enabled(s) will eventually be expanded. C3b An infinite path is not allowed if it contains a state in which some transition α is enabled, but is never included in ample(s) on the path. In order to demonstrate that C0, C1, C2 and C3b preserve LT LX properties, a similar argument as the one presented in [1] can be used. The only difference is the method applied for demonstrating the Lemma 28 of [1]. This Lemma can be demonstrated by using condition C3b instead of condition C3. Ensuring preservation of branching temporal logics requires an additional constraint which is significantly more restrictive [8]: C4 If ample(s) = enabled(s), then ample(s) contains only one transition that is deterministic in s. When conditions C0 to C4 are satisfied, [8] shows that there is a visible bisimulation between the complete and reduced models, which ensures preservation of CT L∗X properties (and thus CT LX and LT LX ). The same argument can be used to demonstrate that there is
36 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
also a visible bisimulation between the full and the reduced state graph, when conditions C0, C1, C2, C3b, and C4 are satisfied. Conditions C1 and C2 depend on the whole state graph and are not directly exploitable in a verification algorithm. Instead, one uses sufficient conditions, typically derived from the structure of the model description, to safely decide where reduction can be performed. 1.6. Process Model In the sequel, we assume a process-oriented modeling language. We define a Process Model as a refinement of a transition system: Definition 5 (Process Model). Given transition system M = (S, T, s0 , L), a process model consists of a finite set P of m processes p0 , p1 , . . . , pm−1 . For each pi , we define safe deterministic actions Ai ⊆ A and safe deterministic transitions Ti = T ∩ (S × Ai × S) such that for all a ∈ Ai , a is invisible, and for all s ∈ S: ample(s) = enable(s, Ti ) = {a} satisfies conditions C1 and C4. All Ti contain only safe deterministic transitions. Given a state s and a Ti , s is safe deterministic w.r.t. Ti if and only if enable(s, Ti ) = ∅. For instance, suppose a concurrent program S composed of a finite number of thread(s) m. Each thread has exclusive access to some local variables, as well as some global variables that all threads can read or write. This program can be translated into a process model M . The translation procedure may translate each thread of S into a process pi . In particular, a (deterministic) instruction of pi that affects only a local variable x (e.g. x = 3) will meet the conditions of Definition 5 and can be modelled as a safe determinisitc action ax=3 ∈ Ai . ax=3 −→ t resulting from the execution of that instruction will be safe Indeed, all transition s −−− deterministic. Thus, ample(s) = enable(s, Ti ) = {ax=3 } is a valid ample set for POR. 1.7. The Two-Phase Approach to Partial Order Reduction This section presents the Two-Phase algorithm (TP) which was firstly introduced in [6]. Starting from a model M , it generates a reduced model M which is visible-bisimilar to M . It is a variant of the classical DFS algorithm with POR [5,1]. It alternates between two distinct phases: • Phase-1 only expands safe deterministic transitions considering each process at a time, in a fixed order. As long as a process is deterministic, the single transition that is enabled for that process is executed. Otherwise, the algorithm moves on to the next process. After expanding all processes, the last reached state is passed on to Phase-2. • Phase-2 performs a full expansion of the state resulting from the Phase-1, then applies Phase-1 recursively to all reached states. To avoid postponing a transition indefinitely, at least one state is fully expanded on each cycle in the reduced state space. Such an indefinite postponing can only arise within Phase-1. It is handled by detecting cycles within the current Phase-1 expansion. When such a cycle is detected, the algorithm moves to the next process or to Phase-2. As shown in [17], the Two-Phase algorithm produces a reduced state space which is visible-bisimilar to the whole one and therefore preserves CT L∗X properties. This follows from the fact that TP is a classical DFS algorithm with POR and that ample(s) meets conditions C0 to C4 of Section 1.5.
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 37
2. The Stuttering Bounded Two-Phase Algorithm This section presents a variant of the two-phase approach to partial-order reduction, called the Stuttering Bounded Two-Phase method (SBTP). In contrast to the original TP, which performs Phase-1 partial expansions as long as possible, our SBTP method imposes a fixed number n of Phase-1 expansions for each process. If less than n successive steps are enabled for some process, invisible idle transitions are performed instead. Figure 4 illustrates the resulting computation tree, for two processes with n = 3 transitions each. T0 else idle T0 else idle T0 else idle T1 else idle T1 else idle T1 else idle
T T0 T0 T0 T1 T1 T1
else idle else idle else idle else idle else idle else idle
T .. .
Figure 4. SBT P (M, 3) with two processes and n = 3.
This approach ensures that, at a given depth in the execution, the same (partial or global) transition relation is applied to all states, which greatly simplifies the encoding and resolution of this exploration as a bounded model-checking problem using SAT solvers. We consider computation trees (CTs) rather than general transition systems. This offers the advantage that states of the original transition system that can be reached through different paths in the original model, and thus be expanded in different ways, become different states in the computation tree, each with its unique expansion. It matches naturally with the SATbased bounded model-checking approach, which does not attempt to prevent exploring the same state several times on the same path, as opposed to conventional enumerative modelcheckers. To precisely define the SBTP approach, we first characterize a broad class of derived CTs reduced according to partial-order criteria and extended with (finite chains of) idle transitions, and show that they are visible-bisimilar to the CT they derive from. Then we define the CT corresponding to the SBTP method we just outlined, as a particular instance of this class of derived CTs. Finally, we express a constraint system whose solutions are (finite or infinite periodic) bounded execution paths of the CT produced by SBTP. 2.1. Transforming the Computation Tree This section presents two classes of derived computation trees that are visible-bisimilar to a given computation tree CT : partial-order reductions (POR) of CT , which removes states and transitions according to POR criteria, and idle-extensions of CT , which adds (finitely many) idle transitions to each state.
38 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
Both derivations can be combined: indeed, given an initial CT we can successively derive CT as a POR of CT then CT as an idle extension of CT . By transitivity of equivalence, we know that CT is visible-bisimilar to CT . Partial-Order Reduction of CTs The definition of POR on computation trees is a straight application of the criteria quoted in Section 1.5. Definition 6 (POR). Given CT = (S, T, s0 , L) and CT = (S , T , s0 , L) two computation trees such that S ⊆ S and T ⊆ T , CT is a partial-order reduction (POR) of CT if and only if ample(s) respects the conditions C0, C1, C2, C3b and C4 from Section 1.5 over CT , where for all s in S, ample(s) = enabled(s, T ) when s ∈ S and ample(s) = enabled(s, T ) otherwise4 . Theorem 1. If CT is a partial-order reduction of CT , then CT ≈ CT . Proof. This can be demonstrated by constructing a visible bisimulation between M and M . a1 a2 The relation ∼ ⊆ S × S is defined such that s ∼ s iff there exists a path s = s1 − −→ s2 − −→ an1 · · · −−−→ sn = s such that ai is invisible and {ai } satisfies C1 from state si for 1 ≤ i < n. It was shown in [8] that the relation ≈ = ∼ ∩ (S × S ) is a visible bisimulation between M and M . Idle-Extension of CTs The idle-extension consists in adding a finite (possibly null) number of idle transitions on states of CT , giving CT . Intuitively, an idle transition is a transition which does nothing and so does not modify the current state. Definition 7 (Idle-Extension). Given a computation tree CT = (S, T, s0 , L), an idleextension of CT is a computation tree CT = (S , T , s0 , L) over an extended set of transitions A ∪ {idle}, with S ⊇ S and such that for all s ∈ S there is a finite sequence idle idle idle s = s0 −−−→ s1 −−−→ · · · −−−→ sn in CT where: • • • •
s1 , . . . , sn are new states not in S, L(s1 ) = · · · = L(sn ) = L(s), idle is the only enabled transition in s0 , . . . , sn−1 , a a for all s −→ t in CT we have sn −→ t in CT . idle∗
We write s −−−→ si when such a sequence exists and call si an idle-successor of s and s the idle-origin of si . Note that the idle transition is invisible according to this definition. Since the idleextension is a tree, idle-successors are never shared between multiple idle-origins. Theorem 2. If CT is an idle-extension of CT , then CT ≈ CT . Proof. Let CT = (S, T, s0 , L) and CT = (S , T , s0 , L). We define B ⊆ CT ×CT such that B(s, s ) iff s is an idle-successor of s (including s itself). We will prove that B is a visible bisimulation between CT and CT . First, obviously we have B(s0 , s0 ). Next, we consider s, s such that B(s, s ) and check that the three conditions of Definition 4 are satisfied both ways. By definition of B, s is an idle-successor of s. 1. L(s) = L(s ) by Definition 4. a idle∗ a 2. If s −→ t in CT , then there is s −−−→ s −→ t in CT , with B(t, t). a Conversely, if s −→ t in CT then either a = idle, which is invisible, and t is another idle-successor of s so B(s, t ), or a = idle, in which case s is the last idlea successor of s and s −→ t in CT , with B(t , t ). 4
The case where s ∈ / S is for technical soundness only.
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 39 a
a
1 2 3. Suppose that there exists an infinite path s − −→ t1 − −→ t2 · · · in CT , where all ai are invisible and B(ti , s ) for all ti . Then s is a shared idle-successor of all ti , which is impossible according to Definition 7. a1 a2 −→ t1 − −→ t2 · · · in CT , Conversely, suppose that there exists an infinite path s − where all ai are invisible and B(s, ti ) for all ti . Then all ti are idle-successors of s, which is again impossible according to Definition 7.
2.2. The Stuttering Bounded Two Phase Computation Tree In order to accelerate the SAT procedure, we want to consider a modified computation tree of a model such that the same (possibly partial) transition relations are applied to all states at a given depth across the tree. This result can be obtained by applying Stuttering Bounded Two Phase (SBTP) which is a variant of the Two-Phase algorithm (TP). For the simplicity of the arguments, the method presented in this Section and in Section 2.3 considers only the case of finite traces without back-loops (c.f. Section 1.3). Section 2.4 explains how to reason about back-loops. We consider a process model M = (S, T, s0 , L) with m processes p0 , p1 , . . . , pm−1 (c.f. Section 1.6) SBTP’s Phase-1 expands exactly n deterministic transitions of p0 , then n deterministic transitions of p1 , . . . , then n deterministic transitions of pm−1 (n ∈ N). If less than n safe deterministic transitions are allowed, then idle transitions are performed instead. After Phase-1, a Phase-2 expansion occurs even if there are safe deterministic transitions remaining. The computation tree produced by SBT P (M, n) is defined in Listing 1, where BCT (s, t, i) computes the transition relation from state t at depth i using transitions from state s. SBT P (M, n) = (S , T , s0 , L) where c = m · n + 1, T = BCT (s0 , s0 , 0), BCT (s, t, i) = a if n · p ≤ i mod c ≤ n · (p + 1) ∧ s −→ s ∈ Tp then a {t −→ s } ∪ BCT (s , s , i + 1) else if n · p ≤ i mod c ≤ n · (p + 1) ∧ enable(s, Tp ) = ∅ then idle
{t −−−→ t } ∪ BCT (s, t , i + 1) where t is a idle - successor of s else // p = M a {t −→ s } ∪ BCT (s , s , i + 1) , and
(s,a,s )∈T
S = {s | s is reachable from s0 using T } Listing 1. SBTP.
It is easily seen that the computation tree produced by SBTP is an idle-extension of a partial order reduction of CT (M ), and is therefore visible-bisimilar to CT (M ). We notice that when n equals 0 no partial order reduction is performed and the resulting computation tree is the same as the original computation tree. Figure 5(b) illustrates the result of applying one full cycle of SBTP to the CT of Figure 5(a), with two processes and n = 3. The gray arrows of Figure 5(a) represent transitions which are ignored by Phase-1.
40 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking s4
T T1 (a) CT (M )
s0
T1
T2
s1
s2
T
s3
s5
T s6
T1
T1 (b) SBT P (M, 3)
s0
s1
T2
idle s20
s21
s4
T idle s30
idle s31
s32
T
s5
T s6
i=0
1
2
3
Phase 1
4
5
6
7
Phase 2
Figure 5. CT (M ) vs SBT P (M, 3), if s and s are linked by a dashed line then s ≈ s and s ≈ s.
2.3. Applying Bounded Model Checking to SBTP This section describes the actual bounded model checking problem used in our approach. This problem encodes bounded executions of the SBTP computation tree defined in the previous section. Given a process model M with m processes, a LT LX property f , and n, k ∈ N, our approach uses a variant of the method presented in [4] to create a propositional formula P . Contrary to the classical bounded model checking methods which uses a sin[[M, ¬f ]]SBT k,n gle transition relation to carry out the required computation on the state space, we define m + 1 transition relations. One is the full transition relation T used in Phase-2. The others, used in Phase-1, only contain for each process pi , the safe deterministic transitions of pi and idle transitions on states where no such safe deterministic transitions are enabled. We denote these relation transitions by Tiidle . Given two states s, t and an action a , Tiidle (s, a, t) if and a only if either enable(s, Ti ) = {a} and s −→ t , or enable(s, Ti ) = ∅ and a = idle . Given the number of processes m and parameter n, we know which phase is used in the unfolding process at each depth i of the unfolding process. Furthermore, if Phase-1 is expanded at i, we know which process is being unfolded (c.f. Figure 4). The transition relation Tn (i, s, a, s ) expanded at level i is defined as follows: Definition 8 (Tn (i, s, a, s )). Given M = (S, T, s0 , L) with m processes p0 , p1 , . . . , pm−1 . Let c = m · n + 1 the number of steps of a cycle: n Phase-1 steps for each of the m processes plus one Phase-2 step, i ∈ N, s, s ∈ S, and a ∈ A: Tn (i, s, a, s ) :=
T (s, a, s ) if i mod c = m · n (Phase-2) Tjidle (s, a, s ) where j = (i mod c) div n otherwise (Phase-1)
We are able to apply POR on bounded model checking by making use of the previous definition into Definition 2 which translate a transition system and a LT L property into a propositional formula:
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 41
Definition 9 (SBTP encoding). Let M be a process model which contains m processes, f be a LT LX property and n, k ∈ N: P [[M, ¬f ]]SBT := I(s0 ) ∧ k,n
i=k−1
Tn (i, si , a, si+1 ) ∧ l Lk ∧ [[¬f ]]k
i=0 P When the propositional formula [[M, ¬f ]]SBT is built, a decision procedure is used to k,n P is satisfiable. The validity of this check its satisfiability. An error is found if [[M, ¬f ]]SBT k,n method stems from the following observations. By comparing the construction of SBTP(M, P it is clear that the latter is the BMC encoding of the former i.e. n) and [[M, ¬f ]]SBT k,n SBT P [[M, ¬f ]]k,n = [[SBT P (M, n), ¬f ]]k (restricted to finite traces). The rest derives from the validity of BMC and SBTP, as follows:
Theorem 3. Let M be a process model with m processes, f be an LT LX formula, and n ∈ N. P if and only if M |= f . There exists k ≥ 0 such that [[M, ¬f ]]SBT k,n Proof.
P ∃k : [[M, ¬f ]]SBT is satifiable k,n
P [[M, ¬f ]]SBT = [[SBT P (M, n), ¬f ]]k ⇐⇒ ∃k : [[SBT P (M, n), ¬f ]]k is satifiable k,n ⇐⇒ ∃k : SBT P (M, n) |=k f (by validity of BMC (c.f. Theorem 2 of [4])) ⇐⇒ SBT P (M, n) |= f (by validity of BMC (c.f. Theorem 1 of [4])) ⇐⇒ M |= f (SBT P (M, n) ≈ CT (M ) ≈ M )
P [[M, ¬f ]]SBT is well suited for the DPLL algorithm in the sense that the Phase-1 trank,n idle sition Tj produces mostly efficient unit propagation with little backtracking. Suppose that a0 we want to find a satisfying assignment for the path s0 − −→ s1 · · · and that s0 is a deterministic state. Once the variable’s values of s0 are completely decided, the variable’s values of s1 can be completely inferred by the propagation unit phase. Because s0 is a deterministic state, there is exactly one possibility for the variable’s value of s1 .
2.4. BMC with the Back-loops This section shows why paths with back-loops invalidate the arguments of Section 2.3, and how to extend those arguments of Section 2.3 to deal with back-loops. Figure 6 represents a path π which contains a back-loop. It is easy to see that this finite loop induces an infinite path which does not belong to SBT P (M, 2). This path belongs to the computation tree presented in Figure 7. All the execution paths start with a prefix of length T idle
T idle
T idle
1 1 2 −→ s1 −−− −→ s2 −−− −→, followed by an infinite expansion of the k1 = 3 of the form s0 −−−
T idle
T
T idle
T idle
T idle
T
2 1 1 2 −→ si+1 −→ si+2 −−− −→ si+3 −−− −→ si+4 −−− −→ si+5 −→. loop of length k2 = 6: si −−− Given a process model M = (S, T, s0 , L) with m processes, and lengths k1 and k2 , we can build variants of SBT P (M, n) that correspond to the computation tree of Figure 7. These variants are still idle extensions of partial order reductions of CT (M ), hence visible-bisimilar P with back-loops, similar to to M . We can then construct a complete version of [[M, ¬f ]]SBT k,n Definition 2, that essentially corresponds to the union of those modified SBTP computation trees. Note that in order to satisfy the condition C3b of Section 1.5, the full transition relation T must be used to check whether there exists a back loop or not. If a transition relation Tjidle was used instead, we could have a loop that does not contain a Phase-2 expansion, thus postponing some transitions indefinitely and violating condition C3b.
42 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
T T1idle
T1idle
T2idle
T2idle
T
T1idle
T1idle
T2idle
sk k1
k2
Figure 6. A finite path with a back-loop.
T1idle
prefix (k1 )
T1idle T2idle T2idle T T1idle
loop 1 (k2 )
T1idle T2idle T T2idle T T1idle
loop 2 (k2 )
T1idle T2idle T
.. .
.. .
Figure 7. variant of SBT P (M, n).
3. Implementation We extended the model checker presented in [7] to support the BMC over SBTP models. It allows us to describe concurrent systems and to verify LT LX properties. Our prototype has been implemented with the Scala language [18]. We decided to use the Scala language because it is a multi-paradigm programming language, fully interoperable with Java, designed to integrate features from object-oriented programming and functional programming. Scala is a pure object-oriented language in the sense that every value is an object. Scala is also a functional language in the sense that every function is a value. The model checker defines a language for describing transitions systems. The design of the language has been influenced on the one hand by process algebras and on the other hand by the NuSMV language [19]. A model of a concurrent system declares a set of global variables, a set of shared actions and a set of processes. A process pi declares a set of local variables, a set of local actions and the set of shared actions which pi is synchronized on. Each process has a distinguished local program counter variable pc. For each value of pc, the behavior of a process is defined by means of a list of action-labelled guarded commands of the form [a]c → u, where a is an action, c is a condition on variables and u is an assignment updating some variables. Shared actions are used to define synchronization between the processes. A shared action occurs simultaneously in all the processes that share it, and
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 43
only when all enable it. For each process pi , we use an heuristic which is based on syntactic information about pi to compute a safe approximation Ai of the safe deterministic transitions. These conditions are described in [20]. We only allow properties to refer the global variables. Intuitively, the safe deterministic transitions are those which perform a deterministic action (e.g. a deterministic assignment) and do not access any global variables or global labels. A more complex heuristic could take into account the variables occurring in the properties being verified to improve the quality of the safe approximation Ai . The model checker takes a model in this language as input. The number of steps per process in Phase-1 (parameter n) is fixed by the user. To find an error, it applies an iterative deepening, producing a SAT problem corresponding to Definition 9 for increasing depths k. The Yices SMT solver is used to check the satisfiability of the generated formula [21]. We decided to use Yices because it offers built-in support for arithmetic operators defined in our language. If a counterexample is found, a trace which violates the property is displayed.
4. Case Study
In order to assess the effectiveness of our method, we applied it to a variant of a producerconsumer system where all producers and consumers contribute on the production of every single item. The model is composed of 2m processes: m producers and m consumers. The producers and consumers communicate together via a bounded buffer. Each producer works locally on a piece p, then it waits until all producers terminate their task. Then, p is added to the bounded buffer, and the producers start processing the next piece. When the consumers remove p from the bounded-buffer, they work locally on it. When all the consumers have terminated their local work, an other piece can be removed from the bounded-buffer. Two properties have been analyzed on this model: P1 states that the bounded buffer is always empty, and P2 states that in all cases the buffer will eventually contain more than one piece. Table 1 and Table 2 compare the classical BMC method and the SBTP method when applied to P1 and P2 . Notice that BMC proceeds by increasing depth k until an error is found (c.f. iterative deepening). Classical BMC quickly runs out of resources whereas our method can treat much larger models in a few minutes. In regard of the verification time, we notice that our method significantly outperforms the BMC method for this example. We also notice that SBTP traces are 3.4 to 6.75 times longer. This difference can come from either the addition of the idle transitions, or the considered paths themselves: contrary to BMC, our method does not consider all possible interleavings, thus it is possible that the smallest error traces are not considered. Table 3 analyses the influence of the number of times Phase-1 is executed for each process (i.e. the parameter n). We notice that for a given number of producers and consumers, n influences in a non-monotonic way the length of the error execution path, the verification time as well as the memory used during the verification. n influences the two aspects of the transformation of the model. On one hand, the graph is more reduced as n is increased due to more partial-order reduction. On the other hand, the number of added idle transitions is also influenced by this parameter. When n is increased, the number of cycles on the discovered error path tends towards the minimum number of unsafe transitions which participate to the violation of the property. We notice that each time the number of cycles is decremented by one (c. f. n = 4), the cpu time and the memory needed reach a local minimum. Then the cpu time and the memory used augment until the number of cycles is decremented again.
44 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking Table 1. Statistics of property P1 of the producer-consumer model using BMC approach and SBTP approach with n = 8. m is the number of producers (resp. consumers), # states is the state space size, k is the smallest bound for which an error is found, TIME is the verification time (in seconds), MEM is the memory used by Yices when the bound equals k (in Megabytes), and # cycles is the number of cycles: Phase-1/Phase-2. — indicates that the computation did not end with 8 hours.
m 1 2 3 4 5 6 7
# states 1,059 51,859 3,807,747 ≈ 108 ≈ 1010 ≈ 1012 ≈ 1014
k 10 18 26 — — — —
BMC property P1 TIME (sec) MEM (MB) 10 29 44 41 11,679 65 — — — — — — — —
k 34 66 98 130 162 194 226
SBTP property P1 # cycles TIME (sec) 2 7 2 8 2 16 2 31 2 43 2 57 2 77
MEM (MB) 30 49 85 122 169 224 288
Table 2. Statistics of property P2 of the producer-consumer model using BMC approach and SBTP approach with n = 8. m is the number of producers (resp. consumers), # states is the state space size, k is the smallest bound for which an error is found, TIME is the verification time (in seconds), MEM is the memory used by Yices when the bound equals k (in Megabytes), and # cycles is the number of cycles: Phase-1/Phase-2. — indicates that the computation did not end with 8 hours.
m 1 2 3 4 5 6 7
# states 1,059 51,859 3,807,747 ≈ 108 ≈ 1010 ≈ 1012 ≈ 1014
BMC property P2 k TIME (sec) 26 73 44 29,898 — — — — — — — — — —
MEM (MB) 33 131 — — — — —
k 153 297 441 585 729 873 1,017
SBTP property P2 # cycles TIME (sec) 9 122 9 211 9 401 9 1,238 9 1,338 9 1,926 9 4,135
MEM (MB) 96 224 363 680 983 1,438 1,618
Table 3. Influence of the parameter n when the number of producers (resp. consumers) equals 2. k is the smaller bound for which an error is found, # cycles is the number of cycles: Phase-1/ Phase-2, TIME is the verification time (in seconds), and MEM is the memory used by Yices when the bound equals k (in Megabytes).
n 0 1 2 3 4 5 6 7 8 9
k 18 35 45 39 51 63 75 87 66 74
# cycles 18 7 5 4 3 3 3 3 2 2
property P1 TIME (sec) 44 12 11 10 8 10 12 13 8 9
MEM (MB) 41 41 40 47 47 50 57 58 49 57
k 44 95 135 169 187 231 275 319 297 333
# cycles 44 19 15 13 11 11 11 11 9 9
property P2 TIME (sec) 29,898 855 235 305 217 375 381 583 211 240
MEM (MB) 131 159 167 194 192 308 240 318 224 295
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 45
5. Related Work Different approaches have been developed to apply symbolic model checking on asynchronous systems. In [22], Enders et al. show how to encode a transition relation T (s, a, s ) into BDDs. This paper focusses on the ordering of the variables within BDDs. It is well-know that the size of BDDs, and therefore performance of BDD-based model checking, strongly depends on this ordering. In general finding the best variable ordering is a NP-complete problem. The paper presents an heuristic which produces BDDs that grow linearly in the number of asynchronous components according to experimental results. In [8], Gerth et al. show how to perform partial order reduction in the context of process algebras. They show that condition C0 to C4 of Section 1.5 can be applied to produce a reduced structure that is branching-bisimilar, and hence preserve Hennessy-Milner logic [23]. Other approaches combine symbolic model checking and POR to verify different classes of properties. In [24], Alur et al. transform an explicit model checking algorithm performing partial order reduction. This algorithm is able to check invariance of local properties. They start from a DFS algorithm to obtain a modified BFS algorithm. Both expand an ample set of transitions at each step. In order to detect the cycles, they assume pessimistically that each previously expanded state might close a cycle. In [25], Abdulla et al. present a general method for combining POR and symbolic model checking. Their method can check safety properties either by backward or forward reachability analysis. So as to perform the reduction, they employ the notion of commutativity in one direction, a weakening of the dependency relation which is usually used to perform POR. In [26], Kurshan et al. introduce a partial order reduction algorithm based on static analysis. They notice that each cycle in the state space is composed of some local cycles. The method performs a static analysis of the checked model so as to discover local cycles and set up all the reductions at compile time. The reduced state space can be handled with symbolic techniques. This paper complements our previous work which combined symbolic model checking and partial order reduction [7]. That work introduces the FwdUntilPOR algorithm that combines two existing techniques to provide symbolic model checking of a subset of CTL on asynchronous models. The first technique is the ImProviso algorithm which efficiently merges POR and symbolic methods [20]. It is a symbolic adaptation of the Two-Phase algorithm. The second technique is the forward symbolic model checking approach applicable to a subset of CTL [27]. Contrary to FwdUntilPOR which checks CT LX properties using BDD-based model checking, our method deals with LT LX properties using a SAT solver. In [28], Jussila presents three improvement to apply bounded model checking to asynchronous systems. Jussila considers reachability properties, whereas our method allows the verification of LT LX properties. The partial order semantics replaces the standard interleaving execution model with non-standard models allowing the execution of several independent actions simultaneously. Then, the on-the-fly determinization consists to determinize the different components during their composition. This is done creating a propositional formula whose models correspond to the executions from the determinized equivalents of the components. We point out that the state automaton resulting from the determinization of the components are never constructed. Finally, the merging of local transitions can be seen as introducing additional transitions to the components. These transitions correspond to the execution of a sequence of local actions. When a transition is added, the component has to contain a path between the transition’s source and target states. The partial order semantics addresses the same problem as we do. Both methods consider a model which contains less execution paths than the original model. On-the-fly determinization can be seen as a complementary method to ours. In general, when asynchronous systems are considered, two causes of non-determinism are identified: the first one comes
46 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
from the components themselves, and the second one comes from the interleaving execution model. The former is handled by on-the-fly determinization while our method tackles the latter. All three approaches are potentially applicable and open interesting directions for further work. However, none of those methods provides a BMC encoding using only a subset of the relation transition at some steps, which has proven to provide important performance gains in our approach. 6. Conclusion In this paper, we introduced a technique which applies the partial order reduction methods to bounded model checking. It provides an algorithm which is appropriate to the verification of LT LX properties to asynchronous models. The formulæ produced by this approach allow for more efficient processing by the DPLL algorithm used in BMC, compared to those produced by the conventional bounded model checking approach. These formulæ are obtained by using only a restricted, safe subset of the transition relation based on POR considerations at many steps in the unfolding of the model. In order to assess the correctness of our method, we define two general procedures for transforming a computation tree CT to a visible-bisimilar one. The partial-order reduction of CT , which reduces CT according to classical POR criteria, and the idle-extension of CT , which adds a finite number of idle transitions to each state. Then, we define the SBTP algorithm which is a particular instance of these transformations. Finally, we present the transformation of SBTP into a bounded model checking problem. We extended a model checker which is currently under development at our university to support our method. We show on a simple case study that our method achieves an improvement in comparison to the classical bounded model checking algorithm. However, our method need to be tested on a larger range of case studies and to be compared with other methods and tools such as NuSMV [19] or FDR [29]. Furthermore, one could explore how to applied our method to those tools. Our approach can be extended in the following ways: • The SBTP algorithm can be extended to handle models featuring variables on infinite domains. This can be achieved by using the capabilities of Satisfiability Modulo Theories solvers such as Yices [21] and MathSat [30]. • When a partial Tjidle (si , a, si+1 ) is applied, only local variables from pj are modified, other variables y being constrained to remain the same (yi = yi+1 ). Based on that we could merge these variables and remove the corresponding constraints. This would amount to parallelizing safe transitions of different processes, approaching the result of Jussila’s first method from a different angle. • The heuristic used to determine the safe deterministic transitions is quite simple. Meanwhile, there exists a large body of literature on this subject. Based on that, we could explore better approximations that result in detecting more safe deterministic states. Acknowledgements This work is supported by project MoVES under the Interuniversity Attraction Poles Programme — Belgian State — Belgian Science Policy. The authors are grateful for the fruitful discussions with Stefano Tonetta, Marco Roveri, and Alessandro Cimatti during a stay at the Fondazione Bruno Kessler. These discussions were the starting point of this work.
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 47
References [1] Edmund M. Clarke, Orna Grumberg, and Doron Peled. Model Checking. Mit Press, 1999. [2] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C-35(8), 1986. [3] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and J. Hwang. Symbolic model checking: 1020 states and beyond. Information and Computation, 98(2):142–170, 1992. [4] Armin Biere, Alessandro Cimatti, Edmund M. Clarke, Ofer Strichman, and Yunshan Zhu. Bounded model checking. Advances in Computers, 58:118–149, 2003. [5] Patrice Godefroid. Partial-Order Methods for the Verification of Concurrent Systems – An Approach to the State-Explosion Problem, volume 1032 of Lecture Notes in Computer Science. Springer-Verlag, 1996. [6] Ratan Nalumasu and Ganesh Gopalakrishnan. A new partial order reduction algorithm for concurrent system verification. In CHDL’97: Proceedings of the IFIP TC10 WG10.5 international conference on Hardware description languages and their applications : specification, modelling, verification and synthesis of microelectronic systems, pages 305–314, London, UK, UK, 1997. Chapman & Hall, Ltd. [7] Jos´e Vander Meulen and Charles Pecheur. Efficient symbolic model checking for process algebras. In 13th International Workshop on Formal Methods for Industrial Critical Systems (FMICS 2008), volume 5596, pages 69–84. LNCS, 2008. [8] Rob Gerth, Ruurd Kuiper, Doron Peled, and Wojciech Penczek. A partial order approach to branching time logic model checking. Information and Computation, 150(2):132–152, 1999. [9] Orna Lichtenstein and Amir Pnueli. Checking that finite state concurrent programs satisfy their linear specification. In POPL ’85: Proceedings of the 12th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 97–107, New York, NY, USA, 1985. ACM. [10] Rob Gerth, Doron Peled, Moshe Y. Vardi, and Pierre Wolper. Simple on-the-fly automatic verification of linear temporal logic. In Proc. 15th Work. Protocol Specification, Testing, and Verification, Warsaw, June 1995. North-Holland. [11] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite-state concurrent systems using temporal logic specifications. ACM Transactions on Programming Languages and Systems, 8(2):244– 263, 1986. [12] E. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite-state concurrent systems using temporal logic. In 10th Annual Symposium on Principles of Programming Languages. ACM, 1983. [13] Julius R. B¨uchi. On a decision method in restricted second order arithmetic. In Ernest Nagel, Patrick Suppes, and Alfred Tarski, editors, Proceedings of the 1960 International Congress on Logic, Methodology and Philosophy of Science, pages 1–11. Stanford University Press, June 1962. [14] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and J. Hwang. Symbolic model checking: 1020 states and beyond. Information and Computation, 98(2):142–170, 1992. [15] Martin Davis, George Logemann, and Donald Loveland. A machine program for theorem-proving. Commun. ACM, 5(7):394–397, 1962. [16] Robin Milner. Communication and Concurrency. Prentice-Hall, 1989. [17] Ratan Nalumasu and Ganesh Gopalakrishnan. An efficient partial order reduction algorithm with an alternative proviso implementation. Formal Methods in System Design, 20(3):231–247, 2002. [18] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Sebastian Maneth, St´ephane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. An overview of the scala programming language. Technical Report IC/2004/64, EPFL Lausanne, Switzerland, 2004. [19] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri. NuSMV: a new symbolic model verifier. In Proc. of International Conference on Computer-Aided Verification, 1999. [20] Flavio Lerda, Nishant Sinha, and Michael Theobald. Symbolic model checking of software. In Byron Cook, Scott Stoller, and Willem Visser, editors, Electronic Notes in Theoretical Computer Science, volume 89. Elsevier, 2003. [21] Bruno Dutertre and Leonardo de Moura. The Yices SMT solver. Tool paper at http://yices.csl.sri.com/toolpaper.pdf, August 2006. [22] Reinhard Enders, Thomas Filkorn, and Dirk Taubner. Generating BDDs for symbolic model checking in CCS. In CAV ’91: Proceedings of the 3rd International Workshop on Computer Aided Verification, pages 203–213, London, UK, 1992. Springer-Verlag. [23] Matthew Hennessy and Robin Milner. Algebraic laws for nondeterminism and concurrency. J. ACM, 32(1):137–161, 1985. [24] Rajeev Alur, Robert K. Brayton, Thomas A. Henzinger, Shaz Qadeer, and Sriram K. Rajamani. Partialorder reduction in symbolic state space exploration. In Computer Aided Verification, pages 340–351, 1997.
48 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking [25] Parosh Aziz Abdulla, Bengt Jonsson, Mats Kindahl, and Doron Peled. A general approach to partial order reductions in symbolic verification (extended abstract). In Computer Aided Verification, pages 379–390, 1998. [26] Robert P. Kurshan, Vladdimir Levin, Marius Minea, Doron Peled, and H¨usn¨u Yenig¨un. Static partial order reduction. In TACAS ’98: Proceedings of the 4th International Conference on Tools and Algorithms for Construction and Analysis of Systems, pages 345–357, London, UK, 1998. Springer-Verlag. [27] Hiroaki Iwashita, Tsuneo Nakata, and Fumiyasu Hirose. CTL model checking based on forward state traversal. In ICCAD ’96: Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design, pages 82–87, Washington, DC, USA, 1996. IEEE Computer Society. [28] Toni Jussila. On bounded model checking of asynchronous systems. Research Report A97, Helsinki University of Technology, Laboratory for Theoretical Computer Science, Espoo, Finland, October 2005. Doctoral dissertation. [29] A. W. Roscoe. Model-checking CSP, In A classical mind: essays in honour of C. A. R. Hoare. Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1994. [30] Roberto Bruttomesso, Alessandro Cimatti, Anders Franz´en, Alberto Griggio, and Roberto Sebastiani. The MathSAT 4 SMT solver. In CAV ’08: Proceedings of the 20th international conference on Computer Aided Verification, pages 299–303, Berlin, Heidelberg, 2008. Springer-Verlag.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-49
49
On Congruence Property of Scope Equivalence for Concurrent Programs with Higher-Order Communication Masaki MURAKAMI Department of Computer Science, Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-Naka, Okayama, 700-0082, Japan
[email protected] Abstract. Representation of scopes of names is important for analysis and verification of concurrent systems. However, it is difficult to represent the scopes of channel names precisely with models based on process algebra. We introduced a model of concurrent systems with higher-order communication based on graph rewriting in our previous work. A bipartite directed acyclic graph represents a concurrent system that consists of a number of processes and messages in that model. The model can represent the scopes of local names precisely. We defined an equivalence relation such that two systems are equivalent not only in their behavior, but also in extrusion of scopes of names. This paper shows that our equivalence relation is a congruence relation w.r.t. τ -prefix, new-name, replication and composition, even when higher-order communication is allowed. We also show our equivalence relation is not congruent w.r.t. input-prefix though it is congruent w.r.t. input-prefix in the first-order case. Keywords. theory of concurrency, π-calculus, bisimilarity, graph rewriting, higherorder communication
Introduction There are a number of formal models of concurrent systems. In models such as πcalculus [11], “a name” represents, for example, an IP address, a URL, an e-mail address, a port number and so on. Thus, the scopes of names in formal models are important for the security of concurrent systems. On the other hand, it is difficult to represent the scopes of channel names precisely with models based on process algebra. In many such models based on process algebra, the scope of a name is represented using a binary operation such as the ν-operation. Thus the scope of a name is a subterm of an expression that represents a system. For example, in a π-calculus term: νa2 (νa1 (b1 |b2 )|b3 ), the scope of the name a2 is the subterm (νa1 (b1 |b2 )|b3 ) and the scope of the name a1 is the subterm (b1 |b2 ). However, this method has several problems. For example, consider a system S consisting of a server and two clients. A client b1 communicates with the server b2 using a channel a1 whose name is known only by b1 and b2 . And a client b3 communicates with b2 using a channel a2 that is known only by b2 and b3 . In this system a1 and a2 are private names. As b2 and b1 knows the name a1 but b3 does not, then the scope of
50
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 1. Scopes of names in S.
a1 includes b1 and b2 and the scope of a2 includes b3 and b2 . Thus the scopes of a1 and a2 are not nested as shown in Figure 1. The method denoting private names as bound names using ν-operator cannot represent the scopes of a1 and a2 precisely because scopes of names are subterms of a term and then they are nested (or disjoint) in any π-calculus term. Furthermore, it is sometimes impossible to represent the scope even for one name precisely with ν-operator. Consider the example, νa(¯ v a.P ) | v(x).Q where x does not occur in Q. In this example, a is a private name and its scope is v¯a.P . The scope of a is extruded by communication with prefixes v¯a and v(x). Then the result of the action is νa(P |Q) and Q is included in the scope of a. However, as a does not occur in Q, it is equivalent to (νaP )|Q by rules of structural congruence. We cannot see the fact that a is ‘leaked’ to Q from the resulting expression: (νaP )|Q. Thus we must keep the trace of communications for the analysis of scope extrusion. This makes it difficult to analyze extrusions of scopes of names. In our previous work we presented a model that is based on graph rewriting instead of process algebra as a solution to the problem of representing the scopes of names [6]. We defined an equivalence relation on processes called scope equivalence such that it holds if two processes are equivalent not only on their behavior but also on the scopes of channel names. We showed the congruence results of weak bisimulation equivalence [7] and of scope equivalence [9] on the graph rewriting model. On the other hand, a number of formal models with higher-order communication have been reported. LHOπ (Local Higher Order π-calculus) [12] is the one of the most well studied model in that area. It is a subcalculus of higher-order π-calculus with asynchronous communication. However the problem of scopes of names also happens in LHOπ . We need a model with higher-order communication that can represent the scopes of names precisely. We extended the graph rewriting model of [6] for systems with higher-order communication [8]. We extended the congruence results of the behavioral equivalence to the model with higher-order communication [10]. This paper discusses the congruence property of scope equivalence for the graph rewriting model with higher-order communication introduced in [8]. We show that the scope equivalence relation is a congruence relation w.r.t. τ -prefix, new-name, replication and composition even if higher-order communication is allowed as presented in section 4.1. These results are extensions of the results presented in [9]. On the other hand, in section 4.2, we show that it is not congruent w.r.t. input-prefix though it is congruent w.r.t. input-prefix in first-order case [9]. Congruence results on bisimilarity based on graph rewriting models are reported in [2,13]. Those studies adopts graph transformation approach for proof techniques. In this paper, graph rewriting is introduced to extend the model for the representation of name scopes.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
51
Figure 2. A bipartite directed acyclic graph.
Figure 3. A message node.
Figure 4. A behavior node α.P .
Figure 5. Message receiving.
1. Basic Idea Our model is based on graph rewriting system such as [2,3,5,4,13]. We represent a concurrent program that consists of a number of processes (and messages on the way) with a bipartite directed acyclic graph. A bipartite graph is a graph whose nodes are decomposed into two disjoint sets: source nodes and sink nodes such that no two graph nodes within the same set are adjacent. Every edge is directed from a source node to a sink node. The system of Figure 1 that consists of three processes b1 , b2 and b3 and two names ai (i = 1, 2) shared by bi and bi+1 is represented with a graph as Figure 2. Processes and messages on the way are represented with source nodes. We call source nodes behaviors. In Figure 2, b1 , b2 and b3 are behaviors. message: A behavior node that represents a message is a node labeled with a name of the recipient n (it is called the subject of the message) and the contents of the message o as Figure 3. The contents of the message is a name or a program code as we allow higher-order messages. As a program code to be sent is represented with a graph structure, then the content of a message may have bipartite graph structure also. Thus the message node has a nested structure that has a graph structure inside of the node. message receiving: A message is received by a receiver process that executes an input action and then continues the execution. We denote a receiver process with a node that consists of its
52
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 6. Extrusion of the scope of n.
Figure 7. Message sending.
Figure 8. Receiving a program code.
epidermis that denotes the first input action and its content that denotes the continuation. For example, a receiver that executes an input action α and then become a program P (denoted as α.P in CCS term) is denoted with a node whose epidermis is labeled with α and the content is P (Figure 4). As the continuation P is a concurrent program, then it has a graph structure inside of the node. Thus the receiver process also has a nested structure. Message receiving is represented as follows. Consider a message to send an object (a name or an abstraction) n and the receiver with a name m (Figure 5a). The execution of message input action is represented by “peeling the epidermis of the receiver process node”. When the message is received then it vanishes, the epidermis of the receiver is removed and the content is exposed (Figure 5b). Now the continuation P is activated. The received object n is substituted to the name x in the content P . The scope of a name is extruded by message passing. For example, π-calculus has a transition τ such that (νnan)|a(y).P → νnP [n/y]. This extrusion is represented by a graph rewriting as Figure 6. A local name n occurs in the message node but there is no edge from the node of the receiver because n is new to the receiver. After receiving the message, as n is a newly imported local name, then a new sink node corresponding to n is added to the graph and new edges are created from each behavior of the continuation to n as the continuation of the receiver is in the scope of n. message sending: In asynchronous π-calculus, message sending is represented in the same way as process activation. We adopt the similar idea. Consider an example that executes an action α and sends a message m (Figure 7 left). When the action α is executed, then the epidermis is peeled and the message m is exposed as Figure 7 right. Now the message m is transmitted and m can move to the receiver. And the execution of Q continues.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
53
higher-order communications: Consider the case that the variable x occurs as the subject of a message like xu in the content of a receiver (Figure 8a). If the received object n is a program code, then nu becomes a program to be activated. As LHOπ , a program code to transfer is in the form of an abstraction in a message. An abstraction denoted as (y)Q consists of a graph Q representing a program and its input argument y. When an abstraction (y)Q is sent to the receiver and substituted to x in Figure 8a, the behavior node (y)Qu is exposed and ready to be activated(Figure 8b). To activate (y)Qu, u is substituted to y in Q (Figure 8c). This action corresponds to the β-conversion in LHOπ . Then we have a program Q with input value u, and it is activated. Note that new edges from each behaviours Q to the sink node which had a edge from xu are created.
2. Formal Definitions In this section, we present formal definitions of the model presented informally in the previous section. 2.1. Programs First, a countably-infinite set of names is presupposed as other formal models based on process algebra. Definition 2.1 (program, behavior) Programs and behaviors are defined recursively as follows. (i) Let a1 , . . . , ak are distinct names. A program is a bipartite directed acyclic graph with source nodes b1 , . . . , bm and sink nodes a1 , . . . , ak such that • Each source node bi (1 ≤ i ≤ m) is a behavior. Duplicated occurrences of the same behavior are possible. • Each sink node is a name aj (1 ≤ j ≤ k). All aj ’s are distinct. • Each edge is directed from a source node to a sink node. Namely, an edge is an ordered pair (bi , aj ) of a source node and a name. For any source node bi and a name aj there is at most one edge from bi to ai . For a program P , we denote the multiset of all source nodes of P as src(P ), the set of all sink nodes as snk(P ) and the set of all edges as edge(P ). Note that the empty graph: 0 such that src(0) = snk(0) = edge(0) = ∅ is a program. (ii) A behavior is an application, a message or a node consists of the epidermis and the content defined as follows. In the following of this definition, we assume that any element of snk(P ) nor x does not occur in anywhere else in the program. 1. A tuple of a variable x and a program P is an abstraction and denoted as (x)P . An object is a name or an abstraction. 2. A node labeled with a tuple of a name: n (called the subject of the message) and an object: o is a message and denoted as no. 3. A node labeled with a tuple of an abstraction and an object is an application. We denote an application as Ao where A is an abstraction and o is an object. 4. A node whose epidermis is labeled with “!” and the content is a program P is a replication, and denoted as !P .
54
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
5. An input prefix is a node (denoted as a(x).P ) that the epidermis is labeled with a tuple of a name a and a variable x and the content is a program P . 6. A τ -prefix is a node (denoted as τ.P ) that the epidermis is labeled with a silent action τ and the content is a program P . Definition 2.2 (local program) A program P is local if for any input prefix c(x).Q and any abstraction (x)Q occurring in P , x does not occur in the epidermis of any input prefix in Q. An abstraction (x)P is local if P is local. A local object is a local abstraction or a name. The locality condition says that “anyone cannot use a name given from other one to receive messages”. Though this condition affects the expressive power of the model, we do not consider that the damage to the expressive power by this restriction is significant. Because as transfer of receiving capability is implemented with transfer of sending capability in many practical example, we consider local programs have enough expressive power for many important/interesting examples. So in this paper, we consider local programs only. Theoretical motivations of this restriction are discussed in [12]. Definition 2.3 (free/bound name) 1. For a behavior or an object p, the set of free names of p : fn(p) is defined as follows: fn(0) = ∅, fn(a) = {a} for a name a, fn(ao) = fn(o) ∪ {a}, fn((x)P ) = fn(P ) \ {x}, fn(!P ) = fn(P ), fn(τ.P ) = fn(P ), fn(a(x).P ) = (fn(P ) \ {x}) ∪ {a} and fn(o1 o2 ) = fn(o1 ) ∪ fn(o2 ). 2. For a program P where src(P ) = {b1 , . . . , bm }, fn(P ) = i fn(bi ) \ snk(P ). The set of bound names of P (denoted as bn(P )) is the set of all names that occur in P but not in fn(P ) (including elements of snk(P ) even if they do not occur in any element of src(P )). The role of free names is a little bit different from that of π-calculus in our model. For example, a free name x occurs in Q is used as a variable in (x)Q or a(x).Q. A channel name that is used for communication with the environments is an element of snk, so it is not a free name. Definition 2.4 (normal program) A program P is normal if for any b ∈ src(P ) and for any n ∈ fn(b) ∩ snk(P ), (b, n) ∈ edge(P ) and any program occurs in b is also normal. It is quite natural to assume the normality for programs, because someone must know a name to use it. In the rest of this paper we consider normal programs only. Definition 2.5 (composition) Let P and Q be programs such that src(P ) ∩ src(Q) = ∅ and fn(P ) ∩ snk(Q) = fn(Q) ∩ snk(P ) = ∅. The composition P Q of P and Q is the program such that src(P Q) = src(P ) ∪ src(Q), snk(P Q) = snk(P ) ∪ snk(Q) and edge(P Q) = edge(P ) ∪ edge(Q). Intuitively, P Q is the parallel composition of P and Q. Note that we do not assume snk(P )∩ snk(Q) = ∅. Obviously P Q = QP and ((P Q)R) = (P (QR)) for any P, Q and R from the definition. The empty graph 0 is the unit of “”. Note that src(P ) ∪ src(Q) and edge(P ) ∪ edge(Q) denote the multiset unions while snk(P ) ∪ snk(Q) denotes the set union. It is easy to show that for normal and local programs P and Q, P Q is normal and local. Definition 2.6 (N -closure) For a normal program P and a set of names N such that N ∩ bn(P ) = ∅, the N -closure νN (P ) is the program such that src(νN (P )) = src(P ), snk(νN (P )) = snk(P ) ∪ N and edge(νN (P )) = edge(P ) ∪ {(b, n)|b ∈ src(P ), n ∈ N }. We denote νN1 (νN2 (P ))) as νN1 νN2 (P ) for a program P and sets of names N1 and N2 .
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
55
Definition 2.7 (deleting a behavior) For a normal program P and b ∈ src(P ), P \ b is a program that is obtained by deleting a node b and edges that are connected with b from P . Namely, src(P \ b) = src(P ) \ {b}, snk(P \ b) = snk(P ) and edge(P \ b) = edge(P ) \ {(b, n)|(b, n) ∈ edge(P )}. Note that src(P ) \ {b} and edge(P ) \ {(b, n)|(b, n) ∈ edge(P )} mean the multiset subtractions. Definition 2.8 (context) Let P be a program and b ∈ src(P ) where b is an input prefix, a τ -prefix or a replication and the content of b is 0. A simple first-order context is a graph P [ ] such that the contents 0 of b is replaced with a hole “[ ]”. We call a simple context a τ -context if the hole is the contents of a τ -prefix, an input context if is the contents of an input prefix and a replication context if it is the contents of a replication. Let P be a program such that b ∈ src(P ) and b is an application (x)0Q. An application context P [ ] is a graph obtained by replacing the behavior b with (x)[ ]Q. A simple context is a simple first-order context or an application context. A context is a simple context or the graph P [Q[ ]] that is obtained by replacing the hole of P [ ] with Q[ ] for a simple context P [ ] and a context Q[ ] (with some renaming of the names which occur in Q if necessary). For a context P [ ] and a program Q, P [Q] is the program obtained by replacing the hole in P [ ] by Q (with some renaming of the names which occur in Q if necessary). 2.2. Operational Semantics We define the operational semantics with a labeled transition system. The substitution of an object to a program, to a behavior or to an object is defined recursively as follows. Definition 2.9 (substitution) Let p be a behavior, an object or a program and o be an object. For a name a, we assume that a ∈ fn(p). The mapping [o/a] defined as follows is a substitution. o if c = a • for a name c, c[o/a] = c otherwise • for behaviors, ((x)P )[o/a] = (x)(P [o/a]), (o1 o2 )[o/a] = o1 [o/a]o2 [o/a], (!P )[o/a] = !(P [o/a]), (c(x).P )[o/a] = c(x).(P [o/a]) and (τ.P )[o/a] = τ.(P [o/a]), • and for a program P and a ∈ fn(P ), P [o/a] = P where P is a program such that src(P ) = {b[o/a]|b ∈ src(P )}, snk(P ) = snk(P ) and edge(P ) = {(b[o/a], n)|(b, n) ∈ edge(P )}. For the cases of abstraction and input prefix, note that we can assume x = a because a ∈ fn((x)P ) or ∈ fn(c(x).P ) without losing generality. (We can rename x if necessary.) Definition 2.10 Let p be a local program or a local object. A substitution [a/x] is acceptable for p if for any input prefix c(y).Q occurring in p, x = c. In the rest of this paper, we consider acceptable substitutions only for a program or an abstraction. Because in any execution of a local programs if a substitution is applied by one of the rules of operational semantics then it is acceptable. Namely we assume that [o/a] is applied only for the objects such that a does not occur as a subject of any input prefix. It is easy to show that substitution and N -closure can be distributed over “” and “\” from the definitions.
56
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Definition 2.11 (action) For a name a and an object o, an input action is a tuple of a and o that is denoted as a(o), and an output action is a tuple that is denoted as ao. An action is a silent action τ , an output action or an input action. α
Definition 2.12 (labeled transition) For an action α, → is the least binary relation on normal programs that satisfies the following rules. a(o)
input : If b ∈ src(P ) and b = a(x).Q, then P → (P \b)ν{n|(b, n) ∈ edge(P )}νM (Q[o/x]) for an object o and a set of names M such that fn(o) ∩ snk(P ) ⊂ M ⊂ fn(o) \ fn(P ). τ
β-conversion : If b ∈ src(P ) and b = (y)Qo, then P → (P \ b)ν{n|(b, n) ∈ edge(P )}(Q[o/y]). τ
τ -action : If b ∈ src(P ) and b = τ.Q, then P → (P \ b)ν{n|(b, n) ∈ edge(P )}(Q). α
α
replication 1 : P → P if !Q = b ∈ src(P ) and P ν{n|(b, n) ∈ edge(P )}Q → P , where Q is a program obtained from Q by renaming all names in snk(R) to distinct fresh names that do not occur elsewhere in P nor programs executed in parallel with P , for all R’s where each R is a program that occur in Q (including Q itself). τ
τ
replication 2 : P → P if !Q = b ∈ src(P ) and P ν{n|(b, n) ∈ edge(P )}(Q1 Q2 ) → P , where each Qi (i = 1, 2) is a program obtained from Q by renaming all names in snk(R) to distinct fresh names that do not occur elsewhere in P nor programs executed in parallel with P , for all R’s where each R is a program that occur in Q (including Q itself). av
output : If b ∈ src(P ), b = av then, P → P \ b. communication : If b1 , b2 ∈ src(P ), b1 = ao, b2 = a(x).Q then, τ
P → ((P \ b1 ) \ b2 )ν{n|(b2 , n) ∈ edge(P )}ν(fn(o) ∩ snk(P ))(Q[o/x]). In all rules above except replication 1/2, the behavior that triggers an action is removed from src(P ). Then the edges from the removed behaviors no longer exist after the action. The set of names M that occur in the input rule is the set of local names imported by the input action. Some name in M may be new to P , and other may be already known to P but b is not in the scope. α
We can show that for any program P and P , and any action α such that P → P , if P is local then P is local and if P is normal then P is normal. α
Proposition 2.1 For any normal programs P , P and Q, and any action α if P → P then α P Q → P Q. α
proof (outline): By the induction on the number of replication 1/2 rules to derive P → P . α
Proposition 2.2 For any program P, Q and R and any action α, if P Q → R is derived by one of input, β-conversion, τ -action or output immediately, then R = P Q for some α α P → P or R = P Q for some Q → Q . proof (outline): Straightforward from the definition. ao
a(o)
τ
τ
Proposition 2.3 If Q → Q and R → R then QR → Q R (and RQ → R Q ) . proof (outline): By the induction on the total number of replication 1/2 rules to derive α α Q → Q and R → R .
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
57
2.3. Behavioral Equivalence Strong bisimulation relation is defined as usual. It is easy to show ∼ defined as Definition 2.13 is an equivalence relation. Definition 2.13 (strong bisimulation equivalence) A binary relation R on normal programs α is a strong bisimulation if: for any (P, Q) ∈ R (or (Q, P ) ∈ R ), for any α and P if P → P α α then there exists Q such that Q → Q and (P , Q ) ∈ R ((Q , P ) ∈ R) and for any Q → Q the similar condition holds. Strong bisimulation equivalence ∼ is defined as {R|R is a strong bisimulation}. The following proposition is straightforward from the definitions. Proposition 2.4 If src(P1 ) = src(P2 ) then P1 ∼ P2 . We can show the congruence results of strong bisimulation equivalence [10] as Proposition 2.6 - 2.10 and Theorem 2.1. First we have the congruence result w.r.t. “”. Proposition 2.5 For any program R, if P ∼ Q then P R ∼ QR. The following propositions Proposition 2.6 - 2.9 say that ∼ is a congruence relation w.r.t. τ -prefix, replication, input prefix and application respectively. Proposition 2.6 For any P and Q such that P ∼ Q and for any τ -context, R[P ] ∼ R[Q]. Proposition 2.7 For any P and Q such that P ∼ Q and for any replication context, R[P ] ∼ R[Q]. Proposition 2.8 For any P and Q such that P ∼ Q and for any input context, R[P ] ∼ R[Q]. Proposition 2.9 For any P and Q such that P ∼ Q and for any application context R[ ], R[P ] ∼ R[Q]. From Proposition 2.6 - 9, we have the following result by the induction on the definition of context. Theorem 2.1 For any P and Q such that P ∼ Q and for any context R[ ], R[P ] ∼ R[Q]. For asynchronous π-calculus, the congruence results w.r.t. name restriction: “P ∼ Q implies νxP ∼ νxQ” is reported also. We can show the corresponding result with the similar argument to the first order case [7]. Proposition 2.10 For any P and Q and a set of names N such that N ∩(bn(P )∪bn(Q)) = ∅, if P ∼ Q then νN (P ) ∼ νN (Q).
3. Scope Equivalence This section presents an equivalence relation on programs which ensures that two systems are equivalent in their behavior and for the scopes of names. Definition 3.1 For a process graph P and a name n such that n, P/n is the program defined as follows:src(P/n) = {b|b ∈ src(P ), (b, n) ∈ edge(P )}, snk(P/n) = snk(P ) \ {n} and edge(P/n) = {(b, a)|b ∈ src(P/n), a ∈ snk(P/n), (b, a) ∈ edge(P )}.
58
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 9. The graph P/a1 .
Intuitively P/n is the subsystem of P that consists of behaviors which are in the scope of n. Let P be an example of Figure 2, P/a1 is a subgraph of Figure 2 obtained by removing the node of b3 (and the edge from b3 to a2 ) and a1 (and the edges to a1 ) as shown in Figure 9. It consists of process nodes b1 and b2 and a name node a2 . The following propositions are straightforward from the definitions. We will refer to these propositions in the proof of congruence results w.r.t. to scope equivalence that will be defined below. Proposition 3.1 For any P, Q and n ∈ snk(P ) ∪ snk(Q), (P Q)/n = P/nQ/n. Proposition 3.2 For a program P , a set of names N such that N ∩ bn(P ) = ∅ and n ∈ snk(P ), (νN (P ))/n = νN (P/n). Proposition 3.3 Let R[ ] be a context and P be a program. For any name m ∈ snk(R), (R[P ])/m = R/m[P ]. Definition 3.2 (scope bisimulation) A binary relation R on programs is scope bisimulation if for any (P, Q) ∈ R, 1. 2. 3. 4.
P = 0 iff Q = 0, src(P/n) = ∅ iff src(Q/n) = ∅ for any n ∈ snk(P ) ∩ snk(Q), P/n ∼ Q/n for any n ∈ snk(P ) ∩ snk(Q) and R is a strong bisimulation.
It is easy to show that the union of all scope bisimulations is a scope bisimulation and it is the unique largest scope bisimulation. Definition 3.3 (scope equivalence) The largest scope bisimulation is scope equivalence and denoted as ⊥. It is obvious from the definition that ⊥ is an equivalence relation. The motivation and the background of the definition of ⊥ is reported in [6,8]. As ⊥ is a strong bisimulation from Definition 3.2, 4, we have the following proposition. Proposition 3.4 P ⊥ Q implies P ∼ Q. Definition 3.4 (scope bisimulation up to ⊥) A binary relation R on programs is a scope bisimulation up to ⊥ if for any (P, Q) ∈ R, 1. 2. 3. 4.
P = 0 iff Q = 0, src(P/n) = ∅ iff src(Q/n) = ∅ for any n ∈ snk(P ) ∩ snk(Q), P/n ∼ Q/n for any n ∈ snk(P ) ∩ snk(Q) and R is a strong bisimulation up to ⊥, namely for any P and Q such that (P, Q) ∈ R (or α α (Q, P ) ∈ R), for any P such that P → P , there exists Q such that Q → Q and P ⊥ R ⊥ Q (Q ⊥ R ⊥ P .).
The following proposition is straightforward from the definition and the transitivity of “⊥”.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
59
Figure 10. Graph representation of P1 and mo.
Proposition 3.5 If R is a strong bisimulation up to ⊥, then ⊥ R ⊥ is a scope bisimulation. Proposition 3.6 If b ∈ src(P ) and !Q = b then, P ν{n|(b, n) ∈ edge(P )}Q ⊥ P where Q is a program obtained from Q by renaming names in bn(Q) to fresh names. proof (outline): We have the result by showing the following relation: {(P ν{n|(b, n) ∈ edge(P )}Q , P )|!Q ∈ src(P ), Q is obtained from Q by fresh renaming of bn(Q).}∪ ⊥ is a scope bisimulation up to ⊥ and Proposition 3.5. Example 3.1 Consider the following (asynchronous) π-calculus processes: P1 = m(x).τ.Q and P2 = νn(m(u).na | n(x).Q). Assume that neither x nor n occurs in Q. P1 and P2 are strongly bisimilar. Consider the case that a message mo is received by Pi (i = 1, 2). In P1 , the object o reaches τ.Q by the execution of m(o). On the other hand, o does not reach to Q in the case of P2 . Assume that o is so confidential that it must not be received by any unauthorized process and Q and τ.Q are not authorized. (Here, we consider that just receiving is not allowed even if the data is not stored.) Then P1 is an illegal process but P2 is not. Thus P1 and P2 should be distinguished but they cannot be distinguished with usual behavioral equivalences in π-calculus. Furthermore we cannot see if o reached to unauthorised process or not just from the resulting processes Q and νnQ. This means that for a system which is implemented with a programming language based on a model such as π-calculus, if someone optimize the system into behavioural equivalent one without taking care of the scopes, the security of the original system may be damaged. One may say that stronger equivalence relations such as syntactic equivalence or structural congruence work. Of course, syntactic equivalence can distinguish these two cases, but it is not convenient. How about structural congruence? Unfortunately it is not successful. It is easy to give an example such that P2 ≡ P3 but both of them are legal (and behavioural equivalent), for example P3 = νn(m(u).(n1 a1 |n1 a2 ) | n1 (x1 ).n2 (x2 ).Q). (Furthermore, we can also give an example of two processes which are structural congruent and one of them is legal but the other is not.) We can use bipartite directed acyclic graph model presented here to distinguish P1 and P2 . The example that corresponds to the system consists of P1 and the message mo is given by the graph in Figure 10 left 1 . This graph evolves to the graph in Figure 10 right (in the case that o is a name) that corresponds to the π-calculus process Q. This graph explicitly denotes that Q is in the scope of the newly imported name o. On the other hand the example of P2 with mo is Figure 11 left. After receiving the message carrying o, the graph evolves into Figure 11 right. This explicitly shows that Q is not in the scope of o. We can see this difference by showing P1 ⊥ P2 . One may consider that an equivalence relation similar to ⊥ can be defined on a model based on process algebra, for example, by encoding a graph into an algebraic term. However it is 1
The sink nodes corresponding n are not depicted in the following examples.
60
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 11. Graph representation of P2 and mo.
Figure 12. Graph P and Q.
not easy to define an operational semantics on which we can enjoy the merit of algebraic model by naive encoding of the graph model. Especially, it seems difficult to give an orthodox structural operational semantics or reduction semantics that consists of a set of rules to rewrite subterms locally. We consider that we need some tricky idea for encoding.
4. Congruence Results of Scope Equivalence This section discusses on congruence property of scope equivalence. 4.1. Congruence Results w.r.t. Composition, τ -prefix and Replication The next proposition says that ⊥ is a congruence relation w.r.t. . Proposition 4.1 If P ⊥ Q then P R ⊥ QR. proof: See Appendix I. The following proposition is also straightforward from the definitions. Proposition 4.2 For any program P and Q, let P and Q be programs obtained from P and Q respectively by renaming n ∈ snk(P )∩snk(Q) to a fresh name n . If P ⊥ Q then P ⊥ Q . The following proposition is the congruence result of ⊥ w.r.t. new name. Proposition 4.3 For any P and Q and a set of names N such that N ∩ (bn(P ) ∪ bn(Q)) = ∅, if P ⊥ Q then νN (P ) ⊥ νN (Q). proof (outline): We show that the following relation is a scope bisimulation: {(νN (P ), νN (Q))|P ⊥ Q}. It is straightforward from the definition to show Definition 3.2 1 and 2. 3. is from Proposition 3.2 and Proposition 2.10. 4. is by the induction on the number of replication 1/2.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
61
Proposition 4.4 For any P and Q such that P ⊥ Q and for any τ -context R[ ], R[P ] ⊥ R[Q]. proof: See Appendix II. Proposition 4.5 For any P and Q such that P ⊥ Q and for any replication context R[ ], R[P ] ⊥ R[Q]. proof: See Appendix III. 4.2. Input and Application Context We can show that the strong bisimulation equivalence is congruent w.r.t. input prefix context and application context [10]. Unfortunately, this is not the case for the scope equivalence of higher-order programs. Our results show that ⊥ is not congruent w.r.t. the input context nor the application context. The essential problem is that ⊥ is not congruent w.r.t. substitutions of abstractions as the following counter example shows. Example 4.1 (i) Let P be a graph such that src(P ) = {b1 , b2 }, edge(P ) = {(b1 , n1 ), (b2 , n2 )} and snk(P ) = {n1 , n2 } and Q be a graph such that src(Q) = {b}, edge(Q) = {(b, n1 ), (b, n2 )} and snk(Q) = {n1 , n2 } where both of b and bi (i = 1, 2) are !xa as Figure 12. Note that nj (j = 1, 2) does not occur in b nor bi (i = 1, 2). Lemma 4.1 Let P and Q be as Example 4.1 (i). Then we have P ⊥ Q. proof (outline): We show that the relation {(P, Q)} is a scope bisimulation. Definition 3.2, 1 is obvious as neither P nor Q is an empty graph. For nj (j = 1, 2), both of P/nj and Q/nj are not 0, so Definition 3.2, 2. holds. For 3. P/nj is the graph such that src(P/nj ) = {bj } and Q/nj is the graph such that src(Q/nj ) = {b}. As bi = b =!xa, src(P/nj ) = src(Q/nj ). From Proposition 2.4, P/nj ∼ Q/nj . For 4., it is easy to show that the relation {(P, Q)} is a xa
xa
bisimulation because P → P and Q → Q are the only transition for P and Q respectively. Example 4.1 (ii) Let P and Q be as Example 4.1(i). Now, let o be an abstraction : (y)c(u).d(v).R where R is a program. P [o/x] is the graph such that src(P ) = {b1 [o/x], b2 [o/x]}, snk(P ) = {n1 , n2 } and edge(P ) = {(b1 [o/x], n1 ), (b2 [o/x], n2 )} as Figure 13a, top. And Q[o/x] is a graph such that src(Q) = {b[o/x]}, snk(Q) = {n1 , n2 } and edge(Q) = {(b[o/x], n1 ), (b[o/x], n2 )} where b[o/x] and bi [o/x](i = 1, 2) are !(y)c(u).d(v).Ra as Figure 13b, top. Lemma 4.2 Let P [o/x] and Q[o/x] be as Example 4.1 (ii). Then, P [o/x] ⊥ Q[o/x]. proof: See Appendix IV. Note that the object o in the counter example is an abstraction. This incongruence happens only in the case of higher-order substitution. In fact, scope equivalence is congruent w.r.t. substitution of any first-order term by the similar argument as presented in [9]. From Lemma 4.1 and 4.2,we have the following results. Proposition 4.1 There exist P and Q such that P ⊥ Q but P [o/x] ⊥ Q[o/x] for some object o. Proposition 4.2 There exist P and Q such that P ⊥ Q but I[P ] ⊥ I[Q] for some input context I[ ].
62
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
a. P [o/x]
b. Q[o/x]
Figure 13. Transitions of P [o/x] and Q[o/x].
proof (outline): Let P and Q be as Example 4.1 (i) and I[ ] be a input context with a behavior m(o)
m(o)
m(x).[ ]. Consider the transitions: I[P ] → P [o/x] and I[Q] → Q[o/x] for o of Example 4.1 (ii). Proposition 4.3 There exist P and Q such that P ⊥ Q but A[P ] ⊥ A[Q] for some application context A[ ]. proof: (outline) Let P, Q and o be as Example 4.1 (ii) and A[ ] be an application context with a behavior (x)[ ]o.
5. Conclusion and Future Work This paper presented congruence results of scope equivalence w.r.t. new name, composition, τ -prefix and replication for a model with higher-order communication. We also showed that scope equivalence is not congruent w.r.t. input context and application context. As we presented in [9], scope equivalence is congruent w.r.t. input context for first order case. Thus, the non-congruent problem arise from higher-order substitutions. The lack of substitutivity of the equivalence relation makes analysis or verification of systems difficult. We will study this problem by the following approaches as future work. The first approach is revision of the definition of scope equivalence. The definition of ⊥ is based on the idea that two process are equivalent if the components that know the name are equivalent for each name. This idea is implemented as the Definition 3.2, 3. Our alternative idea for the third condition is P/N ∼ Q/N for each subset N of common private names
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
63
instead of P/n ∼ Q/n. P and Q in lemma 4.1 are not equivalent based on this definition. We should study if this alternative definition is suitable or not for the equivalence of processes. The second approach involves the counter example. As the counter example presented in Section 4.2 is an artificial one, we should study whether there are any practical examples. Finally, we must reconsider our model of higher-order communication. In our model an output message has the same form as a tuple of a process variable that receives a higher-order term and an argument term. This idea is from LHOπ [12]. One of the main reasons why LHOπ adopts this approach is type theoretical convenience. As we saw in lemma 4.2, this identification of output messages and process variables causes the problem with congruence. Thus we should reconsider the model of higher-order communication used.
References [1] Martin Abadi and Andrew D. Gordon. A Calculus for Cryptographic Protocols: Spi Calculus. Information and Computation 148: pp. 1-70, 1999. [2] Hartmut Ehrig and Barbara K¨onig. Deriving Bisimulation Congruences in the DPO Approach to Graph Rewriting with Borrowed Contexts. Mathematical Structures in Computer Science, vol.16, no.6, pp. 11331163, 2006. [3] Fabio Gadducci. Term Graph rewriting for the π-calculus. Proc. of APLAS ’03 (Programming Languages and Systems), LNCS 2895, pp. 37-54, 2003. [4] Barbara K¨onig. A Graph Rewriting Semantics for the Polyadic π-Calculus. Proc. of GT-VMT ’00 (Workshop on Graph Transformation and Visual Modeling Techniques), pp. 451-458, 2000. [5] Robin Milner. Bigraphical Reactive Systems, Proc. of CONCUR’01. LNCS 2154, Springer, pp. 16-35, 2001. [6] Masaki Murakami. A Formal Model of Concurrent Systems Based on Bipartite Directed Acyclic Graph. Science of Computer Programming, Elsevier, 61, pp. 38-47, 2006. [7] Masaki Murakami. Congruence Results of Behavioral Equivalence for A Graph Rewriting Model of Concurrent Programs. Proc of ICITA 2008, pp. 636-641, 2008 (to appear in IJTMS, Indersience). [8] Masaki Murakami. A Graph Rewriting Model of Concurrent Programs with Higher-Order Communication. Proc. of TMFCS 2008, pp. 80-87, 2008. [9] Masaki Murakami. Congruence Results of Scope Equivalence for a Graph Rewriting Model of Concurrent Programs. Proc. of ICTAC2008, LNCS 5160, pp. 243-257, 2008. [10] Masaki Murakami. Congruence Results of Behavioral Equivalence for A Graph Rewriting Model of Concurrent Programs with Higher-Order Communication. Submitted to FST-TCS 2009. [11] Davide Sangiorgi and David Walker. The π-calculus, A Theory of Mobile Processes. Cambridge University Press, 2001. [12] Davide Sangiorgi. Asynchronous Process Calculi: The First- and Higher-order Paradigms. Theoretical Computer Science, 253, pp. 311-350, 2001. [13] Vladimiro Sassone and Paweł Soboci´nski. Reactive Systems over Cospans. Proc. of LICS ’05 IEEE, pp. 311-320, 2005.
Appendix I: Proof of Proposition 4.1 (outline) We can show that the following relation R is a scope bisimulation. R = {(P R, QR)|P ⊥ Q} Definition 3.2, 1. is straightforward from the definition of “”. The second condition is also straightforward from Proposition 3.1 and the definition of “”. 3. is from Proposition 2.5 and Proposition 3.1.
64
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs α
4. is by the induction on the number of replication 1/2 used to derive P R → P . If it is derived by one of input, β-conversion, τ -action or output immediately, there exists Q such α that QR → Q from Proposition 2.2 and 2.1 and (P , Q ) ∈ R as ⊥ is a bisimulation. For the case that it is derived from communication rule immediately, we consider two cases. First, if both of b1 and b2 are in one of src(P ) or src(R), we can show the existence of Q such α that QR → Q and P ⊥ Q by the similar argument as the cases of input etc. mentioned above. The second case is that one of b1 and b2 is in src(P ) and the other is in src(R). If b1 is in ao
a(o)
ao
src(P ), then P → P1 from output and R → R1 from input. From P ⊥ Q, Q → Q1 τ and P1 ⊥ Q1 . From Proposition 2.3, QR → Q1 R1 and (P1 R1 , Q1 R1 ) ∈ R from the definition. The case b2 is in src(P ) is similar. α
Consider the case that P R → P is derived by applying k + 1 replication 1/2 rules. If the k + 1th rule is replication 1, b =!S ∈ src(P R). First we consider b ∈ src(P ). From the premises of replication 1, P Rν{n|(b, n) ∈ α edge(P R)}S → P . As b ∈ src(P ), ν{n|(b, n) ∈ edge(P )}S = ν{n|(b, n) ∈ edge(P R)}S . From Proposition 3.6 and the transitivity of ⊥, P ν{n|(b, n) ∈ edge(P )}S ⊥ Q. α
Thus, (P ν{n|(b, n) ∈ edge(P )}S R, QR) ∈ R. And P Rν{n|(b, n) ∈ edge(P )}S → P is derived by applying k replication 1/2 rules. From the inductive hypothesis, there exists α Q such that QR → Q and P ⊥ Q . If b ∈ src(R), ν{n|(b, n) ∈ edge(P R)}S = ν{n|(b, n) ∈ edge(R)}S . From the premises α of replication 1, P Rν{n|(b, n) ∈ edge(R)}S → P with k applications of replication 1/2. As (P Rν{n|(b, n) ∈ edge(R)}S , QRν{n|(b, n) ∈ edge(R)}S ) ∈ R, there exα ists Q such that QRν{n|(b, n) ∈ edge(R)}S → Q and P ⊥ Q from the inductive α hypothesis. As b =!S ∈ src(QR), QR → Q by replication 1. The case of replication 2 is similar.
Appendix II: Proof of Proposition 4.4 (outline) We have the result by showing that the following relation R is a scope bisimulation. R = {(R[P1 ], R[P2 ])|P1 ⊥ P2 , R[ ] is a τ -context.}∪ ⊥ . To show Definition 3.2, 1 is straightforward from the definitions. 2. is from Proposition 3.3. 3. is from Proposition 3.3, 3.4 and 2.6. For 4., we can assume that R[ ] has the form of τ.[ ]R1 where τ.[ ] is a context that consists α of just one behavior node that is a τ -prefix with a hole. Then any transition: R[P1 ] → P1 of R[P1 ] is derived by application of τ -rule to τ.[P1 ] or is caused by a transition of R1 . For the first case, P1 has the form of νN (P1 )R1 . Similarly, there exists a transition for R[P2 ] α such that R[P2 ] → νN (P2 )R1 . As P1 ⊥ P2 , we have νN (P1 )R1 ⊥ νN (P2 )R1 from Proposition 4.1 and 4.3. If the transition is derived by applying some rule to R1 , P has the form of τ.[P1 ]R1 α α where R1 → R1 . Then we have τ.[P2 ]R1 → τ.[P2 ]R1 from Proposition 2.1 and (τ.[P1 ]R1 , τ.[P2 ]R1 ) ∈ R .
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
65
Appendix III: Proof of Proposition 4.5 (outline) We can show the result by showing the following relation R is a scope bisimulation up to ⊥ and Proposition 3.5. {(R[P1 ], R[P2 ])|P1 ⊥ P2 , R[ ] is a replication context.}∪ ⊥ . To show Definition 3.4, 1. is straightforward from the definitions. 2 is from Proposition 3.3. 3 is from Proposition 3.3, 3.4 and 2.7. α
4. is by the induction on the number of the replication rules to derive R[P1 ] → R1 . We can assume that R[ ] has the form of ![ ]R1 Where ![ ] is a context that consists of just one behavior node that is a replication of a hole. α
If R[P1 ] =![P1 ]R1 → R1 is derived without any application of replication 1/2, that is a α transition of R1 . For this case, we can show that there exists R2 such that R[P2 ] → R2 and (R1 , R2 ) ∈ R with the similar argument to the proof of Proposition 4.4. Now we go into the induction step. If replication 1/2 is applied to R1 , then we can show the result by the similar way to the base case again. α
We consider the case that R[P1 ] → R1 is derived by replication 1 for ![P1 ]. Then α ![P1 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P1 )R1 → R1 where P1 is a renaming of P1 . By the induction hypothesis, there exists R2 such that α ![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P1 )R1 → R2 ” and (R1 , R2 ”) ∈ R. From Proposition 4.2, P1 ⊥ P2 implies P1 ⊥ P2 , and we have ![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P1 )R1 ⊥![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P2 )R1 from Proposition 4.3 and Proposition 4.1. From this, we have ![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P2 )R1 ⊥![P2 ]ν{n|(![P2 ], n) ∈ edge(R[P2 ])}(P2 )R1 as {n|(![P1 ], n) ∈ edge(R[P1 ])} = {n|(![P2 ], n) ∈ edge(R[P2 ])}. α ˆ Then there exists R2 such that ![P2 ]ν{n|(![P2 ], n) ∈ edge(R[P2 ])}(P2 )R1 ⇒ R2 and R2 ” ⊥ R2 . Thus (R1 , R2 ) ∈ R ⊥. The case of Replication 2 is similar and then R is a scope bisimulation up to ⊥. Appendix IV: Proof of lemma 4.2 (outline) We show that for any relation R, if (P [o/x], Q[o/x]) ∈ R, then it is not a scope bisimulation. If R is a scope bisimulation, R is a strong bisimulation from Definition 3.2. Then for any α α P [o/x] such that P [o/x] → P [o/x] , there exists Q[o/x] such that Q[o/x] → Q[o/x] and (P [o/x] , Q[o/x] ) ∈ R. From replication 1 and β-conversion, we have P [o/x] such that: src(P [o/x] ) = {b } ∪ src(P [o/x]) where b = c(u).d(v).R, snk(P [o/x] ) = snk(P [o/x]) and edge(P [o/x] ) = edge(P [o/x]∪{(b , n1 )} for α = τ (Figure 13a, middle). On the other hand, the only transition τ for Q[o/x] is Q[o/x] → Q[o/x] where src(Q[o/x] ) = {b }∪src(Q[o/x] ), b = c(u).d(v).R, snk(Q[o/x] ) = snk(Q[o/x]) and edge(Q[o/x] ) = edge(Q[o/x] ∪ {(b , n1 ), (b , n2 )} (Figure 13b, middle) by replication 1 and β-conversion. c(m)
If R is a scope bisimulation, there exists Q[o/x]” such that Q[o/x] → Q[o/x]” and c(m)
(P [o/x]”, Q[o/x]”) ∈ R for any P [o/x] → P [o/x]”. Let P [o/x]” be a graph such that: src(P [o/x] ) = {b”} ∪ src(P [o/x]) where b” = d(v).R[m/u], snk(P [o/x] ) = snk(P [o/x]) and edge(P [o/x]”) = edge(P [o/x]) ∪ {(b”, n1 )} obtained by applying input rule (Fig-
66
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 14. P [o/x]”/n2 and Q[o/x]”/n2 .
ure 13a, bottom). The only transition of Q[o/x] by c(m) makes src(Q[o/x] ) = {b”} ∪ src(Q[o/x] ) where b” = d(v).R[m/u], snk(Q[o/x] ) = snk(Q[o/x]) and edge(Q[o/x]”) = edge(Q[o/x] ∪ {(b”, n1 ), (b”, n2 )} (Figure 13b, bottom). Then (P [o/x]”, Q[o/x]”) is in R if R is a bisimulation. However, (P [o/x]”, Q[o/x]”) does not satisfy the condition 3. of Definition 3.2 because P [o/x]”/n2 and Q[o/x]”/n2 (Figure 14) are not strong bisimilar. Thus R cannot be a scope bisimulation.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-67
67
Analysing gCSP Models Using Runtime and Model Analysis Algorithms M. M. BEZEMER , M. A. GROOTHUIS and J. F. BROENINK Control Engineering, Faculty EEMCS, University of Twente, P.O. Box 217 7500 AE Enschede, The Netherlands {M.M.Bezemer , M.A.Groothuis , J.F.Broenink} @utwente.nl Abstract. This paper presents two algorithms for analysing gCSP models in order to improve their execution performance. Designers tend to create many small separate processes for each task, which results in many (resource intensive) context switches. The research challenge is to convert the model created from a design point of view to models which have better performance during execution, without limiting the designers in their ways of working. The first algorithm analyses the model during run-time execution in order to find static sequential execution traces that allow for optimisation. The second algorithm analyses the gCSP model for multi-core execution. It tries to find a resource-efficient placement on the available cores for the given target systems. Both algorithms are implemented in two tools and are tested. We conclude that both algorithms complement each other and the analysis results are suitable to create optimised models. Keywords. CSP, embedded systems, gCSP, process scheduling, traces, process transformation
Introduction Increasingly machines and consumer applications contain embedded systems to perform their main operations. Embedded systems are fed by signals from the outside world and use these signals to control the machine. Designing embedded systems becomes increasingly complex since the requirements grow. To aid developers with this complexity, Communicating Sequential Processes (CSP) [1, 2] can be used. The Control Engineering (CE) group has created a tool to graphically design and debug CSP models, called gCSP [3, 4]. The generated code makes use of the Communicating Threads (CT) library [5], which provides a C++ framework for the CSP language. An overview is shown in Figure 1. Optimisations Optimisations
CT lib gCSP Tool gCSP model
Code gen
+
gCSP (model)
Model analyser
Executable
Animation
Figure 1. gCSP & CT framework.
Runtime analyser
Executable
Figure 2. Locations of the analysers.
gCSP allows designers to create models and submodels from their designer’s point of view. When executing a model, it is desired to have a fast executing program that corresponds
68
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
to the modelled behaviour. Transformations are required to flatten (submodels are removed) the model as much as possible, to represent the model in its simplest form, in order to be able to schedule the processes using the available resources without having too much overhead. This paper presents the results of the research on the first step of model transformations: it looks for ways to analyse models in order to retrieve information about ways to optimise the model. The transformations themselves are not included yet. An example of such resources are multiprocessor chips as described in [6]. It is problematic to schedule all processes efficiently on one of these processors by hand, keeping in mind communication routes, communication costs and processor workloads. Researching the conversion of processes from a design point of view to a execution point of view is the topic of this paper. Related Work This section shortly describes work on algorithms to schedule processes on available resources. It finishes explaining which algorithm was chosen as a basis for this research. Boillat and Kropf [7] designed a distributed algorithm to map processes to nodes of a network. It starts by placing the processes in a randomised way, next it measures the delays and calculates a quality factor. The processes which have a bad influence on the quality factor, are mapped again and the quality is determined again. The target system is a Transputer network of which the communication delays and available paths differ depending on the nodes which need to communicate. So the related processes are automatically mapped close to each other by this algorithm. Magott [8] tried to solve the optimisation evaluation by using Petri nets. First a CSP model is converted to a Petri net, which forms a kind of dependency graph, but still contains a complete description of the model. His research added time factors to these Petri nets and using these time factors he is able to optimise the analysed models according to his theorems. Van Rijn describes in [9] and [10] an algorithm to parallelise model equations. He deduces a task dependency graph from the equations and uses it to schedule the equations on a Transputer network. Equations with a lot of dependencies form heaps and should be kept together to minimise the communication costs. Furthermore, his scheduling algorithm tries to find a balanced load for the Transputers, while keeping the critical path as short as possible. In this work, van Rijn’s algorithms are used as a basis for the model analysis algorithms, mainly because the requirements of both sets of algorithms are quite similar. The CSP processes can be compared to the model equations, both entities depend on predecessors and their results are required for other entities. The target systems both consist of (a network) of nodes having communication costs and or similar properties. However, van Rijn’s algorithm is not perfectly suitable for this work, so it will be extended into a more CSP dedicated algorithm. The results of the runtime analyser can be presented in different ways. Brown and Smith [11] describe three kinds of traces: CSP, VCR and structural traces, but they indicate that even more possibilities of making traces visible are available. Because the runtime analyser tries to find a sequential order in the execution of a gCSP model, the results of the tool use the sequential CSP symbol ‘->’ between the processes. This way of showing the sequential process order is quite intuitive, but for bigger models it gets (too) complex, a solution is presented in the recommendations section. Goals of the Research Designing and implementing algorithms to analyse gCSP models is the main goal of this research. It must be possible to see from the algorithm results how the performance of the
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
GUI
Console Reader
Analysis Algorithm
Socket Communication Handler
Runtime analyser
gCSP Model
Command line
Commands Runtime information (TCP/IP)
69
CT Library
Executable
Figure 3. Architecture of the analyser and the executable.
execution of the model can be optimised in order to save resources. Two methods of analysis are implemented as shown in Figure 2. • The runtime analyser looks at the executing model and its results can be used to determine a static order of running processes. • The model analyser looks at the models at design time and schedules the processes for a given target system. Both analysers return different results due to their different points of view. These results can be combined and used to make an optimised version of the model, which is currently still manual labour. Outline First the algorithms and results of the runtime analyser are described in Section 1. Section 2 contains the algorithms and results of the Model Analyser. The paper ends (Section 3) with conclusion about both analysers and some recommendations. 1. Runtime Analyser This section describes the algorithms and the implementation of the gCSP Runtime Analyser, called the runtime analyser from now on. The algorithm tries to find execution patterns by analysing a compiled gCSP program. Hilderink writes in [12] (section 3.9) about design freedom. He implies that the designer does not need to specify the process order of the complete node, but leave it to the underlying system. The runtime analyser sits between the model and the underlying system, the CT library. It suggests static solutions for (some of) the process orders, so the underlying system gets (partly) relieved from this task, making it more efficient. 1.1. Introduction Figure 3 shows the runtime analyser architecture with respect to the model to be analysed. In order to execute a model, it is compiled together with the CT library into an executable. This can be done by the tool or manually if compiling requires some extra steps. The main part of the analyser consists of the implementation of the used algorithm. The executable gets started after compilation and communication between the tool and the executable is established. After the executable is started it initialises all processes and waits for the user to either run or step through the executable. Communication is required to feed the algorithms with runtime information. Therefore two types of communication are used: • TCP/IP communication, which is send over a TCP/IP channel between the CT library and the Communication Handler. It was originally used for animation purposes
70
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[4, 13], but the runtime analyser reuses it to send commands to the executable, for example to start the execution and to receive information about the activate processes and their states. • command line communication, consisting of texts normally visible on the command line. It is used for debugging only, the texts will be shown to the user in a log view. The processes in a gCSP model have several states, which are used by the algorithms: new, ready, running, blocked and finished. Upon the creation of the process it starts in the new state. When the scheduler decides that a process is ready to be started it is put in the ready queue and its state is changed to the ready state. After the scheduler decides to start that process, it enters the running state. From the running state a process is able to enter the finished or the blocked state, depending on whether the process was finished or was blocked. Blocking mostly occurs when rendezvous communication is required and the other site is not ready to communicate yet. From the finished state a process can be put in the ready state again via a restart or via the next execution of a loop. When the cause for the blocking state is removed, the process is able to resume its behaviour when it is put in the running state again. 1.2. Algorithms The runtime analyser uses two algorithms: one to reconstruct the model tree and the other to determine the order of the execution of the processes. Both algorithms run in parallel and both use the information from changes in the process states. Before the algorithms can be started the initialisation information of the processes is required to let the algorithms know which processes are available. This initialisation information is sent upon the creation of all processes, when they enter their new state. This occurs before the execution of the model is started. 1.3. Model Tree Construction Algorithm Model tree (re)construction is required because the executable does not contain a model tree anymore. Using the model tree from the gCSP model is difficult, because the current version of gCSP has unusable, complex internal data structures. This will be improved in a new version (gCSP2), but until then the model tree construction algorithm is required. An advantage of reconstructing the tree is the possibility to be sure that the used gCSP model tree corresponds with the model tree hidden in the code of the executable. If both model trees are not equal an error has occurred. For example the code generation might have generated incorrect code or communication between the executable and the analyser has been corrupted. Figure 4 shows the reconstructed model-tree of the dual ProducerConsumer model shown in Figure 15. The numbers between parenthesis indicate the number of times a process finished. This is shown for simple profiling purposes, so the user is able to see how active all processes are. Processes are added to the tree when they are started for the first time. All grouped processes, indicated with the dashed rectangles in Figure 15, appear on the same branch. In Figure 4 the label B shows a branch, which is made black for this example. The two processes on this branch, Consumer1 and Consumer2, are grouped and belong to Seq C. The grouped processes indicate a scheduling order, in this example they have a sequential relationship as shown by the CSP notation right of the Seq C tree node. The rule for model tree creation, but also for process ordering, is based on the internal algorithms, which are used by the CT library to activate and execute the processes. When a processes becomes active it creates its underlying processes and adds them to the ready queue of the scheduler. Therefore one rule is sufficient to reconstruct the model tree, as shown in Rule 1
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms Model Par2 REP_P Seq_P (61) Producer1 (61) P1_Seq (61) P1_C (62) P1_Wr (61) Producer2 (61) P2_Seq (61) P2_C (61) P2_Wr (61) REP_C Seq_C (61) Consumer1 (62) C1_Seq (62) C1_Rd (62) C1_C (62) Consumer2 (61) C2_Seq (61) B C2_Rd (61) C2_C (61)
71
Par2 = REP_P || REP_C REP_P = Seq_P; REP_P Seq_P = Producer1; Producer2 Producer1 = P1_C; P1_Wr
Producer2 = P2_C; P2_Wr
REP_C = Seq_C; REP_C Seq_C = Consumer1; Consumer2 Consumer1 = C1_Rd; C1_C
Consumer2 = C2_Rd; C2_C
Figure 4. A reconstructed model tree of the dual ProducerConsumer model, with the CSP notation at the right.
If a process enters its ready or running state for the first time, make it a sibling of the last started process, which is still running.
Rule 1
In order to make the rule more clear the creation of the shown model tree, based on the behaviour of the scheduler build in the CT library, will be discussed. First ‘Model’ is added as the root of the tree and marked as last started process even though it is not a real process. On the first step Par2 enters its running state and thus added to ‘Model’. Since Par2 is a parallel construct, both REP P and REP C enter their ready state and both are added to Par2, since this was the last process entering its running state. In this example REP P enters its running state, since it is added already nothing happens. Now Seq P enters its running state, so it is added to REP P and so on, until P1 C enters its finished state and P1 Wr enters its running state. P1 Wr is not added to P1 C because that process is not running anymore, instead the last running running process is P1 Seq. After P1 Wr blocks, no rendezvous is possible yet, REP C enters its running state, since it already as added to the tree nothing happen. Seq C enters its running state and is added to REP C since that is last running process. This keeps on going until all processes are added to the tree and are removed of the list of unused processes. When this list is empty, the reconstruction algorithm is finished. Otherwise, it may be an indication of (process) starvation behaviour or a modelling error. 1.4. Process Ordering Like the model tree construction algorithm, the process ordering algorithm also uses the changes in the process states. This time, only the state change to ‘finished’ is used, since the required process order is the finished order of the processes and not the starting order. The result of the algorithm is a set of chains of processes showing the execution order of the executable. A chain is a sequential order of a part of the processes, it ends with one or more cross-references to other chains. The algorithm operates in two modes: one for chains which are not complete and therefore not have set cross-references to other chains and one for chains which are finished and have cross-references. A chain gets finished when the same process is added for a second time to this chain, after it is finished no processes can be added to it. The reason to finish chains is that they are not allowed to have the same process add to it twice, as this would result in a single chain which gets endlessly long, instead of having a set of multiple chains showing the static running order of the processes. Therefore, two sets of rules are required one for each mode.
72
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
The combination of the created chains should be equal of the execution trace of the model. So basically the set of chains is another notation of the corresponding trace, so by following the chains and their cross-references the trace can be reconstructed. 1.4.1. Used Notation D->C->F->B->(B,D*) [start]->A->B->C->D->(B)
Figure 5. Example notation of a set of chains.
The notation of the chains is shown in Figure 5. Each chain has a unique starting process followed by a series of processes and is ended by one or more cross-references. The crossreferences indicate which chain(s) which might become active after the current chain is ended and are shown at the end of a chain between parentheses. Asterisks behind a cross-reference are used to indicate that chain is looped. One asterisk is added when an ending process refers to the beginning of its chain. Two asterisks are added when an ending process is pointing to a process in the middle of its chain. During the explanation of the algorithm there always is an active chain, indicating that the the current execution is somewhere in that chain. This chain is shown in bold face and starts with an asterisk, in this example notation the chain starting with B in the example is the active one. The execution order starts at starting point ‘[start]’. In the example of the used notation, the chain starting with B is activated after the ‘[start]’ chain is finished, since the crossreference points to this chain. At the end of chain B two cross-references are available and the outcome depends on the underlying execution engine. Either the current chain stays active and continues at process E or the chain starting with D becomes active and so on. 1.4.2. Rules for Chains with no Cross-References These rules are responsible for creating and extending new chains. The trace shown in Figure 6 is used for the explanation, the numbers above the trace indicate a position, which is referred to in the explanation. 1 2 3 | | | A->B->C->B->D->E->F->B
Figure 6. The used trace.
Figure 7. A new chain.
If the state of process p changes to ‘finished’ add p to the end of the active chain.
Rule 2
Rule 2 is the most basic rule to create new chains. When position 1 is reached, the result is a chain with processes A, B and C added, as shown in Figure 7. At position 2 process B is finished for a second time. Since it is not allowed to add a process twice to a chain, Rules 3 and 4 are required. If process p is finished, but is already present in the active chain, it will become an cross-reference of this chain.
Rule 3
If process p becomes the cross-reference of a chain, the chain starting with p is looked up. If the chain is not available a new chain starting with p will be created. The existing or newly created chain will become the new active chain.
Rule 4
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[start]->A->B->C->(B)
73
[start]->A->B->C->(B)
Figure 8. A second chain is added.
Figure 9. The complete trace converted to chains.
Rule 3 defines when a chain should be ended. When a chain that starts with process B is not found, Rule 4 states that a new chain should be added. The result is shown in Figure 8. When position 3 is reached, process B again applies to Rule 4. This time a chain starting with process B is found: the current active chain. No new chain needs to be added and the current chain is ended as shown in Figure 9. 1.4.3. Rules for Chains with Cross-References When the active chain is finished and thus has one or more cross-references, the algorithm should check if this chain is correctly created. A chain is correctly created if it is valid for the complete execution of the model, when this is not the case the chain needs modification. In order to check whether a chain is correct or not, the position in the currently active chain is required. To explain these rules the trace is extended, as shown in Figure 10. 1 2 3 4 5 | | | | | A->B->C->B->D->E->F->B->D->G->H
Figure 10. The extended trace.
First of all it should be checked if the active process is at the end of the chain. Position 3 ends the chain starting with B shown in Figure 9. Rule 5 performs this check and find the referenced chain, in this case it is the same chain. If the finished process p is at the end of the chain, find the chain starting with process p and make it the active chain.
Rule 5
When the chain is not ended by the finished process, the process should match the next process in the chain, this results in Rule 6. At position 4, process D finishes. Previously process B was finished, so process D is expected next. No problems arise since the expected and the finished processes match. If the active process does not match the finished process p the chain must be split.
Rule 6
At position 5 process G finished, instead of the expected process (E), which comes after the previous process (D) in the chain. According to the rule, the active chain should be split. Splitting chains is a complicated task, since the resulting chains still need to match the previous part of the trace. For the current trace the active chain should be split after process D, since that process was still expected. Rule 7 defines the steps to be taken in such a situation. The active chain should be split at process e when process p is unexpected and a chain starting with process e is not yet present. To split a chain: end the current chain before process e Put the processes after e and the cross-references in a new chain starting with process e Add e and p as new cross-references Create a new chain starting with process p and make it the active chain.
Rule 7
It results in a new chain starting with E containing the remainder of the split chain. A new chain is created starting with process G and is made active. The resulting chains are
74
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
shown in Figure 11. When following the trace from the start till position 5, the set of chains in the figure match the given trace again. The active chain is the chain starting with process G and is not yet finished so the rules of the previous section apply to it. E->F->(B) B->D->(E,G) A->B->C->(B)
Figure 11. Chains after splitting chain B
1.4.4. Rule for splitting chains when a matching chain already exists The steps of Rule 7 can only be used if a chain starting with process e is not present. If the trace is extended again according to Figure 12, a problem arises at position 7. Until position 6 no problems occurred, Figure 13 shows the resulting set of chains until this position. 1 2 3 4 5 6 7 | | | | | | | A->B->C->B->D->E->F->B->D->G->H->B->D->G->H->I
Figure 12. The final trace.
E->F->(B) B->D->(E,G) A->B->C->(B)
Figure 13. The chains till position 6
At position 6 process H is finished and expected, so position 7 is reached. A problem occurs at position 7: process I is finished but process B is expected, so according to Rule 6 the chain should be split. Problem is, a chain starting with process B is present already. Having two chains starting with the same process is not allowed, since it would not be deterministic which chain should be activated when a cross-reference to one of these chains is reached. However, it is clear that the processes after process H match the chain starting with B exactly. Rule 8 provides a solution for these situations. The active chain should be split at process e when process p is unexpected, but a chain starting with process e is present already. Compare the processes after e with the chain starting with process e. If both parts are equal, remove the remaining processes in the active chain starting at process e. Add the cross-references to the chain starting with e if they are not present at this chain. Create a new chain starting with p and make it the active chain. If both parts are unequal the comparison did not succeed. The static process order cannot be determined and the algorithm should stop.
Rule 8
G->H->(B,I) E->F->(B) B->D->(E,G) A->B->C->(B)
Figure 14. The chains after position 7.
Figure 14 shows the result after applying the rule. Process I is active and can be added to the set of chains using the described rules. If the comparison of the rule fails, it indicates that two unequal chains starting with the same process should be created. This situation is not supported by the algorithm, because the executing engine of the CT library in combination with this gCSP model does not result in a static execution order and the model is non deterministic. A simple example which has a chance to fail the comparison would be a ‘one2any channel’: one process writes on the channel and multiple processes are candidates to read the value. If the reading order of these candidate processes is not deterministic Rule 8 fails.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
75
1.5. Results This section describes the results of the analyser, first a functional test is performed. Next, a scalability test is performed to see whether the analyser still functions when a bigger real-life model is used. 1.5.1. Functional Test As a functional test a simple producer consumer model is created, as shown in Figure 15. The Producer and Consumer submodels are shown below the main model. [true]
[true] REP_P
*
*
Producer1
Consumer1
Producer2
Consumer2
P1_Wr
P1_C
REP_C
Ch_P1C1
Ch_P1C1
!
varP1:Double
C1_Rd
?
C1_C
varC1:Double
Figure 15. The dual ProducerConsumer model.
The model is loaded in the analyser, which automatically generates code using gCSP, compiles and creates the executable. Figure 16 shows the result of the analyser. P1_C->C1_Rd->C1_C->P1_Wr->P2_C->P2_Wr->C2_Rd->C2_C->(P1_C*) [start]->(P1_C)
Figure 16. The process chains for the dual ProducerConsumer model.
One big chain is created and the [start]-chain is empty indicating that no startup behaviour is present. While keeping in mind that it shows the finished order of the processes, it can be verified that it corresponds with the model. First P1_C is finished. Next P1_Wr blocks since no reader is active yet, so C1_Rd becomes active and reads the value from the channel. After C1_C finishes C2_Rd blocks since P2_Wr is not active yet. P1_Wr finally can finish as well, this can be continued for the other Producer-Consumer pair. The result of the analysis of the model results in a single chain, in order to compare the results of the optimised model against the results of the original model at least two chains are required. Therefore the model of Figure 15 is reduced to a single Producer-Consumer pair, by removing P2 and C2. The results of the analysis of this new model are shown in Figure 17 P1_Wr->P1_C->C1_Rd->C1_C->(C1_Rd) C1_Rd->C1_C->P1_Wr->P1_C->(P1_Wr) [start]->(C1_Rd)
Figure 17. The process chains for the single ProducerConsumer model.
The chains are about half the size of the original model, which is as expected, but the extra chain is new. The difference between both chains, is that one chain first writes to the
76
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
channel and then reads, while the other chain first reads and then writes. When the chain of the original model is closely examined this behaviour is also visible, it is less noticeable since two ProducerConsumer pairs are available this behaviour. The analysis result now has two chains so a simplified version can be created, which is shown in Figure 18. It has two processes, one for each chain, both containing a code block representing a chain. The repetitions are replaced by infinite while-loops in the code blocks, resulting in a model which is as optimal as possible. Ch_C2C1
Chain1
Chain2
Ch_C2C1
Chain1_Code
Chain2_Code
Ch_C1C2
Ch_C1C2
Figure 18. The simplified ProducerConsumer model, rebuild using the runtime analyser results.
The original gCSP model needs quite some remodelling in order to be usable for time measurements, as can be seen in Figure 19. Mainly, this is because of the lack of ‘init’ and ‘finish’ events for code blocks, so other code blocks are required to simulate these events. Therefore, the results will not be exactly accurate since extra context switches are introduced by the measurement system. Also errors, due to the modified models compared their originals, occur but these errors will be the same for both models and for comparison it does not matter. isRunning:Boolean current_run:Integer [(current_run++)DoubletoBooleanConversion->Sc_Wr17->Sa_Wr8->Sa_Rd4->MC_Wr4->Sa_Rd7->MC_Wr7 ->Sa_Rd6->MC_Wr6->Sa_Rd5->MC_Wr5->Sa_Rd_ESX2_2->Sa_Rd_ESX2_1->Sa_Rd_ESX1_2 ->Sa_Rd_ESX1_1->MC_Rd12->MC_Rd13->Safety_X->Sa_Rd_ESY1->Sa_Rd_ESY2->Safety_Y ->Safety_Z->MC_Rd1->MS_Wr1->MC_Rd2->MS_Wr2->Sa_Wr9->Sc_Rd9->MC_Rd3 ->LongtoDoubleConversion->Controller->MS_Wr3->Sc_Rd10->Sa_Wr10->(Sc_Rd11) Sc_Rd11->DoubletoShortConversion->Sc_Wr14->Sc_Wr15->Sc_Wr16 ->Sa_Wr11->(Sc_Rd8, HPGLParser) MC_Rd12->MC_Rd13->Sa_Rd_ESX2_2->Sa_Rd_ESX2_1->Sa_Rd_ESX1_2->Sa_Rd_ESX1_1->MC_Rd1 ->MS_Wr1->Safety_X->Sa_Rd_ESY1->Sa_Rd_ESY2->Safety_Y->Safety_Z->MC_Rd2->MS_Wr2 ->MC_Rd3->LongtoDoubleConversion->Controller->MS_Wr3->Sa_Wr9->Sc_Rd9->Sc_Rd10 ->Sa_Wr10->(HPGLParser) HPGLParser->(MC_Rd12, Sc_Rd11) [start]->MC_Rd12->MC_Rd13->HPGLParser->MS_Wr1->MC_Rd1->MC_Rd2->MS_Wr2->MC_Rd3 ->LongtoDoubleConversion->Controller->MS_Wr3->MC_Wr5->Sa_Rd5->MC_Wr6->Sa_Rd6->MC_Wr7 ->Sa_Rd7->MC_Wr4->Sa_Rd4->(HPGLParser)
Figure 22. Result of the analyser for the plotter controller.
This shows that the analyser also works for big, real controller models. It is practically impossible to validate the results. From ‘[start]’ to ‘Sa_Rd4’ the order seems reasonable. After that part some optimisation effects start to play a role and ‘HPGLParser’ is finished for the second time, even before the rest of the model has been finished. These optimisation effects are the result of a channel optimisation in the CT library (described in [12] section 5.5.1). They only allow context switches when really required, for example: after reading a result from the channel, the group containing the reading process stays and continues with the execution. As opposed to the possibility that the corresponding writer becomes active again and the flow is more natural. The reader might try to read data from the channel again at some point, but now the channel will be empty, since the writer did not became active yet. Now the reader blocks and the writing process will be activated again. This results in hard to explain analysis results. On the long run every process is called as often as the other processes, which is as expected. To see whether the results are indeed valid the set of chains should be used to create a optimised model to if the model still runs as expected. Determining which chain is started when, is not possible from the analysis results themselves. Using the stepping option of the tool, it becomes possible to manually determine that the order of chains is as shown in Figure 23. The last ‘HPGLParser’ references back to the first one and the loop is complete. This complex order also is a result of the channel optimisations. [start]->HPGLParser->MC_Rd12->HPGLParser->Sc_Rd11->Sc_Rd8->Sc_Rd11->(HPGLParser*)
Figure 23. Execution order of the chains.
1.6. Discussion First of all the runtime analyser seems to work as expected, being able to analyse most (precompiled) models. Two known problems are: • the use of One2Any or Any2Any channels with three or more processes connected. Models which use these types of channels might fail during analysis, because of the lack of deterministic behaviour of such constructs. In these cases the will give a warning that it is not possible to continue analysing. • models using recursion blocks or alternative channels. This may lead to processes which are not used during the analysis, because they might depend on (combinations of) external events which might not be available during analysis.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
79
The rules, described in section 1.4, are sufficient for deterministic models and for simple non-deterministic models as well. However, the rules are not proven to be complete, since they are defined by comparing the results of relevant tests and the expected results. When new (non-deterministic) situations are analysed it might be possible that new rules are required. The bigger the models becomes, the harder the results become to interpret. In the plotter example, it is clear that certain sets of process occur multiple times, like: Safety_X->Sa_Rd_ESY1->Sa_Rd_ESY2->Safety_Y In order to make the results easier to interpret, such sets could be replaced by a group symbol, so the groups can used multiple times. Currently the results of the analysis are based on a single thread scheduler. When multicore systems are used and the scheduler of the CT library is able to schedule multiple threads simultaneously, the analysis results are undefined. The processes will be running really in parallel and the received state changes are not representing a sequential order, but represent multiple sequential orders. To solve this problem the received state changes should be accompanied by thread information, so the multiple sequential orders can be separated. The analysis algorithms can then construct the chains for each thread separately using this extra information. Another usage possibility of the current runtime analyser tool would be post game analysis [16]. The executable is executed on a target system first and the trace is stored in a log. A small tool could be created to replay the logged trace by using the animation protocol to connect to the runtime analyser tool. The analyser would not be able to notice the difference and it becomes possible to do analysis for models which runs on their target system. Even though the results are likely to become very complex, as became clear with the scalability test, usable information still can be obtained. Processes which are unused will not become part of the model tree, this information can be used as a coverage test. Of course these processes might become active depending on external inputs, like a safety checking process that might incorporate processes that only run when unsafe situations occur. This situation might never occur when just analysing a model. It also might indicate that an alternative channel is behaving unexpectedly or that a recursion is infinitely looping. We can however use this information to detect processes which should be reviewed. A more complex usage of the results is to actually implement them. By using the chains it becomes possible to create big processes containing the processes of each chain. By replacing channels, which are used internally in the new processes, with variables the processes will not block and they also become even simpler. Most processes are control related, these processes mostly do not contain (finite) state machines or very small ones, thus combining these processes will not likely give state space explosion related problems. These steps result in an optimised model which requires less system resources and runs faster as shown in section 1.5.1. 2. Model Analyser This section describes the developed model analyser tool. After a short introduction, an explanation of the algorithm is given. Next the testing and results are given using the same models as the ones used for the the runtime analyser in section 1. This section ends with conclusions about the model analyser and its usability. 2.1. Introduction The model analyser consists of several algorithm blocks connected to each other, shown in Figure 24. For each block the corresponding section describing it is shown within the parentheses.
80
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms User Interface
gCSP model
Model Tree Creator (2.2.1)
Dependency Graph Creator (2.2.2)
Critical Path Creator (2.2.3)
Heap Scheduler (2.2.4)
Core Scheduler (2.2.5)
Analysis results
Algorithm
Figure 24. Algorithm steps of the model analyser.
The analysis results are calculated using the gCSP model information combined with extra data from the user interface. The following user data is being used by the algorithm: • • • • •
Number of cores: the number of cores available to schedule the processes on. Core speed: the relative speed of a core. Channel setup time: the time it takes to open a channel from one core to another. Communication time: the time it takes to communicate a value over the channel. Process weight: the time a process needs to finish after is was started.
The results of the separate algorithms steps are made visible in the user interface. All algorithm steps build in the model analyser are automatic, the user only needs to open a gCSP model and wait for the analysis results. The analyser has a possibility to step through the algorithm steps, in order to find out why certain decisions are made. For example why a certain heap is placed on a certain core, or why several processes are grouped on the same heap. 2.2. Algorithms The algorithm created by van Rijn [9] is used as stated before. The important parts are based on a set of rules that can be extended, in order to add new functionality to the analyser. Originally it was created to be used for model equations running on a Transputer network. Our improved algorithm is usable for scheduling CSP processes on available processor cores or distributed system nodes. Scheduling the processes on the available cores is too complex for a single algorithm step, therefore heaps are introduces by van Rijn which can be created by the heap scheduler. Heaps are groups of closely related processes which should be scheduled onto the same core. The heaps can be handled as one process, so the heap scheduler reduces the number of elements to be analysed by the core scheduler and lightens its task. Van Rijn noticed a disadvantage in the use of heaps: they might have lots of dependencies to other heaps, since the process dependencies become heap dependencies after they are placed onto the heaps. Heaps might even contain circular dependencies with other heaps, these circular dependencies cannot be solved by the core scheduler. To solve this problem the processes on heaps are regrouped using index blocks, which start at processes which have an external incoming dependency. The core scheduler is able to schedule these index blocks since they solve the circular dependencies. The algorithm blocks, shown in Figure 24, are independent blocks chained together. So it is fairly easy to extend these blocks or to add a new one in between, without the need to rewrite surrounding blocks, assuming that the data sent to the existing blocks stays compatible of course. This data is also used by the user interface to update its views after each step. Unless stated otherwise, the model of Figure 25 is used to create the results shown in the following sections while explaining the behaviour of the blocks. It is the same as Figure 15, but all sub models are exploded to make all dependencies clearer. 2.2.1. Creation of the Model Tree The model tree is only for informational use and is shown in the user interface of the tool, the algorithm itself does not use it. It is different from the real Model Trees as used in gCSP, this
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[true] REP_P
81
[true]
*
REP_C
P1_Wr P1_C
!
*
C1_Rd
varP1:Double
varC1:Double
P2_Wr P2_C
C1_C
?
! varP2:Double
C2_Rd C2_C
? varC2:Double
Figure 25. Exploded view of the dual ProducerConsumer model.
tree only shows elements which influence the analysis results. This algorithm step is simple: • it loads the selected model, • it builds the tree by recursively walking through all submodels to add the available parts, • and it sends the loaded model to the dependency graph creator. 2.2.2. Dependency Graph Creator The dependency graph is extensively used by the algorithm blocks to find and make use of the dependencies of the processes. The dependency graph creator recursively walks through the model. First to add all vertices which represent the available processes and a second time to add edges representing the dependencies. The first step, adding vertices, is done by finding suitable processes: code blocks, readers and writers. For the second step, adding edges, sequential relations are used: rendezvous channels and sequential compositions. Figure 26 shows an example dependency graph. The vertices are created using the sequential relationships, except for the dependencies between the writers and readers. Those two dependencies are derived from the corresponding two channels. The graphical representation also uses the predecessors and the successors to place the vertices in a partial ordered way: in the direction of time flow (i.e. vertical direction in the figure). start
10
P1_C
20
P1_Wr
30
P2_C
C1_Rd
40
40
P2_Wr
C1_C
50
50
C2_Rd
60
C2_C
70
end
80
Figure 26. Dependency graph of the ProducerConsumer model.
Figure 27. Critical path of the ProducerConsumer model.
82
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
2.2.3. Critical Path Creator A critical path is required for the heap scheduler, see Figure 24, to schedule the heaps efficiently. Finding this critical path is easy: first the highest cumulative weight for each vertex is determined by following paths from the start vertex to the end vertex and storing the highest calculated weight for each vertex. Secondly the path containing the highest calculated weights is determined to be the critical path. Figure 27 shows the critical path with the thick line, the values in the vertices are the highest cumulative weights of the vertices. 2.2.4. Heap Scheduler As described earlier, the heap scheduler groups processes onto heaps, can be seen as groups of closely related processes and lighten the work of the core scheduler. Each process on a heap has a start time set to indicate when the heap need to start the process. Using this start time and the process weight data, a heap end time can be calculated. Since all processes scheduled on heaps have these values, the heaps can be seen as a sequentially scheduled groups of processes. The heap scheduling algorithm is complex and comprehensive, therefore the complete algorithm and its rules are explained in [17]. In general the algorithm puts the vertices of the critical path in heap 1. The remaining vertices are grouped in sequential chains and assigned to their own heaps. The start time, assigned to each process placed on a heap, is the end time of the previous process. As described before, the end time of a process is calculated by adding the weight of a process to the start time and can be used for the start time of the next vertex on the heap. Sometimes, copying a vertex and its predecessors onto multiple heaps might be cheaper than communicating the result to from one heap to another, so several heaps might contain the same vertices and the start and end times might vary for a process placed on different heaps. Having processes scheduled on multiple locations does not give any problems in general, because a process receive some values and performs a calculation of some sort and sends the results to another process. The communication takes place by using rendezvous channels, this mechanism makes sure that processes receive the correct information and whether it is calculated multiple times because that is faster does not matter. An exception are processes which have IO-pins, have these kind of processes duplicated is not possible since only one set of IO-pins is available in general. The result of the example can be found in Figure 28. At the left the dependency graph is shown, the colours of the vertices correspond with the heaps they have been placed on. The texts inside the vertices represent the heap number, the start time and the end time. At the right bars are visible which represent the heaps, with the vertices placed on them in the determined chronological order.
1: 0-10, 2: 0-10 1: 10-20, 2: 10-20 1: 20-30
2: 20-30
1: 30-40
2: 30-40 1: 40-50 1: 50-60
(a) Dependency graph
0 1 2
10
20
30
40
(b) Heap timing
Figure 28. Results of heap scheduler.
50
60
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
83
The figure shows that ‘P1 C’ and ‘P1 Wr’ are scheduled on both heaps, which is understandable since the communication weight is set by default to 25 and the weight of a vertex to 10. So the vertices have a combined weight of 20 and putting it on both heaps saves a weight value of 5, resulting in less workload for the cores. 2.2.5. Core Scheduler The last algorithm block is the core scheduler. This block schedules the heaps onto a number of available cores, which can be seen as independent units, like real cores of a processor or as nodes in a distributed system. The scheduler tries to find an optimum, depending on the amount of available cores and their relative speeds. Like the heap scheduler, the core scheduler is also very complex and comprehensive so only the basics will be explained here. The algorithm is divided in two parts, an index block creator and an index block scheduler part. Index blocks are groups of vertices within heaps, which are easier to schedule than the heaps themselves: for example if two heaps are dependent on each other they cannot be scheduled according to the scheduling rules, since a heap is scheduled after all dependent heaps are scheduled. Therefore index blocks are defined in such a way that this schedule problem and others will not occur. The actual scheduler consists of many rules to come to a single optimal scheduling result for each index block. A distinction is made between heaps which are part of the critical path and the rest and the other heaps, to make sure that the critical path is scheduled optimally and the end time of this path will not become unnecessary long. The overall rules are designed to keep the scheduled length and thus the ending time as short as possible. Figure 29 shows the results of the core scheduler in a same way as the heap results are shown. Three cores, named c1, c2 and c3, were made available. Their core names are visible in the dependency graph with the vertex timing behind them. The bars view also shows the core names with the vertices placed on chronological order. The figure shows that core c3 is unused, which is no surprise since only two heaps are available both completely placed on their own core. c1: 0-10, c2: 10-20 c1: 10-20, c2: 10-20 c1: 20-30
c2: 20-30
c1: 30-40
c2: 30-40 c1: 65-75 c1: 75-85
0 10 20 30 40 50 60 70 80 c1 c2 c3
(a) Dependency graph
(b) Core timing
Figure 29. Results of core scheduler.
2.3. Results This section describes the results of the analyser. The setup is the same as the results of the runtime analyser described in section 1.5. 2.3.1. Functional Test As a functional test the dual producer consumer model is used again, as shown in Figures 15 or 25, but now the number of cores is reduced to one. The result can be seen in Figure 30.
84
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
c1: 0-10 c1: 10-20 c1: 40-50
c1: 20-30
c1: 50-60
c1: 30-40 c1: 60-70 c1: 70-80
(a) Dependency graph
0 1 2
10
20
30
40
50
60
(b) Heap timing
Figure 30. Results of the core scheduler with one available core.
Both heaps now are scheduled on the same core, the following constraints should be met in order to pass the functional test: • ‘P1 C’ and ‘P1 Wr’ are available in both heaps, but of course they should be scheduled only once. • the timing of all processes should be correct. It is not allowed to have two processes with overlapping timing estimates. • the dependencies of the processes should be honoured. A process should be scheduled after its predecessors even when those predecessors were scheduled on different heaps. All of these constraints are met when looking at the figure. In fact these constraints are always met when the available settings are varied even more, like varying the number of cores, the communication weight or the weight of processes, which is not shown here. In [17] more functional tests are performed and it concludes that the analyser is functional. 2.3.2. Scalability test The plotter model is analysed and its results are shown in Figure 31. It has a long sequential path, which is expected since the model was designed to be sequential: first sensor information is read, it gets processed, checked for safely and motors are controlled. The model even has sequential paths for the calculations for the X, Y and Z directions of the plotter. The scalability test shows that the analyser indeed is working for bigger models. In this case, a long sequential path does not give any problems and the analyser is still able to run the algorithms. When looking at the figure, it becomes clear that the cores are scheduled as optimal as possible, meaning that the end time of the schedule is as low as possible for the current configuration. This is visible by the fact that the gaps in the schedule of the first core are smaller than the combined weight of the processes scheduled on the other cores beneath the gaps: it would not fit to put these processes in the gaps. In order to make the gaps smaller or even to get rid of the them, the communication delays must be made smaller. By looking at the analysis results it can be concluded that the model should be made more parallel by design when the model needs to run on a multi-core target system. Currently, it is almost completely scheduled onto a single core, mainly due to the amount of dependabilities between the processes resulting in the long critical path. Parallelising the X, Y and Z paths in the model so that the scheduler schedules these paths onto their own cores, results in a model which is more suitable to run on the target. Another solution is creating a pipeline for the four tasks and to put each step in this pipeline on each own core.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
start ENCX
ENCY
HPGLParser MS_WR1
MC_Rd12
MS_Wr3 MS_Wr2
MC_Rd13 MC_Rd1
MC_Rd2 MC_Rd3 LongtoDoubleConversio Controller MC_Wr7 Sa_Rd7
MC_Wr6 Sa_Rd6
MC_Wr5 Sa_Rd5
MC_Wr4 Sa_Rd4
ESX1_1
ESX1_2
ESX2_1
ESX2_2
Sa_Rd_ESX1_1
Sa_Rd_ESX1_2
Sa_Rd_ESX2_1
Sa_Rd_ESX2_2
Safety_X ESY1 ESY2 Sa_Rd_ESY1 Sa_Rd_ESY2 Safety_Y Safety_Z Sa_Wr9 Sa_Wr11
Sa_Wr10
Sc_Rd9 Sc_Rd10 Sc_Rd11 Sa_Wr8
DoubletoShortConversi Sc_Wr14 PWMX
Sc_Wr15
PWMY
Sc_Wr16
PWMZ
Sc_Rd8
0
100
200
1 6 7
DoubletoBooleanConve Sc_Wr17 VCCZ
8 9 10
0
100
200
300
core-0
400
end (a) Dependency graph
11
core-1
12
core-2
13
core-3
14
(b) Core timing
(c) Heap timing Figure 31. Plotter results.
300
85
86
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
2.4. Discussion As seen in the results section the analyser provides useful information about a model. It allows us to see whether a model is suitable to be scheduled on multiple cores efficiently, or to see how timing is estimated and which processes are the bottleneck depending on the target system. However, a couple of things which need to be improved: • the comminication costs only have one value which is too simple. A delay between the predecessor and the process is added, but in a more realistic situation this is not sufficient. In such situations the delays values should be different for each combination of cores, for example depending on the type of network between the cores or the speed of the cores. • the communication delays are also made too generic. When defining the heaps it is not known whether communication is going to take place or not. Therefore the default communication costs should be removed for the heap scheduler and added to the core schedulers. This should result in less communication costs, since they are only • communication between cores always has the same costs. In real situations cores might be on the same processor or be on different nodes in a distributed system. So the communication costs should be able to be different depending which cores are communicating. • the heap scheduler need to be improved to produce more balanced heaps. Currently heaps tend to be very long, mainly due to the fact that the communication costs are too dominant. To prevent these dominant communication costs, some experiments are required in order to get (relative) values for average communication costs and process weights. These values can then be used as defaults values for the analyser, so the analysis will result in better results. Big heaps are hard to schedule optimal, since a heap is completely placed on one core. • processes might be node dependent This is the case for example processes which depend on external I/O, which probably is wired to one node and the process should be present on that particular node. For these processes a possibility to force them to be scheduled on a particular node would be useful. More information about process-core affinity is described by Sunter [16] section 3.3. More improvements and recommendations are available in [17]. Besides these recommendations, the algorithm could be improved with additional functionality: a lot of results are presented for which new applications can be developed. One of these applications is the implementation of a deadlock analyser, which needs information about dependencies of processes to find situations which might become a problem. An example of a problematic situation are circular paths in the dependency graph. These paths indicates that the processes keep on waiting on each other. Circular paths also have bad influences on the algorithm and are handled internally already. Adding some extra rules to the internal problem handling code could be enough to detect deadlocks during analysis. When building a model designer application (like gCSP) this method could be used to check for deadlocks during the design of a model. Another improvement would be a new algorithm which is able to bundle channels into a multiplexed channel. This would result in less context switches, since multiple writer and readers are combined into a single multiplex writer and reader. After the heap scheduler is finished, it is known which processes will be grouped together. Parallel channels between these groups, could probably be combined into a single channel which transfers the data in a single step.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
87
3. Conclusions and Recommendations Both algorithms can be used to analyse gCSP models in order to make the execution of gCSP models more efficient, each from a different point of view with different results. The runtime analyser returns set of static chains, which can be used to build a new gCSP model. This model implements each chain into one process, which saves context switches and thus system resources. The model analyser presents schedules for gCSP models keeping a given system architecture in mind. The system load, available resources and communication costs are kept as optimal as possible. However, some work on refining both algorithms must be done. The runtime analyser results appear hard to interpret, so a better way of representing them is required. As described, grouping set of processes which occur more than once, might decrease the lengths of the chains, making the set of chains more understandable. Automatically creating a new optimised model from the runtime analyser results would be the most appropriate solution. As described in section 1.6, support for analysing multi-core applications is required as well, especially when the CT library gets updated to support multiple parallel threads. The section also describes state space explosion related problems as unlikely when combining multiple processes into one. We need to look further into this statement in order to see if this indeed is the case. The model analyser lacks a lot of functionality to make it a usable tool. As described in section 2.4, things like improvement of communication costs, a new channel multiplexing algorithm and deadlock checking could help to mature the tool and make it part of our tool chain. After both tools become more usable, it is possible to make them a part of out tool chain. This can be done by integrating them into gCSP2, which also needs more work first, in order to analyse the models while designing them. With the possibility of design time checks, new features can be implemented as well, like multi-node code generation, producing multiple sets of code results in multiple executables, one for each node. For this the results of the model analyser need to be included with the code generated by gCSP2, implementing this is not trivial and needs more structured work. From the designer’s point of view, having lots of small processes for each task, continues to be a good way of designing models. The results of both algorithms are useful for converting these models to an execution point of view. Acknowledgements The authors would like to thank the anonymous reviewers for their helpful and constructive feedback. Without this feedback the paper would not have been ready for publication as it is now. References [1] C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall International, 1985. ISBN 0-13153289-8. [2] A.W. Roscoe, C.A.R. Hoare, and R. Bird. The Theory and Practice of Concurrency. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1997. ISBN 0136744095. [3] D.S. Jovanovic, B. Orlic, G.K. Liet, and J.F. Broenink. Graphical Tool for Designing CSP Systems. In Communicating Process Architectures 2004, pages 233–252, September 2004. ISBN 1-58603-458-8. [4] T.T.J. van der Steen, M.A. Groothuis, and J.F. Broenink. Designing Animation Facilities for gCSP. In Communication Process Architectures 2008, volume 66 of Concurrent Systems Engineering Series, page 447, Amsterdam, September 2008. IOS Press. ISBN 978-1-58603-907-3. doi: 10.3233/ 978-1-58603-907-3-447.
88
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[5] G.H. Hilderink, J.F. Broenink, and A. Bakkers. Communicating Java Threads. In Parallel Programming and Java, Proceedings of WoTUG 20, volume 50, pages 48–76, University of Twente, Netherlands, 1997. IOS Press, Netherlands. [6] D. May. Communicating process architecture for multicores. In A.A. McEwan, W. Ifill, and P.H. Welch, editors, Communicating Process Architectures 2007, pages 21–32, July 2007. ISBN 978-1586037673. [7] J.E. Boillat and P.G. Kropf. A fast distributed mapping algorithm. In CONPAR 90: Proceedings of the joint international conference on Vector and parallel processing, pages 405–416, New York, NY, USA, 1990. Springer-Verlag New York, Inc. ISBN 0-387-53065-7. [8] J. Magott. Performance evaluation of communicating sequential processes (CSP) using Petri nets. Computers and Digital Techniques, IEE Proceedings E, 139(3):237–241, May 1992. ISSN 0143-7062. [9] L.C.J. van Rijn. Parallelization of model equations for the modeling and simulation package CAMAS. Master’s thesis, Control Engineering, University of Twente, November 1990. [10] K.C.J. Wijbrans, L.C.J. van Rijn, and J.F. Broenink. Parallelization of Simulation Models Using the HSDE Method. In E. Mosekilde, editor, Proceedings of the European Simulation Multiconference, June 1991. [11] N.C. Brown and M.L. Smith. Representation and Implementation of CSP and VCR Traces. In Communicating Process Architectures 2008, volume 66 of Concurrent Systems Engineering Series, pages 329–345, Amsterdam, September 2008. IOS Press. ISBN 978-1-58603-907-3. [12] G.H. Hilderink. Managing complexity of control software through concurrency. PhD thesis, Control Engineering, University of Twente, May 2005. [13] T.T.J. van der Steen. Design of animation and debug facilities for gCSP. Master’s thesis, Control Engineering, University of Twente, June 2008. URL http://purl.org/utwente/e58120. [14] M.A. Groothuis, A.S. Damstra, and J.F. Broenink. Virtual Prototyping through Co-simulation of a Cartesian Plotter. In Emerging Technologies and Factory Automation, 2008. ETFA 2008. IEEE International Conference on, number 08HT8968C, pages 697–700. IEEE Industrial Electronics Society, September 2008. ISBN 978-1-4244-1505-2. doi: 10.1109/etfa.2008.4638472. [15] Controllab Products. 20-sim website, 2009. [16] J.P.E. Sunter. Allocation, Scheduling & Interfacing in Real-Time Parallel Control Systems. PhD thesis, Control Engineering, University of Twente, 1994. [17] M.M. Bezemer. Analysing gCSP models using runtime and model analysis algorithms. Master’s thesis, Control Engineering, University of Twente, November 2008. URL http://purl.org/utwente/ e58499.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-89
89
Relating and Visualising CSP, VCR and Structural Traces Neil C. C. BROWN a and Marc L. SMITH b a School of Computing, University of Kent, Canterbury, Kent, CT2 7NF, UK,
[email protected] b Computer Science Department, Vassar College, Poughkeepsie, New York 12604, USA,
[email protected] Abstract. As well as being a useful tool for formal reasoning, a trace can provide insight into a concurrent program’s behaviour, especially for the purposes of run-time analysis and debugging. Long-running programs tend to produce large traces which can be difficult to comprehend and visualise. We examine the relationship between three types of traces (CSP, VCR and Structural), establish an ordering and describe methods for conversion between the trace types. Structural traces preserve the structure of composition and reveal the repetition of individual processes, and are thus wellsuited to visualisation. We introduce the Starving Philosophers to motivate the value of structural traces for reasoning about behaviour not easily predicted from a program’s specification. A remaining challenge is to integrate structural traces into a more formal setting, such as the Unifying Theories of Programming – however, structural traces do provide a useful framework for analysing large systems. Keywords. traces, CSP, VCR, structural traces, visualisation
Introduction Hoare and Roscoe’s Communicating Sequential Processes (CSP) [1,2] is most well-known as a process calculus (or process algebra) for describing and reasoning about concurrent systems. As its name suggests, a concurrent system may be viewed as a composition of individual, communicating sequential processes. One of Hoare and Roscoe’s big ideas was that one may reason about a computation by reasoning about its history of observable events. This history must be recorded by someone, and so CSP introduces the notion of an observer to perform this task. Thus, CSP introduced traces as a means of representing a process’s computational history. The idea of observable events in concurrent computation is an extension of the idea from computability theory that one may observe only the input/output behaviour of processes when reasoning about properties of that process (e.g., will a given process halt on its input?). Communication (between two processes, or between a process and its environment) is another form of input/output behaviour, and therefore observable events are a reasonable basis for recording traces of computations. In mathematics, algebra is the study of equations, and calculus is the study of change. How do these areas of mathematics relate to processes? It follows that a process algebra provides a means for describing processes as equations, and a process calculus provides a means for reasoning about changes in a system as computation proceeds. CSP provides a rich process algebra that is used for process specification, and a process calculus based on possible traces of a process’s specification to help prove properties of a concurrent system. Cast in terms of change, a trace represents a system’s computational progress, or the changes in the state of a system during its computation. Modern CSP [2] considers more than just traces in
90
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
its process calculus (i.e., possible failures, divergences, and refinements of a process), though in this paper we are concerned with investigating what further benefits traces can provide, not in the context of specification, but in the context of debugging large systems. The remainder of this introduction discusses the three different types of traces, a motivating example of distributed systems and the use of traces in debugging. CSP, VCR and Structural Traces In previous work [3], the authors extended the notion of what it means to record a trace, and describe how the shapes of these differently recorded traces extend the original definition of a CSP trace. In this section, we briefly summarize this previous work. CSP traces are sequentially interleaved representations of concurrent computations. One way to characterize a CSP trace is that it abstracts away time and space from a concurrent computation. Due to interleaving in the trace, one cannot in general see a followed by b in the trace and know from the trace alone whether event a actually occurred long before event b, or whether they were observed at the same time and ordered arbitrarily in the trace. Likewise, if process P and process Q are both capable of engaging in event a, one cannot tell from the trace alone whether it was process P or process Q that engaged in event a, or whether a represents a synchronizing event for processes P and Q. VCR traces support View-Centric Reasoning [4], provide for multiple observers, and a means for representing multiple views of a computation. Rather than a sequence of events, a VCR trace is a sequence of event multisets, where the multisets represent independent events. Independent events are events that may be nondeterministically interleaved by the CSP observer. When new events are observed for a VCR trace, they are added to a new multiset iff a dependence relationship can be inferred between the new event and the events in the current multiset [3]. If this dependency cannot be inferred, the event is added to the current multiset in the trace. In other words, a VCR trace preserves some of the timing information of events, but still abstracts away space from a concurrent computation (i.e., what process engaged in each of the events). A VCR trace permits reasoning about all the possible interleavings of a particular computation, but does not preserve which process engaged in which event. Structural traces provide not only for multiple observers, but a means for associating those observers with the processes themselves! As the name suggests, structural traces reflect the structure of a concurrent system’s composition. That is, structural traces abstract away neither time nor space. A parallel composition of CSP processes is recorded as a parallel composition of the respective traces in the structural trace. Each process has a local observer who records a corresponding trace, for example the trace of a process P. If P contains a parallel composition of processes Q and R, then after both Q and R have completed, P’s observer records the traces of Q and R as a parallel composition in P’s trace. Besides the extra structure, the primary difference between structural traces and other types is that structural traces record an event synchronisation multiple times – once for each participant. The grammar of the three different trace types discussed above is given in Figure 1. As an example, given the CSP system: (a → SKIP ||| b → SKIP) → (c → SKIP || c → SKIP) {c}
The possible traces are shown below: CSP: a, b, c or b, a, c VCR: {a, b}, {c} Structural: (a || b) → (c || c) Note that the synchronising event c occurs twice in the parallel composition in the structural trace.
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
91
CSPTRACE ::= | EVENT ( , EVENT)∗
(1)
VCRTRACE ::= | EVENTBAG ( , EVENTBAG )∗
(2)
∗
EVENTBAG ::= { EVENT ( , EVENT ) }
(3)
STRUCTURALTRACE ::= () | SEQ
(4) ?
SEQ ::= ((EVENT | PAR) ( → SEQ) ) | (NATURAL ∗ SEQ) ∗
PAR ::= SEQ SEQ ( SEQ)
(5) (6)
Figure 1. The grammar for CSP traces (line 1), VCR traces (lines 2–3) and structural traces (lines 4–6). Literals are underlined. A superscript ‘∗ ’ represents repetition zero or more times, and a superscript ‘? ’ represents repetition zero or one times (i.e. an optional item).
Distributed Systems We can consider the differences between the three trace types with a physical analogy. Physics tells us that light has a finite speed. We can imagine a galaxy of planets, light-years apart, observing each other’s visible transmissions. CSP calls for an omnipotent, Olympian observer that is able to observe all events in the galaxy and order them correctly (aside from events that appear to have occurred simultaneously) – a strong requirement! VCR’s multisets of independent events allow us to capture some of the haziness of observation, by leaving events independent unless a definite dependence can be observed. Structural traces allow for each planet to observe its own events, thus removing any concerns about accuracy or observation delay. The trace of the whole galaxy is formed by assembling together, compositionally, the traces of the different planets. This analogy may appear somewhat abstract, but it maps directly to the practical problem of recording traces for distributed systems, where each machine has internal concurrency and communication, as well as the parallelism and communication of the multiple machines. Implementing CSP’s Olympian observer in a distributed system, that could accurately observe all internal machine events (from multiple machines) and networked events, is not feasible. The CSP observer must be present on one of the machines, rendering its observation of other machines inaccurate. The system could perhaps be implemented using synchronised highresolution timers (and a later merging of traces) but what VCR and structural traces capture is that the ordering of some events does not matter. If machine A and B communicate while C and D also communicate, with no interaction between the two pairs, we can deduce that the ordering of the A-B and C-D communications was arbitrary as they could have occurred in either order. In VCR terms, they are independent events. Trace Investigation We have already stated that CSP enables proof using traces (among other models) – should your program already be formally modelled, the need for further analysis is perhaps unclear. Two possible uses for trace investigation are given in this section. CSP offers both external choice (where the environment decides between several offered events) and internal choice (where the process itself decides which to offer). There is sometimes an implicit intention that the program be roughly fair in the choice. For example, if a server is servicing requests from two clients, neither client should be starved of attention from
92
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
the server. This may even be a qualitative judgement; hence human examination of the traces of a system (that capture its behaviour) may be needed. We give an example of starvation in section 2.2. Another reason is that the program may have capacity for error handling. For example, some events may be offered by a server (to prevent deadlock), but for well-formed messages from a remote client, these events should not occur. A spate of badly-formed messages from a client may be cause for investigation, and a trace can aid in this respect. Organisation It turns out that in some senses a structural trace generalizes both VCR and CSP traces, which we discuss in section 1. The generality of structural traces is important for two reasons. First, the CHP library [5] is capable of producing all three types of traces of a Haskell program, but producing structural traces is more efficient than producing the other two trace types. (As previously discussed by the authors [3], these traces are different from merely adding print statements to a program in that recording traces does not change the behaviour of the program.) From a structural trace, one can generate equivalent VCR or CSP traces (we give the algorithms in section 1). Second, structural traces provide a framework for visualisation of concurrent programs, which we discuss in section 2, and argue how trace visualisation is helpful for debugging. In section 3 we discuss our progress on characterizing structural traces more formally, and the challenges that remain.
1. Conversion Each trace type is a different representation of an execution of a system. It is possible to convert between some of the representations. One VCR trace can be converted to many CSP traces (see section 1.1). A VCR trace can thus be considered to be a particular equivalence class of CSP traces. One structural trace can be interleaved in different ways to form many VCR traces and (either directly or transitively) many CSP traces (see section 1.2). A structural trace can also be considered to be a equivalence class of VCR (or CSP) traces. This gives us an ordering on our trace representations: one structural trace can form many VCR traces, which can in turn form many CSP traces. These conversions are deterministic and have a finite domain. In general, conversions in the opposite direction (e.g., CSP to Structural) can be non-deterministic and infinite – for example, the CSP trace a could have been generated by an infinite set of structural traces: a0 , or (a0 || a0 ), or (a0 || a0 || a0 ), and so on. 1.1. Converting VCR Traces to CSP Traces One VCR trace can be transformed to many CSP traces. A VCR trace is a sequence of multisets; by forming all permutations of the different multisets and concatenating them, it is possible to generate the corresponding set of all possible CSP traces. A Haskell function for performing the conversion is given in Figure 2 – an example of this conversion is: {a, b}, {c}, {d, e} → {a, b, c, d, e, a, b, c, e, d, b, a, c, d, e, b, a, c, e, d} Note that the number of resulting CSP traces is easy to calculate: for all non-empty VCR traces tr , the identity length (vcrToCSP tr) == foldl1 (∗) (map Set.size tr ) holds. That is, the number of generated CSP traces is equivalent to the product of the sizes of each set in the VCR trace.
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
93
type CSPTrace = [Event] type VCRTrace = [Set Event] -- All sets must be non-empty. cartesianProduct :: [a] -> [b] -> [(a, b)] cartesianProduct xs ys = [ (x , y) | x Map (Event, SeqId) Integer prSeq (Seq ( Left e) s) = combine [Map.singleton e 1, maybe Map.empty prSeq s] prSeq (Seq ( Right (Par sA sB ss )) s) = combine (maybe Map.empty prSeq s : prSeq sA : prSeq sB : map prSeq ss) data Cont = Cont (Map (Event, SeqId) Integer ) (( Event, SeqId) -> Cont) | ContDone convSeq :: StructuralSeq -> Cont convSeq (Seq ( Left e) s) = c where c = Cont (Map.singleton e 1) (\e’ -> if e /= e’ then c else maybe ContDone convSeq s) convSeq (Seq ( Right (Par sA sB ss )) s) = merge (convSeq sA : convSeq sB : map convSeq ss) ‘andThen‘ maybe ContDone convSeq s andThen :: Cont -> Cont -> Cont ContDone ‘andThen‘ r = r Cont m f ‘andThen‘ r = Cont m (\e -> f e ‘andThen‘ r) merge :: [Cont] -> Cont merge cs = case [ m | Cont m f ContDone ms -> Cont (combine ms) (\e -> merge [f e | Cont m f CSPTrace structuralToCSP TEmpty = [] structuralToCSP (T s) = iterate (convSeq s) where participants = prSeq s iterate :: Cont -> CSPTrace iterate ContDone = [] iterate (Cont m f ) = fst e : iterate ( f e) where es = Map. filter (\(n, n’) -> n == n’) (Map.intersectionWith (,) m participants ) e = Map.findMin es -- Arbitrary pick Figure 3. The algorithm for converting Structural traces to CSP traces.
Our first choice may be a, giving us the partial VCR trace {a}. If our next choice is b, the VCR trace becomes {a}, {b}, but if our next choice is c, the VCR trace becomes {a, c}. We could favour the second choice if we wanted more independence visible in the trace – which may aid understanding. We emphasise again that this is a moot point for generating the set of all traces, but we believe that for single traces, a trace with maximal
96
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
obvious independence (i.e. fewer but larger sets in the VCR trace) will be easier to follow. 1.3. Practical Implications In our previous work [3] we discussed implementing the recording of three different types of traces: CSP, VCR and Structural. A CSP trace is most straightforward to record, by using a mutex-protected sequence of events as a trace for the whole program. A VCR trace uses a mutex-protected data structure with additional process identifiers. Both of these recording mechanisms are inefficient and do not scale well, due to the single mutex-protected trace. With four, eight or more cores, the contention for the mutex may cause the program to slow down or alter its execution behaviour. This deficiency is not present with Structural traces, which are recorded locally without a lock, and thus scale perfectly to more cores. We could also consider recording the traces of a distributed system. Implementing CSP’s Olympian observer on a distributed system (a parallel system with delayed messaging between machines) is a very challenging task. VCR’s concepts of imperfect observation and multiple observers would allow for a different style of observation. Structural traces could be recorded on different machines without modification to the recording strategy, and merged afterwards just as they normally are for parallel composition. The structural trace should also be easier to compress, due to regularity that is present in branches of the trace, but that is not present in the arbitrary interleavings of a CSP trace. Given that the structural trace is more efficient to record in terms of time and space (memory requirements), supports distributed systems well and can be converted to the other trace types post hoc, this makes a strong case for recording a structural trace rather than CSP or VCR. 2. Visual Traces It is possible to visualise the different traces in an attempt to gain more understanding of the execution that generated the trace. For an example, we will use P: P = (Q o9 Q) || ((b → b → SKIP) ||| (c → c → SKIP)) {b}
where Q = (a → SKIP) ||| (b → SKIP) One possible CSP trace of this system is: a, b, a, c, c, b This trace is depicted in Figure 4a. It can be seen that for CSP traces, this direct visualisation offers no benefits over the original trace. As an alternative, the trace is also depicted in Figure 4b. This diagram is visually similar to a finite state automata, with each event occurrence being a state, and sequentially numbered edges showing the paths from event to event – but these edges must be followed in ascending order. A possible VCR trace of this system is: {a, b}, {a, c}, {c, b} This trace is depicted straightforwardly in Figure 4c. The structural trace of this system, with event sequence identifiers, is: ((a0 || b0 ) → (a1 || b1 )) || ((b0 → b1 ) || (c0 → c1 )) A straightforward depiction is given in Figure 4d. An alternative is to merge together the nodes that represent the same event and sequence number, and use edges with threadidentifiers to join them together: this is depicted in Figure 4e.
97
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
(a) A simple rendering of the CSP trace: a, b, a, c, c, b.
(b) A more complicated rendering of the CSP trace from (a): event synchronisations are rendered as state nodes, and ascending transition identifiers label the edges.
(c) A simple rendering of the VCR trace: {a, b}, {a, c}, {c, b}
(d) A rendering of the structural trace: ((a0 || b0 ) → (a1 || b1 )) || ((b0 → b1 ) || (c0 → c1 )). Event synchronisations are represented by nodes with the same name, and the edges are labelled threads of execution joining together the event synchronisations.
(e) A rendering of the structural trace: ((a0 || b0 ) → (a1 || b1 )) || ((b0 → b1 ) || (c0 → c1 )). Event synchronisations are represented by a single node, and the edges are labelled threads of execution joining together the event synchronisations. Figure 4. Various graphical representations of traces that could be produced by the CSP system P = (Q o9 Q) || ((b → b → SKIP) ||| (c → c → SKIP)) where Q = (a → SKIP) ||| (b → SKIP). {b}
2.1. Discussion of Visual Traces Merely graphing the traces directly adds little power to the traces (Figures 4a, 4c and 4d). The contrast between the CSP, VCR and structural trace diagrams is interesting. The elegance and simplicity of the CSP trace (Figure 4a) can be contrasted with the more information
98
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
a
a
JOIN
FORK
FORK
b
JOIN b
FORK
JOIN
FORK
c
c
JOIN
Figure 5. A petri net formed from figure 4e by using the nodes as transitions and the edges as places (unlabelled in this graph). Being from a trace, the petri net is finite and non-looping; it terminates succesfully when a token reaches the right-most place.
provided by the VCR trace of independence (Figure 4c) and the more complex structure of the structural trace (Figure 4d). The merged version of the structural trace (Figure 4e) is a simple change from the original (Figure 4d), but one that enables the reader to see where the different threads of execution “meet” as part of a synchronisation. Figure 4e strongly resembles the dual of a graphical representation of a Petri net [6], a model of true concurrency. The nodes in our figure (FORK, JOIN and events) are effectively transitions in a standard Petri net, and the edges are places in a standard Petri net. Thus we can mechanically form a Petri net equivalent, as shown in Figure 5. Figure 4e is also interesting for its relation to VCR traces. Our definition of dependent events [3] is visually evident in this style of graph. An event b is dependent on a iff a path can be formed in the directed graph from a to b. Thus, in Figure 4e, it can be seen that the c events are mutually independent of the a and b events, and that the right-most a event is dependent on the left-most b event, and so on.
2.2. The Starving Philosophers
CSP models (coupled with tools such as FDR) allow formally-specified properties to be proved about programs: for example, freedom from deadlock, or satisfaction of a specification. Some properties of a system can be difficult to capture in this way – we present here an example involving starvation, based on Peter Welch’s “Wot, No Chickens?” example [7]. Our example involves three philosophers. All the philosophers repeatedly attempt to eat chicken (no time for thinking in this example!). They are coupled with a chef, who can serve two portions of chicken at a time. In pseudo-CSP1 , our example is:
1
The ampersand denotes conjunction [8]: the process a & b is an indivisible synchronisation on a and b.
99
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
Figure 6. An example structural trace of five iterations of the starving philosophers example. The chef is collapsed, as it could be in an interactive visualisation of the trace. It is clear from this zoomed out view that one philosopher (shown in the third row) has far fewer synchronisations , revealing that it is being starved.
PHILOSOPHER(n) = chicken.n → PHILOSOPHER(n) CHEF = ((chicken.0 & chicken.1) 2 (chicken.0 & chicken.2) 2 (chicken.1 & chicken.2)) o 9
CHEF
SYSTEM = (PHILOSOPHER(0) ||| PHILOSOPHER(1) ||| PHILOSOPHER(2)) ||
CHEF
{chicken.0,chicken.1,chicken.2}
Our intention is that the three philosophers should eat with roughly the same frequency. However, the direct implementation of this system using the CHP library [5] leads to starvation of the third philosopher. Adding additional philosophers does not alleviate the problem: the new philosophers would also starve. In the CHP implementation, the chef always chooses the first option (to feed the first two philosophers) when multiple options are ready. This is a behaviour that is allowed by the CSP specification but that nevertheless we would like to avoid. Attempting to ensure that the three philosophers will eat with approximately the same frequency is difficult formally, but by performing diagnostics on the system we can see if it is happening in a given implementation. By recording the trace and visualising it (as shown in the structural trace in Figure 6) we are able to readily spot this behaviour and attempt to remedy it. 2.3. Interaction The diagrams already presented are static graphs of the traces. One advantage of the structural traces is that they preserve, to some degree, the organisation of the original program. The amenability of process-oriented programs to visualisation and visual interaction has long been noted – Simpson and Jacobsen provide a comprehensive review of previous work in this area [9]. The tools reviewed all focused on the design of programs, but we could use similar ideas to provide tools for interpreting traces of programs. We wish to support examination of one trace of a program, in contrast to the exploration of all traces of a program that tools such as PRoBE provide [2]. One example of an interactive user interface is shown in Figure 8. This borrows heavily from existing designs for process-oriented design tools – and indeed, it would be possible to display the code for the program alongside the trace in the lower panel. This could provide a powerful integrated development environment by merging the code (what can happen) with the trace (what did happen). An alternative example is given in Figure 9. This shows the process hierarchy vertically, rather than the previous nesting strategy. Synchronising events are shown horizontally, connecting the different parts of parallel compositions. What this view makes explicit is that for any given event, there is an expansion level in the tree such that all uses of that event are solely contained with one process. For a, b and c, this process is numbers; for d it is the root of the tree.
100
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
CommsTime = (numbers(d) || recorder(d)) \ {d} {d}
numbers(out) = ((delta(b, c, out) || prefix(a, b)) || succ(c, a)) \ {a, b, c} {b}
{a,c}
delta(in, outA, outB) = in?x → (outA!x → SKIP ||| outB!x → SKIP) o9 delta(in, outA, outB) prefix(in, out) = out!0 → id(in, out) id(in, out) = in?x → out!x → id(in, out) succ(in, out) = in?x → out!(x + 1) → succ(in, out) recorder(in) = in?x → recorder(in) (a) The CSP specification for the CommsTime network, CommsTime. (numbers: (delta:42 ∗ (b? → (c! || d!))) || (prefix:(42 ∗ (b! → a?) → b!)) || (succ:42 ∗ (c! → a?)) || (recorder:42 ∗ d?)) (b) The structural trace for the CommsTime network. Note that the labels come from code annotations. Figure 7. The specification and trace of the CommsTime example network. Note that the hiding of events – important for the formal specification – is disregarded for recording the trace in our implementation in order to provide maximum information about the system’s behaviour.
numbers
numbers prefix
b
a
delta
d
c
prefix recorder
b
a
delta
d
c
recorder
succ
succ
42*(b? -> (c! || d!)) || (42*(b! -> a?) -> b!) || 42*(c! -> a?)
(a) The selected process is numbers.
42*(b? -> (c! || d!))
(b) The selected process is delta.
Figure 8. An example user interface for exploring the trace (from Figure 7) of a process network. The top panel displays the process network, with its compositionality reflected in the nesting. Processes can be selected, and their trace displayed in the bottom panel. Inner processes could be hidden when a process is not selected (for example, recorder could have internal concurrency that is not displayed while it is not being examined).
3. Relations and Operations on Structural Traces In their Unifying Theories of Programming [10], Hoare and He identified theories of programming by sets of healthiness conditions that they satisfy. Many of these healthiness conditions involve the following relations and operations on traces: prefix (≤), concatenation (ˆ) and quotient (−). For CSP the definitions are obvious. Prefix is the sequence-prefix relation, concatenation joins two sequences, and quotient subtracts the given prefix from a sequence. VCR traces have previously been set within the UTP framework [11], and we should consider the same for Structural traces. However, the definition of these three relations and operations
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
d
numbers
numbers
recorder
recorder d
-
+ succ
a
prefix
101
b
delta
c
42*(c? -> a!)
42*d?
(a) The recorder process is selected.
(b) The succ process is selected.
Figure 9. An example user interface for exploring the trace (from Figure 7) of a process network. The top panel shows a tree-like view for the process network. Processes can be selected and their trace displayed in the bottom panel. Notably, the internal concurrency of processes (and the associated events) can be hidden, as in the left-hand example for numbers. On the right-hand side, the numbers process is expanded to reveal the inner concurrency.
Figure 10. A partial structural trace (nodes without identifiers are dummy fork/join nodes). Concatenation and quotient need to be defined such that the whole trace minus the shaded area can later be concatenated with the shaded area to re-form the entire trace.
is difficult for structural traces. Implicitly, traces must obey the following law: ∀ tr, tr : (tr ≤ tr ) ⇒ (tr ˆ(tr − tr) = tr ) Intuitively, the problem is that whereas a CSP trace has one end to append events to, a structural trace typically has many different ends that events could be added to: with an open parallel composition, the new events could be added to any branch of the open composition, or after the parallel composition (thus closing it). We must find a suitable representation to show where the new events must be added. This is represented in graphical form in Figure 10 – the problem is how to represent the shaded part of the tree so that it can be concatenated with the non-shaded part to form the whole trace. We leave this for future work, but this shows that meshing structural traces with existing trace theory is difficult or unsuitable. Hence, at this time we place our stress on the use of structural traces for practical investigation and understanding, rather than use in theoretical and formal reasoning.
102
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
4. Conclusions We have examined three trace types: CSP, VCR and Structural. We have made a case for structural traces being the most general form of the three and have shown that a given structural trace can be converted to a set of VCR traces, and that a given VCR trace can be converted to a set of CSP traces, giving us an ordering for the different trace types. When developing concurrent systems, it might be instructive to examine the traces of a program under development, to verify that the current behaviour is as expected. This ongoing verification can be in addition to having used formal tools such as FDR to prove the program correct. Even after development is complete, it can be beneficial to check the traces of the system running for a long time under load, to check for patterns of behaviour such as starvation of processes. Additionally, it may be necessary to examine the program’s behaviour following recovery from a hardware fault (e.g., replacement of a network router) or security intrusion to ensure the system continues to operate correctly – situtations that might be beyond the scope of a model checker. Structural traces are the most general, the most efficient to record, and require the fewest assumptions for their recording strategy. Their efficiency arises from recording traces locally within each process, whereas recording CSP and VCR traces involves contending for access to a single centralised trace. They are also the most amenable to compression, as the repeating behaviour of an individual process is represented in the structural trace. Regardless of which trace type is used for investigation, structural traces can be used for the recording. 4.1. Future Work Currently, two main areas of future work remain: visualisation and formalisation. We have explored visualising the different trace types, especially structural traces, which provide the most useful information for visualising a trace. For large traces, structural traces provide a means for implementing a trace browser that permits focusing on particular processes. Such a tool would provide the visualisation capabilities necessary to locate causes of problems recorded in the trace. Figure 8 provides examples of what the user interface of such a trace browser might look like, but this application has yet to be implemented. While the structural traces contain useful structure that permits visualisation and conversion to VCR and CSP traces, this same structure makes it less amenable to formal manipulation. We have explained how operations (concatenation and quotient) that are straightforward on CSP and VCR traces are much more difficult to define on structural traces. Trying to define these operations has given us a greater appreciation of the elegance of CSP traces. Despite the challenge that defining concatenation and quotient presents, we remain motivated to solving this problem and ultimately drawing structural traces into the Unifying Theories of Programming. 4.2. Availability The CHP library [5] remains the primary implementation of the trace recording discussed here. We hope to soon release an update with the latest changes described in this paper and the accompanying programs for automatically generating the graphs. Acknowledgements We remain grateful to past reviewers for helping to clarify our thoughts on VCR and structural traces, especially recasting the notion of parallel events as independent events. We are also grateful to Ian East, for his particularly inspiring “string-and-beads” diagram [12] that paved the way for much of the visualisation in this paper.
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
103
References [1] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [2] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice-Hall, 1997. [3] Neil C. C. Brown and Marc L. Smith. Representation and Implementation of CSP and VCR Traces. In Communicating Process Architectures 2008, pages 329–345, September 2008. [4] Marc L. Smith, Rebecca J. Parsons, and Charles E. Hughes. View-Centric Reasoning for Linda and Tuple Space computation. IEE Proceedings–Software, 150(2):71–84, April 2003. [5] Neil C. C. Brown. Communicating Haskell Processes: Composable explicit concurrency using monads. In Communicating Process Architectures 2008, September 2008. [6] W. Reisig. Petri Nets—An Introduction, volume 4 of EATCS Monographs on Theoretical Computer Science. Springer-Verlag, Berlin, New York, 1985. [7] Peter H. Welch. Java Threads in the Light of occam/CSP. In Architectures, Languages and Patterns for Parallel and Distributed Applications, volume 52 of Concurrent Systems Engineering Series, pages 259–284. WoTUG, IOS Press, 1998. [8] Neil C. C. Brown. How to make a process invisible. In Communicating Process Architectures 2008, page 445, 2008. Talk abstract. [9] Jonathan Simpson and Christian L. Jacobsen. Visual process-oriented programming for robotics. In Communicating Process Architectures 2008, pages 365–380, September 2008. [10] C. A. R. Hoare and Jifeng He. Unifying Theories of Programming. Prentice-Hall, 1998. [11] Marc L. Smith. A unifying theory of true concurrency based on CSP and lazy observation. In Communicating Process Architectures 2005, pages 177–188, September 2005. [12] Ian East. Parallel Processing with Communicating Process Architecture. Routledge, 1995.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-105
105
Designing a Mathematically Verified I2C Device Driver Using ASD Arjen Klomp a 1, Herman Roebbers b, Ruud Derwig c and Leon Bouwmeester a a Verum B.V., Laan v Diepenvoorde 32, 5582 LA, Waalre, The Netherlands b TASS B.V., P.O. Box 80060, 5600KA, Eindhoven, The Netherlands c NXP, High Tech Campus 46, 5656 AE, Eindhoven, The Netherlands Abstract. This paper describes the application of the Analytical Software Design methodology to the development of a mathematically verified I2C device driver for Linux. A model of an I2C controller from NXP is created, against which the driver component is modelled. From within the ASD tool the composition is checked for deadlock, livelock and other concurrency issues by generating CSP from the models and checking these models with the CSP model checker FDR. Subsequently C code is automatically generated which, when linked with a suitable Linux kernel runtime, provides a complete defect-free Linux device driver. The performance and footprint are comparable to handwritten code. Keywords. ASD, CSP, I2C, Device Driver, Formal Methods, Linux kernel
Introduction In this Analytical Software Design [1] [2] (ASD) project, NXP’s Intellectual Property and Architecture group successfully used Verum’s ASD:Suite to model the behaviour of an I2C [3] device driver. It had been demonstrated in other case studies [4][5] to save time and cost for software teams developing and maintaining large and complex systems. But this was the first time the tool had been applied to driver software. NXP is a leading semiconductor company founded by Philips and, working with Verum, undertook a project to evaluate the benefits of using ASD:Suite for upgrading its device driver software. 1. Background Analytical Software Design (ASD) combines the practical application of software engineering mathematics and modelling with specification methods that avoid difficult mathematical notations and remain understandable to all project stakeholders. In addition, it uses advanced code generation techniques. From a single set of design specifications, the necessary mathematical models and program code are generated automatically. ASD uses the Sequence Based Specification Method [6] to specify functional requirements and designs as black box functions. These specifications are traceable to the original (informal) requirements specifications and remain completely accessible to the critical project stakeholders. In turn, this allows the stakeholders and domain experts to play a key role and fully participate in verifying ASD specifications; this is an essential requirement for successfully applying such techniques in practice. 1
Corresponding Author:
[email protected] 106
A. Klomp et al. / A Mathematically Verified I2C Device Driver
Functional Requirements
Used Component Interfaces
Model Checking
Inspection
CSP
BB:S* ൺ R
?
Specification
Functional Specification
)' BSDM
||
CSP Black Box
BB:S* ൺ5 Design
√
Inspection
CSP Black Box
BB:S* ൺ R
Functional Specification
√ Inspection
Hand-written Code
Generated Code
Generated Test Cases
Figure 1. An overview of Analytical Software Design.
At the same time, these specifications provide the degree of rigour and precision necessary for formal verification. Within ASD, one can apply the Box Structured Development Method (BSDM) [7], following the principles of stepwise refinement, to transform the black box design specifications into state box specifications from which programming is typically derived. Figure 1 summarizes the main elements of ASD. The functional specification is analysed using the Sequence Based Specification method extended to enable nondeterminism to be captured. This enables the externally visible behaviour of the system to be specified with precision and guarantees completeness. Next, the design is specified using Sequence-Based Specification. This still remains a creative, inventive design activity requiring skill and experience combined with domain knowledge. With ASD, however, the design is typically captured with much more precision than is usual with conventional development methods, raising many issues like illegal behaviour, deadlocks, race conditions, etc. early in the life cycle and resolving them before implementation has started. The ASD code generator automatically generates mathematical models from the black box and state box specifications and designs. These models are currently generated in CSP [8], which is a process algebra for describing concurrent systems and formally reasoning about their behaviour. This enables the analysis to be done automatically using the ASD model checker (based on FDR [9]). For example, we can use the model checker to verify whether a design satisfies its functional requirements and whether the state box specification (used as a programming specification) is behaviourally equivalent to the corresponding black box design. In most cases, a design cannot be verified in isolation; it depends on its execution environment and the components it uses for its complete behaviour. In ASD, used component interfaces are captured as under-specified sequence based specifications. The corresponding CSP models are automatically generated, with under-specified behaviour
A. Klomp et al. / A Mathematically Verified I2C Device Driver
107
being modelled by the introduction of non-determinism. These models are then combined with those of the design and the complete system is verified for compliance with the specification. Defects detected during the verification phase are corrected in the design specification, leading to the generation of new CSP models and the verification cycle being repeated (this is typically a very rapid cycle). After the design has been verified, the ASD code generator can be used to generate program source code in C++, C or other similar languages. Section 2 introduces the case study and section 3 is about the method used. In Section 4, we give a more detailed overview of how ASD techniques were applied in practice for this particular case and section 5 presents the resulting analysis performed. 2. Case-study: a Mathematically Verified I2C Driver 2.1 Introduction NXP’s IP and Architecture group was developing code for an I2C device driver, for next generation systems. I2C drivers had been in the marketplace for a number of years. In theory, the existing software can be reused for new generations of the hardware, but in practice there were timing and concurrency issues. Therefore they also wanted to know whether the ease of maintenance of the driver could be improved. To learn more about the capabilities of the ASD:Suite and its underlying technology, the group carried out a joint project with Verum in which ASD was applied to model the behaviour of the device driver. The main objectives of the case study were to: • Verify the benefits of ASD. The most important benefits of interest to NXP were: o Delivering a higher quality product with the same or reduced amount of development effort. o Reducing costs of future maintenance. o Achieving at least equivalent performance to the existing product. •
Determine if ASD modelling helps in verifying the correctness of device driver software.
•
Assess whether the ASD:Suite is a useful and practical tool for developing device drivers.
This paper presents an overview of how the ASD method and tool suite were applied to the development of a Linux version of the IP_3204 I2C driver software, hosted on NXP’s Energizer II chip, and draws conclusions based upon these objectives. 2.2 Current Situation I2C devices are used particularly in larger systems on chips (SoCs). These are typically employed in products such as digital TVs, set-top boxes, automotive radios and multimedia equipment, as well as in more general purpose controllers for vending machines etc. Quality is a major concern, especially for the automotive market, where end user products are installed in high-end, expensive vehicles.
108
A. Klomp et al. / A Mathematically Verified I2C Device Driver
The I2C driver software had been ported and updated many times, both to keep pace with the evolving hardware platform and to cater for the requirements of new and upgraded target operating systems. The existing device driver in the past suffered from timing and concurrency issues that caused problems in development and testing, largely stemming from an incomplete definition of the hardware and software interfaces. 3. Method ASD:Suite had been demonstrated in other projects to shorten timescales and to reduce costs when used in the development and maintenance of large and complex systems. This was, however, the first time it had been applied to device driver software. The project covered three key activities: • Modelling of the device driver software. • Automatic generation of C code. • Integration into the Linux Kernel. NXP provided the domain knowledge for the project. Verum did the actual modelling of the driver specification and design. TASS provided input for the C code generation process and assisted in the implementation of the Linux kernel Operating System Abstraction Layer (OSAL) including interrupt management. 3.1 Modelling the Hardware and Software Interfaces The dynamic behaviour of the original driver was largely unspecified and unclear in the original design. The application of ASD clarified the interfaces, which resulted in a better understanding of the behaviour. For example, the software interface, which NXP calls the hardware API (HWAPI), was assumed to be stateless. However, ASD modelling revealed HWAPI state behaviour that had not previously been documented. 3.2 Improving the Design The design of the original I2C driver software was different. More work was done in the interrupt service routine context, instead of in the kernel threads. Using ASD it was possible to do the majority of the driver work in a kernel thread and as little as possible in the interrupt service routine. There is trade-off here; doing more in the kernel thread will improve overall system response but can reduce the driver performance itself. ASD will however ensure that the behaviour of the driver is complete and correct. 3.3 ASD Methodology At all stages of a project, ASD technology is applied to the various parts of the device driver software, and can also be applied to the hardware. In summary the approach, visualized in Figure 2 is: 1. Using the ASD:Suite, gather and document requirements, capturing them in the ASD model.
A. Klomp et al. / A Mathematically Verified I2C Device Driver
109
Figure 2. The triangular relation between ASD Model, CSP and source code.
2. Balance the design for optimal performance (e.g. by introducing asynchronous operation). 3. Generate a formal model of system behaviour and verify the correctness of the design (before any code has been generated). 4. Generate the source code, guaranteed to be defect free and functionally equivalent to the formal model. 3.4 Key Benefits When a software developer has gained familiarity with the ASD:Suite, it is possible to improve the efficiency of the development process by reducing dependency on testing, cutting development timescales by typically 30%. Maintenance becomes easier because the code generated is operating system independent, and changes are implemented to the model, not the code, which can then be easily regenerated. ASD:Suite delivers high performance software with a small code footprint and low resource requirements which is suited to highly embedded systems. Most importantly, the quality of the software improves, because code defects are eliminated. 4. Integrating ASD Generated Code with the Linux Kernel In general the structure of an ASD application can be described as depicted in Figure 3 below. The ASD clients call ASD components, which communicate with their environment through calls to the ASD Run Time Environment (RTE). This RTE uses an OS Abstraction Layer (OSAL) to make calls to some underlying OS. This should provide easy porting from one OS to another. The OSAL comprises a set of calls to implement thread based scheduling and condition monitors, as well as time related functions used by the RTE. In case of a normal application program the OSAL is directly mapped onto POSIX threading primitives, mutexes and condition variables in order to provide the required scheduling and resource protection. Currently only the C generator uses the OSAL interface. In the case of the I2C driver for Linux there is more to do than just have a standard POSIX implementation for the ASD Run Time Environment, as this now has to operate in
110
A. Klomp et al. / A Mathematically Verified I2C Device Driver
kernel space and implement a device driver. What this means is that we need to have a standard driver interface (i.e. open, read, write, ioctl, close) on top of the ASD components and that the ASD client needs to use the corresponding system calls to communicate with the ASD components from user space. Furthermore we need to talk to the hardware directly, which means that we need to deal with interrupts in a standard way. This is realized by an ASD foreign component. A foreign component is the ASD term for a component for which the implementation is handwritten code that is derived from the ASD interface model. 4.1 Execution Semantics of ASD Components In order to understand the requirements of the RTE one needs to consider the execution semantics of ASD components. 4.1.1 ASD Terminology We introduce the concept of durative and non-durative actions. A durative action takes place when a client synchronously invokes a method at a server. Until this method invocation returns the client remains blocked and it cannot invoke other methods at the same or other servers. After the method has returned to the client, the server is going to process the durative part and eventually the server informs the client asynchronously through a callback. A non-durative action takes place when a client synchronously invokes a method at a server and remains blocked until the server has completely processed the request. Basically an ASD component can offer two different kinds of interfaces: synchronous ones and asynchronous ones: The synchronous interfaces are client interfaces and used interfaces and the asynchronous interfaces are the callback interfaces. A component can offer several client interfaces, use several server interfaces and offer as well as use several callback interfaces. In order to be able to deal with asynchronous interfaces there is a separate Deferred Procedure Call (DPC) thread per callback interface. Making a call to a callback interface is implemented as posting a callback packet in a DPC queue serviced by a DPC thread, and notifying this DPC thread that there is work to do. The component code executes in the thread of the caller (it is just a method/subroutine call). Having the DPC threads execute as soon as they are notified may create a problem when there is shared data between the component thread and the DPC thread. As there always is shared data (state machine state), this data needs to be protected from concurrent access using some form of resource protection.
Figure 3. General overview of an ASD system.
A. Klomp et al. / A Mathematically Verified I2C Device Driver
111
For ASD this amounts to run-to-completion semantics. What this means is that it is guaranteed that the stimulus and all responses have been processed completely in the specified order and all predicates are updated before the state transition is made. This then implies that only one client may be granted access to the component at any time, as well as that DPC access can only occur after completion of a synchronous client call. In other words: an ASD component has monitor semantics. Furthermore DPC threads must execute and empty their callback queues before new client access to the component is granted. This is realized by the Client-DPC synchronization. Finally, special precautions must be taken for the following scenario: A client invokes a non-durative action on the interface of component A, where run-to-completion must be observed. If this invocation results in the invocation of a durative method and the client can only return when a callback for the durative action is invoked, we would get into trouble if the DPC server thread would block on the synchronization between client and server thread. In order to prevent this scenario there is a conditional wait after the release of the client-DPC mutex. When the client leaves the component state machine it needs to pass the conditional wait, to be released by the DPC server thread after this has finished processing the expected callback. The following pseudo code enforces the client side sequence of events with the minimum amount of thread switching and is depicted in Figure 4: Get ClientMutex Get ClientDPCMutex CallProcessing Release ClientDPCMutex ConditionalWait (DPC callback) Release ClientMutex Client call
Component A Client synchronisation
Only if callbacks used
Client-DPC synchronisation
State machine
DPC server thread
DPC queue
Used interface
Component B
Figure 4. Internal structure of an ASD component.
112
A. Klomp et al. / A Mathematically Verified I2C Device Driver
The DPC thread, shown inside the dotted area, executes the following pseudo code: while (true) { WaitEvent(wakeup, timeout) Get ClientDPCMutex CallProcessing Release ClientDPCMutex Signal DPC callback }
4.2 Structure of an ASD Linux Device Driver. When using an ASD component as a Linux device driver some glue code is necessary. The standard driver interface must be offered to the Linux kernel in order that the driver functions may be called from elsewhere. Also the implementation of the ASD execution semantics is not so straightforward, because we now need to consider hardware interrupts from the device we are controlling. As ASD cannot directly deal with interrupts we need a way to convert hardware interrupts to events that ASD can cope with. In effect we need an OS Abstraction Layer implementation for Linux kernel space. This means that instead of user level threads we now need kernel level threads, and we need to decouple interrupts from the rest of the processing. To this end we use a kernel FIFO, in which either the interrupt service routine or an event handler writes requests for the kernel level event handler. The kernel FIFO takes care that posting messages to and retrieving messages from the FIFO is interrupt safe. The first level event handler is implemented as a kernel work queue function scheduled by the interrupt handler, which reads messages from the kernel FIFO and then puts messages in a message queue serviced by a kernel DPC thread. This first level event handler is not allowed to block. The kernel DPC thread, signalled by the first level event handler, is allowed to block. Figure 5 explains this in more detail.
Interrupt
i2c_isr() { event = deal_with_IRQ(); kfifo_put(event); queue_work( interrupt_wq, &i2c_event_work) }
workqueue function
i2c_event_handler(){ while (msg_in_kfifo()){ kfifo_read(…,&intdata,…); /* put msg in DPC queue and schedule DPC thread */ schedule_DPC_callback(&int_data); } } Figure 5. Connecting Interrupts To ASD Callbacks
Figure 5. Connecting interrupts to ASD callbacks.
A. Klomp et al. / A Mathematically Verified I2C Device Driver
113
Figure 6. Structure of the ASD device driver.
Figure 6 depicts the relation between kernel code, OSAL, ASD generated code and handwritten code. Starting from the upper layer downwards we find a standard implementation of a device structure initialisation in the dev module, offering the open, close, read, write and ioctl calls. The I2C core module is also standard, calling the functions i2c_add_adapter, i2c_algo_control and i2c_algo_transfer implemented in adapter file myc.c. This file makes the connection between driver entry points and ASD generated I2C component code. In order that the component can be configured (e.g. set the correct I2C bus speed) there needs to be a way to hand configuration parameters down to a configuration function in the ASD file. ParamConverter converts between speed in bits/sec and the register values necessary to obtain the desired speed. I2cMessageIterator repeats the actions necessary for reading or writing of a single byte until the requested number of bytes is reached. For implementing the ASD execution semantics, the ASD C RunTime Environment is used. To fulfill its job, this RTE calls upon an OSAL implementation for Linux kernel space, which this project is the first to use. The OSAL uses the Linux kernel to implement the required scheduling, synchronisation and timing functionality. The I2CHwComponent implements the interface to the I2C hardware, supported by a Hardware Abstraction Layer (vHAL) implementation already available for this component. This hardware component implementation is based on an interface model derived from the IP hardware software interface documentation. This interface model of the hardware is used to verify the driver. The difficult bit is, of course, to make the model behave exactly the same as the hardware.
114
A. Klomp et al. / A Mathematically Verified I2C Device Driver
5. Results NXP has measured whether two of its key objectives had been met. Following completion of the project, they carried out: 1. Extensive stress testing, to check the stability and robustness of the product, and that the functionality was correct. 2. Performance analysis, in terms of speed of operation and footprint. The results of these investigations are as follows. The original code and data sizes of the handwritten code can be seen in Table 1. Table 1. Code and data size of original NXP driver code. Original code built-in.o
Text
Data
Bss
Total
16
592
13076
12468
This handwritten code is directly linked into the kernel. In order to facilitate testing the ASD driver was implemented as a loadable kernel module. Using C code generated from ASD models using the C-code generator (beta version) and combining this with handwritten code to interface to the I2C hardware we get the results shown in Table 2. Table 2. Code and data sizes for ASD generated C code + driver + ASD OSAL. Type of Code
text
data
bss
total
Handwritten ASD runtime lib incl. OSAL Handwritten code ASD generated code
4824 4876 12048
0 612 40
0 864 0
Total code in mymodule.o
21748
652
864
4824 6352 12088 + 23264
21840
920
864
23624
Final kernel module for Linux file manager mymodule.ko
We can see from these tables that the difference between the code sizes is about 10 Kbytes. This difference is constant, since the implementation of the OSAL and ASD RTE is not dependent on the driver. From inspection of the handwritten I2CHwComponent it is to be expected that there is room for optimization, which could make its code size significantly smaller. Code size optimization was, however, not the goal of this project. It will be considered for further work. Initial benchmarking has also shown that the performance of the code is acceptable, only minimally slower than the handwritten code. This is depicted in Table 3 below. Table 3. Comparison of execution times.
Execution time
Handwritten old driver
ASD Generated code + OSAL
Send of 2 bytes Time in interrupt
380 microseconds 60 microseconds
386 microseconds 20 microseconds
A. Klomp et al. / A Mathematically Verified I2C Device Driver
115
Several remarks can be made about these results. One: ASD components in general provide more functionality. They also capture “bad weather” behaviour that is not always captured or correctly captured in conventional designs. This results in many cases in a small increase in code size. Two: The way the C code is generated from the model is very straightforward. Work is underway to optimize the generated code so that it will be smaller. Even now, the larger code size is acceptable for NXP. Three: The time spent in interrupt context is more in the existing handwritten case because more is done in interrupt context than in the ASD case, resulting is a slightly faster performance. The ASD code blocks interrupts for a shorter time than the existing handwritten device driver code does, resulting in better overall system response. Despite these positive findings, there are still some concerns: •
The ASD driver currently implements less functionality than the original driver (no multi-master, no chaining). Adding this functionality will have more impact on code size.
•
Initially it did not survive a 3 hour stress test.
During integration and test of the ASD driver, a number of flaws where discovered. Some were related to mistakes in the handwritten code and some to modelling errors due to misinterpretation of the hardware software interface. The remaining stress test stability issue was determined to be caused by unexpected responses from the I2C EEPROM used during the stress test. The driver did not expect the response from the EEPROM when writing to a bad block. With a fresh EEPROM there are no problems. Thus, the model needs to be enhanced to cope with this situation, which should be a comparatively simple exercise. 6. Conclusions NXP believes ASD:Suite can also provide major benefits for developing defect free device drivers. The structured way of capturing complete requirements and design enable its software developers to model system behaviour correctly, verify the model and automatically generate defect free code. They are already modelling hardware, and are looking at opportunities to combine this with ASD software models, since the hardwaresoftware interface and (lack of) specification of the corresponding protocols remains a source of errors. This project has clearly shown that ASD modelling helps in developing and verifying deeply embedded software and that using the C code generator is beneficial and practical for device drivers. The biggest advantage seen is the rigorous specification process that is enforced with ASD. Software designers are forced to think before they implement, and ASD helps them ensure a complete and correct specification. It was not possible in this particular case to make a direct comparison of development effort with and without ASD, but other studies have shown that using ASD:Suite can reduce development time and cost by around 30%. Additional benefits include much easier and less costly maintenance. NXP’s own investigations have demonstrated the quality of the product and the performance of the generated C code.
116
A. Klomp et al. / A Mathematically Verified I2C Device Driver
Even where there are requirements with timing constraints that cannot be modelled using ASD, race conditions and deadlocks due to unexpected interleaving of activities are prevented by the model checker, and it is a major advantage for developers to be able to perform manual timing checks on guaranteed defect free code. Because the code generator produces verifiably correct code, the number of test cases developed and needed to run to gain confidence in the final product was considerably less than it would have been using a conventional approach to software development. The model checker revealed that there were more than 700,000 unique execution scenarios for the device driver. Without ASD, it would have required over 700,000 test cases to thoroughly test the software. Thus a major reduction in testing effort was achieved. 7. Future Work Some thoughts have been expressed as to whether the current OSAL interface models the ASD principles in the best way. Viewing an ASD component as a separate process, and using channels to implement interfaces could be a more appropriate model for more modern OSes (QNX, OSEK, (μ-)velOSity, Integrity®), which offer higher level message passing primitives. It would also make the system scalable and offer a more efficient implementation under this kind of OS. For OSes not offering the message passing primitives mutexes and conditions can then be used to implement the required functionality. This thinking is, however, still conceptual, and not in the context of this NXP project. References [1]
[2] [3] [4] [5] [6] [7] [8] [9]
Philippa J. Hopcroft, Guy H. Broadfoot, Combining the Box Structure Development Method and CSP, In ASE ’04: Proceedings of the 19th IEEE international conference on Automated Software Engineering, pages 340–345, Washington, DC, USA, 2004. An introduction to ASD, http://www.verum.com/resources/papers.html NXP, I2C-bus specification and user manual Rev 3, www.nxp.com/acrobat_download/ usermanuals/UM10204_3.pdf, NXP, 2007. Rutger van Beusekom, Nanda Technologies IM1000 Wafer Inspection System, Verum White Paper Study, Verum, 2008 R. Wiericx and L. Bouwmeester, Nucletron uses ASD to reduce development and testing times for Cone Beam CT Scan software, Verum White Paper Study, Verum, 2009 S.J. Prowell and J.H. Poore, Foundations of Sequence-Based Software Specification. IEEE Transactions of Software Engineering, 29(5):417-429, 2003 H.D. Mills and R.C. Linger and A.R. Hevner. Principles of Information Systems Analysis and Design. Academic Press, 1986 C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985 Formal Systems (Europe) Ltd, Failures-Divergence Refinement: FDR2 User Manual, 2003
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-117
117
Mobile Escape Analysis for occam-pi Frederick R.M. BARNES School of Computing, University of Kent, Canterbury, Kent, CT2 7NF. England
[email protected] Abstract. Escape analysis is the process of discovering boundaries of dynamically allocated objects in programming languages. For object-oriented languages such as C++ and Java, this analysis leads to an understanding of which program objects interact directly, as well as what objects hold references to other objects. Such information can be used to help verify the correctness of an implementation with respect to its design, or provide information to a run-time system about which objects can be allocated on the stack (because they do not “escape” the method in which they are declared). For existing object-oriented languages, this analysis is typically made difficult by aliasing endemic to the language, and is further complicated by inheritance and polymorphism. In contrast, the occam-π programming language is a process-oriented language, with systems built from layered networks of communicating concurrent processes. The language has a strong relationship with the CSP process algebra, that can be used to reason formally about the correctness of occam-π programs. This paper presents early work on a compositional escape analysis technique for mobiles in the occam-π programming language, in a style not dissimilar to existing CSP analyses. The primary aim is to discover the boundaries of mobiles within the communication graph, and to determine whether or not they escape any particular process or network of processes. The technique is demonstrated by analysing some typical occam-π processes and networks, giving a formal understanding of their mobile escape behaviour. Keywords. occam-pi, escape analysis, concurrency, CSP
Introduction The occam-π programming language [1] is a highly concurrent process-oriented language, derived from classical occam [2], in which systems are built from layered networks of communicating processes. The semantics of classical occam are based largely on those of Hoare’s Communicating Sequential Processes (CSP) [3], an algebra that can be used to reason about the concurrent behaviour of occam programs [4,5]. To occam, occam-π adds new mechanisms and language constructs for data, channel and process mobility, inspired by Milner’s π-calculus [6]. In addition occam-π offers a wealth of other features that allow the construction of dynamic and evolving software systems [7]. Some of these extensions, such as dynamic process creation, mobile barriers and channelbundles, have already had CSP semantics defined for them [8,9,10], providing ways for formal reasoning about these. These semantics are sufficient for reasoning about most occam-π programs in terms of interactions between concurrent components, typically to guarantee the absence of deadlock, or refinement of a specification. However, these semantics do not adequately deal with escape analysis of the various mobile types, i.e. knowing in advance the range of movement of mobiles between processes and process networks. The escape analysis information for an individual process or network of processes is useful in several ways:
118
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
• For checking design-level properties of a system, e.g. ensuring that private mobile data in one part of a system does not escape. • For the implementation, as it describes the components tightly coupled by mobile communication — relevant in shared-memory systems, where pointers are communicated between processes, and for the breakdown of concurrent systems in distributed execution. The remainder of this paper describes an additional mobility analysis for occam-π programs, in a style similar to the well-known traces, failures and divergences analyses of CSP [11]. Section 1 provides a brief overview of occam-π and its mobility mechanisms, in addition to current analysis techniques for occam-π programs. Section 2 describes the additions for mobile escape analysis, in particular, a new mobility model. Section 3 describes how mobile escape analysis is performed for occam-π program code, followed by initial applications of this to occam-π systems in section 4. Related research is discussed in section 5, with conclusions and consideration for future work in section 6. 1. occam-π and Formal Analysis The occam-π language provides a natural expression for concurrent program implementation, based on a communicating processes model as described by CSP. Whole systems are built from layered networks of communicating processes, which interact through a variety of synchronisation and communication mechanisms. The primary mechanism for process interaction is through channel communication, where two processes synchronise (with the semantics of CSP events), and communicate data. The occam-π “BARRIER” type provides synchronisation between any number of processes, but allows no communication (although barriers can be used to provide safe access to shared data [12]). The barrier type is roughly equivalent to the general CSP event, though our implementation does not support interleaving — synchronisation between subsets of enrolled processes. There are four distinct groups of mobile types in the occam-π language, that cover all of the occam-π mobility extensions. These are mobile data, mobile channel-ends, mobile processes and mobile barriers. The operational semantics of these vary depending on the type of mobile (described below). Mobile variables, of all mobile types, are implemented primarily as pointers to dynamically allocated memory. To avoid the need for complex garbage collection (GC), strict aliasing rules are applied. For all mobile types, routines exist in the run-time system that allow these to be manipulated safely including: allocation, release, input, output, assignment and duplication. 1.1. Operational Semantics of Mobile Types Mobile data exists largely for performance reasons. Ordinarily, data is communicated over occam-π channels using a copying semantics — i.e. the outputting process keeps its original data unchanged, and the inputting process receives a copy (overwriting a local variable or parameter). With large data (e.g. 100 KiB or more), the cost of this copy becomes significant, compared with the cost of the synchronisation. With mobile data, only a reference to the actual data is ever copied — a small fixed overhead [13]. However, in order to maintain the aliasing laws of occam (and to avoid parallel race-hazards on shared data), the outputting process must lose the data it is sending — i.e. it is moved to the receiving process. A “CLONE” operator exists for mobile data that creates a copy, for cases where the outputting process needs to retain the data after the output.
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
119
Mobile barriers allow synchronisation between arbitrary numbers of parallel processes. This has uses in a variety of applications, such as the simulation of complex systems [14], where barriers can be used to protect access to shared data (using a phased access pattern of global read then local write). When output by a process, a reference to a mobile barrier is moved, unless it is explicitly cloned, in which case the receiving process is enrolled on the barrier before the communication completes. Mobile channel-ends refer to the end-points of mobile channel bundles. These are structured types that incorporate a number of ordinary channels. Unlike ordinary channels, however, these mobile channel-ends may be moved between processes — dynamically restructuring the process network. Mobile channel ends may be shared or unshared. Unshared ends are always moved on output. Shared channel-ends are always cloned on output. Communication on the individual channels inside a shared channel-end must be done within a “CLAIM” block, to ensure mutually exclusive access to those channels. Mobile processes provide a mechanism for process mobility in occam-π [1]. Mobile processes are either active, meaning that they are connected to an environment and are running (or waiting for an event), or are inactive, meaning that they are disconnected from any environment and are free to be moved between processes. Like mobile data, there is no concept of a shared mobile process, though a mobile process may contain other mobiles (shared and unshared) as part of its internal state. The rules for mobile assignment follow those for communication — in line with the existing laws of occam. For example, assuming “x” and “y” are integer (“INT”) variables, the two following fragments of code are semantically equivalent: CHAN INT c: PAR c ! y c ? x
≡
x := y
This rule must be preserved when dealing with mobiles, whose references are either moved or duplicated, depending on the mobile type used. The semantics of communication are also used when passing mobile parameters to dynamically created (forked) processes [15] — renaming semantics are used for ordinary procedure calls. 1.2. Analysis of occam-pi Programs Starting with an occam-π process, it is moderately straightforward to construct a CSP expression that captures the process’s behaviour [4,5]. Figure 1 shows the traditional “id” process and its implementation, that acts as a one-place buffer within a process network. PROC id (CHAN INT in?, out!) WHILE TRUE INT x: SEQ in ? x out ! x :
in?
id
out!
Figure 1. One place buffer process.
If the specification is for a single place buffer, this code represents the most basic implementation — all other implementations meeting the same specification are necessarily equivalent. The parameterised CSP equation for this process is simply: ID(in, out) = in → out → ID(in, out)
120
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
This captures the behaviour of the process (interaction with its environment by synchronisation on “in” and “out” alternately), but makes no statements about individual data values. CSP itself provides only a limited support for describing the stateful data of a system. Where such reasoning is required, it would be preferable to use related algebras such as Circus [16] or CSPB [17]. Using existing and largely mechanical techniques, the traces, failures and divergences of this “ID” process can be obtained: traces ID = {, in, in, out, in, out, in, . . .} failures ID = (, {out}), (in, {in}), (in, out, {out}),
(in, out, in, {in}), . . . divergences ID = {} As described in [11], the traces of a process are the sequences of events that it may perform. For the ID process, this is ultimately an infinite trace containing “in” and “out” alternatively. The failures of a process describe under what conditions a process will deadlock (behave as STOP ). These are pairs of traces and event-sets, e.g. (X , E ), which state that if a process has performed the trace X and the events E are offered, then it will deadlock. For example, the first failure for the ID process states that if the process has not performed any externally visible events, and it is only offered “out”, then it will deadlock — because the process is actively only waiting for “in”. The divergences of a process are similar to failures, except these describe the conditions under which a process will livelock (behaves as div). The ID process is divergence free. 2. Mobility Analysis The primary purpose of the extra analysis is to track the escape of mobile items from processes. With respect to mobile items, processes can: • create new mobile items; • transport existing mobiles through their interfaces; and • destroy mobile items. Unlike traces, failures and divergences, the mobility of a process cannot be derived from a CSP expression of an occam-π process alone — requiring either the original code from which we would generate a CSP expression, or an augmented version of CSP that provides a more detailed representation of program behaviour, specifically the mobile operations listed above. The remainder of this section describes the representation (syntax) used for mobility sequences, and some simple operations on these. 2.1. Representation The mobility of a process is defined as a set of sequences of tagged events, where the events involved represent channels in the process’s environment. For the non-mobile “id” process discussed in section 1.2, this would simply be the empty set: mobility ID = {}
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
121
For a version of the “id” process that transports mobile data items: mobility MID = {in?a , out!a } The name “a” introduced in the mobility specification has scope across the whole set of sequences (though in this case there is only a single sequence) and indicates that the mobile data received from “in” is the same as that output on “out”. The direction (input or output) is relevant, since escape is asymmetric. Processes that create or destroy mobiles instead of transporting them are defined in similar ways. The syntax for representing and manipulating mobility sequences borrows heavily from CSP [3,11], specifically the syntax associated with traces. 2.1.1. Shared Mobiles For unshared mobile items, simple mobility sequences will have at most two items1 , reflecting the fact that a process acquires a mobile and then loses it — and therefore always in the order of an input followed by an output. For shared mobile items, mobility sequences may contain an arbitrary number of outputs, as a process can duplicate references to that mobile. Where there is more than one output, the order is unimportant — knowing that the mobile escapes is sufficient. Shared mobiles are indicated explicitly — decorated with a “+”. For example, a version of the “id” process that transports shared mobiles has the model: mobility SMID = {in?a+ , out!a+ } 2.1.2. Client and Server Channel Ends As described in section 1.1, mobile channel bundles are represented in code as pairs of connected ends, termed client and server. In practice these refer to the same mobile item, but for the purpose of analysis we distinguish the individual ends — e.g. for some mobile channel a ” for the server-end. A version of “id” that bundle “a”, we use “a” for the client-end and “¯ transports unshared server-ends of a particular channel-type would have the mobility model: mobility USMID = {in?a¯ , out!a¯ } These are slightly different from other mobiles in that they can appear as both superscripts (mobile items) and channel-names (carrying other mobile items). Recursive mobile channel-end structures can also carry themselves, expressed as, e.g. a!a . Where there are multiple channels inside a mobile channel-end, the individual channels can be referred to by their index, e.g. a[0] ?x , a[1] !a , to make clear which particular channel (for communication) is involved. 2.1.3. Undefinedness In certain situations, that are strictly program errors, there is a potential for undefined mobile items to escape a process. Such an undefined mobile cannot be used in any meaningful way, but should be treated formally. A process that declares a mobile and immediately outputs it undefined, for example, would have the mobility model: mobility BAD = {out!γ } The absence of such things can be used to prove that a process, or process network, does not generate any undefined mobiles. 1 Higher order operations, e.g. communicating channels over channels, can produce mobility sequences containing more than two items — see section 3.7.
122
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
2.1.4. Alphabets As is standard in CSP, we use sigma (Σ) to refer to the set of names on which a process can communicate. For mobility sequences, this can be divided into output channels (Σ! ) and input channels (Σ? ), such that Σ = Σ! ∪ Σ? . Ordinary mobile items (data, barriers) are not part of this alphabet, mobile channel-ends are however. The various channels that are in the alphabet of an occam-π process can also be grouped according to their type: Σt , where t is any valid occam-π protocol and T is the set of available protocols, such that t ∈ T. Following on, Σt = Σ!t ∪ Σ?t , and ∀ t : T · Σt ⊆ Σ. For referring to all channels that carry shared mobiles we have Σ+ , with Σ+ = Σ!+ ∪ Σ?+ . 2.2. Operations on Mobility Sequences For convenience, the following operations are defined for manipulating mobility sequences. To illustrate these, the name S refers to a set of mobility sequences, S = {R1 , R2 , . . .}, each of which is a sequence of mobile actions, R = X1 , X2 , . . .. Each mobile action is either an input, X1 = C !x , or an output, X2 = D?v . 2.2.1. Concatenation For joining mobility sequences: X1 , X2 , . . .ˆY1 , Y2 , . . . = X1 , X2 , . . . , Y1 , Y2 , . . . 2.2.2. Channel Restriction Used to remove named channels from mobility sequences: X1 , C !x , . . . − {C } = X1 , . . . Note that this is not quite the same as hiding, the details of which are described later.
3. Analysing occam-pi for Mobility This section describes the specifics of extracting mobile escape information for occam-π processes. Where appropriate, the semantics of these in terms of CSP operators are given. A refinement relation over mobility sets is also considered. 3.1. Primitive Processes The two primitive CSP processes STOP and SKIP are expressed in occam-π using “STOP” and “SKIP” respectively. Although “STOP” is often not used explicitly, it is implicit in certain occam-π constructs — for example, in an “IF” structure, if none of the conditions evaluate to true, or in an “ALT” with no enabled guards. Both SKIP and STOP have empty mobility models. Divergence and chaos, for which there is no exact occam-π equivalent, have undefined though legal mobility behaviours — and are able to do anything that an occam-π process might.
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
123
mobility SKIP = mobility STOP = mobility div = mobility CHAOS = {C !a | C ∈ Σ! } ∪ {D?x | D ∈ Σ? }∪ {C ?v , D!v | ∀ t : T · (C , D) ∈ Σ?t × Σ!t )}
The models of divergence and chaos specify that the process may output defined mobiles on any of its output channels, consume mobiles from any of its input channels, and forward mobiles from any of its input channels to any of its output channels (where the types are compatible). However, neither divergence or chaos will generate (and output) undefined mobiles, but may forward undefined mobiles if these were ever received. 3.2. Input, Output and Assignment Input and output are the basic building blocks of mobile escape in occam-π — they provide the means by which mobile items are moved. For example, a process that generates and outputs a mobile (which escapes): PROC P (CHAN MOBILE THING out!) MOBILE THING x: SEQ ... initialise ‘x’ out ! x :
mobility P = {out!x }
Correspondingly, a process that consumes a mobile: PROC Q (CHAN MOBILE THING in?) MOBILE THING y: SEQ in ? y ... use y :
mobility Q = {in?y }
A similar logic applies to assignment, based on the earlier equivalence with communication. For example: PROC R (CHAN MOBILE THING in?, out!) MOBILE THING v, w: SEQ in ? v w := v out ! w :
mobility R = {in?v , Lc!v , Lc?w , out!w } \ {Lc}
The local channel-name Lc comes from the earlier model for assignment (as a communication between two parallel processes). The semantics for parallelism and hiding are described in the following sections. A compiler does not need to model assignment directly in this manner, however — it can track the movement of mobiles between local variables itself, and generate simpler (but equivalent) mobility sequences. For the above process “R”: mobility R = {in?u , out!u }
124
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
3.3. Sequential Composition Sequential composition provides one mechanism by which a mobile received on one channel can escape on another. In the case of the “id” process, whose mobility model is intuitively obvious (but best determined automatically by a compiler or other tool): SEQ in ? v out ! v
mobility ID = {in?v , out!v }
In general, the mobility model for sequential processes, i.e. mobility(P ; Q), is formed by combining input sequences from mobility P with output sequences from mobility Q, matched by the particular mobile variable input or output. When combining processes in this and other ways, the individual variables representing mobile items may need to be renamed to avoid unintentional capture. 3.4. Choice Programs may make choices either internally (e.g. with “IF” and “CASE”) or externally (with an “ALT” or “PRI ALT”). The rules for internal and external choice are straightforward — simply the union of the sets representing the individual choice branches. For example: PROC plex.data (CHAN MOBILE THING in0?, in1?, out!) WHILE TRUE MOBILE THING v: ALT mobility PD = in0 ? v out ! v in1 ? v out ! v :
{in0?a , out!a , in1?b , out!b }
In general: mobility (P Q) = (mobility P ) ∪ (mobility Q) mobility (P Q) = (mobility P ) ∪ (mobility Q) 3.5. Interleaving and Parallelism Interleaving and parallelism, both specified by “PAR” in occam-π, have straightforward mobility models. For example, a “delta” process for SHARED mobile channel-ends, that performs its outputs in parallel: PROC chan.delta (CHAN SHARED CT.FOO! in?, out0!, out1!) WHILE TRUE SHARED CT.FOO! x: SEQ mobility CD = {in?a+ , out0!a+ , in ? x PAR in?b+ , out1!b+ } out0 ! CLONE x out1 ! CLONE x :
This captures the fact that a mobile input on the “in” channel escapes to both the output channels, indistinguishable from a non-interleaving process that makes an internal choice about where to send the mobile. In general:
125
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
mobility (P Q) = (mobilityP ) ∪ (mobility Q)
Interleaving (e.g. P ||| Q) is a special form of the more general alphabetised parallelism, therefore it is not of huge concern for mobile escape analysis. 3.6. Hiding Hiding is used to model the declaration and scope of channels in occam-π. In particular, it is also responsible for collapsing mobility structures — by removing channel names from them. Where occam-π programs are concerned, channel declarations typically accompany “PAR” structures. For example: PROC network (CHAN MOBILE THING in?, out!) CHAN INT c: PAR mobility thing.id (in?, c!) thing.id (c?, out!) :
NET = {in?a , c!a , c?b , out!b } \ {c}
This reduces to the set: mobility NET = {in?a , out!a } The general rule for which is: mobility (P \ x ) = M ˆN [α/β] |
M ˆx !α , x ?β ˆN ∈ mobility P × mobilityP ∪
(mobility P ) − F ˆx !α | F ˆx !α ∈ mobility P ∪ x ?β ˆG | x ?β ˆG ∈ mobility P ∪ α β H | (H ˆx ! ) ∈ mobility P ∧ (x ? ˆI ) ∈ / mobility P ∧ H = ∪ J | (x ?β ˆJ ) ∈ mobility P ∧ (J ˆx !α ) ∈ / mobility P ∧ J = The above specifies the joining of sequences that end with outputs on the channel x with sequences that begin with inputs on the channel x . The matching sequences are removed from the resulting set, however, the starts of unmatched output sequences and the ends of unmatched input sequences are preserved. 3.7. Higher Order Communication So far, only the transport of mobiles over static process networks has been considered. However, in many real applications, mobile channels will be used to setup connections between processes, which are later used to transport other mobiles (including other mobile channelends). Assuming that the “CT.FOO” channel-type contains a single channel named “c”, itself carrying mobiles, we might write:
126
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
PROC high.order.cli (CHAN CT.FOO! in?) CT.FOO! cli: MOBILE THING v: SEQ in ? cli ... initialise ‘v’ cli[c] ! v :
mobility HOC = in?a , a!b
This captures the fact that the process emits mobiles on the bound name “a”, which it received from its “in” channel. The type “CT.FOO!” specifies the client-end of the mobile channel2 . A similar process for the server-end of the mobile channel could be: PROC high.order.svr (CHAN CT.FOO? in?) CT.FOO? svr: MOBILE THING x: SEQ in ? svr svr[c] ? x ... use ‘x’ :
mobility HOS = in?c¯ , c¯?d
Connecting these in parallel with a generator process (that generates a pair of connected channel-ends and outputs them), and renaming for parameter passing: PROC foo.generator (CHAN CT.FOO! c.out!, CHAN CT.FOO? s.out!) CT.FOO? svr: CT.FOO! cli: SEQ cli, svr := MOBILE CT.FOO mobility FG = c.out!x , s.out!x¯ PAR c.out ! cli s.out ! svr : CHAN CT.FOO! c: CHAN CT.FOO? s: PAR foo.generator (c!, s!) high.order.cli (c?) high.order.svr (s?)
mobility = c!x , s!x¯ , c?a , a!b , s?c¯ , c¯?d \ {c, s} = x !b , ¯ x ?d
This indicates a system in which a mobile is transferred internally, but never escapes. As such, we can hide the mobile channel event “x ” (also “¯ x ”), giving an empty mobility set — concluding that no mobiles escape this small system, as we would have expected. 3.8. Mobility Refinement The previous sections have illustrated a range of mobility sets for various processes and their compositions. Within CSP and related algebras is the concept of refinement, that operates on the traces, failures and divergences of processes, and can in general be used to test whether a particular implementation meets a given specification. In general, we write P Q to mean that P is refined by Q, or that Q is more deterministic than P . 2 The variable “cli” is a mobile channel bundle containing just one channel (named “c”), identified by a record subscript syntax: cli[c].
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
127
For mobile escape analysis, it is reasonable to suggest that there may be a related mobility refinement, whose definition is: ≡
P M Q
mobility Q ⊆ mobility P
The interpretation of this is that Q “contributes less to mobile escape” than P , and where the subset relation takes account of renaming within sets. This is not examined in detail here (an item for future work), but on initial inspection appears sensible — e.g. to test whether a specific implementation meets a general specification. 4. Application As previously discussed, the aim of this analysis is to determine what mobiles (if any) escape a particular network of occam-π processes, and if so, how they escape with respect to that process network (i.e. on which input and output channels). Two examples of the technique are discussed here, one for static process networks and one for dynamically evolving process networks. The former is more typical of small-scale systems, such as those used in small (and memory limited) devices. 4.1. Static Process Networks Figure 2 shows a network of parallel processes and the code that implements it. The individual components have the following mobile escape models: mobility delta = {in?a , out0!a , in?b , out1!b } mobility choice = {in?a , out0!a , in?b , out1!b } mobility gen = {out!a } mobility plex = {in0?a , out!a , in1?b , out!b } mobility sink = {in0?a , in1?b }
A?
X! delta
p plex
q B?
choice
r s
gen
sink
Y!
PROC net (CHAN MOBILE THING A?, B?, X!, Y!) CHAN MOBILE THING p, q, r, s: PAR delta (A?, X!, p!) choice (B?, q!, r!) gen (s!) plex (p?, q?, Y!) sink (r?, s?) :
Figure 2. Parallel process network.
When combined, with appropriate renaming for parameter passing (and to avoid unintentional capture), this gives the mobility set:
mobility Net = {A?a , X !a , A?b , p!b , B ?c , q!c , B ?d , r !d , s!e , p?f , Y !f , q?g , Y !g , r ?h , s?h } \ {p, q, r , s}
128
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
Applying the rule for hiding to the channels p, q, r and s gives: \{p} −→ A?a , X !a , A?b , Y !b , B ?c , q!c , B ?d , r !d , s!e , q?g , Y !g , r ?h , s?h \{q} −→ A?a , X !a , A?b , Y !b , B ?c , Y !c , B ?d , r !d , s!e , r ?h , s?h \{r } −→ A?a , X !a , A?b , Y !b , B ?c , Y !c , B ?d , s!e , s?h \{s} −→ A?a , X !a , A?b , Y !b , B ?c , Y !c , B ?d The resulting mobility analysis indicates that mobiles input on A escape through output on X and Y , and that inputs received on B either escape through Y or are consumed internally. The fact that certain mobility sequences are not present in the result provides more information: that mobiles input on A are never discarded internally, and that the resulting network does not generate escaping mobiles. 4.2. Dynamic Process Networks In dynamically evolving systems, RMoX in particular [18,19], connections are often established within a system for the sole purpose of establishing future connections. An example of this is an application process that connects to the VGA framebuffer (display) device via a series of other processes, then uses that new connection to exchange mobile data with the underlying device. Figure 3 shows a snapshot of connected graphics processes within a running RMoX system. service.core
application
gfx.core
kernel
vga.fb
vga
driver.core
Figure 3. RMoX driver connectivity.
Escape analysis allows for certain optimisations in process networks such as these. If the compiler (and associated tools) can determine that mobile data generated in “vga” or “vga.fb” is not discarded internally, nor escapes through the processes “gfx.core” and “application”, then it will be safe to pass the real framebuffer (video) memory around for rendering. Without the guarantees provided by this analysis, there is a danger that parts of the video memory could escape into the general memory pool — with odd and often undesirable consequences3 . Assuming that framebuffer memory originates and is consumed within “vga.fb”, we have an occam-π process with the structure: PROC vga.fb (CT.DRV? link) CT.GUI.FB! fb.cli: CT.GUI.FB? fb.svr: 3 Mapping process memory (typically a process’s workspace) into video memory, or vice-versa, does provide an interesting way of visualising process behaviour in RMoX, however.
F.R.M. Barnes / Mobile Escape Analysis for occam-pi SEQ fb.cli, fb.svr := MOBILE CT.GUI.FB ...
129
-- create channel-bundle
other initialisation and declarations
PAR WHILE TRUE link[in] ? CASE CT.DRV.R! ret: open.device; ret IF DEFINED fb.cli ret[out] ! device; fb.cli TRUE ret[out] ! device.busy ... other cases PLACED MOBILE []BYTE framebuffer AT ...: WHILE TRUE fb.svr[in] ? CASE get.buffer fb.svr[out] ! buffer; framebuffer put.buffer; framebuffer SKIP
-- request to open device
-- return bundle client-end
-- request from connected client -- outgoing framebuffer -- incoming framebuffer
:
That has the mobility model: a[1] !b , ¯ a[0] ?c mobility VFB = link ?r , r !a , ¯ The escape information here indicates that mobiles are generated and consumed at the serverend of the channel bundle a¯ , whilst the client-end of this bundle, a, escapes through another channel bundle r that the process receives from its link parameter. Instead of going into detail for the other processes involved, that would require a significant amount of space, the generic forwarding and use of connections is considered. 4.2.1. Client Processes The mechanism by which dynamic connections to device-drivers and suchlike are established involves sending the client-end of a return channel-bundle along with the request. A client process (e.g. “application” from figure 3) therefore typically has the structure: PROC client (SHARED CT.DRV! to.drv) CT.DRV.R! r.cli: CT.DRV.R? r.svr: CT.GUI.FB! guilink: SEQ r.cli, r.svr := MOBILE CT.DRV.R CLAIM to.drv to.drv[in] ! open.device; r.cli r.svr[out] ? CASE device.busy ... fail gracefully device; guilink ... use ’guilink’ :
-- create response channel-bundle
-- send request -- wait for response
130
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
This has the mobility model: mobility CLI = to.drv !e , ¯ e ?f ∪ M where M is the mobility model for the part of the process that uses the “guilink” connection to the underlying service, and will communicate directly on the individual channels within f . Connecting this client and the “vga.fb” processes directly, with renaming for parameter passing, gives the following mobility set: ¯ r , r !a , ¯ a[1] !b , ¯ a[0] ?c , A!e , ¯ e ?f ∪ M A?
¯ gives: Hiding the internal link A, A a[1] !b , ¯ a[0] ?c , ¯ e ?f ∪ M e!a , ¯
If we take a well-behaved client implementation for M — i.e. one that inputs a mobile (framebuffer) from the underlying driver, modifies it in some way and then returns it, without destroying or creating these (M = {f[1] ?x , f[0] !x }) — we get: a e! , ¯ a[1] !b , ¯ a[0] ?c , ¯ e ?f , f[1] ?x , f[0] !x Subsequently hiding e, which represents the “CT.DRV.R” link, causes f to be renamed to a, giving the set: a[0] ?c , a[1] ?x , a[0] !x ¯ a[1] !b , ¯ Logically speaking, and for this closed system, b and c must represent the same thing — in this case, mobile framebuffers. Thus we have a guarantee that mobiles generated within the “vga.fb” process are returned there, for this small system. On the other hand, a less well-behaved client implementation for M could be one that occasionally loses one of the framebuffers received, instead of returning it (i.e. M = {f[1] ?x , f[0] !x , f[1] ?y }). This ultimately gives the mobility set: ¯ a[1] !b , ¯ a[0] ?c , a[1] ?x , a[0] !x , a[1] ?y As before, b and c must represent the same mobiles, so the only mobiles received back must have been those sent. However, the presence of the sequence a[1] ?y indicates that framebuffers can be received and then discarded by this client. Another badly behaved client implementation is one that generates mobiles and returns these as framebuffers, in addition to the normal behaviour, e.g. M = {f[1] ?x , f[0] !x , f[0] !z }. This gives the resulting mobility set:
a[0] ?c , a[1] ?x , a[0] !x , a[0] !z ¯ a[1] !b , ¯
In this case, b and c do not necessarily represent the same mobiles — as while x can only be b, c can be either x (and therefore b) or z . Thus there is the possibility that mobiles are returned to the “vga.fb” driver that did not originate there. 4.2.2. Infrastructure Within RMoX, such client and server processes are normally connected through a network of processes that route requests around the system. From figure 3, this includes the “driver.core”, “service.core” and “kernel” processes. In earlier versions of RMoX [19], both requests and their responses were routed through the infrastructure. This is no longer the case — requests now include, as part of the request,
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
131
a mobile channel-end that is used for the response. This is a cleaner approach in many respects and is more efficient in most cases. From the client’s perspective, a little more work is involved when establishing connections, since the return channel-bundle must be allocated. Most of the infrastructure components within RMoX consist of a single server-end channelbundle on which requests are received, whose client-end is shared between multiple processes, and multiple client-ends connecting to other server processes such as “vga.fb” and other infrastructure components. A very general implementation of an infrastructure component is: PROC route (CT.DRV? in, CT.DRV! out.this, SHARED CT.DRV! out.next) WHILE TRUE in[in] ? CASE CT.DRV.R! ret: open.device; ret IF request.for.this out.this[in] ! open.device; ret NOT invalid CLAIM out.next! out.next[in] ! open.device; ret TRUE ret[out] ! no.such.device ...
other cases
:
The mobility model of this process is: mobility Rt = in?a , out.this!a , in?b , out.next!b , in?c The last component indicates that this routing process may discard the request (and the response channel-end) internally — after it has reported an error back on the response channel, of course. With the “route” process as it is, there would need to be an additional process at the end of this chain that responds to all connection requests with an error, e.g.: PROC end.route (CT.DRV? in) WHILE TRUE in[in] ? CASE CT.DRV.R! ret: open.device; ret ret[out] ! no.such.device ...
mobility ERt = in?x
other cases
:
Combining one “route” process and one “end.route” process with the existing “vga.fb” and “client” processes produces the network shown in figure 4. This has the following mobility model:
¯ ?r , r !a , ¯ ¯ ?x , A? ¯ a , C !a , A? ¯ b , B !b , A? ¯ c ∪ M C a[1] !b , ¯ a[0] ?c , A!e , ¯ e ?f , B
132
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
client
A
route
B
C vga.fb
end.route
Figure 4. RMoX routing infrastructure.
Hiding the internal links A, B and C gives: \{A} ¯ ?r , r !a , ¯ ¯ ?x , C !e , B !e ∪ M −→ C a[1] !b , ¯ a[0] ?c , ¯ e ?f , B \{B } ¯ ?r , r !a , ¯ −→ C a[1] !b , ¯ a[0] ?c , ¯ e ?f , C !e ∪ M \{C } −→ e!a , ¯ a[1] !b , ¯ a[0] ?c , ¯ e ?f ∪ M This system has an identical mobile escape model to the earlier directly connected “client” and “vga.fb” system. As such, the system can still be sure that framebuffer mobiles generated by “vga.fb” are returned there. 5. Related Research The use of escape analysis for determining various properties of dynamic systems stems from the functional programming community. One use here is for determining which parts of an expression escape a particular function, and if they can therefore be allocated on the stack (i.e. they are local to the function) [20]. More recently, escape analysis has been used in conjunction with object-oriented languages, such as Java [21]. Here it can be used to determine the boundaries of object references within the object graph, for the purposes of stack allocation and other garbage collector (GC) optimisations [22]. With the increasing use of multi-core and multi-processor systems, this type of analysis is also used to discover which objects are local to which threads (known as thread escape analysis), allowing a variety of optimisations [23]. While escape analysis for functional languages is generally well-understood, it gets extremely complex for object-oriented languages such as C++ and Java. Features inherent to object-oriented languages, inheritance and polymorphism in particular, have a significant impact on formal reasoning. The number of objects typically involved also create problems for automated analysis (state-space explosion). The escape analysis described here is more straightforward, but is sufficient for determining the particular properties identified earlier. The compositional nature of occam-π and CSP helps significantly, allowing analysis to be done in a divide-and-conquer manner, or to enable analysis to be performed on a subset of processes within a system (as shown in section 4.2.2). 6. Conclusions and Future Work This paper has presented a straightforward technique for mobile escape analysis in occam-π, and its application to various kinds of process network. The analysis provides for the checking of particular design-time properties of a system and can permit certain optimisations in the
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
133
implementation. At the top-level of a system, this escape analysis can also provide hints towards efficient distribution of the system across multiple nodes — by identifying those parts interconnected through mobile communication (and whose efficiency of implementation is greatly increased with shared-memory). Although the work here has focused on occam-π, the techniques are applicable to other process-oriented languages and frameworks. The semantic model for mobility presented here is not quite complete. Some of the formal rules for process composition have yet to be specified, though we have a good informal understanding of their operation. Another aspect yet to be fully considered is one of mobile processes. These can contain other mobiles as part of their state (within local variables), and as such warrant special treatment. The analysis techniques shown provide a very general model for mobile processes — in practice this either results in a larger state-space (where mobiles within mobile processes are tracked individually), or a loss in accuracy (e.g. treating a mobile process as CHAOS ). Once a complete semantic model has been established, it can be checked for validity, and the concept of mobility refinement investigated thoroughly. For the practical application of this work, the existing occam-π compiler needs to be modified to analyse and generate machine readable representations of mobile escape. Some portion of this work is already in place, discussed briefly in [24], where the compiler has been extended to generate CSP style behavioural models (in XML) of individual PROCs occam-π code. The mobile escape information obtained will be included within these XML models, incorporating attributes such as type. A separate but not overly complex tool will be required to manipulate and check particular properties of these — e.g. that an application process does not discard or generate framebuffer mobiles (section 4.2). How such information can be recorded and put to use for compiler and run-time optimisations is an issue for future work. Acknowledgements This work was funded by EPSRC grant EP/D061822/1. The author would like to thank the anonymous reviewers for their input on an earlier version of this work. References [1] P.H. Welch and F.R.M. Barnes. Communicating mobile processes: introducing occam-pi. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [2] Inmos Limited. occam 2.1 Reference Manual. Technical report, Inmos Limited, May 1995. Available at: http://wotug.org/occam/. [3] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-13-1532715. [4] M.H. Goldsmith, A.W. Roscoe, and B.G.O. Scott. Denotational Semantics for occam2, Part 1. In Transputer Communications, volume 1 (2), pages 65–91. Wiley and Sons Ltd., UK, November 1993. [5] M.H. Goldsmith, A.W. Roscoe, and B.G.O. Scott. Denotational Semantics for occam2, Part 2. In Transputer Communications, volume 2 (1), pages 25–67. Wiley and Sons Ltd., UK, March 1994. [6] R. Milner. Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press, 1999. ISBN: 0-52165-869-1. [7] P.S. Andrews, A.T. Sampson, J.M. Bjørndalen, S. Stepney, J. Timmis, D.N. Warren, and P.H. Welch. Investigating patterns for the process-oriented modelling and simulation of space in complex systems. In S. Bullock, J. Noble, R. Watson, and M.A. Bedau, editors, Artificial Life XI: Proceedings of the Eleventh International Conference on the Simulation and Synthesis of Living Systems, pages 17–24. MIT Press, Cambridge, MA, 2008. [8] Frederick R.M. Barnes. Dynamics and Pragmatics for High Performance Concurrency. PhD thesis, University of Kent, June 2003. [9] P.H. Welch and F.R.M. Barnes. Mobile Barriers for occam-pi: Semantics, Implementation and Application. In J.F. Broenink, H.W. Roebbers, J.P.E. Sunter, P.H. Welch, and D.C. Wood, editors, Communicat-
134
[10] [11] [12]
[13]
[14]
[15] [16]
[17]
[18]
[19]
[20]
[21] [22] [23] [24]
F.R.M. Barnes / Mobile Escape Analysis for occam-pi ing Process Architectures 2005, volume 63 of Concurrent Systems Engineering Series, pages 289–316, Amsterdam, The Netherlands, September 2005. IOS Press. ISBN: 1-58603-561-4. P.H. Welch and F.R.M. Barnes. A CSP model for mobile channels. In Proceedings of Communicating Process Architectures 2008. IOS Press, September 2008. A.W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1997. ISBN: 0-13-674409-5. F.R.M. Barnes, P.H. Welch, and A.T. Sampson. Barrier synchronisations for occam-pi. In Hamid R. Arabnia, editor, Proceedings of PDPTA 2005, pages 173–179, Las Vegas, Nevada, USA, June 2005. CSREA press. F.R.M. Barnes and P.H. Welch. Mobile Data, Dynamic Allocation and Zero Aliasing: an occam Experiment. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, editors, Proceedings of Communicating Process Architectures 2001. IOS Press, September 2001. P.H. Welch, F.R.M. Barnes, and F.A.C. Polack. Communicating Complex Systems. In Michael G. Hinchey, editor, Proceedings of the 11th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS-2006), pages 107–117, Stanford, California, August 2006. IEEE. ISBN: 0-76952530-X. F.R.M. Barnes and P.H. Welch. Prioritised dynamic communicating and mobile processes. IEE Proceedings – Software, 150(2):121–136, April 2003. J.C.P. Woodcock and A.L.C. Cavalcanti. The Semantics of Circus. In ZB 2002: Formal Specification and Development in Z and B, volume 2272 of Lecture Notes in Computer Science, pages 184–203. SpringerVerlag, 2002. S. Schneider and H. Treharne. Communicating B Machines. In ZB 2002: Formal Specification and Development in Z and B, volume 2272 of Lecture Notes in Computer Science, pages 251–258. SpringerVerlag, January 2002. F.R.M. Barnes, C.L. Jacobsen, and B. Vinter. RMoX: a Raw Metal occam Experiment. In J.F. Broenink and G.H. Hilderink, editors, Communicating Process Architectures 2003, WoTUG-26, Concurrent Systems Engineering, ISSN 1383-7575, pages 269–288, Amsterdam, The Netherlands, September 2003. IOS Press. ISBN: 1-58603-381-6. C.G. Ritson and F.R.M. Barnes. A Process Oriented Approach to USB Driver Development. In A.A. McEwan, S. Schneider, W. Ifill, and P.H. Welch, editors, Communicating Process Architectures 2007, volume 65 of Concurrent Systems Engineering Series, pages 323–338, Amsterdam, The Netherlands, July 2007. IOS Press. ISBN: 978-1-58603-767-3. Y.G. Park and B. Goldberg. Higher order escape analysis: Optimizing stack allocation in functional program implementations. In Proceedings of ESOP ’90, volume 432 of LNCS, pages 152–160. SpringerVerlag, 1990. B. Joy, J. Gosling, and G. Steele. The Java Language Specification. Addison-Wesley, 1996. ISBN: 0-20-163451-1. B. Blanchet. Escape analysis for Java(TM): Theory and practice. ACM Transactions on Programming Languages and Systems, 25(6):713–775, 2003. K. Lee, X. Fang, and S.P. Midkiff. Practical escape analyses: how good are they? In Proceedings of VEE ’07, pages 180–190. ACM, 2007. F.R.M. Barnes and C.G. Ritson. Checking process-oriented operating system behaviour using CSP and refinement. In PLOS 2009. ACM, 2009. To Appear.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-135
135
New ALT for Application Timers and Synchronisation Point Scheduling (Two excerpts from a small channel based scheduler) Øyvind TEIG and Per Johan VANNEBO Autronica Fire and Security1, Trondheim, Norway {oyvind.teig, per.vannebo}@autronicafire.no Abstract. During the design of a small channel-based concurrency runtime system (ChanSched, written in ANSI C), we saw that application timers (which we call egg and repeat timers) could be part of its supported ALT construct, even if their states live through several ALTs. There are no side effects into the ALT semantics, which enable waiting for channels, channel timeout and, now, the new application timers. Application timers are no longer busy polled for timeout by the process. We show how the classical occam language may benefit from a spin-off of this same idea. Secondly, we wanted application programmers to be freed from their earlier practice of explicitly coding communication states at channel synchronisation points, which was needed by a layered in-house scheduler. This led us to develop an alternative to the non-ANSI C “computed goto” (found in gcc). Instead, we use a switch/case with goto line-number-tags in a synch-point-table for scheduling. We call this table, one for each process, a proctor table. The programmer does not need to manage this table, which is generated with a script, and hidden within an #include file. Keywords. application timers, alternative, synch-point scheduling
Introduction This paper describes two ideas that have been implemented as part of a small runtime system (ChanSched), to be used in a forthcoming new product. ChanSched relates to processes, synchronous, zero-buffered, rendezvous-type, one-way data channels and asynchronous signal / timeout data-free channels. It started with the first author showing the second author a private “designer’s note” [1], where the problem of mixing communication states and application states is discussed. Channel state machines are visible in the code, on par with application states. This state mix had been incurred by an earlier runtime system, where synchronous channels were running on top of an asynchronous runtime system. With ChanSched, we now wanted to make it simpler for new programmers to understand and use these channels as a fundamental design and implementation paradigm. Even if the earlier system, described in [2] and [3], has worked fluently for years, the reasonable critique by readers was that the code was somewhat difficult to understand.
1
A UTC Fire & Security Company. NO-7483 Trondheim, Norway. See http://www.autronicafire.no
136
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
So, the second author was inspired to build ChanSched from scratch, based on this earlier experience. Added goals were to use it in systems with low power consumption and high formal product “approval level”. During this development, which ping-ponged between the two of us, we solved two problems present with the previous system: how do we manage application timers without busy-polling and how do we schedule to synchronisation points using only ANSI C ? 1. Application Timers in a New ALT 1.1 The Problem and our Problem The problem was that with the earlier system we had “busy-polled” application timers to trigger actions. The actions could be status checks once per hour or lights to blink in certain patterns. On porting this to ChanSched, we wanted to avoid polling. We had written this channel-based runtime system in C, where its ALT could either have channel components, or a single2 timer that could timeout on silent channels and nothing more. We called the latter an ALTTIMER, defined as non-application timer. We also wanted several timers and channel handling to go on concurrently in the same ALT. Earlier experience with occam [4] was of some help when designing this, but when going through the referees’ comments to this paper, we understood that knowledge had withered. We were asked to show examples in occam-like syntax. Obeying this, we surprisingly (and problematically for the paper, we thought) saw that occam in fact does not need to build application timers by polling – its designers had included them in the ALT! Something had been lost in our translation to C. However, the surprise persisted when we saw that this exercise showed that even classical occam could benefit from our thoughts. Looking over our shoulders, maybe there was a reason why the occam timers had timed out for us. 1.2 Our ANSI C based ChanSched System and the New Flora of Timers We start by showing a ChanSched process. Then, we proceed with a classic occam example and finally to a suggestion for a new timer mechanism for occam. Then, we will come back and do a more thorough discussion of our implementation. The code (Listing 1) is an “extended” version of the Prefix process in Commstime3 ([5] and Figure 1). Our cooperative, non-preemptive scheduler runs processes by calling the process by name, via its resolved for once address. So, on every scheduling and rescheduling, line 3 is entered. There, the process’s context pointer is restored from the parameter g_CP (pointing to a place in the heap), filled by the scheduler. In the second half of this paper we shall see how the PROCTOR_PREFIX causes cooperative synchronisation (or blocking) points in the code to be reached, by jumping over lines, even into the while loop. The consequence of this is that lines 5-7 are initialisation code. Knowing this one could try to understand the code as one would occam: CHAN_OUT and gALT_END are synchronisation points, meaning that line 9 will run only when a communication has taken place, and line 18 when a communication or a timeout has happened. We will not go through every detail here, but concentrate on the timers.
2
Our implementation handled only a single timer in its ALT. When we developed ChanSched, we used Commstime as our natural test case. Since it has no ALT timer handling in any of its processes, we decided to insert our timers in P_Prefix, for no concrete reason.
3
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
137
Void P_Prefix (void) // extended “Prefix” { Prefix_CP_a CP = (Prefix_CP_a)g_CP; // get process Context from Scheduler PROCTOR_PREFIX() // jump table (see Section 2) ... some initialisation SET_EGGTIMER (CHAN_EGGTIMER, CP->LED_Timeout_Tick); SET_REPTIMER (CHAN_REPTIMER, ADC_TIME_TICKS); CHAN_OUT (CHAN_DATA_0, &CP->Data_0, sizeof(CP->Data_0)); // first output while (TRUE) { ALT(); // this is the needed ”PRI_ALT” ALT_EGGREPTIMER_IN (CHAN_EGGTIMER); ALT_EGGREPTIMER_IN (CHAN_REPTIMER); gALT_SIGNAL_CHAN_IN (CHAN_SIGNAL_AD_READY); ALT_CHAN_IN (CHAN_DATA_2, &CP->Data_2, sizeof (CP->Data_2)); ALT_ALTTIMER_IN (CHAN_ALTTIMER, TIME_TICKS_100_MSECS); gALT_END(); switch (g_ThisChannelId) { ... process the guard that has been taken, e.g. CHAN_DATA_2 CHAN_OUT (CHAN_DATA_0, &CP->Data_0, sizeof (CP->Data_0)); }; } }
Listing 1. EGGTIMER, REPTIMER and PROCTOR_PREFIX (ANSI C and macros). (See Figure 1 for process data-flow diagram)
Note that ALT guard preconditions are hidden – they are controlled with SET and CLEAR macros (none shown). Only the input macros beginning with ‘g’ check preconditions; the others do not waste time testing a constant TRUE value. As one may understand from the above, timers are seen as channels – one channel per timer. Listing 1 is discussed in more detail throughout this paper. 1.2.1 Flora of Timers: ALTTIMER Line 16 is our ALTTIMER. As mentioned, when neither channel CHAN_SIGNAL_AD_READY4 nor CHAN_DATA_2 have communicated for the last 100 ms, the CHAN_ALTTIMER guard causes the ALT to be taken. (The ALT structure is said to be “taken” by the first ready guard.) When a channel guard is taken, the underlying timer associated with the ALTTIMER is stopped. It is restarted again every time the ALT is entered. 1.2.2 ALTTIMER and Very Long Application Timers The ALTTIMER was the only form of timer we had in our first port [2, 3]. Now we wanted to get further. On any scheduling from that timeout or any channel input, function calls had been made to handle application timers. An application timer is an object in local process context, used as parameter to a timer library. We needed to start, stop or poll for timeout, every so often. Since channels were often silent, the scheduling caused by ALTTIMERs bounded the resolution of our application timers. If a timeout were 100 ms, then a 10 seconds timeout would be accurate to 100 ms. By polling with a library call, we could handle timeouts longer that the global system timer word. The 100 ms could easily build timeouts of days – out of reach of a shorter system timer, which increments a 10 ms tick into a 16 bits integer. It is the system timer that is the basic mechanism of the ALT timer; the same is true for occam. The main rationale for application timer polling in smaller real-time systems may be this extended flexibility. 4
In order to also test non-timer interrupts, we included an analogue input ready channel. The potentiometer value thus read was used to pulse control a blinking LED. This way our example became rather complete.
138
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
1.2.3 Flora of Timers: EGGTIMER and REPTIMER In order to avoid application timers by polling, we decided to implement two new timer types and make them usable directly in the ALT. An EGGTIMER times out once (at a predefined time) and an REPTIMER times out repeatedly (at a pre-defined sequence of equally spaced times). Generically, we will call them EGGREPTIMERs here. These timers are initialised in application code before the ALT. After initialisation, they will be started the first time they are seen in an ALT – the next time the ALT is reached, they will continue to run (discussed later). They are not stopped when their ALT is taken by some other guard, only when they have timed out. So long as their ALT remains on the process execution path (e.g. in a loop), sooner or later they will time out and be taken. Even then, the REPTIMER will already have continued, with no skew and low jitter handling. However, EGGREPTIMERs may be stopped by application code before they have timed out. In this respect, the semantics of ALTTIMER and EGGREPTIMER differ – since an ALTTIMER has no meaning outside an ALT. 1.2.4 Arithmetic of Time Observe that no use of our timers makes reference to system time, or any derived value used to store some previous or future time. Therefore, no time arithmetic with values derived from system time may be done. We have in fact not yet seen any need for time arithmetic at process level5. If this for some reason is needed, we could easily add a function to read system time. 1.3 Timers in Classical occam As we were forced to rediscover: occam is able to handle any timer, including our application timers. In listing 2, there is an example of an ALTTIMER and a REPTIMER6. 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
PROC P_Listing2 (VAL INT n, CHAN INT InChan? OutChan!) -- extended “Prefix” INT Timeout_ALTTIMER, Timeout_REPTIMER: TIMER Clock_ALTTIMER, Clock_REPTIMER: SEQ OutChan ! n Clock_REPTIMER ? Timeout_REPTIMER Timeout_REPTIMER := Timeout_REPTIMER PLUS half.an.hour WHILE TRUE Clock_ALTTIMER ? Timeout_ALTTIMER PRI ALT Clock_REPTIMER ? AFTER Timeout_REPTIMER ... process every 30 minutes Timeout_REPTIMER := Timeout_REPTIMER PLUS half.an.hour -- no skew, only jitter INT Data: InChan ? Data ... process Data Clock_ALTTIMER ? AFTER Timeout_ALTTIMER PLUS hundred.ms ... MyChan pause do background task (starvation possible) -- skew and jitter :
Listing 2. General timers in occam. 5 Note that the Consume process in a proper Commstime implementation in fact does use time arithmetic (for performance measurement). We measured consumed time with the debugger. 6 To free ourselves from the ChanSched ANSI C extended Commstime, the next two examples stand for themselves, reflecting only the additional timer aspects of P_Prefix.
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
139
Scheduling is enabled AFTER the specified timeouts. Therefore both timer types will cause jitter, but REPTIMER will be without skew, provided a timeout is handled before the next one is due. This is the same as for the code in Listing 1. In occam, it is the process code that does the necessary time arithmetic. Inputting system time from any timer into an INT (lines 30 and 33) happens immediately – no synchronisation is involved that may cause blocking. These absolute time values are used to calculate the next absolute timeout value (lines 31 and 42) and used by the ALT guards (lines 35 and 42). The language requires the programmer to manage these values. We remember working with occam code. Should the TIMER or the INT have clock or time in its name (or neither)? And when reading other people’s code: is now a TIMER or an INT? The INT holding those absolute time values obfuscated the thinking process. 1.4 New Timers for occam 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
PROC P_Listing3 (VAL INT n, CHAN INT InChan? OutChan!) -- extended “Prefix” TIMER My_ALTTIMER, My_REPTIMER: -- only timers, no variables SEQ OutChan ! n SET_TIMER (REPTIMER, My_REPTIMER, 30, MINUTE, 24H) SET_TIMER (ALTTIMER, My_ALTTIMER, 0, MILLISEC, 32BIT) WHILE TRUE PRI ALT My_REPTIMER ? AFTER () ... process every 30 minutes (no timeout value to compute) -- no skew, only jitter INT Data: InChan ? Data ... process Data My_ALTTIMER ? AFTER (100) ... MyChan pause do background task (starvation possible) -- skew and jitter :
Listing 3. Concrete configurable timers in a new occam.
Listing 3 shows a suggestion of how to rectify. Here, we don’t need to do time arithmetic. No declaration like Line 26 is needed – we never see those INT values. We just have TIMERs, which now behave more like abstract data types: SET_TIMER is parameterised with unit (granularity) and length (max time). Line 50 sets a granularity of 1 minute and orders a tick every 30 minutes – thus needs 60 x 24 = 1440 (i.e. INT16 is enough) to count up to 24 hours. Line 51 enables timeouts up to 232 milliseconds – some 49 days (INT32). Lines 50 and 51 also define the type of timer: ALTTIMER or REPTIMER. So, we now have a set of configurable timers, mixable within the same ALT. A restriction is that there may be only one ALTTIMER, since it defines the handling of silent channels. The semantics of EGGREPTIMERs are that they are treated like channels: associated content is not touched when the ALT is taken. In this case it means that a stopped or running timer will stay stopped or running. Observe that we have not banned the possibility to read system time and do arithmetic with time. We are not concerned about precise language syntax here; enough to say that we are uncertain about parameters to AFTER in Lines 54 and 60. Neither have usage rules and/or runtime checks for these timers been considered. Another point is that we suggest to initialise an EGGREPTIMER with the SET-command, and start it in the ALT – as we do in ChanSched (see below). It may start to run straight
140
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
away. However, in any case it will timeout in the ALT. It is outside the scope of this paper to discuss the precise semantics or semantic differences of these two possibilities, or the enforcement of usage rules by a compiler. Our choice compares to calculating a new timeout value (initialise) and then using (start) that timeout in a classical occam AFTER. 1.5 More on our ANSI C-based ChanSched System 1.5.1 Some Implementation Issues Our implementation relies on a separate timer server process P_Timers_Handler, handling any number of timeouts (Figure 1). It delivers timeouts through asynchronous send type channels carrying no data, called TIMER channels, one per EGGREPTIMER.
Figure 1. Our test system, called “extended Commstime”. (See Listing 1 for P_Prefix code)
An individual asynchronous non-blocking timer channel thus represents each timer, so any number of these may be signalled simultaneously. When a new EGGREPTIMER is added, the timer process is recompiled with a larger set of timers, controlled by a constant drawn from a table. So, once written and tested, the timer process requires no further editing. Initially, we did parameter handling of EGGREPTIMERs in the ALT, modeled by how the ALTTIMER is handled, where AFTER has a parameter. However, we soon realised that this was impractical, since start, restart and preemptive stop often would be easiest coded (and consequently, understood) away from the ALT. However, the macros/functions used to handle the start, restart and stop do fill the timer data structure, including the value for next timeout. So, there should not be synchronising points between timer set and the ALT, as this could cause a wanted timeout to have passed before the ALT was reached. Section 1.4 discusses this in more detail for the new occam.
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
141
We use a function call and no real channel communication to set the EGGREPTIMERs parameters, used by the concurrent P_Timers_Handler. This does not violate shared value exclusive usage, due to careful coding hidden from the user and the “run-to-completion” semantics of our runtime system. Therefore, any buffer process is not needed. P_Timers_Handler takes its "tick updated" signal from a signal channel, sent directly from the system timer interrupt. Processor sleep continues until the next timeout, if there is no other pending work. 1.6 Discussion and Conclusion The EGGTIMER and REPTIMER do not seem to interfere with the ALTTIMER, even if their states outlive the ALT. To outlive the ALT is not as unique as one may think: any channel would in a way outlive the ALT, since it is possible for a process to be first on a channel when the input ALT is not active. This is, in fact, a bearing paradigm. We have not done any research to find existing uses of this concept in different programming environments. We certainly suspect this could exist. Raising timers from "link level" ALTTIMER to "application level" EGGREPTIMERs we feel is a substantial move towards a more flexible programming paradigm for timers, in our ANSI C based system. Now, none of the processes that need application timers need to do any busy poll. It improves understanding, coding and battery life. Configurability of the timers has been shown. A new occam may benefit from these ideas as well. We have done no formal verification of EGG or REPTIMERs. 2. Cooperative Scheduling via the Proctor Table 2.1 The Scheduler is Not as Transparent to User Code as we Thought This section only considers Listing 1. As described in the introduction, the non-preemptive scheduler controlling an asynchronous message system with processes that have run-tocompletion semantics had never been designed to reschedule to synchronisation points, since there are no synchronisation points in the paradigm. So we built a layer on top of it ([2] and [3]), looking heavily to the SPoC [6] occam to C translator. We had learnt that the asynchronous scheduler worked like the synchronous scheduler of SPoC, which indeed had a rich set of synchronisation points. However, that had occam source code on top. The SPoC scheduler had unique states for channel communication (i.e. synchronisation points) and the compiler flattened application states and communication states. This is the model we had used, where we did flattening of the two state spaces (in ANSI C) by hand. Discovering the obvious – to make channel visibility be like many channel-based libraries – has been a long way to go [1]. The goal described in that note was to “send and send and then receive, including ALT” sequentially in code, with no visible communication states in between. So we decided to make a new cooperative scheduler from scratch, also motivated by the Safety Integrity Level (SIL) requirements as defined by IEC 61508, where arguing along the CSP line of thinking is appreciated [7]. Our main criterion for a new scheduler was that, in some way or another, it should be able to reschedule a process to the code immediately following the synchronisation point. This is when the process had been first on a channel (or set of input channels) and should not proceed until the second contender arrived at the other end. This is the same functionality as described in [2] and [3] – but with invisible synchronisation points.
142
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
2.2 The Proctor Scheduling Table Our solution was the “proctor” jump table: a name invented by us, illustrating that it takes care of scheduling and acts on behalf of the scheduler. It is generated by standard ANSI C pre-processor constructs, by hand coding or by a script. Errors in the table would cause the compiler either to issue an error about a missing label, or to warn about an unused label. We raised that warning to become an error, to make the scheme bullet proof. Listing 4 shows how the CHAN_OUT macro first stores the actual line number, then makes a label like SYNCH_8_L, which is the rescheduling point (in Listing 1). Observe that a C macro, no matter how many lines it may look, is laid out as a single line by the preprocessor. Now, the system has a legal label to which it can reschedule; so a goto (if automatically generated) is a viable mechanism to use. 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#define SCHEDULE_AT goto #define CAT(a,b,c,d,e) a##b##c##d##e // Concatenate to f.ex. “SYNCH_8_L” #define SYNCH_LABEL(a,b,c,d,e) CAT(a,b,c,d,e) // Label for Proctor-table #define PROC_DESCHEDULE_AND_LABEL() \ CP->LineNo = __LINE__; \ return; \ SYNCH_LABEL(SYNCH,_,__LINE__,_,L): #define CHAN_OUT(chan,dataptr,len) \ if (ChanSched_ChanOut(chan,dataptr,len) == FALSE) \ { \ PROC_DESCHEDULE_AND_LABEL(); \ } \ g_ThisAltTaken = FALSE
Listing 4. Some macros used to build, and usage of line number labels.
The proctor table takes us there. A goto line number (SCHEDULE_AT) taking as parameter (which is not on the stack but in process context) has survived the return:
CP->LineNo
81 82 83 84 85 86 87 88 89
#define PROCTOR_PREFIX()\ switch (CP->LineNo)\ {\ case 0: break;\ case 8: SCHEDULE_AT SYNCH_8_L;\ case 17: SCHEDULE_AT SYNCH_17_L;\ case 21: SCHEDULE_AT SYNCH_21_L;\ DEFAULT_EXIT\ }
Listing 5. The proctor-table.
This is standard ANSI C. We avoid the extension called “computed goto” (address) that is available in gcc, a compiler we do not use for these applications [8]. We could call our solution “scripted goto” (label), just to differentiate. Listing 6 shows the output of our script, which generates the proctor table file for us: 90 91 92
In P_Commstime.c there were 4 processes, and 10 synchronisation points In P_Timers_Handler.c there was 1 process, and 1 synchronisation point There were a total of 2 files, 5 processes and 11 syncronisation points
Listing 6. Log from the ProctorPreprocessor script.
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
143
When the scheduler always schedules the process to the function start in Listing 1, the proctor table macro causes the process to re-schedule to the correct line. The code is truly invisible but available, since the macro body is contained in a separate #included file. Initially, a dummy CP->LineNo, set up by the run-time system, is set to zero. This takes the process through its initialising code: from the proctor table to the first synchronisation point, Line 8 of Listing 1. 2.3 Discussion and Conclusion The complexities of a preemptive scheduler – and the fact that we do not need one – makes this solution quite usable. It is safe and invisible to the user, who do not need to relate to link level states (also called communication states or synchronisation points). So, the user needs to relate only to application states. The code is portable, standard ANSI C. Local process variables that reside on the stack will not survive a synchronisation point, so the programmer has to place these in process context. The overhead of the proctor jump table also includes storing the next line number at run-time, but this is small and acceptable for us. These points are less frequent than function calls, but are comparable in cycle count. 3. Conclusions Section 1 shows that differentiating configurable types of timers in the ALT may raise timers to a higher and more portable level. Section 2 displays the use of a standard ANSI C feature wrapped into a jump (proctor) table, in service for a cooperative scheduler. Making ANSI C process scheduling with invisible channel communication and synchronisation states is a step forward for us. With EGGTIMERs and REPTIMERs, process application code is now easier to write, read and understand. We have also noted that there may be a need to show timer handling into an “extended Commstime”, so that implementors could have a common platform also for this. Acknowledgement We thank the management at the Autronica Fire and Security’s development department in Trondheim for allowing us to publish this work. [Øyvind Teig is Senior Development Engineer at Autronica Fire and Security. He has worked with embedded systems for more than 30 years, and is especially interested in real-time language issues (see http://www.teigfam.net/oyvind/pub for publications and contact information). Per Johan Vannebo is Technical Expert at Autronica Fire and Security. He has worked with embedded systems for 13 years.]
References [1]
Ø. Teig, A scheduler is not as transparent as I thought (Why CSP-type blocking channel state machines were visible, and how to make them disappear), in ‘Designer's Notes’ #18, at author’s home page,
[2]
Ø. Teig, From message queue to ready queue (Case study of a small, dependable synchronous blocking channels API – Ship & forget rather than send & forget), in ‘ERCIM Workshop on Dependable Software Intensive Embedded Systems’, in cooperation with 31st. EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA), Porto, Portugal, 2005. IEEE Computer Press, ISBN 2-912335-15-9. Also at http://www.teigfam.net/oyvind/pub/pub_details.html#Ercim05
http://www.teigfam.net/oyvind/pub/notes/18_A_scheduler_is_not_so_transparent.html
144 [3]
[4] [5]
[6]
[7] [8]
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling Ø. Teig, No Blocking on Yesterday’s Embedded CSP Implementation (The Rubber Band of Getting it Right and Simple), in ‘Communicating Process Architectures 2006’, P.H. Welch, J. Kerridge, and F.R.M. Barnes (Eds.), pp. 331-338, IOS Press, 2006. Also at http://www.teigfam.net/oyvind/pub/pub_details.html#NoBlocking Inmos Limited, The occam programming language. Prentice Hall, 1984. Also see http://en.wikipedia.org/wiki/occam_(programming_language) P.H. Welch and F.R.M. Barnes, Prioritised Dynamic Communicating Processes - Part I. In ‘Communicating Process Architectures 2002’, J. Pascoe, R. Loader and V. Sunderam (Eds.), pp. 321–352, IOS Press, 2002. M. Debbage, M. Hill, S. Wykes, D. Nicole, Southampton's Portable occam Compiler (SPoC). In: R. Miles, A. Chalmers (eds.), in ‘Progress in Transputer and occam Research’, WoTUG 17, pp. 40-55. IOS Press, Amsterdam, 1994. IEC 61508, SIL: Safety Integrity Level (SIL level), a safety-related metric used to quantify a system's safety level. See http://en.wikipedia.org/wiki/Safety_Integrity_Level Wikipedia, Computed goto. See http://en.wikipedia.org/wiki/Goto_(command)#Computed_GOTO
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-145
145
Translating ETC to LLVM Assembly Carl G. RITSON School of Computing, University of Kent, Canterbury, Kent, CT2 7NF, England
[email protected] Abstract. The LLVM compiler infrastructure project provides a machine independent virtual instruction set, along with tools for its optimisation and compilation to a wide range of machine architectures. Compiler writers can use the LLVM’s tools and instruction set to simplify the task of supporting multiple hardware/software platforms. In this paper we present an exploration of translation from stack-based Extended Transputer Code (ETC) to SSA-based LLVM assembly language. This work is intended to be a stepping stone towards direct compilation of occam-π and similar languages to LLVM’s instruction set. Keywords. concurrency, ETC, LLVM, occam-pi, occ21, tranx86
Introduction and Motivation The original occam language toolchain supported a single processor architecture, that of the INMOS Transputer [1,2]. Following INMOS’s decision to end development of the occam language, the sources for the compiler were released to the occam For All (oFA) project [3]. The oFA project modified the INMOS compiler (occ21), adding support for processor architectures other than the Transputer, and developed the basis for today’s Kent Retargetable occam Compiler (KRoC) [4]. Figure 1 shows the various compilation steps for an occam or occam-π program. The occ21 compiler generates Extended Tranputer Code (ETC) [5], which targets a virtual Transputer processor. Another tool, tranx86 [6], generates a machine object from the ETC for a target architecture. This is in turn linked with the runtime kernel CCSP [7] and other system libraries. Tools such as tranx86, octran and tranpc [8], have in the past provided support for IA32, MIPS, PowerPC and Sparc architectures; however, with the progressive development of new features in the occam-π language, only IA-32 support is well maintained at the time of writing. This is a consequence of the development time required to maintain support for a large number of hardware/software architectures. In recent years the Transterpreter Virtual Machine (TVM), which executes linked ETC bytecode directly, has provided an environment for executing occam-π programs on architectures other than IA-32 [9,10]. This has been possible due to the small size of the TVM codebase, and its implementation in architecture independent ANSI C. Portability and maintainability are gained at the sacrifice of execution speed, a program executed in the TVM runs around 100 times slower its equivalent tranx86 generated object code. In this paper we present a new translation for ETC bytecode, from the virtual Transputer instruction set, to the LLVM virtual instruction set [11,12]. The LLVM compiler infrastructure project provides a machine independent virtual instruction set, along with tools for its optimisation and compilation to a wide range of machine architectures. By targeting a virtual instruction set that has a well developed set of platform backends, we aim to increase
146
C.G. Ritson / Translating ETC to LLVM Assembly
Figure 1. Flow through the KRoC and Transterpreter toolchains, from source to program execution. This paper covers developments in the grey box.
the number of platforms the existing occam-π compiler framework can target. LLVM also provides a pass based framework for optimisation at the assembly level, with a large number of pre-written optimisation passes (e.g. deadcode removal, constant folding, etc). Translating to the LLVM instruction set provides us with access to these ready-made optimisations as opposed to writing our own, as has been done in the past [6]. The virtual instructions sets of the Java Virtual Machine (JVM) or the .NET’s Common Language Runtime (CLR) have also been used as portable compilation targets [13,14]. Unlike LLVM these instruction sets rely on a virtual machine implementation and do not provide a clear path for linking with our efficient multicore language runtime [7]. This was a key motivating factor in choosing LLVM over the JVM or CLR. An additional concern regarding the JVM and CLR is that large parts of their code bases are concerned with language features not relevant for occam-π, e.g. class loading or garbage collection. Given our desire to support small embedded devices (section 3.7), it seems appropriate not to encumber ourselves with a large virtual machine. LLVM’s increasing support for embedded architectures, XMOS’s XCore processor in particular [15], provided a further motivation to choose it over the JVM or CLR. In section 1 we briefly outline the LLVM instruction set and toolchain. We describe the steps of our translation from ETC bytecode to LLVM assembly in section 2. Section 3 contains initial benchmark results comparing our translator’s output via LLVM to that from tranx86. Finally, in section 4 we conclude and comment on directions for future work. 1. LLVM In this section we briefly describe the LLVM project’s infrastructure and its origins. Additionally, we give an introduction to LLVM assembly language as an aid to understanding the translation examples in section 2. Lattner proposed the LLVM infrastructure as a means of allowing optimisation of a program not just at compile time, but throughout its lifetime [11]. This includes optimisation at compile time, link time, runtime and offline optimisation. Where offline optimisations may tailor a program for a specific system, or perhaps apply profiling data collected from previous executions.
C.G. Ritson / Translating ETC to LLVM Assembly
define i32 @cube (i32 %x) { %sq = mul i32 %x, %x %cu = mul i32 %sq, %x ret i32 %cu }
147
; multiply x by x ; multiply sq by x ; return cu
Figure 2. Example LLVM function which raises a value to the power of three.
The LLVM infrastructure consists of a virtual instruction set, a bytecode format for the instruction set, front-ends which generate bytecode from sources (including assembly), a virtual machine and native code generators for the bytecode. Having compiled a program to LLVM bytecode it is then optimised before being compiled to native object code or JIT compiled in the virtual machine interpreter. Optimisation passes take bytecode (or its inmemory representation) as input, and produce bytecode as output. Each pass may modify the code or simply insert annotations to influence other passes, e.g. usage information. In this paper we discuss the generation of LLVM assembly language from ETC for use with LLVM’s native code generators. The LLVM assembly language is strongly typed, and uses static single-assignment (SSA) form. It has neither machine registers nor an operand stack, rather identifiers are defined when assigned to, and this assignment may occur only once. Identifiers have global or local scope; the scope of an identifier is indicated by its initial character. The example in Figure 1, shows a global function @cube which takes a 32-bit integer (given the local identifier %x), and returns it raise to the power of three. This example also highlights LLVM’s type system, which requires all identifiers and expressions to have explicitly specified types. LLVM supports the separate declaration and definition of functions: header files declare functions, which have a definition at link time. The use of explicit functions, as opposed to labels and jump instructions, frees the programmer from defining a calling convention. This in turn allows LLVM code to transparently function with the calling conventions of multiple hardware and software ABIs. In addition to functions LLVM provides a restricted form of traditional labels. It is not possible to derive the address of an LLVM label or assign a label to an identifier. Furthermore the last statement of a labelled block must be a branching instruction, either to another label or a return statement. These restrictions give LLVM optimisations a well-defined view of program control flow, but do present some interesting challenges (see section 2.2). In our examples we have, where appropriate, commented LLVM syntax; however, for a full definition of the LLVM assembly language we refer the reader to the project’s website and reference manual [16]. 2. ETC to LLVM Translation This section describes the key steps in the translation of stack-based Extended Transputer Code (ETC) to the SSA-form LLVM assembly language. 2.1. Stack to SSA ETC bases its execution model on that of the Transputer, a processor with a three register stack. A small set of instructions have coded operands, but the majority consume (pop) operands from the stack and produce (push) results to it. A separate data stack called the workspace provides the source or target for most load and store operations. Blind translation from a stack machine to a register machine can be achieved by designating a register for each stack position and shuffling data between registers as operands are
148
LDC 0 LDL 0 LDC 64 CSUB0 LDLP 3 BSUB SB
C.G. Ritson / Translating ETC to LLVM Assembly
; ; ; ; ; ; ;
load constant 0 load workspace location 0 load constant 64 assert stack 1 < stack 0, and pop stack load a pointer to workspace location 3 subscript stack 0 by stack 1 store byte in stack 1 to pointer stack 0
Figure 3. Example ETC code which stores a 0 byte to an array. The base of the array is workspace location 3, and offset to be written is stored in workspace location 0.
LDC 0 LDL 0 LDC 64 CSUB0 LDLP 3 BSUB SB
; ; ; ; ; ; ;
() => (reg 1) () => (reg 2) () => (reg 3) (reg 3, reg 2) => (reg 2) () => (reg 4) (reg 4, reg 2) => (reg 5) (reg 5, reg 1) => ()
STACK STACK STACK STACK STACK STACK STACK
= = = = = = =
Figure 4. Tracing the stack utilisation of the ETC in Figure 3, generating a register for each unique operand.
LDC 0
reg_1
LDL 0
SB
reg_2
BSUB
reg_5
CSUB
LDC 64
reg_3
LDLP 3
reg_4
Figure 5. Data flow graph generated from the trace in Figure 4.
. pushed and popped. The resulting translation is not particularly efficient as it has a large number of register-to-register copies. More importantly, this form of blind translation is not possible with LLVM’s assembly language as identifiers (registers) cannot be reassigned. Instead we must trace the stack activity of instructions, creating a new identifier for each operand pushed and associate it with each subsequent pop or reference of that operand. This is possible as all ETC instructions consume and produce constant numbers of operands. The process of tracing operands demonstrates one important property of SSA, its obviation of data dependencies between instructions. Figures 3, 4 and 5 show respectively: a sample ETC fragment, its traced form and a data flow graph derived from the trace. Each generated identifier is a node in the data flow graph connected to nodes for its producer and consumer nodes. From the example we can see that only the SB instruction depends on the first LDC, therefore it can be reordered to any point before the SB, or in fact constant folded. This direct mapping to the data flow graph representation, is what makes SSA form desirable for pass-based optimisation. We apply this tracing process to the standard operand stack and the floating point operand
C.G. Ritson / Translating ETC to LLVM Assembly
149
; load workspace offset 1 %reg 1 = load i32∗ (getelementptr i32∗ %wptr 1, i32 1) ; add 1 %reg 2 = add i32 %reg 1, 1 ; store result to workspace offset 2 store i32 %reg 2, (getelementptr i32∗ %wptr 1, i32 2) ; load workspace offset 3 %reg 3 = load i32∗ (getelementptr i32∗ %wptr 1, i32 3) ; synchronise barrier call void kernel barrier sync (%reg 3) ; load workspace offset 1 %reg 4 = load i32∗ (getelementptr i32∗ %wptr 1, i32 1) ; add 2 %reg 5 = add i32 %reg 4, 2 ; store result to workspace offset 2 store i32 %reg 5, (getelementptr i32∗ %wptr 1, i32 2) Figure 6. LLVM code example which illustrates the dangers of optimisation across kernel calls.
stack. A data structure in our translator provides the number of input and output operands for each instruction. Additionally, we trace modifications to the workspace register redefining it as required. Registers from the operand stack are typed as 32-bit integers (i32), and operands on the floating point stack as 64-bit double precision floating point numbers (double). The workspace pointer is an integer pointer (i32∗). When an operand is used as memory address it is cast to the appropriate pointer type. In theory, these casts may hinder certain kinds of optimisations, but we have not observed this in practice. 2.2. Process Representation While the programmer’s view of occam is one of all processes executing in parallel, this model is in practice simulated by one or more threads of execution moving through the concurrent processes. The execution flow may leave processes at defined instructions, reentering at the next instruction. The state of the operand stack after these instructions is undefined. Instructions which deschedule the process, such as channel communication or barrier synchronisation, are implemented as calls to the runtime kernel (CCSP) [7]. In a low-level machine code generator such as tranx86, the generator is aware of all registers in use and ensures that their state is not assumed constant across a kernel call. Take the example in Figure 6, there is a risk the code generator may choose to remove the second load of workspace offset 1, and reuse the register from the first load. However the value of this register may have been changed by another process which is scheduled by the kernel before execution returns to the process in the example. While the system ABI specifies which registers should be preserved by the callee if modified, the kernel does not know which registers will be used by other processes it schedules. If the kernel is to preserve the registers then it must save all volatile registers when switching processes. This requires space to be allocated for each process’s registers, something the occam compiler does not do as the instruction it generated was clearly specified as to undefine the operand stack. More importantly, the code to store a process’s registers must be rewritten in the system assembly language for each platform to be supported. Given our goal of minimal maintenance portability this is not acceptable. Our solution is to breakdown monolithic processes into sequences of uninterruptable
150
C.G. Ritson / Translating ETC to LLVM Assembly
Process A
f1
f2
fn
Kernel Process B
g1
g2
gm
Figure 7. Execution of the component functions of processes A and B is interleaved by the runtime kernel.
functions which pass continuations [17]. Control flow is then restricted such that it may only leave or enter a process at the junctures between its component functions. The functions of the process are then mapped directly to LLVM function definitions, which gives LLVM an identical view of the control flow to that of our internal representation. LLVM’s code generation backends will then serialise state at points where control flow may leave the process. Figure 7 gives a graphical representation of this process, as the runtime kernel interleaves the functions f1 to fn of process A with g1 to gm of process B. In practice the continuation is the workspace (wptr), with the address of the next function to execute stored at wptr[−1] . This is very similar to the Transputer’s mechanism for managing blocked processes, except the stored address is a function and thus the dispatch mechanism is not a jump, but a call. Thus the dispatch of a continuation (wptr) is the tail call: wptr[−1] (wptr). We implement the dispatch of continuations in the generated LLVM assembly. Kernel calls return the next continuation as selected by the scheduler, which is then dispatched by the caller. This removes the need for system specific assembly instructions in the kernel to modify execution flow, and thus greatly simplifies the kernel implementation. The runtime kernel can then be implemented as a standard system library. Figure 8 shows the full code listing of a kernel call generated by our translation tool. Two component functions of a process kroc.screen.process are shown (see section 2.5.1 for more details on function naming). The first constructs a continuation to the second, then makes a kernel call for channel input and dispatches the returned continuation. 2.3. Calling Conventions When calling a process as a subroutine, we split the present process function and make a tail call to the callee passing a continuation to newly created function as the return address. This process is essentially the same the Transputer instructions CALL and RET. There are are however some special cases which we address in the remainder of this section. The occam language has both processes (PROC) and functions (FUNCTIONS). Processes may modify their writable (non-VAL) parameters, interact with their environment through channels and synchronisation primitives, and go parallel creating concurrent subprocesses. Functions on the other hand may not modify their parameters or perform any potentially blocking operations or go parallel, but may return values (processes do not return values). While it is possible to implement occam’s pure functions in LLVM using the normal call stack, we have not yet done so for pragmatic reasons. Instead we treat function calls as process calls. Function returns are then handled by rewriting the return values into parameters to the continuation function. The main obstacle to supporting pure functions is that the occ21 compiler lowers functions to processes, this obscures functions in the resulting ETC output. It also allows some
C.G. Ritson / Translating ETC to LLVM Assembly
151
; Component function of process "kroc.screen.process" define private fastcc void @O kroc screen process L0.3 0 (i8∗ %sched, i32∗ %wptr 1) { ; ... code omitted ... ; Build continuation ; tmp 6 = pointer to workspace offset −1 %tmp 6 = getelementptr i32∗ %wptr 1, i32 −1 ; tmp 7 = pointer to continuation function as byte pointer %tmp 7 = bitcast void (i8∗, i32∗)∗ @O kroc screen process L0.3 1 to i8∗ ; tmp 8 = tmp 7 cast to an 32−bit integer %tmp 8 = ptrtoint i8∗ %tmp 7 to i32 ; store tmp 8 (continuation function pointer) to workspace offset −1 store i32 %tmp 8, i32∗ %tmp 6 ; Make kernel call ; The call parameters are reg 8, reg 7 and reg 6 ; The next continuation is return by the call as tmp 9 %tmp 9 = call i32∗ @kernel Y in (i8∗ %sched, i32∗ %wptr 1, i32 %reg 8, i32 %reg 7, i32 %reg 6) ; Dispatch the next continuation ; tmp 10 = pointer to continuation offset −1 %tmp 10 = getelementptr i32∗ %tmp 9, i32 −1 ; tmp 12 = pointer to continuation function cast as 32−bit integer %tmp 12 = load i32∗ %tmp 10 ; tmp 11 = pointer to continuation function %tmp 11 = inttoptr i32 %tmp 12 to void (i8∗, i32∗)∗ ; tail call tmp 11 passing the continuation (tmp 9) as its parameter tail call fastcc void %tmp 11 (i8∗ %sched, i32∗ %tmp 9) noreturn ret void } ; Next function in the process "kroc.screen.process" define private fastcc void @O kroc screen process L0.3 1 (i8∗ %sched, i32∗ %wptr 1) { ; ... code omitted ... } Figure 8. LLVM code example is actual output from our translation tool showing a kernel call for channel input. This demonstrates continuation formation and dispatch.
kernel operations (e.g. memory allocation) within functions. Hence to provide pure function support, the translation tool must reconstruct functions from processes, verify their purity, and have separate code generation paths for process and functions. We considered such engineering excessive for this initial exploration; however, as the LLVM optimiser is likely to provide more effective inlining and fusion of pure functions we intend to explore it in future work. In particular, the purity verification stage involved in such a translator should also be able to lift processes to functions, further improving code generation. Another area affected by LLVM translation is the Foreign Function Interface (FFI). FFI allows occam programs to call functions implemented in other languages, such as C [18,19]. This facility is used to access the system libraries for file input and output, networking and graphics. At present the code generator (tranx86) must generate not only hardware specific assembly, but structure the call to conform to the operating system specific ABI. LLVM greatly simplifies the FFI call process as it abstracts away any ABI specific logic. Hence
152
C.G. Ritson / Translating ETC to LLVM Assembly
foreign functions are implemented as standard LLVM calls in our translator. 2.4. Branching and Labels The Transputer instruction set has a relatively small number of control flow instructions: • • • • •
CALL call subroutine (and a general call variant - GCALL), CJ conditional jump, J unconditional jump, LEND loop end (form of CJ which uses a counting block in RET return from subroutine.
memory),
In sections 2.2 and 2.3 we addressed the CALL and RET related elements of our translation, in this section we address the other instructions. The interesting aspect of the branching instructions J and CJ are their impact on the operand stack. An unconditional jump undefines the operand stack, this allows a process to be descheduled on certain jumps, which provided a preemption mechanism for long running process on the Transputer. The conditional jump instruction branches if the first element of the operand stack is zero, in doing so it preserves the stack. If it does not branch then it instead pops the first element of the operand stack. As part of operand tracing during the conversion to SSA-form (see section 2.1), each encountered label within a process is tagged with the present stack operands. For the purposes of tracing, unconditional jumps undefine the stack and conditional jumps consume the entire stack outputting stackdepth−1 new operands. Having traced the stack we compare the inputs of each label with inferred inputs from the branch instructions which reference it, adjusting instruction behaviour as required. These adjustments can occur, for example, when the target of a conditional jump does not require the entire operand stack. While the compiler outputs additional stack depth information this is not always sufficient, hence our the introduction of an additional verification stage. The SSA syntax of LLVM’s assembly language adds some complication to branching code. When a label is the target of more than one branching instruction, φ nodes (phi nodes) must be introduced for each identifier which is dependent on the control flow. Figure 9 illustrates the use of φ nodes in a contrived code snippet generating 1/n, where the result is 1 when n = 0. The φ node selects a value for %fraction from the appropriate label’s namespace, acting as a merge of values in the data flow graph. In our translation tool we use the operand stack information generated for each label to build appropriate φ nodes for labels which are branch targets. Unconditional branch instructions are then added to connect these labels together, as LLVM’s control flow does not automatically transition between labels. Transputer bytecode is by design position independent, the arguments passed to start process instructions, loop ends and jumps are offsets from the present instruction. To support these offsets the occam compiler specifies the instruction arguments as label differences, such that Lt − Li where Lt is the arguments target label and Li is a label placed before the instruction consuming the jump offset. While we can revert these differences to absolute label reference by removing the subtraction of Li , LLVM assembly does not permit the derivation of label addresses. This prevents us passing labels as arguments to kernel calls such as start process (STARTP). We overcome this by lifting labels, for which the address is required, to function definitions. This is achieved by splitting the process in the same way as is done for kernel calls (see section 2.2). Adjacent labels are then connected by tail calls with the operand stack passed as parameters. There is no need to build continuations for these calls as control flow will not leave the process. Additionally, φ nodes are not required as the passing of the operand stack as parameters provides the required renaming.
C.G. Ritson / Translating ETC to LLVM Assembly
153
; Compare n to 0.0 %is zero = fcmp oeq double %n, 0.0 ; Branch to the correct label ; zero if is zero = 1, otherwise not zero br i1 %is zero, label %zero, label %not zero zero: ; Unconditionally branch to continue label br label %continue not zero: ; Divide 1 by n %nz fraction = fdiv double 1.0, %n ; Unconditionally branch to continue label br label %continue continue: ; fraction depends on the source label: ; 1.0 if the source is zero ; nz fraction if the source is not zero %fraction = phi double [ 1.0, %zero, %nz fraction, %not zero ] Figure 9. Example LLVM code showing the use of a phi node to select the value of the fraction identifier.
As an aside, earlier versions of our translation tool lifted all labels to function definition to avoid the complexity of generating φ nodes, and avoid tracking the use of labels as arguments. While it appeared that LLVM’s optimiser was able to fuse many of these processes back together, it was felt that a layer of control flow was being obscured. In particular this created output which was often hard to debug. Hence, we redesigned our translator to only lift labels when required. 2.5. Odds and Ends This section contains some brief notes on other interesting areas of our translation tool. 2.5.1. Symbol Naming While LLVM allows a wide range of characters in symbol names, the generation of symbol names for processes is consistent with that used in tranx86 [6]. Characters not valid for a C function are converted to underscores, and a O prefix added. This allows ANSI C code to manipulate occam process symbols by name. Only processes marked as global by the compiler are exported, and internally generated symbols are marked as private and tagged with the label name to prevent internal collisions. Declaration declare statements are added to the beginning of the assembly output for all processes referenced within the body. These declarations may include processes not defined in the assembly output; however, these will have been validated by the compiler as existing in another ETC source. The resulting output can then be compiled to LLVM bytecode or system assembly and the symbols resolved by the LLVM linker or the system linker as appropriate. 2.5.2. Arithmetic Overflow An interesting feature of the occam language is that its standard arithmetic operations check for overflow and trigger an error when it is detected. In the ANSI C TVM emulating these arithmetic instructions requires a number of additional logic steps and calculations [10]. This is inefficient on CPU architectures which provide flags for detecting over-
154
C.G. Ritson / Translating ETC to LLVM Assembly
flow. The LLVM assembly language does not provide access to the CPU flags, but instead provides intrinsics for addition, subtraction and multiplication with overflow detection. We have used these intrinsics (@llvm.sadd.with.overflow, @llvm.ssub.with.overflow and @llvm.smul.with.overflow) to efficiently implement the instructions ADD, SUB and MUL. 2.5.3. Floating Point occam supports a wide range of IEEE floating-point arithmetic and provides the ability to set the rounding mode in number space conversions. While an emulation library exists for this arithmetic, a more efficient hardware implementation was present in later Transputers and we seek to mirror this in our translator. However, we found that LLVM lacks support for setting the rounding mode of the FPU (this is still the case at the end of writing, with LLVM version 2.5). The LLVM assembly language specification defines all the relevant instructions to truncate their results. While not ideal, we use this fact by adding or subtracting 0.5, before converting a value in order to simulate nearest rounding. We do not support plus and minus rounding modes as the compiler never generates the relevant instructions. We observed that the occ21 compiler only ever generates a rounding mode change instruction directly prior to a conversion instruction. Thus instead of generating LLVM code for the mode change instruction we tag the proceeding instruction with the new mode. Hence mode changes become static at the point of translation and can be optimised by LLVM, although this was not done for the purposes of optimisation. 3. Benchmarks In this section we discuss preliminary benchmark results comparing the output of the existing tranx86 ETC converter, to the output of our translation tool passed through LLVM’s optimiser (opt) and native code generator (llc). These benchmarks were performed using source code as-is from the KRoC subversion repository revision 6002 1 , with the except of the mandelbrot benchmark from which we removed the frame rate limiter. Table 1 shows the wall-clock execution times of the various benchmarks we will now discuss. All our benchmarks were performed on an eight core Intel Xeon workstation composed of two E5320 quad-core processors running at 1.86GHz. Pairs of cores share 4MiB of L2 cache, giving a total of 16MiB L2 cache across eight cores. Table 1. Benchmark execution times, comparing tranx86 and LLVM based compilations. Benchmark agents 8 32 agents 8 64 fannkuch fasta mandelbrot ring 250000 spectralnorm
tranx86 (s) 29.6 91.8 1.29 6.78 27.0 3.84 23.1
LLVM (s) 27.6 86.5 1.33 6.90 8.74 4.28 14.3
Difference (tranx86 → LLVM) -7% -6% +3% +2% -68% +12% -38%
3.1. agents The agents benchmark was developed to compare the performance of the CCSP runtime [7] to that of other language runtimes. It is based on the occoids simulation developed as part of 1
http://projects.cs.kent.ac.uk/projects/kroc/trac/log/kroc/trunk?rev=6002
C.G. Ritson / Translating ETC to LLVM Assembly
155
the CoSMoS project [20]. A number of agent processes move over a two-dimensional torus avoiding each other, with their behaviour influenced by the agents they encounter. Each agent calculates forces between itself and other agents it can see using only integer arithmetic. The amount of computation increases greatly with the density of agents and hence we ran two variants for comparison. One with 32 initial agents per grid tile on an eight by eight grid, giving 2048 agents, and the other with double the density at 4096 agents on the same size grid. We see a marginal performance improvement in the LLVM version of this benchmark, we attribute this to LLVMs aggressive optimisation of the computation loops. 3.2. fannkuch The fannkuch benchmark is based on a version from The Computer Language Benchmarks Game [21,22]. The source code involves a large numbers of reads and writes to relatively small arrays of integers. We notice a very small decrease in performance in the LLVM version of this benchmark. This may be the result of tranx86 generating a more efficient instruction sequence for array bounds checking. 3.3. fasta The fasta benchmark is also taken from The Computer Language Benchmarks Game. A set of random DNA sequences is generated and output, this involves array accesses and floatingpoint arithmetic. Again, like fannkuch, we notice a negligible decrease in performance and attribute this to array bounds checks. 3.4. mandelbrot We modified the occam-π implementation of the mandelbrot set generator in the ttygames source directory to remove the frame rate limiter and used this as a benchmark. The implementation farms lines of the mandelbrot set image to 32 worker processes for generation, and buffers allow up to eight frames to be concurrently calculated. The complex number calculations for the mandelbrot set involve large numbers of floating point operations, and this benchmark demonstrates a vast improvement in LLVM’s floating-point code generator over tranx86. FPU instructions are generated by tranx86, whereas LLVM generates SSE instructions, the latter appear to be more efficient on modern x86 processors. Additionally, as we track the rounding mode at the source level (see section 2.5.3) we do not need to generate FPU mode change instructions, which may be disrupting FPU pipelining of tranx86 generated code. 3.5. ring Another CCSP comparison benchmark, this sets up a ring of 256 processes. Ring processes receive a token, increment it, and then forward it on to the next ring node. We time 250,000 iterations of the ring, giving 64,000,000 independent communications. This allows us to calculate the communication times of tranx86 and our LLVM implementation at 60ns and 67ns respectively. We attributed the increase in communication time to the additional instructions required to unwind the stack when returning from kernel calls in our implementation. The tranx86 version of CCSP does not return from kernel calls (it dispatches the next process internally). 3.6. spectralnorm The final benchmark from The Computer Language Benchmarks Game. This benchmark calculates the spectral norm of an infinite matrix. Matrix values are generated using floating-
156
C.G. Ritson / Translating ETC to LLVM Assembly Table 2. Binary text section sizes, comparing tranx86 and LLVM based compilations. Benchmark agents fannkuch fasta mandelbrot ring spectralnorm
tranx86 (bytes) 16410 3702 5134 6098 3453 4065
LLVM (bytes) 36715 5522 10494 12865 6716 6318
Difference (tranx86 → LLVM) +124% +49% +104% +111% +94% +55%
point arithmetic by a function which is called from a set of nested loops. The significant performance improvement with LLVM can be attributed to its inlining and more efficient floating-point code generation. 3.7. Code Size Table 2 shows the size of the text section of the benchmark binaries. We can see that the LLVM output is typically twice the size of the equivalent tranx86 output. It is surprising that this increase in binary size does not adversely affect performance, as it increases the cache pressure. As an experiment we passed a −code−model=small option to LLVM’s native code generator; however, this made no difference to binary size. Some of the increase in binary size may be attributed to the continuation dispatch code which is inlined within the resulting binary, rather than as part of the runtime kernel as with tranx86. The fannkuch and spectralnorm benchmarks make almost no kernel calls, therefore contain very few continuation dispatches, and accordingly show the least growth. Another possibility is LLVM aggressively aligning instructions to increase performance. Further investigation is required to establish whether binary size can be reduced, as it is of particular concern for memory constrained embedded devices. 4. Conclusions and Future Work In this preliminary work we have demonstrated the feasibility of translating the ETC output of the present occam-π compiler into the LLVM project’s assembly language. With associated changes to our runtime kernel this work provides a means of compiling occam-π code for platforms other than X86. We see this work as a stepping stone on the path to direct compilation of occam-π using LLVM assembly as part of a new compiler, Tock [23]. In particular, we have established viable representations of processes and a kernel calling convention, both fundamental details of any compiled representation of occam-π. The performance of our translations compares favourably with that of previous work (see section 3). While our kernel call mechanism is approximately 10% slower, loop unrolling enhancements and dramatically improved float-point performance offset this overhead. Typical applications are a mix of communication and computation, which should help preserve this balance. The occ21 compiler’s memory bound model of compilation presents an underlying performance bottleneck to translation based optimisation such as the one presented in this paper. This is a legacy of the Transputer’s limited number of stack registers, and it is our intention to overcome this in the new Tock compiler. The ultimate aim of our work is to directly compile occam-π to LLVM assembly using Tock, bypassing ETC entirely. Aside from the portability aspects of this work, access to an LLVM representation of occam-π programs opens the door to exploring concurrency specific optimisations within an established optimisation framework. Interesting optimisations, such as fusing parallel pro-
C.G. Ritson / Translating ETC to LLVM Assembly
157
cesses using vector instructions and removing channel communications in linear pipelines, could be implemented as LLVM passes. LLVM’s bytecode has also been used for various forms of static verification, a similar approach many be able to verify aspects of a compiled occam-π program such as the safety of its access to mobile data. Going further, it is likely that LLVM’s assembly language may benefit from a representation of concurrency, particularly for providing information to concurrency related optimisations. Acknowledgements This work was funded by EPSRC grant EP/D061822/1. We also thank the anonymous reviewers for comments which helped us improve the presentation of this paper. References [1] David A. P. Mitchell, Jonathan A. Thompson, Gordon A. Manson, and Graham R. Brookes. Inside The Transputer. Blackwell Scientific Publications, Ltd., Oxford, UK, 1990. [2] INMOS Limited. The T9000 Transputer Instruction Set Manual. SGS-Thompson Microelectronics, 1993. Document number: 72 TRN 240 01. [3] Michael D. Poole. occam-for-all – Two Approaches to Retargeting the INMOS occam Compiler. In Brian O’Neill, editor, Parallel Processing Developments – Proceedings of WoTUG 19, pages 167–178, Nottingham-Trent University, UK, March 1996. World occam and Transputer User Group, IOS Press, Netherlands. [4] Kent Retargetable occam Compiler. (http://projects.cs.kent.ac.uk/projects/kroc/trac/). [5] Michael D. Poole. Extended Transputer Code - a Target-Independent Representation of Parallel Programs. In Peter H. Welch and A.W.P.Bakkers, editors, Architectures, Languages and Patterns for Parallel and Distributed Applications, volume 52 of Concurrent Systems Engineering, Address, April 1998. WoTUG, IOS Press. [6] Frederick R.M. Barnes. tranx86 – an Optimising ETC to IA32 Translator. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, editors, Communicating Process Architectures 2001, number 59 in Concurrent Systems Engineering Series, pages 265–282. IOS Press, Amsterdam, The Netherlands, September 2001. [7] Carl G. Ritson, Adam T. Sampson, and Frederick R. M. Barnes. Multicore Scheduling for Lightweight Communicating Processes. In John Field and Vasco Thudichum Vasconcelos, editors, Coordination Models and Languages, 11th International Conference, COORDINATION 2009, Lisboa, Portugal, June 9-12, 2009. Proceedings, volume 5521 of Lecture Notes in Computer Science, pages 163–183. Springer, June 2009. [8] Peter H. Welch and David C. Wood. The Kent Retargetable occam Compiler. In Brian O’Neill, editor, Parallel Processing Developments – Proceedings of WoTUG 19, pages 143–166, Nottingham-Trent University, UK, March 1996. World occam and Transputer User Group, IOS Press, Netherlands. [9] Christian L. Jacobsen and Matthew C. Jadud. The Transterpreter: A Transputer Interpreter. In Dr. Ian R. East, Prof David Duce, Dr Mark Green, Jeremy M. R. Martin, and Prof. Peter H. Welch, editors, Communicating Process Architectures 2004, volume 62 of Concurrent Systems Engineering Series, pages 99–106. IOS Press, September 2004. [10] Christian L. Jacobsen. A Portable Runtime for Concurrency Research and Application. PhD thesis, Computing Laboratory, University of Kent, April 2008. [11] Chris Lattner. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, Dec 2002. [12] LLVM Project. (http://llvm.org). [13] Rich Hickey. The Clojure Programming Language. In DLS ’08: Proceedings of the 2008 Symposium on Dynamic Languages, pages 1–1, New York, NY, USA, 2008. ACM. [14] Michel Schinz. Compiling Scala for the Java Virtual Machine. PhD thesis, Institut d’Informatique Fondamentale, 2005. [15] David May. Communicating Process Architecture for Multicores. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, pages 21–32, jul 2007. [16] LLVM Language Reference Manual. (http://www.llvm.org/docs/LangRef.html).
158
C.G. Ritson / Translating ETC to LLVM Assembly
[17] John C. Reynolds. The discoveries of continuations. Lisp and Symbolic Computation, 6(3-4):233–248, 1993. [18] David C. Wood. KRoC – Calling C Functions from occam. Technical report, Computing Laboratory, University of Kent at Canterbury, August 1998. [19] Damian J. Dimmich and Christan L. Jacobsen. A Foreign Function Interface Generator for occam-pi. In J. Broenink, H. Roebbers, J. Sunter, P. Welch, and D. Wood, editors, Communicating Process Architectures 2005, pages 235–248, Amsterdam, The Netherlands, September 2005. IOS Press. [20] Paul Andrews, Adam T. Sampson, John Markus Bjørndalen, Susan Stepney, Jon Timmis, Douglas Warren, and Peter H. Welch. Investigating patterns for process-oriented modelling and simulation of space in complex systems. In S. Bullock, J. Noble, R. A. Watson, and M. A. Bedau, editors, Proceedings of the Eleventh International Conference on Artificial Life, Cambridge, MA, USA, August 2008. MIT Press. [21] The Computer Language Benchmarks Game. (http://shootout.alioth.debian.org/). [22] Kenneth R. Anderson and Duane Rettig. Performing lisp analysis of the fannkuch benchmark. SIGPLAN Lisp Pointers, VII(4):2–12, 1994. [23] Tock Compiler. (http://projects.cs.kent.ac.uk/projects/tock/trac/).
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-159
159
Resumable Java Bytecode – Process Mobility for the JVM Jan Bækgaard PEDERSEN and Brian KAUKE School of Computer Science, University of Nevada, Las Vegas, Nevada, United States
[email protected],
[email protected] Abstract. This paper describes an implementation of resumable and mobile processes for a new process-oriented language called ProcessJ. ProcessJ is based on CSP and the π-calculus; it is structurally very close to occam-π, but the syntax is much closer to the imperative part of Java (with new constructs added for process orientation). One of the targets of ProcessJ is Java bytecode to be executed on the Java Virtual Machine (JVM), and in this paper we describe how to implement the process mobility features of ProcessJ with respect to the Java Virtual Machine. We show how to add functionality to support resumability (and process mobility) by a combination of code rewriting (adding extra code to the generated Java target code), as well as bytecode rewriting.
Introduction In this paper we present a technique to achieve process resumability and mobility for ProcessJ processes executed in one or more Java Virtual Machines. ProcessJ is a new process-oriented language with syntax close to Java and a semantics close to occam-π [20]. In the next subsection we briefly introduce ProcessJ. We have developed a code generator (from ProcessJ to Java) and a rewriting technique of the Java bytecode (which is the result of compiling the Java code generated by the ProcessJ compiler) to alter the generated Java bytecode to save and restore state as well as support for resuming execution in the middle of a code segment. This capability we call transparent mobility [16], which differs from non-transparent mobility in that the programmer does not need be concerned about preserving the state of the system at any particular suspend or resume point. We do not, however, mean that processes may be implicitly suspended at arbitrary points in their execution. ProcessJ ProcessJ is a new process-oriented language. It is based on CSP [8] and the π-calculus [10]. Structurally it is very much like occam-π; it is imperative with support for synchronous communication through typed channels. Like occam-π, it supports mobility of processes. Syntactically it is very close to Java (but without objects), and with added constructs needed for process orientation. ProcessJ currently targets the following execution platforms through different code generators (it produces source code which is then compiled using a compiler for the target language): • Any platform that supports the KRoC [21,23] occam-π compiler. ProcessJ is translated to occam-π, and then passed to the KRoC compiler.
160
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
• C and MPI [5], making it possible to write process-oriented programs for a distributed memory cluster or wide area network. ProcessJ is translated to C with MPI message passing calls and passed to a C compiler. • Java with JCSP [17,19], which will run on any architecture supporting a JVM. ProcessJ is translated to JCSP, which is Java with library-provided CSP support, and passed to the Java compiler. In this paper we focus on the process mobility of ProcessJ for Java/JCSP. As the JVM itself provides no support for continuations and the Java language provides a restricted set of flow control constructs on which to build such functionality, it was initially not clear whether transparent process mobility could be usefully implemented on this platform. Like any other process-oriented language, ProcessJ has the notion of processes (procedures executing in their own executing context), but since the translation is to Java, it is sometimes necessary to refer to methods when describing the generated Java code and Java bytecode. A simple example of a piece of ProcessJ code (without mobility) is a multiplexer that accepts input on two input channels (in1 and in2) and outputs on an output channel (out): proc void mux ( chan < int >. read in1 , chan < int >. read in2 , chan < int >. write out ) { int data ; while ( true ) { alt { data = in1 . read () : out . write ( data ); data = in2 . read () : out . write ( data ); } } }
where chan.read in1 declares c1 to be the reading end of a channel that carries integers, out.write(data) writes the value of data to the out channel, and alt represents an “alternation” between the two input guards guarding two channel write statements. Other approaches such as [2,16] consider thread migration (which involves general object migration) in the Java language, but since ProcessJ is not object-oriented, we do not need to be concerned with object migration at the programmer level. We do use an encapsulation object at the translation level from ProcessJ to Java to hold the data that is transferred (this object serves as a continuation for the translated ProcessJ process). In addition, mobile processes, like in occam-π, are started, and resumed, by simply calling it as a regular procedure (which translates into invoking it as a regular non-static method in the resulting Java code). In this way, we can interpret the suspended mobile as a continuation [7] represented by the object which holds the code, the saved state, and information about where to continue the code execution upon resumption. 1. Resumability We start by defining the term resumability. We denote a procedure as resumable if it can be temporarily terminated by a programmer-inserted suspend statement and control returned to the caller, and at some later time restarted at the instruction immediately following the suspend point and with the exact same local state, possibly in a different JVM located on a different machine (i.e. all local variables contain the same values as they did when the method
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
161
was terminated). In a process-oriented language, the only option a process has for communicating with its environment, that is, other processes, is though channel communication, which means that when the process is resumed, it might be with a different set of (channel, channel end, or barrier) parameters; in other situations a process might take certain parameters for initialization purposes [22], which must be provided with “dummy values” upon subsequent resumptions. Therefore, in this paper we consider resumability for procedures where the values of the local variables are saved and restored when the process is resumed, but where each process resumption can happen with a new set of actual parameters. This allows the environment to interact with the process. We start with a formal definition of resumability. 1.1. Formal Definition of Resumability In this section we briefly define resumability for JVM bytecode in a more formal way (we disregard objects and fields as neither are used in ProcessJ.) Each invocation of a bytecode method has its own evaluation stack; recall, the JVM is a stack based architecture, and all arithmetic takes place on the evaluation stack, which we can model as an array s of values: s = [e0 , e1 , . . . , ei ] In addition to an evaluation stack, each invocation has its own activation record (AR) (we consider non-static methods, but static methods are handled in a similar manner; the only difference is that a reference to this is stored at address 0 in the activation record for non-static method invocations). We can also represent a saved activation record as an array: A = [this, p1 , . . . , pn , v1 , . . . , vm ], where this is a reference to the current object, pi are parameters, and vi are local variables. (p1 = A[1], . . . , pn = A[n], v1 = A[n + 1], . . . , vm = A[n + m]), where A[i] denotes the value of the parameter/local variable stored at address i. We do not need to store this in the saved activation record as it is automatically replaced at address 0 of the activation record for every invocation of a method, but we include it here as there are instructions that refer to address 0, where this is always stored for non-static methods. It is worth mentioning that the encapsulating object used in the ProcessJ to Java translation uses non-static methods and fields; this is necessary since a ProcessJ program might have more mobile processes based on the same procedure. We can now define the semantic meaning of executing a basic block of bytecode instructions by considering the effect it has on the stack and the activation record. Only the last instruction of such a block can be a jump, so we are working with a block of code that will be executed completely. At this point, it is worth mentioning that at the end of a method invocation the stack is always empty; in addition, a ProcessJ suspend statement will translate to a return instruction, and at these return (suspend) points, the evaluation stack will also be empty. We consider a semantic function EJV M [[B]](s, A) where B = i0 i1 . . . ik is a basic block of bytecode statements and define: EJV M [[i0 i1 . . . ik ]](s, A) = EJV M [[i1 . . . ik ]](s , A ), where (s , A ) = EJV M [[i0 ]](s, A) We shall not give the full semantic specification for the entire instruction set for the Java Virtual Machine as it would take up too much space in this paper, but most of the instructions are straightforward. A few examples are (we assume non-static invocations here):
162
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
EJV M [[iload 1]]([. . .], A)
=
EJV M [[istore 1]]([. . . , ek−1 , ek ], A)
=
EJV M [[goto X]]([. . .], A) = EJV M [[ifeq X]]([e0 , e1 , . . . , ek ], A)
=
EJV M [[ifeq X]]([e0 , e1 , . . . , ek ], A)
=
EJV M [[invokevirtual f]]([. . . , q1 , . . . , qj ], A)
=
([. . . , a1 ], A) where A = [this, a1 , . . . , an+m ] ([. . . , ek−1 ], [this, ek , a2 , . . . , an+m ]) where A = [a0 , a1 , . . . , an+m ] EJV M [[B ]]([. . .], A) where B is the basic block that starts at address X. EJV M [[B ]]([e0 , e1 , . . . , ek ], A), if ek = 0 where B is the basic block that starts at address X. EJV M [[B ]]([e0 , e1 , . . . , ek ], A), if ek = 0 where B is the basic block that immediately follows ifeq X. ([. . . , r], A) where r is the return value of EJV M [[Bf ]]([ ], A ) and A = (q0 , q1 , . . . , qj , ⊥, . . . , ⊥), q0 is an object reference, and f (q1 , . . . , qj ) is the actual invocation. Bj is the code for a non-void method f .
where ⊥ represents the undefined value. This is more of a semantic trick than reality as no activation record entries are ever left undefined at the actual use of the value (the Java compiler assures this), but here we simply wish to denote that the values of the locals might not have been assigned a value by the user code at this moment. Now let B = i0 i1 . . . ij−1 ij ij+1 . . . ik be a basic block of instructions (from the control flow graph associated with the code we are executing), and let ij represent a imaginary suspend instruction (as mentioned, eventually it becomes a return): EJV M [[B]]([ ], A) = EJV M [[ij+1 . . . ik ]](s , A ) where (s , A ) = EJV M [[i0 i1 . . . ij−1 ]]([ ], A)
(1)
or equivalently: EJV M [[i0 i1 . . . ij−1 ij ij+1 . . . , ik ]](s, A) = EJV M [[i0 i1 . . . ij−1 ij+1 . . . ik ]](s, A); simply ignore the suspend instruction ij . Naturally, if the code is evaluated in two stages as in the first semantic definition, the invoking code must look something like this (assuming B is the body of a method foo()): . . foo(..); // Execute i0 . . . ij−1 foo(..); // Execute ij+1 . . . ik . . We call this form of resumability “resumability without parameter change” since (1) uses A and not an A where A has the same local variables but different parameter values (i.e, the parameters passed to foo are exactly the same for both calls). Resumability without parameter changes is not particularly interesting from a mobility standpoint in a process-oriented
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
163
language; typically we wish to be able to supply different parameters (most often channels, channel ends, and barriers) to the process when it is resumed (especially because the parameters could be channel ends which allow the process to interact with a new environment, that is, the process receiving the mobile). It turns out that if we can implement resumability without parameter change in the JVM (i.e., devise a method of restoring activation records between invocations), then the more useful type of resumability with parameter change comes totally free of charge! For completeness, let us define this as well: Let us consider again the basic block code B = i0 i1 . . . ij−1 ij ij+1 . . . ik , where again ij represents a suspend instruction that returns control to the caller, and let us assume that the code in B is invoked by the calls foo(v1 , . . . , vn ), and foo(v1 , . . . , vn ) respectively. EJV M [[i0 . . . ij−1 ij ij+1 . . . ik ]](s, A) = EJV M [[ij+1 . . . ik ]](s , A ) where A = [a0 = this, a1 = v1 , a2 = v2 , . . . , an = vn , an+1 = ⊥, . . . , an+m = ⊥] A = [a0 = this, a1 = v1 , a2 = v2 , . . . , an = vn , an+1 = an+1 , . . . , an+m = an+m ] (s , A ) = EJV M [[i0 i1 . . . ij1 ]](s, A) A = [a0 , . . . , an+m ] We call this “resumability with parameter changes”. The above extends to loops (through multiple basic block code segments), and to code blocks with more than one suspend instruction. As we can see from the semantic function EJV M , the activation record must “survive” between invocations/suspend-resumptions; local variables are saved and restored, parameters are not stored and are changed according to each invocation’s actual parameters. Naturally we must assure that the locations in the activation record holding the locals are restored before they are referenced again. 2. Target Bytecode Structure All the extra code needed to save and restore state upon suspension and resumption can be generated by the ProcessJ code generator; only the code associated with resuming execution in the middle of a code block will require bytecode rewriting. Let us consider a very simple example with a single suspend statement (the following is a snippet of legal ProcessJ code): type proc mobileFooType (); mobile proc void foo () implements mobileFooType { int a ; a = 0; while ( a == 0) { a = a + 1; suspend ; a = a - 1; } }
The resulting bytecode would look something like this: public void foo (); Code : 0: iconst_0 1: istore_1 2: iload_1 3: ifne 20 6: iload_1 7: iconst_1
; a = 0; ; while ( a == 0) {
164
J.B. Pedersen and B. Kauke / Resumable Java Bytecode 8: 9: 10: 13: 14: 15: 16: 17: 20:
iadd istore_1 ??? iload_1 iconst_1 isub istore_1 goto 2 return
; ;
a = a + 1; suspend handled here
; a = a - 1; ; }
}
Since the suspend is handled in line 10 by inserting a return instruction, we need to store the local state before the return, and upon resuming the execution, control must be transferred to line 13 rather than starting at line 0 again, and the state must be restored before executing line 13. This requires three new parts inserted into the bytecode: 1. Code to save the local state (in the above example the local variable a) before the suspend statement in line 10. 2. Code to restore the local state before resuming execution of the instructions after the previous suspend statement, that is, after line 1 and before line 13. 3. Code to transfer control to the right point of the code depending on which suspend was most recently executed (before line 0). Thus the goal is to automate the generation of such code. 1 and 2 can be done completely in Java by the ProcessJ code generator, and 3 can be done by a combination of Java code and bytecode rewriting. Before turning to this, let us first mention a few restrictions that mobile processes have in ProcessJ: processes have no return type (the equivalent in Java is a void method), and mobile processes cannot be recursive. The semantics for a recursive mobile process are not yet clear, and we do not see any obvious need for recursion of mobiles at this time. 3. Source Code Rewriting As mentioned, the ProcessJ code generator emits Java source code, which is then compiled using the Java compiler, and the resulting bytecode is subsequently rewritten. Let us describe the Java code emitted from the ProcessJ compiler first. To transform the foo method from the previous section into a resumable process, we encapsulate it in a Java Object that contains two auxiliary fields as well as the process rewritten as a Java method and two dummy placeholder methods. 1. The method is encapsulated in a new class: public class Foo { private private private private
Object [] actRec ; static void suspend () { } static void resume () { } int jumpTarget = 0;
public void foo () { ... switch statement that jumps to resume point . int a ; a = 0; while ( a == 0) { a = a + 1; ... code to save the current state .
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
165
suspend (); resume (); ... code to restore the previous state . a = a - 1; } } }
where actRec represents a saved activation record. The suspend and resume methods are just dummy methods that are added to satisfy the compiler (more about these later). Finally, a field jumpTarget has been added. jumpTarget will hold nonnegative values (0 if the execution is to start from the beginning), and 1, 2, .... if the execution is to resume from somewhere within the code (i.e., not from the start). 2. The code for foo must also be rewritten to support resumability: • Support must be added for saving and restoring the local variable part of the JVM activation record; this is done through the Object array actRec. • A lookupswitch JVM instruction [9] must be added; based on the jumpTarget field it will jump to the instruction following the last suspend executed. A simple Java switch statement that switches on the jumpTarget will translate to such a lookupswitch instruction. 3.1. Saving Local State A Java activation record consists of two or three parts: Local variables, parameters and for non-static methods, a reference to this stored at address 0 in the activation record. The layout is illustrated in Figure 1. Recall, we need the encapsulated method to be non-static. Since this never changes for an object, and since each resumption of the method provides a new set of parameters, all we have to save is the set of locals. As we rely on the JVM invocation instructions, each invocation of a method creates its own new JVM activation record that contains this, the provided parameters, and room for the locals. The first step in resuming the method is to restore the locals to the state they were in when the method was suspended. We use an array of Objects to store the m locals. If the field jumpTarget has value 0, representing that the method starts from the top (this is the initial invocation of the process), no restoration of locals is necessary as the execution starts from the beginning of the code (and the ProcessJ and Java compilers have assured that no path to a use of an uninitialized variable exists). On subsequent resumptions, the saved array of locals must be restored, and the value of the field jumpTarget determines from where execution should continue (immediately after the return instruction that suspended the previous activation of the method).
Figure 1. JVM Activation Records.
166
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
If for example a method has locals a, b, and c of integer type, we can save an Object array with their values in the following way by using the auto-boxing feature provided by the Java compiler: actRec = new Object [] { a , b , c }; jumpTarget = ...;
and they can be restored in the following manner: a = ( Integer ) actRec [0]; b = ( Integer ) actRec [1]; c = ( Integer ) actRec [2];
Both of these code blocks are generated by the ProcessJ code generator, the former before the suspend and the latter after. 3.2. Resuming Execution When a method is resumed (by being invoked), the jumpTarget field determines where in the code execution should continue; namely immediately after the return that suspended the previous invocation. We cannot add Java code that gets translated by the Java compiler for this; in order to do so we would need a goto instruction (as well as support for labels), and although goto is a reserved word in Java, it is not currently in use. To achieve this objective, we must turn to bytecode rewriting. We need to insert a lookupswitch instruction that switches on the jumpTarget field, and jumps to the address of the instruction following the return that suspended the previous invocation. We can generate parts of the code with help from the Java compiler; we insert code like this at the very beginning of the generated Java code: switch ( jumpTarget ) { case 1: break ; default : break ; }
There will be as many cases as there are suspends in the ProcessJ code. We get bytecode like this: 4:
24: 27:
lookupswitch { 1: 24; default : 27 } goto 27 ...
In the rewriting of the bytecode all we have to do is replace the absolute addresses (24 and 27) in the switch by the addresses of the resume points. The addresses of the resume points can be found by keeping track of the values assigned to the jumpTarget field in the generated Java code or by inspecting the bytecode as explained below. Since we replaced the suspend keyword by calls to the dummy suspend() and resume() method, we can look for the static invocation of resume(): 52: 53: 54: 57: 60: 63:
aload_0 iconst_1 putfield invokestatic invokestatic ...
#5; // Field jumpTarget : I #18; // Method suspend :() V #19; // Method resume :() V
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
167
(here found in line 60), and the two instructions immediately before the suspend call will reveal the jumpTarget value that the address (60) should be associated with. The instruction in line 53 will be one of the iconst X (X=1,2,3,4,5) instructions or a bipush instruction. For the above, the lookupswitch should be rewritten as: 4:
24: 27:
lookupswitch { 1: 60; default : 27 } goto 27 ...
Furthermore the lines 57 and 60 must be rewritten to be a return (this cannot be done before compile time, as the Java compiler will complain about unreachable code) and a nop respectively. Alternatively, the resume method can be removed and the jump target will be the instruction following the suspend call. 4. Example Let us rewrite the previous example to obtain this new Foo class: public class Foo { private private private private
Object [] actRec ; static void suspend () { } static void resume () { } int jumpTarget = 0;
public void foo () { int a ; switch ( jumpTarget ) { case 1: break ; default : break ; } a = 0; while ( a == 0) { a = a + 1; actRec = new Object [] { a }; jumpTarget = 1; suspend (); resume (); a = ( Integer ) actRec [0]; a = a - 1; } jumpTarget = 0; }
// Begin : jump
// End : jump
// Begin : save state // End : save state
// restore state
// Reset jumpTarget
}
Note that the jumpTarget should be set to 0 before each original return statement to assure that the next time the process is resumed, it will start from the beginning. This is very close to representing the code we really want, and best of all, it actually compiles. Note also that the line saving local state must include all locals in scope. If the rewriting is done solely in bytecode, this would require an analysis of the control flow graph (CFG) associated with the code – like the approach taken in the Southampton Portable Occam Compiler (SPOC) [12]. But since we generate the store code as part of the code generation from
168
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
the ProcessJ compiler, we have access to all scope information. It is further simplified by the fact that scoping rules for ProcessJ follows those of Java (when removing fields and objects). Let us look at the generated bytecode. Because of the incomplete switch statement, every invocation of foo will always execute the a = 0 statement (i.e. start from the beginning): public void foo (); Code : 0: aload_0 1: getfield jumpTarget I // switch ( jumpTarget ) { 4: lookupswitch { 1: 24; // case 1: ... default : 27 } // default : ... 24: goto 27 // } 27: iconst_0 28: istore_1 // a = 0; 29: iload_1 // while ( a == 0) { 30: ifne 83 33: iload_1 34: iconst_1 35: iadd 36: istore_1 37: aload_0 // a = a + 1; 38: iconst_1 39: anewarray java / lang / Object 42: dup 43: iconst_0 44: iload_1 45: invokestatic java / lang / Integer . valueOf ( I ) Ljava / lang / Integer ; 48: aastore 49: putfield actRec [ Ljava / lang / Object ; // actRec = new Object []{ a }; 52: aload_0 53: iconst_1 54: putfield jumpTarget I // jumpTarget = 1; 57: invokestatic suspend () V // suspend ; 60: invokestatic resume () V // // resume point 63: aload_0 64: getfield actRec [ Ljava / lang / Object ; 67: iconst_0 68: aaload 69: checkcast java / lang / Integer 72: invokevirtual java / lang / Integer . intValue () I 75: istore_1 // a = ( Integer ) actRec [0]; 76: iload_1 77: iconst_1 78: isub 79: istore_1 // a = a - 1; 80: goto 29 // } 83: aload_0 84: iconst_1 85: putfield jumpTarget I // jumpTarget = 0; 88: return }
Lines 0–24 represent the switch statement, 38–54 the save state code, 57–60 the suspend/resume placeholder method calls, 63–75 the restore state code, and 83–88 the rewritten original return code.
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
169
As pointed out above, this code is not correct; a number of things still need to be changed: • Line 4 is the jump table that must be filled with correct addresses. If the field jumpTarget equals 1, execution continues at the invocation of the dummy resume() method – line 60. The default label is already correct and can be left unchanged. • Line 57, the dummy suspend() invocation, should be replaced by a return instruction (we could not simply place a Java return instruction in the source code because the compiler would complain about the code following the return statement being unreachable). • Line 60, the dummy resume() invocation should be replaced by a nop. This only serves as a placeholder; theoretically we could have used address 63 in the lookupswitch. An example of use in ProcessJ could be this: proc void sender ( chan < mobileFooType >. write ch ) { // create mobile mobileFooType mobileFoo = new mobile foo ; // invoke foo (1 st invocation ) mobileFoo (); // send to different process ch . write ( mobileFoo ); } proc void receiver ( chan < mobileFooType >. read ch ) { mobileFooType mobileFoo ; // receive mobileFooType process mobileFoo = ch . read (); // invoke foo (2 nd invocation ) mobileFoo (); } proc void main () { chan < MobileFooType > ch ; par { sender ( ch . write ); receiver ( ch . read ); } }
The resulting Java/JCSP code looks like this: import org . jcsp . lang .*; public class PJtest { public static void sender ( ChannelOutput ch_write ) { Foo mobileFoo = new Foo (); mobileFoo . foo (); ch_write . write ( mobileFoo ); } public static void receiver ( ChannelInput ch_read ) { Foo mobileFoo ; mobileFoo = ( Foo ) ch_read . read (); mobileFoo . foo (); }
170
J.B. Pedersen and B. Kauke / Resumable Java Bytecode public static void main ( String args []) { final One2OneChannel ch = Channel . one2one (); new Parallel ( new CSProcess [] { new CSProcess () { public void run () { sender ( ch . out ()); } }, new CSProcess () { public void run () { receiver ( ch . in ()); } } }). run (); } }
One small change is still needed to support mobility across a network. Since the generated Java code is a class, this can be made serializable by making the generated classes implement the Serializable interface. An object of such a class can now be serialized and sent across a network. Welch et al. [18] provide such a mechanism in their jcsp.net package as well. Since the rewriting described encapsulates the mobile process in a new class, objects of that class can be sent as data across the network and the mobile process inside that object can be resumed by invoking the method that encapsulates the mobile process (mobileFoo.foo() above).
5. Related Work and Other Approaches Approaches to process mobility can be categorized as either transparent or non-transparent, sometimes termed strong and weak migration (mobility), respectively [2,6]. With nontransparent mobility the programmer must explicitly provide the logic to suspend and resume the mobile process whenever necessary. Existing systems such as jcsp.mobile [3,4] already provide this functionality. Transparent mobility significantly eases the task of the programmer, but requires support from the run-time system which does not exist within the Java Virtual Machine. Some early approaches to supporting resumable programs in Java involved modification of the JVM itself [2]. In our view, however, one of the most important advantages of targeting the JVM is portability across the large installed base of Java runtime environments. Therefore any approach that extends the JVM directly is of limited utility. Some success has been demonstrated using automated transformation of Java source code [6]. Due to the lack of support within the language for labeled gotos, this approach suffers from a proliferation of conditional guards and a corresponding increase in code size. Bytecode-only transformations methods targeting general thread resumability in Java are explored in [1] and [16]. These approaches require control flow analysis of the bytecode in order to generate code for the suspend point. Alternatively, the Kilim [15] actor-based frame-
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
171
work uses a CPS bytecode transformation to support cooperatively scheduled lightweight threads (fibers) within Java. Another example of bytecode state capture can be found in Java implementation of the object-oriented language Python (Jython [13]) in order to support generators, a limited form of co-routines [11]. This is perhaps the most similar to our implementation, even though generator functions are somewhat different in concept and application from ProcessJ procedures. We wish, however, to be able to utilize the existing Java compilers to produce optimized bytecode with our back-end. The process-oriented nature of ProcessJ allows us to adopt a simple hybrid approach that combines Java source and bytecode methods. 6. Conclusion In this paper we have shown that a compiler for a process-oriented language can provide transparent mobility using the existing Java compiler tool chain with minimal modification. We developed a simple way to generate Java source code and rewrite Java bytecode to support resumability and ultimately process mobility for the ProcessJ language. We described the Java source code generated by the ProcessJ compiler, and also demonstrated how to rewrite the Java bytecode to save and restore local state in between resumptions of code executions as well as how to assure that execution continues with the same local state (but with possibly new parameter values) at the instruction following the previous suspension point. 7. Future Work A number of interesting issues remain to be addressed. For ProcessJ, where we have channels, an interesting problem arise when assigning a parameter of channel end type to a local variable. If a local variable holds a reference to a channel end, and the process is suspended and sent to a different machine, the end of the channel now lives on a different physical machine. This is not a simple problem to solve; for occam-π, the pony [14] system addresses this problem. One way to approach this problem is to include a channel server, much like the one found in JCSP.net [18], that keeps track of where channel ends are located; this is the approach we are working with for the MPI/C code generator. Mobile channels can be handled in the same way, but are outside the scope of this paper. Other issues that need to be addressed include how resource management is to be handled; if a mobile process contains references to (for example) open files that are not available on the JVM to which the process is sent, accessing this file becomes impossible. We may wish to enforce certain kinds of I/O restrictions on mobile processes in order to more clearly define their behavior under mobility. With a little effort, the saving and restoration could be gathered at the beginning and the end of the method saving some code/instructions, but for clarity reasons we used a different approach (as presented in this paper). 8. Acknowledgments This work was supported by the UNLV President’s Research Award 2008/2009. We would like to thank the reviewers who did a wonderful job in reviewing this paper. Their comments and suggestions have been valuable in producing a much stronger paper.
172
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
References [1] Sara Bouchenak. Techniques for Implementing Efficient Java Thread Serialization. In ACS/IEEE International Conference on Computer Systems and Applications (AICCSA03), pages 14–18, 2003. [2] Sara Bouchenak and Daniel Hagimont. Pickling Threads State in the Java System. In Third European Research Seminar on Advances in Distributed Systems, 2000. [3] Kevin Chalmers and John Kerridge. jcsp.mobile: A Package Enabling Mobile Processes and Channels. In Jan Broenink and Herman Roebbers and Johan Sunter and Peter Welch and and David Wood, editor, Communicating Process Architectures 2005, pages 109–127, 2005. [4] Kevin Chalmers, John Kerridge, and Imed Romdhani. Mobility in JCSP: New Mobile Channel and Mobile Process Models. In Alistair McEwan and Steve Schneider and Wilson Ifill and Peter Welch, editor, Communicating Process Architectures 2007, pages 163–182, 2007. [5] Jack Dongarra. MPI: A Message Passing Interface Standard. The International Journal of Supercomputers and High Performance Computing, 8:165–184, 1994. [6] Stefan F¨unfrocken. Transparent Migration of Java-based Mobile Agents - Capturing and Reestablishing the State of Java Programs. In Mobile Agents, pages 26–37. Springer Verlag, 1998. [7] R. Hieb and R.K. Dybvig. Continuations and Concurrency. ACM Sigplan Notices, 25:128136, 1990. [8] C. A. R. Hoare. Communicating Sequential Processes. Communications of the ACM, 21(8):666–677, August 1978. [9] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification, 2nd Edition. Prentice Hall PTR, 1999. [10] Robin Milner. Communicating and Mobile Systems: the π-Calculus. Cambridge University Press, 1999. [11] Ana L´ucia De Moura and Roberto Ierusalimschy. Revisiting coroutines. ACM Transactions on Programming Languages and Systems, 31:1–31, 2009. [12] D.A. Nicole, M. Debbage, M. Hill, and S. Wykes. Southampton’s Portable Occam Compiler (SPOC). In A.G. Chalmers and R. Miles, editors, Proceedings of WoTUG 17: Progress in Transputer and Occam Research, volume 38 of Concurrent Systems Engineering, pages 40–55, Amsterdam, The Netherlands, April 1994. IOS Press. ISBN: 90-5199-163-0. [13] Samuele Pedroni and Noel Rappin. Jython Essentials. O’Reilly Media, Inc., 2002. [14] Mario Schweigler and Adam T. Sampson. pony - The occam-π Network Environment. In Peter Welch, Jon Kerridge, and Fred Barnes, editors, Communicating Process Architectures 2006, volume 64 of Concurrent Systems Engineering Series, pages 77–108, Amsterdam, The Netherlands, September 2006. IOS Press. [15] S. Srinivasan and A. Mycroft. Kilim: Isolation-typed actors for java. In Procedings of the European Conference on Object Oriented Programming (ECOOP), pages 104–128. Springer, 2008. [16] Eddy Truyen, Bert Robben, Bart Vanhaute, Tim Coninx, Wouter Joosen, and Pierre Verbaeten. Portable Support for Transparent Thread Migration in Java. In ASA/MA, pages 29–43. Springer Verlag, 2000. [17] Peter H. Welch. Process Oriented Design for Java: Concurrency for All. In Hamid R. Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Process Techniques and Applications, volume 1, pages 51–57, Las Vegas, Nevada, USA, June 2000. CSREA, CSREA Press. ISBN: 1-892512-52-1. [18] Peter H. Welch, Jo R. Aldous, and Jon Foster. CSP Networking for Java (JCSP.net). Lecture Notes in Computer Science, 2330:695–708, 2002. [19] Peter H. Welch and Paul D. Austin. Communicating Sequential Processes for Java (JCSP) Home Page. Systems Research Group, University of Kent, http://www.cs.kent.ac.uk/projects/ofa/jcsp. [20] Peter H. Welch and Frederick R.M. Barnes. Communicating Mobile Processes: introducing occam-π. In Ali E. Abdallah, Cliff B. Jones, and Jeff W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [21] Peter H. Welch, Jim Moores, Frederick R. M. Barnes, and David C. Wood. The KRoC Home Page. http://www.cs.kent.ac.uk/projects/ofa/kroc/. [22] Peter H. Welch and Jan B. Pedersen. Santa Claus - with Mobile Reindeer and Elves. In Proceedings of Communicating Process Architectures, 2008. [23] Peter H. Welch and David C. Wood. The Kent Retargetable occam Compiler. In Brian O’Neill, editor, Parallel Processing Developments, volume 47 of Concurrent Systems Engineering, pages 143–166, Amsterdam, The Netherlands, March 1996. World occam and Transputer User Group, IOS Press.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-173
173
OpenComRTOS: A Runtime Environment for Interacting Entities Bernhard H.C. SPUTH a , Oliver FAUST a , Eric VERHULST a and Vitaliy MEZHUYEV a a Altreonic; Gemeentestraat 61A bus 1; 3210 Linden. Belgium {bernhard.sputh, oliver.faust, eric.verhulst, vitaliy.mezhuyev}@altreonic.com Abstract. OpenComRTOS is one of the few Real-Time Operating Systems for embedded systems that was developed using formal modelling techniques. The goal was to obtain a proven dependable component with a clean architecture that delivers high performance on a wide variety of networked embedded systems, ranging from a single processor to distributed systems. The result is a scalable relibable communication system with real-time capabilities. Besides, a rigorous formal verification of the kernel algorithms led to an architecture which has several properties that enhance safety and real-time properties of the RTOS. The code size in particular is very small, typically 10 times less than a typical equivalent single processor RTOS. The small code size allows a much better use of the on-chip memory resources, which increases the speed of execution due to the reduction of wait states caused by the use of external memory. To this point we ported OpenComRTOS to the MicroBlaze processor from Xilinx, the Leon3 from ESA, the ARM Cortex-M3, the Melexis MLX16, and the XMOS. In this paper we concentrate on the Microblaze port, which is an environment where OpenComRTOS competes with a number of different operating systems, including the standard operating system Xilinx Micro Kernel. This paper reports code size figures of the OpenComRTOS on a MicroBlaze target. We found that this code size is considerably smaller compared with published code sizes of other operating systems. Keywords. OpenComRTOS, embedded systems, system engineering, RTOS
Introduction Real-Time Operating Systems (RTOSs) are a key software module for embedded systems, often requiring properties of high reliability and safety. Unfortunately, most commercial, as well as open source implementations cannot be verified or even certified, e.g. according to the DoD 178B [1] or IEC61508 [2] standards. Similarly, software engineering is often done in a non-systematic way, although well defined and established Systems Engineering Processes exist [3,4]. The software is rarely proven to be correct even though this is possible with formal model checkers [5]. In the context of a unified systems engineering approach [6] we undertook a research project where we followed a stricter methodology, including formal model checking, to obtain a network-centric RTOS which can be used as a trusted component. The history of this project goes back to the early 1990’s when a distributed real-time RTOS called Virtuoso (Eonic Systems) [7] was developed for the INMOS transputer [8]. This processor had built-in support for concurrency as well as interprocess communication and was enabled for parallel processing by way of 4 communication links. Virtuoso allowed such a network of processors to be programmed in a topology transparent way. Later, the software evolved and was ported from single chip micro-controllers to systems with over
174
B.H.C. Sputh et al. / OpenComRTOS
a thousand Digital Signal Processors until the technology was acquired by Wind River and after a few years removed it from the market. The OpenComRTOS project was motivated by the lessons learned from developing three Virtuoso generations. These lessons became part of the requirements. We list the most important ones: • Scalability: The RTOS should support very small single processor systems, as well as widely distributed processing systems interconnected through external networks like the internet. To achieve that, the software components must be independent of the execution environment. In other words, it must be possible to map the software components onto the network topology. • Heterogeneous: The RTOS should support systems which consist of multiple nodes, with different CPU architectures. Naturally, different link technologies should be usable as well, ranging from low speed links such as RS232 up to high speed Ethernet links. • Efficiency: The essence of multi-processor systems is communication. The challenge, from an RTOS point of view, is keeping the latency to a minimum while at the same time maximizing the performance. This is achieved when most of the critical code resides in the limited amount of on-chip memory. • Small code size: This has a double benefit: a) performance and b) less complexity. Less complex systems have fewer potential sources of errors and side-effects. • Dependability: As testing of distributed systems becomes very time consuming, it is mandatory that the system software can be trusted from the start. As errors typically occur in “corner cases”, the use of formal methods was deemed necessary. • Maintainability and ease of development: The code needs to be clear and simple to facilitate the development of e.g. drivers, the latter have often been the weak point in system software. OpenComRTOS provides a runtime environment which supports these requirements. The remainder of this paper focuses on this runtime environment and the execution on a MicroBlaze target. But, before we discuss the details of OpenComRTOS in greater detail, we deduce two general points from the list of requirements. The scalability requirement imposes that data-communication is central in the RTOS architecture. The trustworthiness and maintainability aspects are addressed in the context of a Systems Engineering methodology. The use of common semantics during all activities is crucial, because only common semantics enable us to generate most of the implementation code from the modelling and simulation phase. Generated code is more trustworthy compared to handwritten code. To be able to use an “Interacting Entities” paradigm requires a runtime environment that supports concurrency and synchronization/communication in a native way between concurrent entities. OpenComRTOS is this runtime environment 1. OpenComRTOS Architecture Even with the problems mentioned above, Virtuoso was a successful product. The goal was to improve on its weaknesses. Its architecture had a high performance, but very hard to port and to maintain. Hence, for OpenComRTOS we adopted a layered architecture which is based on semantic layering. The lowest functionality level is limited to priority based preemptive multitasking. On this level Tasks exchange standardized Packets using an intermediate entity we call Port. Two tasks rendezvous by one task sending a ‘put’ request and the other task sending a ‘get’ request to the Port. The Port behaves similar to the JCSP Any2AnyChannel [9]. Hence, Tasks can synchronise and communicate using Packets and Ports. The Packets are the essential workhorse of the system. They have header and data fields and are exclusively used for all services, rather than performing function calls or using jump tables. Hence, it be-
B.H.C. Sputh et al. / OpenComRTOS
175
Figure 1. Open License Society: the unified view.
comes straightforward to provide services that operate in a transparent way across processor boundaries. In fact for a task it makes no difference in semantics whether a Port-Hub exists locally or on another node, the kernel takes care of routing the packet to the node which holds the Port-Hub. Furthermore, Packets are very efficient, because kernel operations often come down to shuffling Packets around (using handlers) between system level data structures. At the next semantic level we added more traditional RTOS services like events, semaphores, etc (see Table 2 on Page 6 for the included RTOS services). Finally, the architecture was kept simple and modular by developing kernel and drivers as Tasks. All these Tasks have a ‘Task input Port’ for accepting Packets from other Tasks. This has some unusual consequences like: a) the possibility to process interrupts received on one processor on another processor, b) the kernel having a lower priority than the drivers or even c) having multiple kernel Tasks on a single node. 1.1. Systems Engineering Approach The Systems Engineering approach from the Open License Society [6], outlined in Figure 1, is a classical one as defined in[3,4], but adapted to the needs of embedded software development. It is first of all an evolutionary process using continuous iterations. In such a process, much attention is paid to an incremental development requiring regular review meetings by several of the stakeholders. On an architectural level, the system or product under development is defined under the paradigm of “Interacting Entities”, which maps very well on an RTOS based runtime system. Applied to the development of OpenComRTOS, the process was started by elaborating an initial set of requirements and specifications. Next, an initial architecture was defined. From this point onwards, two groups started to work in parallel. The first group worked out an architectural model, while a second group developed initial formal models using TLA+/TLC [10]. These models were incrementally refined. Note that no real attempt was made to model the complete system at once. This is not possible in a generic way, because formal TLA models cannot be parametrised. For example, one must model a specific set of tasks and services which leads very quickly to a state space explosion which limits the achievable complexity of such models. Hence, we modelled only specific parts, e.g. a model was built for each class of services (Ports, Events, Semaphores, etc.). This was sufficient and has the benefit of having very clean, orthogonal models. Due to
176
B.H.C. Sputh et al. / OpenComRTOS
the orthogonality of the models there is no need to model the complete system, which has the big advantage that they can be developed by different teams. At each review meeting between the software engineers and the formal modelling engineer, more details were added to the models, the models were checked for correctness and a new iteration was started. This process was stopped when the formal models were deemed close enough to the implementation architecture. Next, a simulation model was developed on a PC (using Windows NT as a virtual target). This code was then ported to a real 16bit micro controller in form of the MLX16 from Melexis, who at this time were sponsoring the development of OpenComRTOS. The MLX16 a propriety micro controller used by Melexis to develop application specific ICs, it has up to 2kiB RAM and 32kiB Flash. On this target a few specific optimizations were performed during the implementation, while fully maintaining the design and architecture. The software was written in ANSI C and verified for safe coding practices with a MISRA rule checker [11]. 1.2. Lessons Learnt from Using Formal Modelling The goal of using formal techniques is the ability to prove that the software is correct. This is an often heard statement from the formal techniques community. A first surprise was that each model gave no errors when verified by the TLC model checker. This is actually due to the iterative nature of the model development process and partly its strength. From an initially rather abstract model, successive models are developed by checking them using the model checker and hence each model is correct when the model checker finds no illegal states. As such, model checkers can’t prove that the software is correct. They can only prove that the formal model is correct. For a complete proof of the software the whole programming chain as well as the target hardware should be modelled and verified. This is an unachievable goal due to its complexity and the resulting state space explosion. Nevertheless, it was attempted in the Verisoft project [12]. The model itself would be many times larger than the developed software. This indicates that if we would make use of verified target processors and verified programming language compilers, model checking becomes practical, because it is limited to modelling the application. Other issues, related to formal modelling, were also discovered. A first issue is that the TLC model checker declares every action as a critical section, whereas e.g. in the case of a RTOS, many components operate concurrently and real-time performance dictates that on a real target the critical sections are kept as short as possible. This forced us to avoid shared data structures. However, it would be helpful to have formal model assistance that indicates the required critical sections. 1.3. Benefits Obtained from Using Formal Modelling As was outlined above, the use of formal modelling was found to result in a much better architecture. This benefit results from successive iteration and review of the model. Another reason for the better architecture is the fact that formal model checkers provide a higher level of abstraction compared to the implementation. In the project we found that the semantics associated with specific programming terms involuntarily influence choices made by the architecture engineer. An example was the use of both waiting lists and Port buffers, which is one of the main concepts of OpenComRTOS. A waiting list is associated with just one waiting action but one overlooks the fact that it also provides buffering behaviour. Hence, one waiting list is sufficient resulting in a smaller and cleaner architecture. Formal modelling and abstract levels have helped to introduce, define and maintain orthogonal architectural concepts. Orthogonality is key to small and safe, i.e. reliable, designs. Similarly, even if there was a short learning curve to master the mathematical notation in TLA, with hindsight this was an advantage vs. e.g. SPIN [13], which uses a C-like syn-
B.H.C. Sputh et al. / OpenComRTOS
177
Figure 2. OpenComRTOS-L0 view.
tax. The latter leads automatically to thinking in terms of an implementation code with all its details, whereas the abstraction of TLA helps to think in more abstract terms. This also highlights the importance of specifying first before implementation is started. A final observation is that using formal modelling techniques turned out to be a much more creative process than the mathematical framework suggests. TLA/TLC as such was primarily used as an architectural design tool, aiding the team in formulating ideas and testing them in a rather abstract way. This proved to be teamwork with lots of human interaction between the team members. The formal verification of the RTOS itself was basically a sideeffect of building and running the models. Hence, this project has shown how a combination of teamwork with extensive peer-review, formal modelling support and a well defined goal can result in a “correct-by-design” product. 1.4. Novelties in the Architecture OpenComRTOS has a semantically layered architecture. Table 1 provides an overview over the available services at the different levels. At the lowest: level the minimum set of Entities provides everything that is needed to build a small networked real-time application. The Entities needed are Tasks (having a private function and workspace), and Interacting Entities, called Ports, to synchronize and communicate between the Tasks (see Figure 2). Ports act like channels in the tradition of Hoare’s CSP [14], but they allow multiple waiters and asynchronous communication. One of the Tasks is a Kernel Task which schedules the other Tasks in order of priority and manages Port-based services. Driver Tasks handle inter-node communication. Pre-allocated as well as dynamically allocated Packets are used as carriers for all activities in the RTOS, such as: service requests to the kernel, Port synchronization, data-communication, etc. Each Packet has a fixed size header and data payload with a user defined but global data size. This significantly simplifies the Packet management, particularly at the communication layer. A router function also transparently forwards Packets in order of priority between the network nodes. The priority of a Packet is the same as the priority of the Task from which the Packet originates. In the next semantic level services and Entities were added, similar to those which can be found in most RTOSs: Boolean events, counting semaphores, FIFO queues, resources,
178
B.H.C. Sputh et al. / OpenComRTOS
memory pools, etc. The formal modelling leads to the definition of all these Entities as semantic variants of a common and generic entity type. We called this generic entity a “Hub”. In addition, the formal modelling also helped to define “clean” semantics for such services, whereas ad-hoc implementations often have side-effects. Table 2 summarises the semantics. Table 1. Overview of the available Entities on the different Layers. Layer L0 L1 L2
Available Entities Task, Port Task, Hub based implementations of: Port, Boolean Event, Counting Semaphore, FIFO Queue, Resource, Memory Pool Mobile Entities: all L1 entities moveable between Nodes.
Table 2. Semantics of L1 Entities. L1 Entity Event Counting Semaphore Port FIFO queue Resource Memory Pool
Semantics Synchronisation on a Boolean value. Synchronisation with counter allowing asynchronous signalling. Synchronisation with exchange of a Packet. Buffered communication of Packets. Synchronisation when queue is full or empty. Event used to create a logical critical section. Resources have an owner Task when locked. Linked list of memory blocks protected with a resource.
Table 3. Service synchronization variant. Services Synchronising Behavior variants “Single-phase” services NW Non Waiting: when the matching filter fails the Task returns with a RC Failed. W Waiting: when the matching filter fails the Task waits until such events happens. WT Waiting with a time-out. Waiting is limited in time defined by the time-out value. “Two-phase” services Async Asynchronous: when the entity is compatible with it, the Task continues independently of success or failure and will resynchronize later on. This class of services is called “two-phase” services.
The services are offered in a non-blocking variant ( NW), a blocking variant ( W), a blocking with time out variant ( WT), and an asynchronous variant ( A) for services where this is applicable (currently in development). All services are topology transparent and there is no restriction in the mapping of Task and kernel Entities onto this network. See Tables 2 and 3 for details on the semantics. Using a single generic entity leads to more code reuse, therefore the resulting code size is at least 10 times less than for an RTOS with a more traditional architecture. One could of course remove all such application-oriented services and just use Hub based services. Unfortunately, this has the drawback that services loose their specific semantic richness, e.g. resource locking clearly expresses that the Task enters a critical section in competition with other Tasks. Also erroneous runtime conditions, like raising an event twice (with loss of the previous event), are easier to detect at application level compared with the case when only a generic Hub is used. During the formal modelling process, we also discovered weaknesses in the traditional way priority inheritance is implemented in most RTOSs. Fortunately, we found a way to re-
B.H.C. Sputh et al. / OpenComRTOS
179
duce the total blocking time. In single processor RTOS systems this is less of an issue, but in multi-processor systems, all nodes can originate service requests and resource locking is a distributed service. Hence, the waiting lists can grow longer and lower priority Tasks can block higher priority ones while waiting for the resource. This was solved by postponing the resource assignment until the rescheduling moment. Finally, by generalization, also memory allocation has been approached like a resource locking service. In combination with the Packet Pool, this opens new possibilities for safe and secure memory management, e.g. the OpenComRTOS architecture is free from buffer overflow by design. For the third semantic layer (L2), we plan to add dynamic support like mobility of code and of kernel Entities. A potential candidate is a light-weight virtual machine supporting capabilities as modelled in pi-calculus [15]. This is the subject of further investigations and will be reported in subsequent papers. 1.5. Inherent Safety Support By its architecture the L1 semantic layers are all statically linked, hence an application specific image will be generated by the tool-chain during the compilation process. As we don’t consider security risks for the moment, our concern is limited to verifying whether or not the code is inherently safe. A first level of safety is provided by the formal modelling approach. Each service is intensely modelled and verified with most “corner cases” detected during design time prior to writing code. A second level is provided by the kernel services themselves. All services have well defined semantics. Even when they are asynchronously used, the services become synchronous when resources become depleted. At such moments, a Task is forced to wait, which allows other Tasks to proceed and free up resources (like Packets, buffer space, etc.); Hence, the systems becomes “self-throttling”. A third level is provided by the data structures, mostly based on Packets. All single-phase services use statically allocated Packets which are part of the Task context. These Packets are used for service requests, even when going across processor boundaries. They also carry return values. For two phase services Packets must be allocated from a Packet Pool. When the Pool is empty, the system will start to throttle until Packets are released. Another specific architectural feature is the fact that the system can never run out of space to store requests because there is a strict limit of how many requests there can be in the system (the number of packets). All queues are represented by linked list, and each packet contains the necessary header information, therefore no buffers are required to handle requests, which therefore cannot overflow. In the worst case, the application programmer defined insufficient Packets in the Pool and the buffers will stop growing when all Packets are in use. A last level is the programming environment. All Entities are defined statically, so they are generated together with all other system level data structures by a tool, hence no Entities can be created at runtime. Of course, dynamic support at L2 will require extra support. However, this can only be achieved reliably with hardware support, e.g. to provide protected memory spaces. The same applies to using stack spaces. In OpenComRTOS interrupts are handled on a private and separate stack, so that the Task’s stack spaces are not affected. On the MLX16 such a space can be protected, but it is clear that such an inexpensive mechanism should become the norm for all embedded processors. A full MMU is not only too complex and too large, it is also simply not necessary. The kernel has various threshold detectors and provides support for profiling, but the details are outside the scope of this paper. 2. OpenComRTOS on Embedded Targets Porting OpenComRTOS to the Microblaze soft processor was the first major work done by Altreonic. One reason for choosing the Microblaze CPU as a first target was the prior expe-
180
B.H.C. Sputh et al. / OpenComRTOS
Figure 3. Hardware setup of the test system.
rience of the new team with the Microblaze environment [16,17]. This section compares the Microblaze port with the port of OpenComRTOS to the MLX16. It also gives performance and code size figures for other available ports of OpenComRTOS. The Microblaze soft processor is realised in a Field Programmable Gate Array (FPGA). FPGAs are emerging as an interesting design alternative for system prototyping and implementation for critical applications when the production volume is low [18]. We realised the target architecture with the Xilinx Embedded Developer Kit 9.2 and synthesized with Xilinx ISE version 9.2 on an ML403 board with a Virtex-4 XC4VFX12 FPGA clocked at 100 MHz. Our architecture, shown in Figure 3, is composed of one MicroBlaze processor connected to a Processor Local Bus (PLB). The PLB enables accessing TIMER and GPIO. The TIMER is used to measure the time it takes to execute context switches. The GPIO was used for basic debugging. The processor uses local memory to store code and data of the Task it runs. This memory is implemented through Block RAMs (BRAMs). The MicroBlaze Debug Module (MDM) enables remote debugging of the MicroBlaze processor. 2.1. Code Size Figures This section reports the code size figures of OpenComRTOS on the MicroBlaze target. To put these figures into perspective we did two things. First the OpenComRTOS code size figures on the MicroBlaze target are compared with the ones on the other targets. To this point we have ported OpenComRTOS to the MicroBlaze processor from Xilinx [19], the Leon3 as used by ESA[20], the ARM Cortex-M3[21], and the XMOS XS1-G4[22]. The second comparison is concerned with the code size figures for a simple semaphore example. This example has been implemented using a) Xilinx Micro-Kernel (XMK) and b) OpenComRTOS. The later example is more important, because we can show that the OpenComRTOS version uses only 75% code size (.text segment) compared to the XMK version, to achieve the same functionality. Table 4 reports the code size figures for individual L1 Services for all different targets we support. The total code size of ‘Total L1 Services’ is just the sum of the individual code sizes. The Service ‘L1 Hub shared’ represents the code necessary to achieve the functionality of the Hub, upon which all other L1 Services depend. This explains why adding the Port functionality requires only 4-8 Bytes more code. In general the code size figures are lower for the MLX16, ARM-Cortex-M3 and XMOS due to their 16bit instruction set. Both Microblaze and Leon3 in contrast use a 32bit instruction set. Even among the targets with 16bit instruction sets we can see vast differences in the code size. One reason for this is the number of registers these targets have. The MLX16 has only four registers which need to be saved during a context switch. In contrast the XMOS port has to save 13 registers during a context switch. This has also an impact on the performance figures, which are shown in Table 6 on page 10 2.1.1. Comparing OpenComRTOS Against the Xilinx Micro-Kernel The Xilinx Micro-Kernel (XMK) is an embedded operating system from Xilinx for its Microblaze and PPC405 cores. In this section we compare the size of a comparable application example between OpenComRTOS and XMK for the Microblaze target. A complete comparison of code size figures between XMK and OpenComRTOS is not possible, because these
181
B.H.C. Sputh et al. / OpenComRTOS Table 4. OpenComRTOS L1 code size figures (in Bytes) obtained for our different ports. Service L1 Hub shared L1 Port L1 Event L1 Semaphore L1 Resource L1 FIFO L1 PacketPool Total L1 Services
MLX16 400 4 70 54 104 232 NA 1048
MicroBlaze 4756 8 88 92 96 356 296 5692
Leon3 4904 8 72 96 76 332 268 5756
ARM 2192 4 36 40 40 140 120 2572
XMOS 4854 4 54 64 50 222 166 5414
Figure 4. Semaphore loop example project.
operating systems offer different services. However, to give an indication of the code size efficiency, we implemented a simple application based on two services both OS offer. Figure 4 shows two tasks (T1 T2 ), which exchange messages and synchronise on two semaphores (S1 , S2 ), in both cases 1KiB stacks were defined, which is default size for XMK. In case of the OpenComRTOS implemenation 512Byte would have been more than enough. Table 5 shows that the complete OpenComRTOS program requires about 15% less memory when compared with the memory usage of XMK. This is an important result, because with OpenComRTOS there is more RAM available for user applications. This is particularly important when, either for speed reasons or for PCB size constraints, the complete application has to run in internal (BRAM) memory. Table 5. XMK vs. OpenComRTOS code size in bytes. OS XMK OpenComRTOS
.text 12496 9400
.data 348 1092
.bss 7304 6624
total 20148 17116
2.2. Performance Figures The performance figures were evaluated by measuring the loop time. We define this loop time as the time a particular target takes to complete one loop in the semaphore loop example. The resulting measurement values allow us to compare the performance of OpenComRTOS on different target platforms. OpenComRTOS abstracts the hardware from the application programmer, therefore the application source code, which is executed by the individual targets, stays the same. To show how compact OpenComRTOS application code is, Listings 1 and 2 show the source code for the Semaphore loop example which was used to measure the loop time figures. Listing 1 shows the code for task T1 which represents T1 . The Arguments of the function call are not used. Line 2 defines 3 variables of type 32 bit unsigned int. All the work is done wihtin the infinite loop, starting from Line 3. In Line 4 the number of elapsed processor cycles is stored in the start variable. The code block from Line 8 to 8 signals semaphore 1 (S1) 1000 times and tests semaphore 2 (S1) also 1000 times; for the semantics of L1_SignalSemaphore and L1_TestSemaphore see Table 2. In Line 9 the elapsed processor cycles are stored in the stop variable.
182
1 2 3 4 5 6 7 8 9 10 11
B.H.C. Sputh et al. / OpenComRTOS
void T1 ( L1_TaskArguments Arguments ){ L1_UINT32 i =0 , start =0 , stop =0; while (1) { start = L1_getElapsedCycles (); for ( i = 0; i < 1000; i ++){ L1_SignalSemaphore_W ( S1 ); L1_TestSemaphore_W ( S2 ); } stop = L1_getElapsedCycles (); } } Listing 1. Souce code for task T1.
1 2 3 4 5 6
void T2 ( L1_TaskArguments Arguments ){ while (1) { L1_TestSemaphore_W ( S1 ); L1_SignalSemaphore_W ( S2 ); } } Listing 2. Souce code for task T2.
For completeness, Listing 2 shows the source code for T2 which represents T2 . Similarly to T1 the task is represented by a function whose parameters are not used. Within the whileloop, from Line 2 to 5, semaphore S1 is tested before semaphore S2 is signaled. Both calls are blocking, as indicated by the postfix ‘_W’, see Table 3. After having obtained the start and stop values for all the targets we use the following Equation to calculate the loop time. Loop time =
stop − start Clock speed × 1000
(1)
This equation does not take into account the overhead from getting the elapsed clock cycles and from the loop implementation. This overhead is negligible compared with the processing time for signalling and testing the semaphores. Table 6 reports the measured loop times for the different targets. Each run of the loop requires eight context switches, this is caused by the fact that the Semaphores are accessed in the kernel context. Therefore, any access to a Semaphore requires to switch into the kernel context and afterwards to switch back to the requesting task. Table 6. OpenComRTOS loop times obtained for our different ports. Clock speed Context size Memory location Loop time
MLX16 6MHz 4 × 16bit internal 100.8μs
MicroBlaze 100MHz 32 × 32bit internal 33.6μs
Leon3 40MHz 32 × 32bit external 136.1μs
ARM 50MHz 16 × 32bit internal 52.7μs
XMOS 100MHz 14 × 32bit internal 26.8μs
B.H.C. Sputh et al. / OpenComRTOS
183
The loop times expose the differences between the individual architectures. What sticks out is the performance of the MLX161 , which despite its low Clock speed of only 6MHz is faster than the Leon3 running at more than 6 times the Clock frequency. One of the main reasons for this is that the MLX16 has only to save and restore 4 16bit registers during a context switch compared to 32 32bit registers in case of the Leon3. Furthermore, the Leon3 uses only external memory, whereas all other targets use internal memory. 3. Conclusions The OpenComRTOS project has shown that even for software domains which are often associated with ‘black art’ programming, formal modelling works very well. The resulting software is not only very robust and maintainable but also respectably compact and fast. It is also inherently safer than standard implementation architectures. Its use however must be integrated with a global systems engineering approach, because the process of incremental development and modelling is as important as using the formal model checker itself. The use of formal modelling has resulted in many improvements of the RTOS properties. The previous section analysed two distinct RTOS properties. Namely, code size and speed measurements. With a code size as low as 1kiB a stripped down version of OpenComRTOS fits in the memory of most embedded targets. When more memory is available, the full kernel fits in less than 10kiB on many targets. Compared with the Xilinx Micro-Kernel OpenComRTOS has about 75% of the code size. The loop time measurements brought out the differences between individual target architectures. In general however, the measured loop times confirm that OpenComRTOS performs well on a wide verity of possible targets. Acknowledgments The OpenComRTOS project is partly funded under an IWT project for the Flemish Government in Belgium. The formal modelling activities were provided by the University of Gent. References [1] RTCA. DO-178B Software Considerations in Airborne Systems and Equipment Certification, January 1992. [2] ISO/IEC. TR 61508 Functional Safety of electrical / electronic / programmable electronic safety-related systems, January 2005. [3] The International Council on Systems Engineering (INCOSE) aims to advance the state of the art and practice of systems engineering. www.incose.org. [4] Andreas Gerstlauer, Haobo Yu, and Daniel D. Gajski. RTOS Modeling for System Level Design. In DATE03, page 10130, Washington, DC, USA, 2003. IEEE Computer Society. [5] Formal Systems (Europe) Ltd. Failures-Divergence Refinement: FDR Manual. http://www.fsel.com/fdr2 manual.html. [6] The Open License Sociaty researches and develops a systematic systems engineering methodology based on interacting entities and thrustworthy components. www.openlicensesociety.org. [7] Eonic Systems. Virtuoso The Virtual Single Processor Programming System User Manual. Available at: http://www.classiccmp.org/transputer/microkernels.htm. [8] M. D. May, P. W. Thompson, and P. H. Welch, editors. Networks, Routers and Transputers: Function, Performance and Applications. IOS Press, Amsterdam Netherlands, 1993. [9] P.H.Welch. Process Oriented Design for Java: Concurrency for All. In H.R.Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA2000), volume 1, pages 51–57. CSREA, CSREA Press, jun 2000. 1
Stripped down version of OpenComRTOS
184
B.H.C. Sputh et al. / OpenComRTOS
[10] Leslie Lamport. Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002. [11] The MISRA Guidelines provide important advice to the automotive industry for the creation and application of safe, reliable software within vehicles. http://www.misra.org. [12] Eyad Alkassar, Mark A. Hillebrand, Dirk Leinenbach, Norbert W. Schirmer, and Artem Starostin. The Verisoft Approach to Systems Verification. In Jim Woodcock and Natarajan Shankar, editors, VSTTE 2008, Lecture Notes in Computer Science, Toronto, Canada, October 2008. Springer. [13] Gerard J. Holzmann. The SPIN Model Checker : Primer and Reference Manual. Addison-Wesley Professional, September 2003. [14] C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, 1978. [15] Robin Milner. Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press, June 1999. [16] Bernhard Sputh, Oliver Faust, and Alastair R. Allen. Portable csp based design for embedded multi-core systems. In Communicating Process Architectures 2006, sep 2006. [17] Bernhard Sputh, Oliver Faust, and Alastair R. Allen. A Versatile Hardware-Software Platform for In-Situ Monitoring Systems. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, pages 299–312, jul 2007. [18] Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, and Donatella Sciuto. A dual-priority real-time multiprocessor system on fpga for automotive applications. In DATE08, pages 1039–1044. IEEE, 2008. [19] Xilinx. MicroBlaze Processor Reference Guide. http://www.xilinx.com. [20] Gaisler Research AB. SPARC V8 32-bit Processor LEON3 / LEON3-FT CompanionCore Data Sheet. http://www.gaisler.com/cms/. [21] ARM. An Introduction to the ARM Cortex-M3 Processor. http://www.arm.com/. [22] XMOS. XS1-G4 Datasheet 512BGA. http://www.xmos.com/.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-185
185
Economics of Cloud Computing: a Statistical Genetics Case Study Jeremy M. R. MARTIN a, Steven J. BARRETT a, Simon J. THORNBER a, Silviu-Alin BACANU a, Dale DUNLAP b, Steve WESTON c a
GlaxoSmithKline R&D Ltd, New Frontiers Science Park, Third Avenue, Harlow, Essex, CM19 5AW, UK b Univa UD, 9737 Great Hills Trail, Suite 300, Austin, TX 78759, USA c REvolution Computing, One Century Tower, 265 Church Street, Suite 1006, New Haven, CT 06510, USA Abstract. We describe an experiment which aims to reduce significantly the costs of running a particular large-scale grid-enabled application using commercial cloud computing resources. We incorporate three tactics into our experiment: improving the serial performance of each work unit, seeking the most cost-effective computation cycles, and making use of an optimized resource manager and scheduler. The application selected for this work is a genetics association analysis and is representative of a common class of embarrassingly parallel problems. Keywords. Cloud Computing, Grid Computing, Statistical Genetics, R
Introduction GlaxoSmithKline is a leading multinational pharmaceutical company which has a substantial need for high-performance computing resources to support its research into creating new medicines. The emergence of affordable, on demand, dynamically scalable hosted computing capabilities (commonly known as ‘Cloud Computing’ [1]) is of interest to the company: potentially it could allow adoption of an ‘elastic’ grid model whereby the company runs a reduced internal grid with the added capability to expand seamlessly on demand by harnessing external computing infrastructure. As these clouds continue to accumulate facilities comprising powerful parallel scientific software [2] and key public databases for biomedical research [3], they will become increasingly attractive places to support large, collaborative, in silico projects. This will improve as efforts [4,5] to evaluate and identify the underlying benefits continue within commercial and public science sectors. In this article we explore the economic viability of transferring a particular Statistical Genetics application from the GSK internal PC grid to an external cloud. The application is written using the R programming language. One run of this application requires 60,000 hours of CPU time on the GSK desktop grid, split across 250,000 jobs which require approximately fifteen minutes of CPU time each. Each job uses approximately 50MB of memory and performs negligible I/O. The GSK desktop grid comprises the entire fleet of R&D desktop PCs within GSK: these are Pentium machines with an average speed of about 3 GHz. At any given time up to 1500 of these devices may be used concurrently, harvested for spare compute cycles using the Univa UD Grid MP system. Based on raw performance, each node in this grid is roughly 3 times faster than a ‘small’ instance on Amazon EC2 [6]. This suggests that the entire application would require approximately 180,000 hours of CPU time on the Amazon EC2 cloud which implies that the naïve cost for running the
186
J.M.R. Martin et al. / Economics of Cloud Computing
application on the EC2, without any optimization, would be around $18,000 ($0.10 per CPU hour). This figure is too high to be attractive when compared with the running costs of the internal GSK grid infrastructure. However if we were able to reduce that price by an order of magnitude it would open up new possibilities for GSK to run an economical elastic grid. In order to tackle this problem we have established a three-way collaboration between GSK, Univa UD which is a pioneering company in the area of grid and cloud computing resource management and scheduling, and REvolution Computing which specialises in high-performance and parallel implementation of the R programming language. Our cost reduction strategy is based on three tactics: •
Pursuit of low-cost computing resources
•
Reduction of serial execution time
•
Efficient scheduling and resource management
The rest of the paper is organized as follows. In section 1 we describe the Statistical Genetics grid application which we are using for this experiment. Section 2 gives a brief introduction to the R programming language. In section 3 we review the current status of cloud computing. Section 4 provides a detailed explanation of our methods, and the results are presented in section 5. Finally we review the results and make predictions for the future uptake of cloud computing in science and technology. 1. Development of Methods for Genetic Analysis Using Simulation Discovery of genetic variants responsible for certain diseases or adverse drug reactions is currently a high priority for both governmental and private organizations. As described in Bacanu et al [7], the tools for discovery consist of statistical methods for detecting association between phenotype of interest and biomarkers (genotypes in this case). While these methods are general in spirit, the type of data they are applied to is unfortunately very specific. For instance, the data for a very common “genome scan” consists of around one million or more variables (markers) that can be very correlated regionally and, at the same time, the correlation structure is highly irregular and its magnitude varies widely between even adjacent genomic regions. Due to the size and irregularity of genetic data, performance of association methods has to be assessed for very large simulation designs with very time-consuming individual simulations. The computational complexity of the problems is further compounded when the number of available subjects is small; under these circumstances asymptotic statistical theory cannot be used and the statistical significance has to assessed via computationally-intensive permutation tests. Consequently, to successfully develop and apply methods, statistical genetics researchers need extensive computational power such as very large clusters, grid or cloud computing. Association analysis is the statistical technique for linking the presence of a particular genetic mutation in subjects to the susceptibility of suffering from a particular disease.. To assess the presence of associations, subjects in study cohorts are divided into cases (those with the disease) and controls (those without). The DNA of each subject is genotyped at particular markers where relatively common mutations are present. The combination of genotype and phenotype (disease symptom) data is then analyzed to arrive at a statistical significance (for association) at each studied marker. A typical design for testing association methods involves running many instances (sometimes more than 250,000) of an identical program, each with different input data or, if
J.M.R. Martin et al. / Economics of Cloud Computing
187
permutations are necessary, a randomly permuted copy of the input data. Each instance produces for each marker a single data point within a simulated probability distribution. Once all the instances have completed the resultant probability distribution is used to approximate the statistical significance of the actual data, and hence whether there is a significant link between a particular genetic marker and the disease under study.
2. The R Programming Language The R programming language[8] is a popular public-domain tool for statistical computing and graphics. R is a very high level function-based language and is supported by a vast online library of hundreds of validated statistical packages[9]. Because R is both a highly productive environment and is freely available it appeals both to those academics who are teaching the next generation of statisticians and to others who are developing new 'state-ofthe-art' statistical algorithms, which are becoming increasingly complex as methods become progressively more sophisticated. The flip-side of this level usefulness and programming productivity is that R programs may take a long time to execute when compared with equivalent programs written in lowlevel languages like C. However the additional learning and effort required to develop complex numerical and statistical applications using C is not a realistic option for most statisticians, so there have been many initiatives to make R programs run faster. These fall into three general categories: 1. Task farm parallelisation: running a single program many times in parallel with different data across a grid or cluster of computers. This is the approach that has been used for the genetics application of this project – using the GSK desktop grid a program which would take seven years to run sequentially can complete in a few days. 2. Explicit parallelisation in the R code using MPI or parallel loop constructs (e.g. R/Parallel [10]). The drawback of this approach is that it requires the R programmer to have some understanding of concurrency in order to use the parallel constructs or functions within the program. 3. Speeding up the performance of particular functions by improved memory handling, or by using multithreaded or parallelised algorithms ‘beneath the hood’ to accelerate particular R functions e.g. REvolution R, Parallel R[11] or SPRINT[12]. With this approach the programmer does not need to make any changes to code – the benefits will be automatic. However at present only certain R functions have been optimised in this way – so this approach will only work for codes which require significant use of those functions. For the purpose of this experiment we will be using a combination of approach 1 with approach 3. The code has already been enabled to run as a collection of 250,000 independent work units (as described above). By also aiming to reduce the serial execution time of the code (by improved memory handling rather than by multithreading or parallelised algorithms) we shall hope to reduce our overall bill for use of cloud virtual machines.
188
J.M.R. Martin et al. / Economics of Cloud Computing
3. Cloud Computing Cloud computing is a style of computing in which dynamically scalable and virtualized resources are provided as a service over the Internet. Major cloud computing vendors, such as Amazon, HP and Google, can provide a highly reliable, pay-on-demand, virtual infrastructure to organizations. This is attractive to small commercial organizations which do not wish to invest heavily in internal IT infrastructure, and who need to adjust the power of their computing resources dynamically, according to demand. It may also be attractive to large global companies, such as pharmaceutical companies [13], who will potentially be able to choose to set their internal computing capacity to match the demands of day-to-day operations, and provide an elastic capability to expand on to clouds to deal with spikes of activity. In either case the economics of cloud computing will be crucial to the likelihood of uptake 14]. As well as economic considerations, cloud computing comes with a number of additional complications and there exists a certain amount of skepticism in the Computing world about its true potential. Armbrust et al [15] list ten obstacles to the adoption of cloud computing as follows: 1. Availability of Service. Risks of system downtime might be unacceptable for business-critical applications. 2. Data Lock-In. Each cloud vendor currently has its own proprietary API and there is little interoperability which prevents customers from easily migrating applications between clouds. 3. Data Confidentiality and Auditability. Cloud computing offerings are essentially public rather than private networks and so are more exposed to security attacks than internal company resources. There are also requirements for system auditing in certain areas of business (such as Pharmaceuticals) which need to be made feasible on cloud virtual machines. 4. Data Transfer Bottlenecks. Migrating data to clouds is currently costly (around $150 per terabyte) and may also be very slow. 5. Performance Unpredictability. Significant variations in performance have been observed between cloud virtual machine instances of the same type from the same vendor. This is a common problem with virtualisation and can be a particular issue for high-performance computing applications which are designed to be load-balanced across homogeneous compute clusters. 6. Scalable Storage. Providing high-performance, shared storage across multiple cloud virtual machines is still considered to be an unsolved problem. 7. Bugs in Large Distributed Systems. Debugging distributed applications is notoriously difficult. Adding the cloud dimension increases the complexity and will require the development of specialised tools. 8. Scaling Quickly. Cloud vendors generally charge according to the quantity of resources that are reserved: virtual CPUs, memory and storage, even if they are not currently in use. Therefore a system which can scale the reserved resources up and down quickly according to actual usage, hence minimizing costs, would be very attractive to customers.
189
J.M.R. Martin et al. / Economics of Cloud Computing
9. Reputation Fate Sharing. One customer’s bad behaviour could potentially tarnish the reputation of all customers of a cloud, possibly resulting in blacklisting of IP addresses. 10. Software Licensing. Current licensing agreements tend be node-locked, assigned to named users, or floating concurrent licenses. What is needed for cloud computing is for software vendors to charge by the CPU hour. Each of these matters needs to be carefully reviewed when a company makes a decision whether to adopt cloud computing. They are all subject to ongoing technical research within the cloud computing community. Scientific software standards also need to be consistently adhered to and specific issues such as random number generation addressed. For the Genetics application described here, GSK is able to avoid data confidentiality issues by taking steps to make the data anonymous. Also data transfer bottlenecks and scalable storage are not major issues here as this problem is highly computationally bound. The concern of data lock-in is to be handled by Univa UD’s Unicloud product, which provides a standard interface whichever cloud vendor is used. The R programming environment is open source and so there are no associated license costs. 4. Tactics for Reducing Cost The main objective of our experiment is to reduce the cost of running the GSK Statistical Genetics application by an order of magnitude from our starting estimate of $18K. We are using three separate, largely independent mechanisms to achieve this. 4.1 Pursuit of Low-Cost Computing Resources Cloud vendors usually offer several different varieties of virtual computer, with differing levels of resources and correspondingly different costs. We will be trialing a number of these from two vendors, Amazon and Rackspace, as listed in Tables 1 and 2. (Note that R is platform neutral: it runs on both Windows and Linux). Table 1. Amazon EC2 instance types.
Storage
Architecture
1.7GB
EC2 Compute Units 1
160GB
32-bit
Price per CPU hour (Linux) $0.10
7.5GB
4
850GB
64-bit
$0.40
15GB 1.7GB
8 5
1690GB 350GB
64-bit 32-bit
$0.80 $0.20
7GB
20
1690GB
64-bit
$0.80
Instance
Memory
Standard Small Standard Large Standard XL High CPU Medium High CPU XL
190
J.M.R. Martin et al. / Economics of Cloud Computing
Table 2. Rackspace cloud instance types.
Memory 256MB 512MB 1024MB 2048MB
Storage 10GB 20GB 40GB 80GB
Price per CPU hour (Linux) $0.015 $0.03 $0.06 $0.12
Note that the CPU power of each Amazon EC2 instance is given as a multiple of their socalled ‘compute unit’ which is claimed to be equivalent to a 1.0-1.2 GHz 2007 Opteron processor[6]. The CPU performance of Rackspace cloud instances, however, is not provided on their website[16] and so for the purpose of our project has been determined by running benchmarks. 4.2 Reduction of Serial Execution Time Given our starting estimate of $18K to run our application on a cloud, it would be well worth investing effort in optimizing the serial performance of each of the 250,000 work units. This can be achieved in two ways: either by performance tuning the R code of the application directly, or by optimizing the R run-time environment. REvolution Computing’s version of the R runtime environment contains optimized versions of certain underlying mathematical functions with improved memory and cache utilization. Programs which make heavy use of the optimized functions may show a significant speed up in serial execution time. In section 4 we shall describe the results of using REvolution’s tools in this case study as well as transforming the R code directly using application profiling. Another potential approach to improving the serial performance would be to rewrite the code in an efficient low-level language such as C or Fortran. However, this would be a very costly undertaking because of the complexity of the R packages used in the program. Also the R program used for our Genetics Analysis is subject to frequent revisions and improvements and the flexibility to make frequent changes would disappear using this approach. 4.3 Efficient Scheduling and Resource Management In order to achieve the highest throughput in a cloud environment, we need to be able to quickly provision a large environment, and also efficiently schedule a large number of jobs within that environment. An HPC cloud is essentially a cluster of a large number of virtual machines. To be useful, we need the ability to quickly create a large clustered environment, consisting of dozens, or possibly hundreds or thousands of virtual cores. Ideally, this process should be highly scalable: requiring little (if any) additional effort to build a 1,000 node cluster, as opposed to a 2-node cluster. In our experiment, we used Univa UD’s UniCloud product to provision the cluster. UniCloud works as follows: one virtual machine (“installer node”) is created manually within the cloud environment (for example, Amazon EC2). This machine is installed with CentOS Linux 5.2. The customers’ account information is configured in a special file (which varies by cloud vendor), and this allows UniCloud to use the cloud vendor’s API to create additional virtual machines programmatically. A UniCloud command (“ngedit”) is used to define the configuration for additional nodes, and the “addhost” command is then
J.M.R. Martin et al. / Economics of Cloud Computing
191
used to actually provision the nodes. This includes building the operating system, setting up ssh keys for password-less login, and installation of optional software packages. One of the optional packages installed by UniCloud is Sun Grid Engine [17], which is used as the batch job scheduler. This allows us to submit all 250,000 jobs at once. The jobs are queued by SGE and then farmed out to different machines as they become available. Without a scheduler, it would be impossible to fully utilize all the compute resources. 5. Experimental Results We ran a representative subset of jobs on three flavours of cloud VM, two from Amazon and one from Rackspace, using the UniCloud scheduler and resource manager in order to achieve an efficient throughput of jobs. Because of the large number of individual jobs that are run in our experiment, the scheduler was able to achieve near perfect efficiency in utilization of processor cores. We used these results to make the cost and execution time predictions in Table 3. Table 3. Initial performance results.
Average Runtime Per Job Jobs Per Hour Per Instance Total Number of Jobs Wall Clock Time (200 instances) Cost per hr Total Cost
EC2 Standard XL
EC2 High CPU XL
Rackspace 256MB
29.1 min
17.9 min
19.6 min
8.25 250,000
26.82 250,000
12.23 250,000
151.6 hr $0.80 $24,250.00
46.6 hr $0.80 $7,458.33
102.2 hr $0.015 $306.72
Using the Rackspace clouds works out over twenty times less expensive that the lowest cost seen with Amazon. But, there are fundamental differences between the two environments that make direct comparison difficult. An instance at Amazon EC2 provides a guaranteed minimum performance, as well as a certain amount of memory and disk space. Each Rackspace VM provides access to four processing cores, and a specific amount of memory (in our case, 256MB). Each physical machine at Rackspace actually has two dualcore processors, and 16GB of memory. There could be up to 64 separate 256MB VMs sharing this hardware. Each VM can use up to the full performance of the machine, if the other VMs are idle. But, it is possible (although unlikely) that the performance will be significantly less. Rackspace’s provisioning algorithm attempts to avoid this problem by distributing new VMs instead of stacking them and most applications already deployed to run on the Rackspace cloud are not at all CPU-hungry. In our testing, the Rackspace 256MB instance was the most cost efficient, because each of our jobs requires about 50MB of memory. So, four concurrent jobs can fully utilize one of these virtual machines without swapping. The scheduler was able to keep each of these cores constantly busy until all the jobs had been processed. Note also that there is currently a limit of 200 concurrent VMs that a Rackspace user may hire in a single session, so Rackspace would appear to be significantly less scalable for HPC than Amazon. The best execution time on the Rackspace cloud is 102 hours. Amazon can do the analysis in 46 hours using 200 VM instances, and could potentially go much faster than this by creating 1000 or more concurrent VMs.
192
J.M.R. Martin et al. / Economics of Cloud Computing
5.1 R Code Optimisation Our performance optimization for the test R program was largely focused in two areas: language-level and processor optimizations. The R language is a rich, easy to use, high-level programming language. The flexibility of the language makes it easy to express computational ideas in many ways, often simplifying their implementation. However, that flexibility also admits many less than optimal implementations. It is possible for several mathematically-equivalent R programs to have widely different performance characteristics. Fortunately, R code is easy to profile for time and memory utilization, allowing us to identify bottlenecks in programs and compare alternate implementations. The Rprof function produces a detailed log of the time spent in every function call over the course of a running program, as well as optional detailed memory consumption statistics. There exist several excellent open-source profile analysis and visualization packages that work with Rprof, incuding 'proftools' and 'profr.' These are essential tools for R performance tuning. Processor optimizations represent the other main approach to R performance optimization. Unlike the language-level optimization, processor optimizations are usually implemented through low-level numeric libraries. The R language relies on low-level libraries for basic computational kernels like linear algebra operations. Running the Statistical Genetics R code using REvolution R showed a marginal performance improvement (6%) over running with the public domain version of R. In order to explore further we used the RProf profiling package to analyse the code. This shows a breakdown of how much time the program spends in each function: for our program the RProf output begins as follows: Each sample represents 0.02 seconds. Total run time: 1774.31999999868 seconds. Total seconds: time spent in function and callees. Self seconds: time spent in function alone. % total 100.00 100.00 99.99 99.99 99.99 99.99 99.99 99.98 64.55 32.21 32.17 30.21 29.92
total seconds 1774.32 1774.32 1774.18 1774.18 1774.18 1774.14 1774.10 1774.02 1145.30 571.48 570.88 536.02 530.80
% self 0.00 0.00 0.00 0.34 0.00 1.09 9.79 0.09 1.51 0.03 0.02 1.22 0.18
self seconds 0.00 0.02 0.00 6.12 0.00 19.34 173.74 1.68 26.88 0.60 0.40 21.72 3.20
name "source" "eval.with.vis" "t" "sapply" "replicate" "lapply" "FUN" "getpval" "lm" "anova" "anova.lm" "eval" "anova.lmlist"
This shows that most of the execution time was within the ‘lm’ function (and related functions) for fitting linear models. The test program runs a large number of linear models, which are ultimately based on the QR matrix factorization. The particular QR factorization
193
J.M.R. Martin et al. / Economics of Cloud Computing
used by R has not yet been adapted to take advantage of tuned numeric libraries. This explains why the speedup obtained was small. (However, optimization of the QR factorization code is work in progress and we expect it will provide about 5 to 10% further speed.) Fortunately it turns out that our program’s performance may be significantly improved by a code transformation. The original code contains the following ‘for loop’ construct: for(i in 1:ncov){ if(i==1){ model0 0: n -= 1
and found that a parallel execution of the task in threads performed 1.8 times slower than a sequential execution and that performance improved if one (of two) CPU cores was disabled. These counter-intuitive results are often the basis for developers to call for the GIL to be removed. The Python FAQ6 summarises why the GIL is unlikely to be removed from the reference implementation of the interpreter, essentially because alternative implementations of thread scheduling have caused a performance penalty to single-threaded programs. The current solution, therefore, is to provide programmers with alternatives: either to write singlethreaded code, perhaps using coroutines for cooperative multitasking, or to take advantage of multiple cores and use processes and IPC in favour of threads and shared state. A second solution is to use another implementation of Python, apart from the CPython interpreter. Stackless Python [2] is an implementation which largely avoids the use of the C stack and has green threads (called “tasklets”) as part of its standard library. Google’s Unladen Swallow7 is still in the design phase, but aims to improve on the performance of CPython five-fold and intends to eliminate the GIL in its own implementation by 2010. This paper describes another alternative, to augment Python with a higher-level abstraction for message-passing concurrency, python-csp based on Hoare’s Communicating Sequential Processes [3]. The semantics of CSP are relatively abstract compared with libraries such as pthreads and so the underlying implementation of CSP “processes” as either system threads, processes or coroutines is hidden from the user. This means that the user can choose an implementation which is suitable for the interpreter in use or the context of the application they are developing (for example, processes where a multi-core architecture is expected to be used, threads where one is not). Also CSP was designed specifically to help avoid well-known problems with models such as shared memory concurrency (such as deadlocks and race conditions) and to admit formal reasoning. Both properties assist the user by encouraging program correctness. The authors have a particular interest in using Python to implement complex tasks requiring coordination between many hosts. The SenSor simulator and development tool [4,5] provided facilities for prototyping algorithms and applications for wireless sensor networks in pure Python, using shared-memory concurrency. The burden of managing explicit locking in an already complex environment made the implementation of SenSor difficult. A new version of the tool is currently under development and will be built on the python-csp and jython-csp libraries. To deal with the different performance characteristics of threads and processes in the current implementations of Python, the python-csp library currently has a number of different implementations: • The csp.cspprocess module which contains an implementation of python-csp based on operating system processes, as managed by the multiprocessing library; and • the csp.cspthread module which contains an implementation of python-csp based on system threads. There is also a version of the library called jython-csp that targets Jython, a version of Python which runs on the Java VM. jython-csp uses Java threads (which are also sys5
http://blip.tv/file/2232410 http://www.python.org/doc/faq/library/ 7 http://code.google.com/p/unladenswallow/ 6
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
295
tem threads), rather than Python threads. Jython allows the user to mix Java and Python code in programs (almost) freely. As such, jython-csp allows users to use any combination of Python and Java, including being able to write pure Java programs using wrappers for jython-csp types. In general, in this paper we will refer to python-csp, however, the syntax and semantics of the libraries can be assumed to be identical, unless stated otherwise. The remainder of this paper describes the design and implementation of python-csp and jython-csp. Section 1 gives an overview of the design philosophy of the library and its syntax and (informal) semantics. Section 2 describes a longer python-csp program and gives a discussion of the design patterns used in the implementation. Section 3 begins an evaluation of python-csp by describing benchmark results using the Commstime program and comparing our work with similar libraries. python-csp and jython-csp have been bench-marked against PyCSP [6], another realisation of CSP in Python and JCSP [7], a Java library. Section 4 outlines ongoing work on model checking channel synchronisation and non-deterministic selection in the python-csp implementation. Section 5 concludes and describes future work. 1. python-csp and jython-csp: Syntax and Semantics The design philosophy behind python-csp is to keep the syntax of the library as “Pythonic” and familiar to Python programmers as possible. In particular, two things distinguish this library from others such as JCSP [7] and PyCSP [6]. Where languages such as Java have strong typing and sophisticated control over encapsulation, Python has a dynamic type system, often using so-called “duck typing” (which means that an object is said to implement a particular type if it shares enough data and operations with the type to be used in the same context as the type). Where an author of a Java library might expect users to rely on the compiler to warn of semantic errors in the type-checking phase, Python libraries tend to trust the user to manage their own encapsulation and use run-time type checking. Although Python is a dynamically typed language, the language is helpful in that few, if any, type coercions are implicit. For example, where in Java, a programmer could concatenate a String and an int type when calling System.out.println, the equivalent expression in Python would raise an exception. In general, the Python type system is consistent and this is largely because every Python type is reified as an object. Java differs from Python in this respect, as primitive Java types (byte, short, int, long, char, float, double, boolean) do not have fields and are not created on the heap. In Python, however, all values are (first-class) objects, including functions and classes. Importantly for the work described here, operators may be overloaded for new types as each Python operator has an equivalent method inherited from the base object, for example: >>> 3 >>> 3 >>> [1 , >>> [1 , >>>
1 + 2 (1). __add__ (2) [1] * 3 1 , 1] ([1]). __mul__ (3) 1 , 1]
Lastly, Python comes with a number of features familiar to users of functional programming languages such as ML that are becoming common in modern, dynamically typed languages. These include generators, list comprehensions and higher-order functions. CSP [3] contains three fundamental concepts: processes, (synchronous) channel communication and non-deterministic choice. python-csp provides two ways in which the user may create and use these CSP object types: one method where the user explicitly creates
296
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
instances of types defined in the library and calls the methods of those types to make use of them; and another where users may use syntactic sugar implemented by overriding the Python built in infix operators. Operator overloading has been designed to be as close to the original CSP syntax as possible and is as follows: Syntax P>Q P&Q c1 | c2 n∗A A∗n Skip()
Meaning CSP equivalent Sequential composition of processes P; Q Parallel composition of processes P Q Non-deterministic choice c 1 c2 Repetition n•A Repetition n•A Skip guard, always ready to synchronise Skip
where: • • • •
n is an integer; P and Q are processes; A is a non-deterministic choice (or ALT); and c1 and c2 are channels.
The following sections describe each of the python-csp features in turn. 1.1. python-csp Processes In python-csp a process can be created in two ways: either explicitly by creating an instance of the CSPProcess class or, more commonly, by using the @process decorator8 . In either case, a callable object (usually a function) must be created that describes the run-time behaviour of the process. Listing 1 shows the two ways to create a new process, in this case one which opens a sensor connected to the USB bus of the host and continuously prints out a transducer reading every five minutes. Whichever method is used to create the process, P, a special keyword argument _process must be passed in with a default value. When the process P is started (by calling its start method) _process is dynamically bound to an object representing the underlying system thread or process which is the reification of the CSPProcess instance. This gives the programmer access to values such as the process identifier (PID), or thread name which may be useful for logging and debugging purposes. When the start methods in a CSPProcess object has returned the underlying thread or process will have terminated. Once this has happened, accessing the methods or data of the corresponding _process variable will raise an exception. # Using the CSPProcess class : def print_rh_reading (): rhsensor = ToradexRH () # Oak temp / humidity sensor rhsensor . open () while True : data = rhsensor . get_data () print ’ Humidity %% g : Temp : % gC ’ % data [1:] dingo . platform . gumstix . sleep (60 * 5) # 5 min P = CSPProcess ( print_rh_reading , _process = None ) P . start () # Using the @process decorator : 8 A “decorator” in Python is a callable object “wrapped” around another callable. For example, the definition of a function fun decorated with the @mydec decorator will be replaced with fun = mydec(fun).
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
297
@process def print_rh_reading ( _process = None ): rhsensor = ToradexRH () # Oak temp / humidity sensor rhsensor . open () while True : data = rhsensor . get_data () print ’ Humidity %% g : Temp : % gC ’ % data [1:] dingo . platform . gumstix . sleep (60 * 5) # 5 min P = print_rh_reading () P . start () Listing 1. Two ways to create a python-csp process.
1.2. python-csp Parallel and Sequential Execution CSP processes can be composed either sequentially or in parallel. In sequential execution each process starts and terminates before the next in the sequence begins. In parallel execution all processes run “at once” and therefore the order of any output they effect cannot be guaranteed. Parallel and sequential execution can be implemented in python-csp either by instantiating Par and Seq objects or by using the overloaded & or > operators. In general, using the overloaded infix operators results in clear, simple code where there are a small number of processes. Listing 2 demonstrates sequential and parallel process execution in python-csp. ... ... ... >>> >>> 1 2 >>> >>> 1 2 >>> >>> 2 1 >>> >>> 1 2
def P (n , _process = None ): print n # In sequence , using syntactic sugar ... P (1) > P (2)
# In sequence , using objects ... Seq ( P (1) , P (2)). start ()
# In parallel , using syntactic sugar ... P (1) & P (2)
# In parallel , using objects ... Par ( P (1) , P (2)). start ()
Listing 2. Two ways to run processes sequentially and in parallel.
1.3. python-csp Channels In CSP communication between processes is achieved via channels. A channel can be thought of as a pipe (similar to UNIX pipes) between processes. One process writes data down the channel and the other reads. Since channel communication in CSP is synchronous, the writing channel can be thought of as offering data which is only actually written to the channel when another process is ready to read it. This synchronisation is handled entirely by the language, meaning that details such as locking are invisible to the user. Listing 3 shows how channels can be used in python-csp.
298
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
In its csp.cspprocess implementation, python-csp uses an operating system pipe to transmit serialised data between processes. This has a resource constraint, as operating systems limit the number of file descriptors that each process may have open. This means that although python-csp programs can create any number of processes (until available memory is saturated), a limited number of channels can be created. In practice this is over 600 on a Ubuntu Linux PC. To compensate for this, python-csp offers a second implementation of channels called FileChannel. A FileChannel object behaves in exactly the same way as any other channel, except that it uses files on disk to store data being transferred between processes. Each read or write operation on a FileChannel opens and closes the operating system file, meaning that the file is not open for the duration of the application running time. Programmers can use FileChannel objects directly, or, if a new Channel object cannot be instantiated then Channel will instead return a FileChannel object. A third class of channel is provided by python-csp, called a NetworkChannel. A NetworkChannel transfers data between processes via a socket listener which resides on each node in the network for the purpose of distributing channel data. By default when the Python csp.cspprocess package is imported, a socket server is started on the host (if one is not already running). @process def send_rh_reading ( cout , _process = None ): rhsensor = ToradexRH () # Oak temp / humidity sensor timer = TimerGuard () rhsensor . open () while True : data = rhsensor . get_data () cout . send ( ’ Humidity %% g : Temp : % gC ’ % data [1:]) timer . sleep (60 * 5) # 5 min guard . read () # Synchronise with timer guard . @process def simple_sensor ( _process = None ): ch = Channel () Printer ( ch ) & send_rh_reading ( ch ) Listing 3. Two processes communicating via a channel.
1.4. Non-Deterministic Selection and Guards in python-csp Non-deterministic selection (called “select” in JCSP and “ALT” or “ALTing” in occam) is an important feature of many process algebras. Select allows a process to choose between a number of waiting channel reads, timers or other “guards” which are ready to synchronise. In python-csp this is achieved via the select method of any guard type. To use select an Alt object must be created and should be passed any number of Guard instances. See Listing 4 for an example. The select method can be called on the Alt object and will return the value returned from the selected guard. To implement a new guard type, users need to subclass the Guard class and provide the following methods: enable which should attempt to synchronise. disable which should roll back from an enable call to the previous state. is selectable which should return True if the guard is able to complete a synchronous transaction and False otherwise. select which should complete the synchronous transaction and return the result (note this semantics is slightly different from that found in JCSP, as described by Welch [7], where an index to the selected guard in the guard array is returned by select).
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
299
poison which should be used to finalize and delete the current guard and gracefully terminate any processes connected with it [8,9]. Since Python has little support for encapsulation, the list of guards inside an Alt object is available to any code which has access to the Alt. @process def p r i n t A v a i l a b l e R e a d i n g ( cins , _process = None ): alt = Alt ( cins ) while True : print alt . select () @process def simpleSensorArray ( _process = None ): chans , procs = [] , [] for i in NUMSENSORS : chans . append ( Channel ()) procs . append ( sendRHReading ( chans [ -1])) procs . append ( prin tAva i l a b l e R e a d i n g ( chans )) Par (* procs ). start () Listing 4. Servicing the next available sensor reading with non-deterministic selection.
Like many other implementations of CSP, python-csp implements a number of variants of non-deterministic selection: select enables all guards and either returns the result of calling select on the first available guard (if only one becomes available) or randomly chooses an available guard and returns the result of its select method. The choice is truly random and determined by a random number generator seeded by the urandom device of the host machine. priority select enables all guards and, if only one guard becomes available then priority_select returns the result of its select method. If more than one guard becomes available the first guard in the list passed to Alt is selected. fair select enables all guards and, if only one guard becomes available then fair_select returns the result of its select method. Alt objects keep a reference to the guard which was selected on the previous invocation of any select method, if there has been such an invocation and the guard is still in the guards list. If fair_select is called and more than one guard becomes available, then fair_select gives lowest priority to the guard which was returned on the previous invocation of any of the select methods. This idiom is used to reduce the likelihood of starvation as every guard is guaranteed that no other guard will be serviced twice before it is selected. There are two forms of syntactic sugar that python-csp implements to assist in dealing with Alt objects: a choice operator and a repetition operator. Using the choice operator, users may write: result = guard_1 | guard_2 | ... | guard_n
which is equivalent to: alt = Alt ( guard_1 , guard_2 , ... , guard_n ) result = alt . select ()
To repeatedly select from an Alt object n times, users may write: gen = n * Alt ( guard_1 , guard_2 , ... , guard_n )
300
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
or: gen = Alt ( guard_1 , guard_2 , ... , guard_n ) * n
this construct returns a generator object which can be iterated over to obtain results from Alt.select method calls. Using generators in this way is idiomatic in Python and will be familiar to users. The following is a typical use case: gen = Alt ( guard_1 , guard_2 , ... , guard_n ) * n while True : ... gen . next () ...
Each time gen.next() is called within the loop, the select() method of the Alt object is called, and its result returned. In addition to channel types, python-csp implements two commonly used guard types: Skip and TimerGuards. Skip is the guard which is always ready to synchronise. In python-csp its select method always returns None, which is the Python null value. TimerGuards are used to either suspend a running process (by calling their sleep method) or as part of a synchronisation where the guard will become selectable after a timer has expired: @process def alarm ( self , cout , _process = None ): alt = Alt ( TimerGuard ()) t0 = alt . guard [0]. read () # Fetch current time alt . guard [0]. set_alarm (5) # Selectable 5 secs from now alt . select () duration = guard . read () - t0 # In seconds cout . write ( duration )
1.5. Graceful Process Termination in python-csp Terminating a parallel program without leaving processes running in deadlock is a difficult problem. The most widely implemented solution to this problem was invented by Peter Welch [10] and is known as “channel poisoning”. The basic idea is to send a special value down a channel which, when read by a process, is then propagated down any other channels known to that process before it terminates. In python-csp this can be affected by calling the poison method on any guard. A common idiom in python-csp, especially where producer-consumer patterns are implemented, is this: alt = Alt (* channels ) for i in xrange ( len ( channels )): alt . select ()
Here, it is intended that each guard in channels be selected exactly once. Once a guard has been selected its associated writer process(es) will have finished its computation and terminate. In order to support this idiom efficiently, python-csp implements a method called poison on Alt objects which serves to poison the writer process(es) attached to the last selected guard and remove that guard from the list, used as follows: a = Alt (* channels ) for i in xrange ( len ( channels )): a . select () a . poison ()
By shortening the list of guards less synchronisation is required on each iteration of the for loop, reducing the computational effort required by the select method.
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
301
1.6. Built-In Processes and Channels python-csp comes with a number of built-in processes and channels, aimed to speed development. This includes all of the names defined in the JCSP “plugnplay” library. In addition to these and other built-in processes and guards, python-csp comes with analogues of every unary and binary operator in Python. For example, the Plus process reads two values from channels and then writes the addition of those values to an output channel. An example implementation9 of this might be: @process def Plus ( cin1 , cin2 , cout , _process = None ): while True : in1 = cin1 . read () in2 = cin2 . read () cout . write ( in1 + in2 ) return Listing 5. A simple definition of the built-in Plus process.
1.7. jython-csp Implementation and Integration with Pure Java jython-csp is a development of python-csp for integration with Jython. Jython has similar semantics as Python but uses the Java runtime environment (JRE) which allows access to the large number of Java libraries, such as Swing, which are useful, platform-independent and well optimised. jython-csp has similar workings to the initial python-csp with the ability to utilise any class from the standard Java and Python class libraries. jython-csp utilises Java threads and would be expected to perform similarly to other CSP implementations based on Java threads, such as JCSP (e.g. [7]). A comparison between jython-csp and python-csp implementations of the Monte Carlo approximation to π is shown in Table 1. Table 1. Results of testing Java threads against Python threads and OS processes. Thread Library jython-csp (Java Threads) python-csp (Python Threads) python-csp (OS Processes)
Running time of π approximation (seconds) 26.49 12.08 9.59
As we shall see in Section 3 the JCSP library performs channel communication very efficiently, so one might expect that jython-csp would also execute quickly. A speculation as to why jython-csp (using Java threads) performs poorly compared to the other CSP implementations is a slow Java method dispatch within Jython. In addition to the modification of the threading library used jython-csp also takes advantage of Java locks and semaphores from the java.util.concurrent package. jython-csp has no dependency on non standard packages; the library will work with any JRE which is compatible with Jython 2.5.0final. 9 In fact, these processes are implemented slightly differently, taking advantage of Python’s support for reflection. Generic functions called _applybinop and _applyunaryop are implemented, then Plus may be defined simply as Plus = _applybinop(operator.__add__). The production version of this code is slightly more complex as it allows for documentation for each process to be provided whenever _applybinop and _applyunaryop are called.
302
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
1.8. Java-csp Java-csp is the integration of jython-csp with Java to allow Java applications to utilise the flexibility of jython-csp. The Java CSP implementation attempts to emulate the built in Java thread library (java.lang.Thread) with a familiar API. As with Java threads there are two ways to use threads in an application: • By extending the Thread class and overwriting the run method; or • By implementing the Runnable interface. The Java csp implementation has a similar usage: • Extending the JavaCspProcess and overwriting the target method; or • Implementing the JavaCspProcessInterface interface. jython-csp and python-csp uses the pickle library as a means of serialising data down a channel. Pickle takes a Python/Jython object and returns a sequence of bytes; this approach only works on Python/Jython object and is unsuitable for native Java objects. As a solution java-csp implements a wrapped version of Java object serialization which allows Jython to write pure Java objects, which implement the Serializable interface, to a channel, in addition to this, Python/Jython objects can be written down a channel if they extend the PyObject class. 2. Mandelbrot Generator: an Example python-csp Program 2.1. Mandelbrot Generator Listing 6 shows a longer example of a python-csp program as an illustration of typical coding style. The program generates an image of a Mandelbrot set which is displayed on screen using the PyGame library10 (typically used for implementing simple, 2D arcade games). The Mandelbrot set is an interesting example, since it is embarrassingly parallel – i.e. each pixel of the set can be calculated independently of the others. However, calculating each pixel of a large image in a separate process may be a false economy. Since channel communication is likely to be an expensive part, the resulting code is likely to be I/O bound. The code in Listing 6 is structured in such a way that each column in the image is generated by a separate “producer” process. Columns are stored in memory as a list of RGB tuples which are then written to a channel shared by the individual producer process and a single “consumer” process. The main work of the consumer is to read each image column via an Alt object and write it to the display surface. This producer-consumer architecture is common to many parallel and distributed programs. It it not necessarily, however, the most efficient structure for this program. Since some areas of the set are essentially flat and so simple to generate, many producer processes are likely to finish their computations early and spend much of their time waiting to be selected by the consumer. If this is the case, it may be better to use a “farmer” process to task a smaller number of “worker” processes with generating a small portion of the image, and then re-task each worker after it has communicated its results to the farmer. Practically, if efficiency is an important concern, these options need to be prototyped and carefully profiled in order to determine the most appropriate solution to a given problem. from csp . cspprocess import * @process def mandelbrot ( xcoord , ( width , height ) , cout , 10
http://www.pygame.org
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
303
acorn = -2.0 , bcorn = -1.250 , _process = None ): # Generate image data for column xcoord ... cout . write (( xcoord , imgcolumn )) _process . _terminate () @process def consume ( IMSIZE , filename , cins , _process = None ): # Create initial pixel data . pixmap = Numeric . zeros (( IMSIZE [0] , IMSIZE [1] , 3)) pygame . init () screen = pygame . display . set_mode (( IMSIZE [0] , IMSIZE [1]) , 0) # Wait on channel data . alt = ALT (* cins ) for i in range ( len ( cins )): xcoord , column = alt . select () alt . poison () # Remove last selected guard and producer . # Update column of blit buffer pixmap [ xcoord ] = column # Update image on screen . pygame . surfarray . b lit_array ( screen , pixmap ) def main ( IMSIZE , filename ): channels , processes = [] , [] for x in range ( IMSIZE [0]): # Producer + channel per column . channels . append ( Channel ()) processes . append ( mandelbrot (x , IMSIZE , channels [ x ])) processes . insert (0 , consume ( IMSIZE , filename , channels )) mandel = PAR (* processes ) mandel . start () Listing 6. Mandelbrot set in python-csp (abridged).
The worker-farmer architecture shows an improvement in performance over the producerconsumer architecture. The code in Listing 7 is structured in such a way that each column in the image is computed by a single process. Initially a set of workers are created and seeded with the value of the column number, the pixel data for that column is generated then written to a channel. When the data has been read, the “farmer” then assigns a new column for the worker to compute. If there are no remaining columns to be generated, the farmer will write a terminating value and the worker will terminate. This is required to instruct the “worker” that there are no more tasks to be performed. The consumer has the same function as before although with a smaller set of channels to choose from. Workers which have completed their assigned values have a shorter amount of time to wait until the Alt object selects them. @process def mandelbrot ( xcoord , ( width , height ) , cout , acorn = -2.0 , bcorn = -1.250 , _process = None ): # Generate image data for column xcoord ... cout . write (( xcoord , imgcolumn )) xcoord = cout . read () if xcoord == -1: _process . _terminate () @process def consume ( IMSIZE , filename , cins , _process = None ): global SOFAR # Create initial pixel data pixmap = Numeric . zeros (( IMSIZE [0] , IMSIZE [1] , 3)) pygame . init () screen = pygame . display . set_mode (( IMSIZE [0] , IMSIZE [1]) , 0) # Wait on channel data alt = Alt (* cins )
304
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython for i in range ( IMSIZE [0]): xcoord , column = alt . pri_select () # Update column of blit buffer pixmap [ xcoord ] = column # Update image on screen . pygame . surfarray . b lit_array ( screen , pixmap ) if SOFAR < IMSIZE [0]: alt . last_selected . write ( SOFAR ) SOFAR = SOFAR + 1 else : alt . last_selected . write ( -1) Listing 7. Mandelbrot set in python-csp (abridged) using the “farmer” /“worker” architecture.
The improvement in performance can been seen in Figure 1: using a smaller number of processes reduces the run time of the program.
Figure 1. Run times of “farmer” / “worker” Mandelbrot program with different numbers of CSP processes.
The graph shows a linear characteristic, which would be expected as the select method in Alt is O(n). 3. Performance Evaluation and Comparison to Related Work The Commstime benchmark was originally implemented in occam by Peter Welch at the University of Kent at Canterbury and has since become the de facto benchmark for CSP implementations such as occam-π [11], JCSP [7] and PyCSP [6]. Table 2 shows results of running the Commstime benchmark on JCSP version 1.1rc4, PyCSP version 0.6.0 and python-csp. To obtain fair results the implementation of Commstime used in this study was taken directly from the PyCSP distribution, with only syntactic changes made to ensure that tests run correctly and are fairly comparable. In each case, the type of “process” used has been varied and the default channel implementation has been used. In the case of the python-csp channels are reified as UNIX pipes. The JCSP implementation uses the One2OneChannelInt class. The PyCSP version uses the default PyCSP Channel class for each process type. All tests were run on an Intel Pentium dual-core 1.73 GHz CPU and
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
305
1 GB RAM, running Ubuntu 9.04 (Jaunty Jackalope) with Linux kernel 2.6.28-11-generic. Version 2.6.2 of the CPython interpreter was used along with Sun Java(TM) SE Runtime Environment (build 1.6.0 13-b03) and Jython 2.50. Table 2. Results of testing various CSP libraries against the Commstime benchmark. In each case the default Channel class is used. CSP implementation JCSP PyCSP PyCSP PyCSP python-csp python-csp jython-csp
Process reification JVM thread OS process Python thread Greenlet OS process Python thread JVM thread
min (μ s) 15 195.25 158.46 24.14 67.6 203.05 135.05
max (μ s) 29 394.97 311.2 25.37 155.97 253.56 233
mean (μ s) 23.8 330.34 292.2 24.41 116.75 225.77 157.8
standard deviation 4.29 75.82 47.21 0.36 35.53 17.51 30.78
The results show that channel operations in jython-csp are faster than channel operations between python-csp objects when reified as threads, but slower than the thread-based version of python-csp. This is a surprising result, as Java threads are better optimised than Python threads (because of the way the Python GIL is implemented) and, as the results for JCSP show, it is possible to implement CSP channels very efficiently in pure Java. The loss of performance is likely to be due to the way in which methods are invoked in Jython. Rather than all compiling Jython code directly to Java bytecode (as was possible when Jython supported the jythonc tool), Jython wraps Python objects in Java at compile time and executes pure Python code in and instance of a Python interpreter. Mixing Python and Java code, as the jython-csp library does, can therefore result in poor performance. It may be possible to ameliorate these problems by implemented more of the jython-csp library in pure Java code. It is also possible that future versions of Jython will improve the performance of method invocation and/or provide a method of compiling Python code directly to Java bytecode. The difference between the python-csp and PyCSP libraries is also surprising. python-csp implements channel synchronisation in a simple manner, with two semaphores protecting each channel, and two reentrant locks to guard against conflicts between multiple readers and/or writers. PyCSP has a very different architecture and synchronisation strategy which may account for the difference in performance. More detailed profiling of the two libraries, together with the completion of the model checking work described in Section (to ensure that python-csp is not simply faster because it is somehow incorrect) will form part of our future work. 3.1. Related Work There are some differences between the implementation of python-csp and other realisations of CSP, such as occam-π, JCSP [7] and PyCSP [6]. In particular, any channel object may have multiple readers and writers. There are no separate channel types such as JCSP’s One2OneChannel. This reflects the simplicity that Python programmers are used to and the PEP20 [1] maxim that ““There should be one—and preferably only one—obvious way to do it.”. Also, when a Channel object (or variant of such) is instantiated, the instance itself is returned to the caller. In contrast, other systems return a “reader” and “writer” object, often implemented as the read and write method of the underlying channel. This is similar to the implementation of operating system pipes in many libraries, where a reader and writer to the pipe is returned by a library call, rather than an abstraction of the pipe. The authors of these other CSP realisations would argue that their design is less likely to be error prone and that
306
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
they are protecting the error-prone programmer from inadvertently reading from a channel that is intended to be a “output” channel to the given process or writing to a channel that is intended to be an “input” to the process. However, python-csp comes with strong tool support which may ameliorate some of these potential errors, and the profusion of channel types in some systems may confuse rather than simplify. Similarly, Alt.select methods return the value read by a guard rather than the index to the selected guard (as in JCSP) or a reference to the selected guard (PyCSP). The last guard selected is stored in the field Alt.last_selected and so is available to users. Some pragmatic concessions to the purity of python-csp have been made. In particular, the three default channel types (Channel, FileChannel and NetworkChannel) have very different performance and failure characteristics and so are implemented separately and conspicuously to the user. The chances of a network socket failing, and the potential causes of such a failure, differ greatly from that of an operating system pipe. Equally, a process which times-out waiting for a channel read will need to wait considerably longer for a read on a FileChannel than a Channel. These are non-trivial differences in semantics and it seems beneficial to make them explicit. 4. Correctness of Synchronisation in python-csp To verify the correctness of synchronisation in python-csp, a formal model was built using high level language to specify systems descriptions, called PROMELA (a PROcess MEta LAnguage). This choice is motivated by convenience since a large number of PROMELA models are available in the public domain and some of the features of the SPIN (Simple PROMELA INterpreter) tool environment [12], which interprets PROMELA, greatly facilitate our static analysis. PROMELA is a non-deterministic language, loosely based on Dijkstra’s guarded command language notation and borrowing the notation for I/O operations from Hoare’s CSP language. The model was divided into the two primary processes: the read and the write. Synchronisation was modeled by semaphores for several readers and several writers in PROMELA. Process type declarations consist of the keyword proctype, followed by the name of the process type, and an argument list. For example, n instances of the process type read are defined as follows: active [ n ] proctype read () { do :: ( rlock ) -> rlock = false ; /* obtain rlock */ atomic { /* wait ( sem )... acquire available */ available > 0 -> available - } c_r ? msg ; /* get data from pipe ... critical section */ taken ++; /* release taken */ rlock = true ; /* release rlock */ od ; }
The do-loop (terminated by od) is an infinite loop. The body of the loop obtains the rlock flag to protect from races between multiple readers, then blocks until an item becomes available in the pipe, then gets the item from the pipe (in the critical section), then announces the item has been read, then releases the rlock flag. The :: symbol indicates the start of a command sequence block. In a do-loop, a non-deterministic choice will be made among the command sequence blocks. In this case, there is only one to choose from. The write process is declared in a similar style to the read process. The body of the do-loop obtains the wlock flag
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
307
to protect from races between multiple writers, then places an item in the pipe, then makes the item available for the reader, then blocks until the item has been read, and finally releases the wlock. active [ n ] proctype write () { do :: ( wlock ) -> wlock = false ; /* obtain wlock */ s_c ! msg ; /* place item in pipe ... critical section */ available ++; /* release available */ atomic { taken > 0; taken - - /* acquire taken */ } wlock = true ; /* release wlock */ od ; }
The channel algorithm used in this model is defined as: chan s_c = [0] of { mtype }; /* rendezvous channel */ chan c_r = [0] of { mtype }; active proctype data () { mtype m ; do :: s_c ? m -> c_r ! m od }
The first two statements declare sc and cr to be channels. These will provide communication to write into or read from the pipe, with no buffering capacity (i.e., communication will be via rendezvous) that carries messages consisting of a byte (mtype). The body of the do-loop retrieves the received message and stores it into the local variable m on the receiver side. The data is always stored in empty channels and is always retrieved from full channels. Firstly, the model was run in SPIN random simulation mode. The SPIN simulator enables users to gain early feedback on their system models that helps in the development of the designer’s understanding of the design space before they advance in any formal analysis. However, SPIN provides a limited form of support for verification in terms of assertion checking, i.e. the checking of local and global system properties during simulation runs. For example, a process called monitor was devoted to assert that a read process will not be executed if the buffer is empty and the buffer can not be overwritten by write process before the read process is executed. proctype monitor () { do :: assert ( taken ” and “[]” were used in their place. The Hydra lexer supports four basic expression types, namely identifiers, characters, integers and booleans. Identifiers start with a lowercase letter of the alphabet, and can be followed by any combination of uppercase and lowercase letters, digits and the underscore character. Characters can be any valid ASCII character, denoted between single-quotes. Integers are simply defined as a series of digits. And finally, Boolean expressions are denoted by either “True” or “False” and are case-sensitive.
318
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
3.4. Hydra Framework Implementation 3.4.1. Using Hydra Before going into the implementation of the framework, it would be beneficial to describe the manner in which CSP is used within a Python program. The mechanism chosen is fairly simple, although somewhat less than ideal. The process is described below, along with a very simple example, which defines a process, declares an integer variable and assigns it a value. from Hydra . csp import cspexec code = """ [[ prod :: data : integer ; data := 4; ]]; """ cspexec ( code , progname = ’ simple ’)
Firstly, the Hydra csp module must be imported. The csp module provides the cspexec method, which takes the string containing the CSP algorithm as an argument and an optional program name argument. The cspexec method is responsible for converting the algorithm and managing the execution of the processes. Since the CSP algorithm is represented as a string, it is possible to specify the algorithm inline as a triple-quoted string or the algorithm could be specified in a separate file which could then be read in and supplied to the cspexec method. The program is then run by simply executing the Python program as usual. 3.4.2. Implementation Decisions The parallel production, although very simple in its appearance, is of paramount importance as it defines the concurrent architecture of the program. It takes a list of one or more processes to be executed in parallel. During execution, these processes are spawned asynchronously and may execute in parallel, thus achieving one of the project goals: execution of code over multiple processors. However, the prototype implementation exhibits two distinct behaviours. For the top-level parallel construct, it generates the appropriate code for executing over multiple Python interpreters, but for any nested parallel statements, PyCSP’s Parallel method is used. The rationale behind this is that current desktop computers have at most eight processor cores, therefore, implementing every process in a new Python interpreter instance is not likely to scale adequately. In the future, this distinction could be made explicitly controllable by the programmer. Another important set of CSP constructs is the input and output commands. These essentially define the channels of communication between processes and provide a synchronisation mechanism in the form of a rendezvous. Channels are named according to their source and destination processes and are carefully tracked and recorded so that they can be correctly registered by the PYRO nameserver before execution. The input and output commands generate simple read and write method calls on the appropriate Channel objects. While it is possible to communicate with external code through these channels by acquiring the appropriate Channel object via PYRO, such actions are not recommended as they can affect the correct operation of the Hydra program. The process construct is represented as a PyCSP Process. The necessary care is taken to retrieve the relevant Channel objects from the PYRO nameserver using the getNamedChannel method, based on the recorded channels. Since CSP allows for the definition of anonymous processes, a technique for handling and defining these methods was devised. The technique is fairly simple and involves giving each process an internal name. It
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
319
is worth noting that since anonymous processes have no user-defined name, it is not possible to use input and output commands within these processes. The repetitive, alternative and guarded statements are implemented using appropriately constructed Python while and if-else statements. To simplify code generation, all while blocks start with an if statement with the condition always false. This means that only elif statements need to be generated when translating the alternative construct. Input guards are implemented using PyCSP’s Alternative class and the priSelect method that uses order of appearance as an indicator of priority. Array declarations need to be treated specially by the declaration code as CSP permits the declaration of array bounds, which can lead to out-of-bounds errors if the programmer attempts to reference an uninitialised list variable. Therefore, arrays are declared as a Python list with the appropriate number of elements all set to None. Python has a number of keywords that cannot be used as method or variable names. Therefore, all user-defined identifiers are sanitised by simply prefixing an underscore to any identifiers that clash with known keywords. Python expressions and statements embedded in the CSP are handled by simply expressing them as normal Python code. This allows the programmer to use external modules and leverage the power of Python within their CSP code. 3.5. Process Distribution and Execution Once the programmer has defined their Hydra CSP-based program within a Python file, this file can be executed as a normal Python program (this runs in its own Python interpreter), which then calls the cspexec method of the Hydra framework. The cspexec method controls the execution process from receiving the CSP algorithm, having the ANTLR-based translator convert it, registering the appropriate channels, to finally starting the execution of the concurrent program. A relatively simple approach was taken to bootstrapping and executing the relevant processes once code generation was complete. One of the problems encountered with using PyCSP’s network channel functionality is that all channels need to be registered with the PYRO nameserver before the processes are able to retrieve the remote Channel objects. There is no easy way to add this registration process to the generated program without encountering situations where one process requests a channel that has not yet been registered. This problem was addressed by registering all the necessary channels beforehand in the cspexec method of the Hydra.csp module. The translation process returns a list of channels and processes that need to be configured and executed, which is then processed by a simple loop that registers the appropriate channel names with the PYRO nameserver. This can be seen in the Python code snippet below. # Iterate through required channels for i , chan in enumerate ( outpt . channels ): cn = One2OneChannel () # Create new channel chans . append ( cn ) # Keep track of the channel # Register the channel with the PYRO nameserver r e g i s t e r N a m e d C h a n n e l ( chans [ i ] , chan )
Since this happens before process execution, there is no chance of channels being unregistered or multiple registrations occurring for the same channel name, thus breaking interprocess communication. Once the channels have been registered, the processes are asynchronously executed by spawning a new Python interpreter using a loop and Python threads. The cspexec method then waits for the processes to finish executing and allows the user to view the results before ending the program. This process can be seen in the following Python code snippet.
320
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
class runproc ( Thread ): def __init__ ( self , procname , progname ): Thread . __init__ ( self ) self . procname = procname self . progname = progname def run ( self ): os . system ( ’ python ’ + self . progname + ’. py ’ + self . procname ) def cspexec ( cspcode , progname = ’ hydraexe ’ , debug = False ): # Translation and channel registration occurs here proclist = [] # Iterate through the list of defined processes for proc in outpt . procs : # Create a new Python interpreter for each top - level # process and start it in a new thread . newproc = runproc ( proc , progname ) proclist . append ( newproc ) newproc . start () # Wait for processes to finish before ter minating for proc in proclist : proc . join ()
An overview of the Hydra translation and execution process is shown in Figure 1.
Figure 1. Hydra translation and execution process.
Since all the processes are defined within the same Python file, it was necessary to provide some means for the new Python interpreters to execute the correct process method. During code generation, simple command-line argument handling is added to the output file that allows for the correct method to be executed based on the supplied argument. This is shown in the Python code snippet below.
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
321
# Snippet from generated Python code def __program ( _proc_ ): # PyCSP processes omitted # Process selection if _proc_ == " producer " : Sequence ( producer ()) elif _proc_ == " consumer " : Sequence ( consumer ()) else : print ’ Invalid process specified . ’ __program ( sys . argv [1])
In the above snippet, PyCSP’s Sequence construct is used to start the appropriate Process. This is because PyCSP Processes are only executed when they are used in a PyCSP Parallel or Sequence construct. 4. Analysis Two forms of testing were performed on the Hydra prototype. The system was tested with a number of sample programs; the resulting output code was then manually inspected to determine if it is an accurate representation of the CSP algorithm. Secondly, the code was executed and the operating system’s process and CPU load monitoring tools were used to determine whether or not the program was executing over multiple process cores. Since this implementation is merely a prototype for investigating the feasibility of using CSP within Python, the actual performance of the framework was not considered. The primary focus was on enabling multiprocessor execution in Python. Once that had been achieved, further work can focus on refining the framework and optimising for performance. All testing was performed on the system configuration specified in Table 1. Table 1. Testing platform configuration. Component CPU Motherboard Memory Hard Disk Network Operating System Python
Specification AMD Opteron 170 (2 cores @ 2.0GHz) ASUS A8R32-MVP Deluxe 2x1GB G.Skill DDR400 Seagate 320GB 16MB Cache Marvel Gigabit On-board Network Card Microsoft Windows 2003 Server SP2 Version 2.5.2
4.1. Generated Code Analysis One of the example Hydra CSP algorithms can be seen in the code listing below. This is a simple program with two processes (only the producer is shown). The producer process outputs the value of x to the consumer process 10000 times and the consumer process simply inputs the value received from producer and stores it in y. The value of x is updated with the current time expressed in seconds from a fixed starting time. The example also highlights the use of Python import statements and the use of Python statements within CSP.
322
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing from Hydra . csp import cspexec prodcons = """ _include { from time import time } [[ -- producer process : sends the value of x to consumer producer :: x : integer ; x := 1; *[ { x { print " prod : x = " + str ( x )}; consumer ! x ; x := { time ()}; ]; || consumer :: -- code omitted ]]; """ cspexec ( prodcons , progname = ’ prodcons ’)
One of the resulting Python processes can be seen in the listing below. import sys from pycsp import * from pycsp . plugNplay import * from pycsp . net import * from time import time def __program ( _proc_ ): @process def producer (): __procname = ’ producer ’ __chan_consumer_out = getNamedChannel ( " producer - > consumer " ) x = None x = 1 __lctrl_1 = True while ( __lctrl_1 ): if False : pass elif x 0 ⇒ w = 0). Guard(r, w) = w = 0 & startRead → Guard(r + 1, w) 2 endRead → Guard(r − 1, w) 2 r = 0 ∧ w = 0 & startWrite → Guard(r, w + 1) 2 endWrite → Guard(r, w − 1). The problem with the above design is that writers may be permanently locked out of the database if there is always at least one reader using the database (even if no individual reader uses the database indefinitely). The following version gives priority to writers, by not allowing a new reader to start using the database if there is a writer waiting:
328
G. Lowe / Extending CSP with Tests for Availability
Guard(r, w) = w = 0 & notReady startWrite & startRead → Guard(r + 1, w) 2 endRead → Guard(r − 1, w) 2 r = 0 ∧ w = 0 & startWrite → Guard(r, w + 1) 2 endWrite → Guard(r, w − 1). This idea can be extended further, to achieve fairness to both types of process; the parameter priRead records whether priority should be given to readers.2 Guard(r, w, priRead) = w = 0 ∧ priRead & startRead → Guard(r + 1, w, false) 2 w = 0 & notReady startWrite & startRead → Guard(r + 1, w, false) 2 endRead → Guard(r − 1, w, false) 2 r = 0 ∧ w = 0 ∧ ¬priRead & startWrite → Guard(r, w + 1, true) 2 r = 0 ∧ w = 0 & notReady startRead & startWrite → Guard(r, w + 1, true) 2 endWrite → Guard(r, w − 1, true). We now consider a few examples in order to better understand aspects of the semantics of processes with readiness tests: it turns out that some standard algebraic laws no longer hold. Throughout these examples, we omit alphabets from the parallel composition operator where they are obvious from the context. Example 1 Consider P Q where P = a → STOP and Q = if ready a then b → STOP else error → STOP. Clearly, it is possible for Q to detect that a is ready and so perform b. Could Q detect that a is not ready, and so perform error? If P makes a available immediately then clearly the answer is no. However, if it takes P some time to make a available, then Q could test for the availability of a before P has made it available. We believe that any implementation of prefixing will take some time to make a available: for example, in a multi-threaded implementation, scheduling decisions will influence when the a becomes available; further, the code for making a available will itself take some time to run. This is the intuition we follow in the rest of the paper. This decision has a considerable impact on the semantics: it will mean that all processes will take some time to make events available (essentially since all the CSP operators maintain this property). Returning to Example 1, in the combination P Q, Q can detect that a is not available initially and so perform error. Example 2 (if ready a then P else Q) \ {a} = P \ {a}: the hiding of a means that the ready a test succeeds, since there is nothing to prevent a from happening. Example 3 External choice is not idempotent. Consider P = a → STOP b → STOP and Q = ready a & ready b & error → STOP. Then P Q cannot perform error, but P 2 P Q can, if the two nondeterministic choices are resolved differently. We do not allow external choices to be resolved by ready or notReady tests: we consider these tests to be analogous to evaluation of standard boolean conditions in if statements, or boolean guards, which are evaluated internally. Example 4 The process R = ready a & P 2 notReady a & Q is not the same as if ready a then P else Q, essentially since the former checks for the readiness of a twice, but the latter checks 2
In principle one could merge the first two branches, by using a guard w = 0 ∧ (priRead ∨
notReady startWrite); however, allowing complex guards that mix booleans with readiness testing would com-
plicate the semantic definitions.
329
G. Lowe / Extending CSP with Tests for Availability
only once. When in the presence of the process a → STOP, R can evolve to the state P 2 Q (if the ready a test is made after the a becomes available, and the notReady a test is made before the a becomes available) or to STOP 2 STOP = STOP (if the ready a test is made before the a becomes available, and the notReady a test is made after the a becomes available); neither of these is, in general, a state of if ready a then P else Q. Example 5 ready a & ready b & P is not the same as ready b & ready a & P. Consider P = error → STOP and Q = a → STOP b → STOP. Then ready a & ready b & P Q can perform error, but ready b & ready a & P Q cannot. Similar results hold for notReady, and for a mix of ready and notReady guards. The above example shows why we do not allow more complex guards, such as ready a ∧ ready b & P: any natural implementation of this process would have to test for the availability of a and b in some order, but the order in which those are tested can make a difference. 3. Operational Semantics In this section we give operational semantics to the language of CSP extended with tests for the readiness or non-readiness of events. For simplicity, we omit interleaving, the form of A
parallel composition, and renaming from the language we consider. a As normal, we write P −→ P , for a ∈ Σ ∪ {τ } (where Σ is the set of visible events, and τ represents an internal event), to indicate that P performs the event a to become P . In addition, we include transitions to indicate successful readiness or non-readiness tests: ready a
• We write P −−→ P to indicate that P detects that the event a is ready, and evolves into P ; notReady a • We write P −−→ P to indicate that P detects that the event a is not ready, and evolves into P . Note the different fonts between ready and notReady, which are part of the syntax, and ready and notReady, which are part of the semantics. Define, for A ⊆ Σ: ready A = {ready a | a ∈ A}, †
A = A ∪ ready A ∪ notReady A,
notReady A = {notReady a | a ∈ A}, A†τ = A† ∪ {τ }. ready a
notReady a
Transitions, then, will be labelled by elements of Σ†τ . We think of the −−→ and −−→ transitions as being internal in the sense that they cannot be directly observed by any parallel peer. We refer to elements of Σ†τ as actions, and restrict the word events to elements of Στ . a a a Below we use standard conventions, writing, e.g., P −→ for ∃ P • P −→ P , and P −→ a for ¬(∃ P • P −→ P ). Recall our intuition that a process such as a → P may not make the a available immediately. We model this by a τ transition to a state where the a is indeed available. It turns out that this latter state is not expressible within the syntax of the language (this follows from Lemma 11, below). Within the operational semantic definitions, we will write this state as aˇ → P. We therefore define the semantics of prefixing by the following two rules. τ
a → P −→ aˇ → P,
a
aˇ → P −→ P.
We stress, though, that the aˇ → . . . notation is only for the purpose of defining the operational semantics, and is not part of the language.
330
G. Lowe / Extending CSP with Tests for Availability
The following rules for normal events are completely standard. For brevity, we omit the symmetrically equivalent rules for external choice (2) and parallel composition (A B ). The identifier a ranges over visible events. a
P −→ P τ P 2 Q −→ P 2 Q
a
P −→ P τ P Q −→ P Q
τ
P −→ P a P 2 Q −→ P
τ
P −→ P a P Q −→ P τ
P Q −→ Q
τ
P Q −→ P
a
α
P −→ P α ∈ A − B ∪ {τ } α PA B Q −→ P A B Q
P −→ P a Q −→ Q a∈A∩B a PA B Q −→ P A B Q a
α
P −→ P a∈A τ P \ A −→ P \ A
τ
μ X • P −→ P[μ X • P/X]
P −→ P α ∈ Σ − A ∪ {τ } α P \ A −→ P \ A div
τ
P Q −→ Q
τ
−→ div
The following rules show how the tests for readiness operate. ready a
if ready a then P else Q −−→ P,
if ready a then P else Q
notReady a
−−→ Q.
The remaining rules show how the readiness tests are promoted by various operators. We omit symmetrically equivalent rules for brevity. The rules for the choice operators are straightforward. ready a
notReady a
P −−→ P
P −−→ P
ready a
notReady a
P 2 Q −−→ P 2 Q
P 2 Q −−→ P 2 Q
ready a
notReady a
P −−→ P
P −−→ P
ready a
notReady a
P Q −−→ P Q
P Q −−→ P Q
The rules for parallel composition are a little more involved. A ready a action can occur only if all processes with a in their alphabet are able to perform a. ready b
P −−→ P b Q −→ ready b
P A B Q −−→ P A B Q
ready a
P −−→ P b∈B
ready a
P A B Q −−→ P A B Q
a∈ /B
A notReady a action requires at least one parallel peer with a in its alphabet to be unable to perform a. In this case, the action is converted into a τ . notReady b
P −−→ P b Q −→ b∈B τ P A B Q −→ P A B Q
notReady b
P −−→ P b Q −→ notReady b
P A B Q −−→ P A B Q
b∈B
G. Lowe / Extending CSP with Tests for Availability
331
notReady a
P −−→ P notReady a
P A B Q −−→ P A B Q
a∈ /B
Note that in the second rule, the notReady b may yet be blocked by some other parallel peer. If a ready a action can be performed in a context where a is then hidden, then all relevant parallel peers are able to perform a; hence the transition can occur; the action is converted into a τ . ready a
P −−→ P a∈A τ P \ A −→ P \ A α
P −→ P α ∈ ready(Σ − A) ∪ notReady(Σ − A) α P \ A −→ P \ A Note that there is no corresponding rule for notReady a: in the context P \ A, if P can perform notReady a (with a ∈ A) then all parallel peers with a in their alphabet are able to perform a, and so the a is available; hence the notReady a action is blocked for P \ A. The following two lemmas can be proved using straightforward structural inductions. First, ready a and notReady a actions are available as alternatives to one another. Lemma 6 For every process P: ready a
notReady a
(∃ Q • P −−→ Q) ⇔ (∃ Q • P −−→ Q ). Informally, the two transitions correspond to taking the two branches of a construct of the form if ready a then R else R . The if construct may be only part of the process P above, and so R and R may be only part of Q and Q above. Initially, each process can perform no standard events. This is a consequence of our assumption that a process of the form a → P cannot perform the a from its initial state. Lemma 7 For every process P expressible using the syntax of the language (so excluding a the aˇ → . . . construct), and for every standard event a ∈ Σ, P −→. Of course, P might have a τ transition to a state where visible events are available. 4. Denotational Semantics We now consider how to build a compositional denotational semantic model for our language. We want the model to record at least the traces of visible events performed by processes: any coarser model is likely to be trivial. In order to consider what other information is needed in the model, it is useful to consider (informally) a form of testing: we will say that test T distinguishes processes P and Q if P T and Q T have different traces of visible events. In this case, the denotational model should also distinguish P and Q. We want to record within traces the ready and notReady actions that are performed. For example, the processes b → STOP and ready a & b → STOP are distinguished by the test STOP (with alphabet {a}); we will distinguish them denotationally by including the ready a action in the latter’s trace. Further, we want to record the events that were available as alternatives to those events that were actually performed. For example, the processes a → STOP 2 b → STOP and
332
G. Lowe / Extending CSP with Tests for Availability
a → STOP b → STOP can be distinguished by the test ready a & b → STOP; we will distinguish them denotationally by recording that the former offers a as an alternative to b. We therefore add actions offer a and notOffer a to represent that a process is offering or not offering a, respectively. These actions will synchronise with ready a and notReady a actions. We write notOffer A = {notOffer a | a ∈ A},
offer A = {offer a | a ∈ A}, ‡
†
A = A ∪ offer A ∪ notOffer A,
A‡τ = A‡ ∪ {τ }.
A trace of a process will, then, be a sequence of actions from Σ‡ . We can calculate the traces of a process in two ways: by extracting then from the operational semantics, and by giving compositional rules. We begin with the former. We augment the operational semantics with extra transitions as follows: a
• We add offer a loops on every state P such that P −→; a • We add notOffer a loops on every state P such that P −→. Formally, we define a new transition relation −− by: α
α
offer a
a
notOffer a
a
for α ∈ Σ†τ ,
P −− Q ⇔ P −→ Q, P −−− P ⇔ P −→ , . P −−− P ⇔ P −→
Appendix A gives rules for the −− that can be derived from the rules for the −→ relation and the above definition. We can then extract the traces (of Σ‡ actions) from the operational semantics (following [4, Chapter 7]): tr
Definition 8 We write P −→ Q, for tr = α1 , . . . , αn ∈ (Σ‡τ )∗ , if there exist P0 = P, αi+1 tr P1 , . . . , Pn = Q such that Pi −− Pi+1 for i = 0, . . . , n−1. We write P =⇒ Q, for tr ∈ (Σ‡ )∗ , tr
if there is some tr such that P −→ Q and tr = tr \ τ . tr
The traces of process P can then be defined to be the set of all tr such that P =⇒. The following lemma states some healthiness conditions concerning this set. Lemma 9 For all processes P expressible using the syntax of the language (so excluding the tr aˇ → . . . construct), the set T = {tr | P =⇒} satisfies the following conditions: 1. T is non-empty and prefix-closed. 2. T includes (notOffer Σ)∗ , i.e., the process starts in a state where no standard events are available. 3. offer and notOffer actions can always be remove from or duplicated within a trace: tr αtr ∈ T ⇒ tr α, αtr ∈ T ∧ tr tr ∈ T, for α ∈ offer Σ ∪ notOffer Σ. 4. ready a and notReady a actions are available as alternatives to one another: tr ready a ∈ T ⇔ tr notReady a ∈ T. 5. Either an offer a or notOffer a action is always available. tr tr ∈ T ⇒ tr offer atr ∈ T ∨ tr notOffer atr ∈ T.
G. Lowe / Extending CSP with Tests for Availability
333
Proof: (Sketch) 1. This follows directly from the definition of =⇒. 2. The follows from Lemma 7 and the definition of −− . 3. This follows directly from the definition of −− : offer and notOffer transitions always form self-loops. 4. This follows directly from Lemma 6. 5. This follows directly from the definition of −− : each state has either an offer a or a notOffer a loop. 2 4.1. Compositional Traces Semantics We now give compositional rules for the traces of a process. The semantics for each process will be an element of the following model. Definition 10 The Readiness-Testing Traces Model contains those sets T ⊆ (Σ‡ )∗ that satisfy conditions 2–5 of Lemma 9. We write tracesR [[P]] for the traces of P3 . Below we will show that these are congruent to the operational definition above. STOP and div are equivalent in this model: they can perform no standard events; they can only signal that they are not offering events. tracesR [[STOP]] = tracesR [[div]] = (notOffer Σ)∗ . The process a → P can initially signal that it is not offering events; it can then signal that it is offering a but not offering other events; it can then perform a, and then continue like P. tracesR [[a → P]] = Init ∪ {tr atr | tr ∈ Init ∧ tr ∈ tracesR [[P]]} where Init = {tr tr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ ({offer a} ∪ notOffer(Σ − {a}))∗ }. The process if ready a then P else Q can initially signal that it is not offering events; it can then either detect that a is ready and continue as P, or detect that a is not ready and continue like Q. tracesR [[ if ready a then P else Q]] = (notOffer Σ)∗ ∪ {tr ready atr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ tracesR [[P]]} ∪ {tr notReady atr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ tracesR [[Q]]}. The process P Q can either perform a trace of P, or can perform a trace of P with no standard events, and then (after the timeout) perform a trace of Q. The process P Q can perform traces of either of its components. tracesR [[P Q]] = tracesR [[P]] ∪ {trP trQ | trP ∈ tracesR [[P]] ∧ trP |` Σ = ∧ trQ ∈ tracesR [[Q]]}, tracesR [[P Q]] = tracesR [[P]] ∪ tracesR [[Q]]. Before the first visible event, the process P 2 Q can perform an offer a action if either P or Q can do so; it can perform a notOffer a action if both P and Q can do so. Therefore, P 3 We include the subscript “R” in tracesR [[P]] to distinguish this semantics from the standard traces semantics, traces[[P]].
334
G. Lowe / Extending CSP with Tests for Availability
and Q must synchronise on all notOffer actions before the first visible event. Let tr
notOffer Σ
tr
be the set of ways of interleaving tr and tr , synchronising on all notOffer actions (this operator is a specialisation of the operator defined in [4, page 70]). The three sets in the definiX
tion below correspond to the cases where (a) neither process performs any visible events (so the two processes synchronise on notOffer actions throughout the execution), (b) P performs at least one visible event (after which, Q is turned off), and (c) the symmetric case where Q performs at least one visible event. tracesR [[P 2 Q]] = {tr | ∃ trP ∈ tracesR [[P]], trQ ∈ tracesR [[Q]] • trP |` Σ = trQ |` Σ = ∧ tr ∈ trP trQ } ∪ notOffer Σ
{tr atrP | ∃ trP atrP ∈ tracesR [[P]], trQ ∈ tracesR [[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP trQ } ∪ notOffer Σ
{tr atrQ | ∃ trP ∈ tracesR [[P]], trQ atrQ ∈ tracesR [[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP trQ }. notOffer Σ
In order to give a semantic equation for parallel composition, we define a relation to
A B capture how traces of parallel components are combined4 . We write (trP , trQ ) − → tr if the traces trP of P and trQ of Q can lead to the trace tr of PA B Q. Let privateA = (A − B) ∪ offer(A − B) ∪ notOffer A ∪ ready (Σ − B) ∪ notReady(Σ − B); these are the actions that the process with alphabet A can perform without any cooperation from the other process. Let syncA,B = (A ∩ B) ∪ offer(A ∩ B); these are the actions that the two processes synchronise
A B upon. The relation − → is defined by:
A B → , (, ) −
A B → tr and b ∈ B, then if (trP , trQ ) −
A B → αtr, (αtrP , trQ ) − A B
for α ∈ privateA ,
(αtrP , αtrQ ) −→ αtr,
A B
for α ∈ syncA,B ,
(ready btrP , offer btrQ ) −→ ready btr,
A B → tr, (notReady btrP , notOffer btrQ ) −
A B (notReady btrP , offer btrQ ) − → notReady btr,
The symmetric equivalents of the above cases. In the second clause: the first case corresponds to P performing a private action; the second case corresponds to P and Q synchronising on a shared action; the third case corresponds to a readiness test of P detecting that Q is offering b; the fourth case corresponds to a nonreadiness test of P detecting that Q is not offering b; the fifth case corresponds to a nonreadiness test of P detecting that Q is offering b. The reader might like to compare this definition with the corresponding operational semantics rules for parallel composition. The semantics of parallel composition is then as follows; note that each component is restricted to its own alphabet, and that the composition can perform arbitrary notOffer (Σ − A − B) actions: 4 One normally defines a set-valued function to do this, but in our case it is more convenient to define a relation, since this leads to a much shorter definition.
G. Lowe / Extending CSP with Tests for Availability
335
tracesR [[PA B Q]] = {tr | ∃ trP ∈ tracesR [[P]], trQ ∈ tracesR [[Q]] • trP |` (Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A) = ∧ trQ |` (Σ − B) ∪ offer(Σ − B) ∪ notOffer(Σ − B) = ∧
A B → tr \ notOffer(Σ − A − B)}. (trP , trQ ) −
The semantic equation for hiding of A captures that notReady A and offer A actions are blocked, A and ready A actions are internalised, and arbitrary notOffer A actions can occur. tracesR [[P \ A]] = {tr | ∃ trP ∈ tracesR [[P]] • trP |` (notReady A ∪ offer A) = ∧ trP \ (A ∪ ready A) = tr \ notOffer A}. We now consider the semantics of recursion. Our approach follows the standard method using complete partial orders; see, for example, [4, Appendix A.1]. Lemma 11 The Readiness-Testing Traces Model forms a complete partial order under the subset ordering ⊆, with tracesR [[div]] as the bottom element. Proof: That tracesR [[div]] is the bottom element follows from item 2 of Lemma 9. It is straightforward to see that the model is closed under arbitrary unions, and hence is a complete partial order. 2 The following lemma can be proved using precisely the same techniques as for the standard traces model; see [4, Section 8.2]. Lemma 12 Each of the operators is continuous with respect to the ⊆ ordering. Hence from Tarski’s Theorem, each mapping F definable using the operators of the language has a least fixed point given by n≥0 F n (div). This justifies the following definition. tracesR [[ μ X • F(X)]] = the ⊆-least fixed point of the semantic mapping corresponding to F. The following theorem shows that the two ways of capturing the traces are congruent. Theorem 13 For all traces tr ∈ (Σ‡ )∗ : tr
tr ∈ tracesR [[P]] iff P =⇒ . Proof: (Sketch) By structural induction over the syntax of the language. We give a couple of cases in Appendix B. 2 Theorem 14 For all processes, tracesR [[P]] is a member of the Readiness-Testing Traces Model (i.e., it satisfies conditions 2–5 of Lemma 9). Proof: This follows directly from Lemma 9 and Theorem 13. 2 We can relate the semantics of a process in this model to the standard traces semantics. Let φ be the function that replaces readiness tests by nondeterministic choices, i.e., φ(if ready a then P else Q) = φ(P) φ(Q) and φ distributes over all other operators (e.g. φ(PA B Q) = φ(P)A B φ(Q)). The standard traces of φ(P) are just the projection onto standard events of the readiness-testing traces of P. Theorem 15 traces[[φ(P)]] = {tr |` Σ | tr ∈ tracesR [[P]]}.
336
G. Lowe / Extending CSP with Tests for Availability
4.2. Failures We now consider how to refine the semantic model, to make it analogous to the stable failures model [4], i.e. to record information about which events can by stably refused. The refusal of events seems, at first sight, to be very similar to those events not being offered, as recorded by notOffer actions. The difference is that refusals are recorded only in stable states, i.e. where no internal events are available: this means that if an event is stably refused, it will continue to be refused (until a visible event is performed); on the other hand, notOffer actions can occur in any states, and may subsequently become unavailable. So, for example: • a → STOP STOP is equivalent to a → STOP in the Readiness-Testing Traces model, since the traces of STOP are included in the initial traces of a → STOP; but a → STOP STOP can stably refuse a initially, whereas a → STOP cannot. • a → STOP STOP a → STOP has the trace offer a, notOffer a, offer a (where the notOffer a action is from the intermediate STOP state) whereas a → STOP does not; but neither process can stably refuse a before a is performed. Recall that in the standard model, stable failures are of the form (tr, X), where tr is a trace and X is a set of events that are stably refused. For the language in this paper, should refusal sets contain actions other than standard events? Firstly, we should not consider states with ready or notReady transitions to be stable: recall that we consider these actions to be similar to τ events, in that they are not externally visible. We define: α
stable P ⇔ ∀ α ∈ ready Σ ∪ notReady Σ ∪ {τ } • ¬P −→ . Therefore such actions are necessarily unavailable in stable states, so there is no need to record them in refusal sets. There is also no need to record the refusal of an offer a action, since this will happen precisely when the event a is refused. It turns out that including notOffer actions within refusal sets can add to the discriminating power of the model. Consider P = a → STOP STOP, Q = (a → STOP STOP) a → STOP. Then P and Q have the same traces, and have the same stable refusals of standard events. However, Q can, after the empty trace, stably refuse {b, notOffer a} (i.e., stably offer a and stably refuse b), whereas P cannot. We therefore have a choice as to whether or not we include notOffer actions within refusal sets. We choose not to, because the distinctions one can make by including them do not seem useful, and excluding them leads to a simpler model: in particular, the refusal of notOffer actions do not contribute to the performance or refusal of any standard events. I suspect that including notOffer actions within refusal sets would lead to a model similar in style to the stable ready sets model [9,10]. Hence, we define, for X ⊆ Σ: x
P ref X ⇔ stable P ∧ ∀ x ∈ X • ¬P −→ . We then define the stable failures of a process in the normal way: tr
(tr, X) ∈ failuresR [[P]] ⇔ ∃ Q • P =⇒ Q ∧ Q ref X.
(2)
Definition 16 The Readiness-Testing Stable Failures Model contains those pairs (T, F) where T ⊆ (Σ‡ )∗ , F ⊆ (Σ‡ )∗ × P Σ, T satisfies conditions 2–5 of the Readiness-Testing Traces Model, and also
G. Lowe / Extending CSP with Tests for Availability
337
6. If (tr, X) ∈ F then tr ∈ T. Below, we give compositional rules for the stable failures of a process. Since the notion of refusal is identical to as in the standard stable failures model, the refusal components are calculated precisely as in that model, and so the equations are straight-forward adaptations of the rules for traces. The only point worth noting is that in the construct if ready a then P else Q, no failures are recorded before the if is resolved. failuresR [[div]] = {}, failuresR [[STOP]] = {(tr, X) | tr ∈ (notOffer Σ)∗ ∧ X ⊆ Σ}, failuresR [[a → P]] = {(tr, X) | tr ∈ Init ∧ a ∈ / X} ∪ {(tr atr , X) | tr ∈ Init ∧ (tr , X) ∈ failuresR [[P]]} where Init = {tr tr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ ({offer a} ∪ notOffer(Σ − {a}))∗ }, failuresR [[ if ready a then P else Q]] = {(tr ready atr , X) | tr ∈ (notOffer Σ)∗ ∧ (tr , X) ∈ failuresR [[P]]} ∪ {(tr notReady atr , X) | tr ∈ (notOffer Σ)∗ ∧ (tr , X) ∈ failuresR [[Q]]}, failuresR [[P Q]] = {(tr, X) | (tr, X) ∈ failuresR [[P]] ∧ tr |` Σ = } ∪ {(trP trQ , X) | trP ∈ traces[[P]] ∧ trP |` Σ = ∧ (trQ , X) ∈ failuresR [[Q]]}, failuresR [[P Q]] = failuresR [[P]] ∪ failuresR [[Q]], failuresR [[P 2 Q]] = {(tr, X) | ∃(trP , X) ∈ failuresR [[P]], (trQ , X) ∈ failuresR [[Q]] • trP |` Σ = trQ |` Σ = ∧ tr ∈ trP trQ } ∪ notOffer Σ
{(tr atrP , X) | ∃(trP atrP , X) ∈ failuresR [[P]], trQ ∈ traces[[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP {(tr atrQ , X)
notOffer Σ
| ∃ trP ∈ traces[[P]], (trQ atrQ , X) ∈ failuresR [[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP
notOffer Σ
trQ } ∪
trQ },
failuresR [[PA B Q]] = {(tr, Z) | ∃(trP , X) ∈ failuresR [[P]], (trQ , Y) ∈ failuresR [[Q]] • trP |` (Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A) = ∧ trQ |` (Σ − B) ∪ offer(Σ − B) ∪ notOffer(Σ − B) = ∧
A B → tr \ notOffer(Σ − A − B) ∧ Z − A − B = X ∩ A ∪ Y ∩ B}, (trP , trQ ) −
failuresR [[P \ A]] = {(tr, X) | ∃(trP , X ∪ A) ∈ failuresR [[P]] • trP |` (notReady A ∪ offer A) = ∧ trP \ (A ∪ ready A) = tr \ notOffer A}, failuresR [[ μ X • F(X)]] = the ⊆-least fixed point of the semantic mapping corresponding to F. The fixed-point definition for recursion can be justified in a similar way to as for traces. The congruence of the above rules to the operational definition of stable failures —i.e., equation (2)— can be proved in a similar way to Theorem 13. Conditions 2–5 of the ReadinessTesting Stable Failures Model are satisfied, because of the corresponding result for traces
338
G. Lowe / Extending CSP with Tests for Availability
(Theorem 14). Condition 6 follows directly from the definition of a stable failure, and the congruence of the operational and denotational semantics. The following theorem relates the semantics of a process in this model to the standard stable failures semantics. Theorem 17 failures[[φ(P)]] = {(tr |` Σ, X) | (tr, X) ∈ failuresR [[P]]}.
5. Model Checking In this section we illustrate how one can use a standard CSP model checker, such as FDR [5, 6], to analyse processes in the extended language of this paper. We just give an example here, in order to give the flavour of the translation; we discuss prospects for generalising the approach in the concluding section of the paper. We consider the following solution to the readers and writers problem. Guard(r, w) = w = 0 ∧ r < N & notReady startWrite & startRead → Guard(r + 1, w) 2 r > 0 & endRead → Guard(r − 1, w) 2 r = 0 ∧ w = 0 & startWrite → Guard(r, w + 1) 2 w > 0 & endWrite → Guard(r, w − 1). This is the solution from Section 2 that gives priority to writers, except we impose a bound of N upon the number of readers, and add guards to the second and fourth branches, in order to keep the state space finite. We will show that this solution is starvation free as far as the writers is concerned: i.e. if a writer is trying to gain access then one such writer eventually succeeds. We will simulate the above guard process using standard CSP, in particular simulating the ready, notReady, offer and notOffer actions by fresh CSP events on channels ready, notReady, offer and notOffer. Each process is translated into a form that uses these channels, following the semantics presented earlier; the simulation will have transitions that correspond to the −− transitions of the original, except it will have a few additional τ transitions that do not affect the failures-divergences semantics. More precisely, let α ˆ be the event used to simulate the action α; for example, if α = ready e then α ˆ = ready.e. Then each process P α α ˆ τ is simulated by a translation trans(P), where if P −− Q then trans(P) −→(−→)∗ trans(Q), and vice versa. In particular, for each standard event e, we must add an offer.e or notOffer.e loop to each state. For convenience, and to distinguish between the source and target languages, we present the simulation using prettified machine-readable CSP.5 The standard events and the channels to simulate the non-standard actions are declared as follows: channel startWrite, endWrite, startRead, endRead E = {startWrite, endWrite, startRead, endRead} channel ready, notReady, offer, notOffer : E
We start by defining some helper processes. The following process is the translation of STOP: it can only signal that it is not offering standard events. STOPT = notOffer?e → STOPT 5
The CSP text below is produced (almost) directly from the machine-readable CSP using LATEX macros.
G. Lowe / Extending CSP with Tests for Availability
339
The following process is the translation of e → P: initially it can signal that it is not offering standard events; it can then timeout into a state where e is available, after which it acts like P; in this latter state it can also signal that it is offering e but no other standard events.6 Prefix(e,P) = notOffer?e1 → Prefix(e,P) Prefix1(e,P) Prefix1(e,P) = e → P 2 offer.e → Prefix1(e,P) 2 notOffer?e1:diff(E,{e}) → Prefix1(e,P)
The reader might like to compare these with the −− semantics in Appendix A. In order to simulate the Guard process, we simulate each branch as a separate parallel process: the branches of Guard synchronise on notOffer actions before the choice is resolved, so the processes simulating these branches will synchronise on appropriate notOffer events. The first branch is simulated as below: Branch(1,r,w) = if w==0 and r0 then Branch(3,r,w) if r==0 and Branch(4,r,w) if w>0 then
= Prefix(endRead, Restart(2,r−1,w)) else STOPT = w==0 then Prefix(startWrite, Restart(3,r,w+1)) else STOPT = Prefix(endWrite, Restart(4,r,w−1)) else STOPT
When one branch executes and reaches a point corresponding to a recursion within the Guard process, all the other branch processes need to be restarted, with new values for r or w. We implement this by the executing branch signalling on the channel restart. R = {0..N} −− possible values of r W = {0..1} −− possible values of w BRANCH = {1..4} −− branch identifiers channel restart : BRANCH.R.W Restart(i,r,w) = restart!i.r.w → Branch(i,r,w)
Each branch can receive such a signal from another branch, as an interrupt, and restart with the new values for r and w.7 Branch’(i,r,w) = Branch(i,r,w) ' restart?j:diff(BRANCH,{i})?r’.w’ → Branch’(i,r’,w’) 6 7
The operator diff represents set difference. The ' is an interrupt operator; the left hand side is interrupted when the right hand side performs an event.
340
G. Lowe / Extending CSP with Tests for Availability
Below we will combine these Branch’ processes in parallel so as to simulate Guard. We will need to be able to identify which branch performs certain events. For events e other than notOffer events, we rename e performed by branch i to c.i.e. We rename each notOffer.e event performed by branch i to both itself and notOffer1.i.e: the former will be used before the choice is resolved (synchronised between all branch processes), and the latter will be used after the choice is resolved (privately to branch i):8 EE = union(E, {|ready, notReady, offer|}) −− events other than notOffer channel c : BRANCH.EE −− c.i.e represents event e done by Branch(i, , ) channel notOffer1 : BRANCH.E Branch’’(i,r,w) = Branch’(i,r,w) [[ e \ c.i.e | e ← EE ]] [[ notOffer.e \ notOffer.e, notOffer.e \ notOffer1.i.e | e ← E ]] alpha(i) = {|c.i,restart,notOffer,notOffer1.i|} −− alphabet of branch i
Below we will combine the branch processes in parallel, together with a regulator process Reg that, once a branch has done a standard event to resolve the choice, blocks all events of the other branches until a restart occurs; further, it forces processes to synchronise on notOffer events before the choice is resolved, and subsequently allows the unsynchronised notOffer1 events.9 Reg = c?i?e → (if member(e,E) then Reg’(i) else Reg) 2 notOffer? → Reg Reg’(i) = c.i? → Reg’(i) 2 restart.i? ? → Reg 2 notOffer1.i?
→ Reg’(i)
We build the guard process by combining the branches and regulator in parallel, hiding the restart events, and reversing the above renaming.10 Guard0(r,w) = ( i : BRANCH • [ alpha(i) ] Branch’’(i,r,w)) [| {|c,restart,notOffer,notOffer1|} |] Reg Guard(r,w) = (Guard0(r,w) \ {| restart |}) [[ c.i.e \ e | e ← EE, i ← BRANCH ]] [[ notOffer1.i.e \ notOffer.e | e ← E, i ← BRANCH ]]
We can check the simple safety property that the guard allows at most one active writer at a time, and never allows both readers and writers to be active. Spec(r,w) = w==0 and r0 & endRead → Spec(r−1,w) 2 r==0 and w==0 & startWrite → Spec(r,w+1) 2 w>0 & endWrite → Spec(r,w−1) internals = {|ready, notReady, offer, notOffer, writerTrying|} assert Spec(0,0) T Guard(0,0) \ internals 8
union represents the union operation. member tests for membership of a set. 10 The is an indexed parallel composition, indexed by i; here the ith component has alphabet alpha(i). The notation [| A |] is the machine-readable CSP version of . 9
A
G. Lowe / Extending CSP with Tests for Availability
341
This test succeeds, at least for small values of N. In order to verify a liveness property, we need to model the readers and writers themselves. Each reader alternates between performing startRead and endRead, or may decide to stop (when not reading). Each writer is similar, but, for later convenience, we add an event writerTrying to indicate that it is trying to perform a write. It is important that the startWrite event becomes available immediately after the writerTrying event, in order for the liveness property below to be satisfied, hence we have the following form. channel writerTrying Reader = Prefix(startRead, Prefix(endRead, Reader)) STOPT Writer = (writerTrying → Prefix1(startWrite , Prefix(endWrite, Writer)) STOPT) 2 notOffer.startWrite → Writer
Following the semantic definitions, we need to synchronise the notOffer.startWrite events of the individual writers, and we need to synchronise the offer.startWrite and notOffer.startWrite events of the writers with the ready.startWrite and notReady.startWrite events of the guard, respectively. For convenience, we block the remaining offer and notOffer events, since we make no use of them, and processes do not change state when they perform such an event. Readers = ( ||| r:{1..N} • Reader) Writers = ([|{notOffer.startWrite}|] w:{1..N} • Writer) [[ offer.startWrite \ ready.startWrite, notOffer.startWrite \ notReady.startWrite ]] ReadersWriters = Readers ||| Writers System = let SyncSet = union(E,{ready.startWrite,notReady.startWrite}) within (Guard(0,0) [| SyncSet |] ReadersWriters) [| {|offer,notOffer|} |] STOP
We now consider the liveness property that the guard is fair to the writers, in the sense that if at least one writer is trying to gain access, then one of them eventually succeeds. Testing for this property is not easy: the only way to test that a startWrite event eventually becomes available is to hide the readers’ events, and to check that startWrite becomes available without a divergence (so after only finitely many readers’ events); however, hiding all the readers’ events will lead to a divergence when no writer is trying to gain access (at which point refinement tests do not act in the way we would like). What we therefore do is use a construction that has the effect of hiding the readers’ events when a writer is trying to gain access, but leaving the startRead events visible when no writer is trying to gain access. We then test against the following specification (where all other irrelevant events are hidden). WLSpec(n) = n0 & startWrite → WLSpec(n−1) 2 n==0 & (startRead → WLSpec(n) STOP)
The parameter n records how many writers are currently trying; when n>0 (i.e. at least one writer is trying), this process insists that a writer can start after a finite amount of (hidden) activity by the readers; when n==0 (i.e. no writer is trying), the process allows arbitrary startRead events.
342
G. Lowe / Extending CSP with Tests for Availability
The way to implement the state-dependent hiding described above is to rename events both to themselves and a new event startRead’, put this in parallel with a regulator that allows startRead events when no writer is trying and startRead’ events when at least one writer is trying, and then hide startRead’ and other irrelevant events. startRead
channel startRead’ System’ = (System [[ startRead \ startRead, startRead \ startRead’ ]] [| {writerTrying,startWrite,startRead,startRead’} |] Reg1(0)) \ {|endRead,endWrite,ready,notReady,startRead’|} Reg1(n) = n0 & startWrite 2 n==0 & startRead 2 n>0 & startRead’
→ → → →
Reg1(n+1) Reg1(n−1) Reg1(n) Reg1(n)
We can then use FDR to verify assert WLSpec(0) FD System’
In particular, this means that the right hand side is divergence-free, so when n>0 the startWrite events will become available after a finite number of the hidden events. 6. Discussion In this paper we have considered an extension of CSP that allows processes to test whether an event is available. We have formalised this construct by giving an operational semantics and congruent denotational semantic models. We have illustrated how we can use a standard model checker to analyse systems in this extended language. In this final section we discuss some related work and possible extensions to this work. 6.1. Comparison with Standard Models There have been several different denotational semantic models for CSP. Most of these are based on the standard syntax of CSP, with the standard operational semantics. It is not possible to compare these models directly with the models in this paper, since we have used a more expressive language, with the addition of readiness tests; however, we can compare them with the sub-language excluding this construct. For processes that do not use readiness tests, the two models of this paper are more distinguishing than the standard traces and stable failures models, respectively. Theorems 15 and 17 show that our models make at least as many distinctions as the standard models. They distinguish processes that the standard models identify, such as a → STOP (a → STOP b → STOP)
and
a → STOP b → STOP :
the former, but not the latter, has the trace offer a, b. In [10], Roscoe gives a survey of the denotational models based on the standard syntax. The most distinguishing of those models based on finite observations (and so not modelling divergences) is the finite linear model, FL. This model uses observations of the form A0 , a0 , A1 , a1 , . . . , an−1 , An , where each ai is an event that is performed, and each Ai is either (a) a set of events, representing that those events are offered in a stable state (and so ai ∈ Ai ), or (b) the special value • representing no information about what events are stably offered (perhaps because the process did not stabilise).
G. Lowe / Extending CSP with Tests for Availability
343
For processes with no readiness tests, FL is incomparable with the models in this paper. Our models distinguish processes that FL identifies, essentially because the latter records the availability of events only in stable states, whereas our models record this information also in unstable states. For example, the processes (b → STOP a → STOP) b → STOP
and
a → STOP b → STOP
are distinguished in our models since just the former has the trace offer b, a; however they are identified in FL (and hence all the other finite observation models from [10]) because this b is not stably available. Conversely, FL distinguishes processes that our models identify, such as a → b → STOP (a → STOP STOP)
and
a → (b → STOP STOP) STOP,
since the former has the observation {a}, a, •, b, •, but the latter does not since its a is performed from an unstable state; however, they are identified by our failures model (and hence our traces model) since this records stability information only at the end of a trace. I believe it would be straightforward to extend our models to record stability information throughout the trace in the way FL does. Roscoe also shows that each of the standard finite observation models can be extended to model divergences in three essentially different ways. I believe it would be straightforward to extend the model in this paper to include divergences following any of these techniques. 6.2. Comparison with Other Prioritised Models As we described in the introduction, the readiness tests can be used to implement a form of priority. There have been a number of previous attempts to add priority to CSP. Lawrence [11] models priorities by representing a process as a set of triples of the form (tr, X, Y), meaning that after performing trace tr, if a process is offered the set of events X, then it will be willing to perform any of the events from Y. For example, a process that initially gives priority to a over b would include the triple , {a, b}, {a}). Fidge [12] models priorities using a set of “preferences” relations over events. For example, a process that gives priority to a over b would have the preferences relation {a → a, b → b, a → b}. In [13,14], I modelled priorities within timed CSP using an order over the sets (actually, multi-sets) of events that the process could do at the same time. For example, a process that gives priority to a over b (and would rather do either than nothing) at time 0 would have the ordering (0, {a}) (0, {b}) (0, {}). All the above models are rather complex: I would claim that the model in this paper is somewhat simpler, and has the advantage of allowing a translation into standard CSP, in order to use the FDR model checker. One issue that sometimes arises when considering priority is what happens when two processes with opposing priorities are composed in parallel. For example, consider P Q where P and Q give priority to a and b respectively: P = a → P1 2 notReady a & b → P2 , Q = b → Q1 2 notReady b & a → Q2 . There are essentially three ways in which this parallel composition can behave (in a context that doesn’t block a or b): • If P performs its notReady test before Q, the test succeeds; if, further, P makes the b available before Q performs its notReady test, then Q’s notReady test will fail, and so the parallel composition will perform b;
344
G. Lowe / Extending CSP with Tests for Availability
• The symmetric opposite of the previous case, where Q performs its notReady test and makes a available before P performs its notReady test, and so the parallel composition performs a; • If both processes perform their notReady tests before the other process makes the corresponding event available, then both tests will succeed, and so both events will be possible. I consider this to be an appropriate solution. By contrast, the model in [11] leads to this system deadlocking; in [13,14] I used a prioritised parallel composition operator to give priority to the preferences of one components, thereby avoiding this problem, but at the cost of considerable complexity. 6.3. Readiness Testing for Channels Most CSP-like programming languages make use of channels that pass data. In such languages, one can often test for the availability of channels, rather than of individual events. Note, though, that there is an asymmetry between the reader and writer of the channel: the reader will normally want to test whether there is any event on the channel that is ready for communication (i.e. the writer is ready to communicate), whereas the writer will normally want to test if all events on the channel are ready for communication (i.e. the reader is ready to communicate). It would be interesting to extend the language of this paper with constructs if ready all A then P else Q
and
if ready any A then P else Q,
where A is a set of events (e.g. all events on a channel), to capture this idea. 6.4. Model Checking In Section 5, we gave an indication as to how to simulate the language of this paper using standard CSP, so as to use a model checker such as FDR. This technique works in general. This can be shown directly, by exhibiting the translation. Alternatively, we can make use of a general result from [15], where Roscoe shows that any operator with an operational semantics that is “CSP-like” —essentially that the operator can turn arguments on, interact with arguments via visible events, promote τ events of arguments, and maybe turn arguments off— can be simulated using standard CSP operators. The −− semantics of this paper is “CSP-like” in this sense, so we can use those techniques to simulate this semantics. We intend to automate the translation. One difficulty is that the translation from [15] can produce processes that are infinite state because they branch off extra parallel processes at each recursion of the simulated process. In Section 5 we avoided this problem by restarting the Branch processes that constituted the guard at each recursion (this uses an idea also due to Roscoe), whereas Roscoe’s approach would branch off new processes at each recursion. I believe the technique in Section 5 can be generalised and probably automated. 6.5. Full Abstraction The denotational semantic models we have presented turn out not to be fully abstract with respect to may-testing [16]. Consider the process if ready a then P else P. This is denotationally distinct from P, since its traces have ready a or notReady a events added to the traces of P. Yet there seems no way to distinguish the two processes by testing: i.e., there is no good reason to consider those processes as distinct. We believe that one could form a fully abstract semantics as follows. Consider the relation ∼ over sets of traces defined by
345
G. Lowe / Extending CSP with Tests for Availability
(S ∪ {tr tr }) ∼ (S ∪ {tr ready atr , tr notReady atr }) for all S ∈ P(Σ∗ ), tr, tr ∈ Σ∗ , a ∈ Σ. In other words, two sets are related if one is formed from the other by adding ready a and notReady a actions in the same place. Let ≈ be the transitive reflexive closure of ∼. This relation essentially abstracts away irrelevant readiness tests. We conjecture that P and Q are testing equivalent iff traces[[P]] ≈ traces[[Q]], and that it might be possible to produce a compositional semantics corresponding to this equivalence. It is not clear that the benefits of full abstraction are worth this extra complexity, though. Acknowledgements I would like to thank Bill Roscoe and Bernard Sufrin for interesting discussions on this work. I would also like to thank the anonymous referees for a number of very useful suggestions. References [1] Peter Welch, Neil Brown, James Morres, Kevin Chalmers, and Bernhard Sputh. Integrating and extending JCSP. In Communicating Process Architectures, pages 48–76, 2007. [2] Peter Welch and Neil Brown. Communicating sequential processes for Java (JCSP). http://www.cs. kent.ac.uk/projects/ofa/jcsp/, 2009. [3] Gregory R. Andrews. Foundations of Multithreaded, Parallel, and Distributed Programming. AddisonWesley, 2000. [4] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1997. [5] A. W. Roscoe. Model-checking CSP. In A Classical Mind, Essays in Honour of C. A. R. Hoare. PrenticeHall, 1994. [6] Formal Systems (Europe) Ltd. Failures-Divergence Refinement—FDR 2 User Manual, 1997. [7] C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985. [8] P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with “readers” and “writers”. Communications of the ACM, 14(10):667–668, 1971. [9] E. R. Olderog and C. A. R. Hoare. Specification-oriented semantics for communicating processes. Acta Informatica, 23(1):9–66, 1986. [10] A. W. Roscoe. Revivals, stuckness and the hierarchy of CSP models. Journal of Logic and Algebraic Programming, 78(3):163–190, 2009. [11] A. E. Lawrence. Triples. In Proceedings of Communicating Process Architectures, pages 157–184, 2004. [12] C. J. Fidge. A formal definition of priority in CSP. ACM Transactions on Programming Languages and Systems, 15(4):681–705, 1993. [13] Gavin Lowe. Probabilities and Priorities in Timed CSP. DPhil thesis, Oxford, 1993. [14] Gavin Lowe. Probabilistic and prioritized models of Timed CSP. Theoretical Computer Science, 138:315– 352, 1995. [15] A.W. Roscoe. On the expressiveness of CSP. Available via http://web.comlab.ox.ac.uk//files/ 1383/complete(3).pdf, 2009. [16] R. de Nicola and M. C. B. Hennessy. Testing equivalences for processes. Theoretical Computer Science, 34:83–133, 1984.
A. Derived Operational Semantics The definition of the −− relation, and the operational semantic rules for the −→ relation can be translated into the following defining rules for −− . notOffer a
STOP −−− STOP notOffer b
a → P −−− a → P offer a
aˇ → P −−− a → P
τ
for a ∈ Σ
a → P −− aˇ → P
for b ∈ Σ
aˇ → P −− P
a
notOffer b
aˇ → P −−− a → P
for b = a
346
G. Lowe / Extending CSP with Tests for Availability
if ready a then P else Q if ready a then P else Q if ready a then P else Q
ready a
−−− P notReady a
−−− Q
notOffer b
−−− if ready a then P else Q
a
for b ∈ Σ
τ
P −− P
P −− P
a
τ
P 2 Q −− P 2 Q
P 2 Q −− P ready a
notReady a
P −−− P
P −−− P
ready a
notReady a
P 2 Q −−− P 2 Q
P 2 Q −−− P 2 Q notOffer a
P −−− P
offer a
P −−− P offer a
notOffer a
Q −−− Q
P 2 Q −−− P 2 Q
notOffer a
P 2 Q −−− P 2 Q
a
τ
P −− P
P −− P
a
τ
P Q −− P Q
P Q −− P ready a
notReady a
P −−− P
P −−− P
ready a
notReady a
P Q −−− P Q
P Q −−− P Q
offer a
notOffer a
P −−− P
P −−− P
offer a
notOffer a
P Q −−− P Q α
P −− P
P Q −−− P Q
α ∈ (Σ − A)) ∪ {τ } ∪ ready(Σ − A) ∪ notReady(Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A)
α
P \ A −− P \ A α
P −− P
α ∈ A ∪ ready A
τ
P \ A −− P \ A
notOffer a
P \ A −−− P \ A,
for a ∈ A
α
α
P −− P α
PA B Q −− P A B Q
α ∈ privateA ∪ {τ }
α
PA B Q −− P A B Q
ready b
α ∈ syncA,B
notReady b
P −−− P
P −−− P
offer b
notOffer b
Q −−− Q
b∈B
ready b
PA B Q −−− P A B Q notReady b
P −−− P
Q −−− Q τ
PA B Q −− P A B Q
b∈B
notOffer d
PA B Q −−− PA B Q,
offer b
Q −−− Q notReady b
P −− P α Q −− Q
PA B Q −−− P A B Q
b∈B
for d ∈ Σ − A − B.
347
G. Lowe / Extending CSP with Tests for Availability
B. Congruence of the Operational Semantics In this appendix, we prove some of the cases in the proof of Theorem 13: tr
tr ∈ tracesR [[P]] iff P =⇒ . Hiding We prove the case of hiding in Theorem 13 using the derived rules in Appendix A. (⇒) Suppose tr ∈ tracesR [[P \ A]]. Then there exists some trP ∈ tracesR [[P]] such that trP |` (notReady A ∪ offer A) = and trP \ (A ∪ ready A) = tr \ notOffer A. By the inductive α1 αn trP i.e., P −− . . . −− for some α1 , . . . , αn such that trP = α1 , . . . , αn \ hypothesis, P =⇒, {τ }. From the derived operational semantics rules, P\A has the same transitions but with each αi ∈ A∪ready A replaced by a τ , i.e., transitions corresponding to the trace trP \(A∪ready A). Further, using the third derived rule for hiding, arbitrary notOffer A self-loops can be added tr to the transitions, giving transitions corresponding to trace tr. Hence P \ A =⇒. tr
(⇐) Suppose P \ A =⇒. Consider the transitions of P that lead to this trace. By consideratrP tion of the derived rules, we see that P =⇒ for some trace trP such that trP |` (notReady A ∪ offer A) = and trP \ (A ∪ ready A) = tr \ notOffer A. By the inductive hypothesis, trP ∈ tracesR [[P]]. Hence, tr ∈ tracesR [[P \ A]]. Parallel Composition We prove the case of parallel composition in Theorem 13 using the derived rules from Appendix A. (⇒) Suppose tr ∈ tracesR [[PA B Q]]. Then there exist trP ∈ tracesR [[P]] and trq ∈ tracesR [[Q]] such that trP |` (Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A) = , trQ |` (Σ − B) ∪ offer(Σ − B) ∪
A B notOffer(Σ−B) = and (trP , trQ ) − → tr\notOffer(Σ−A−B). By the inductive hypothesis,
tr
trQ
α1
β1
αn
βm
P P =⇒ and Q =⇒. So P −− . . . −− and Q −− . . . −− for some α1 , . . . , αn , β1 , . . . , βm such that trP = α1 , . . . , αn \ {τ } and trq = β1 , . . . , βm \ {τ }. We then have that
tr\notOffer(Σ−A−B)
A B PA B Q ==⇒ , since each event implied by (trP , trQ ) − → tr \ notOffer(Σ − A − B) has a corresponding transition implied by the operational semantics rules (formally, this is a
A B →, combined with a straightforward induction on m + n). case analysis over the clauses of − Further, using the final derived rule for parallel composition, arbitrary notOffer(Σ − A − B) self-loops can be added to the transitions, giving transitions corresponding to trace tr. Hence tr PA B Q =⇒.
tr\notOffer(Σ−A−B)
tr
==⇒ . Con(⇐) Suppose PA B Q =⇒. Then by item 3 of Lemma 9, PA B Q sider the transitions of P and Q that lead to this trace according to the operational semanα1
β1
αn
βm
tics rules, say P −− . . . −− and Q −− . . . −− . Let trP = α1 , . . . , αn \ {τ } and tr
trQ
P and Q =⇒. By the inductive hypothesis, trP ∈ tracesR [[P]] trq = β1 , . . . , βm \ {τ }; so P =⇒ and trQ ∈ tracesR [[Q]]. Also, by consideration of the operational semantics rules, trP |` (Σ − A)∪offer(Σ−A)∪notOffer(Σ−A) = and trQ |` (Σ−B)∪offer(Σ−B)∪notOffer(Σ−B) = .
A B Further, (trP , trQ ) − → tr \ notOffer(Σ − A − B), since each transition implied by the opera-
A B tional semantics rules has a corresponding event implied by the definition of − → (formally, this is a case analysis over the operational semantics rules, combined with a straightforward induction on m + n). Hence tr ∈ tracesR [[PA B Q]].
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-349
349
Design Patterns for Communicating Systems with Deadline Propagation a
Martin KORSGAARD a,1 and Sverre HENDSETH a Department of Engineering Cybernetics, Norwegian University of Science and Technology Abstract. Toc is an experimental programming language based on occam that combines CSP-based concurrency with integrated specification of timing requirements. In contrast to occam with strict round-robin scheduling, the Toc scheduler is lazy and does not run a process unless there is a deadline associated with its execution. Channels propagate deadlines to dependent tasks. These differences from occam necessitate a different approach to programming, where a new concern is to avoid dependencies and conflicts between timing requirements. This paper introduces client-server design patterns for Toc that allow the programmer precise control of timing. It is shown that if these patterns are used, the deadline propagation graph can be used to provide sufficient conditions for schedulability. An alternative definition of deadlock in deadlinedriven systems is given, and it is demonstrated how the use of the suggested design patterns allow the absence of deadlock to be proven in Toc programs. The introduction of extended rendezvous into Toc is shown to be essential to these patterns. Keywords. real-time programming, Toc programming language, design patterns, deadlock analysis, scheduling analysis
Introduction The success of a real-time system depends not only on its computational results, but also on the time at which those results are produced. Traditionally, real-time programs have been written in C, Java or Ada, where parallelism and timeliness relies on the concepts of tasks and priorities. Each timing requirement from the functional specification of the system is translated into a periodic task, and the priority of each task is set based on a combination of the tasks period and importance. Tasks are implemented as separate threads running infinite loops, with a suitable delay statement at the end to enforce the task period. Shared-memory based communication is the most common choice for communication between threads. With this approach, the temporal behaviour of the program is controlled indirectly, through the choice of priorities and synchronization mechanisms. The occam programming language [1] has a concurrency model that only supports synchronous communication, and where parallelism is achieved using structured parallels instead of threads. Control of timeliness is only possible with a prioritized parallel construct, which is insufficient for many real-time applications [2]. The occam-π language [3] improves on this by allowing control of absolute priorities. The occam process model is closely related to the CSP process algebra [4, 5], and CSP is well suited for formal analysis of occam programs. Toc [6, 7] is an experimental programming language based on occam that allows the specification of timing requirements directly in the source code. The goal of Toc is to create a 1 Corresponding Author: Martin Korsgaard, Department of Engineering Cybernetics, O.S Bragstads plass 2D, 7491 Trondheim, Norway. Tel.: +47 73 59 43 76; Fax: +47 73 59 45 99; E-mail:
[email protected].
350
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
language where the temporal response is specified explicitly rather than indirectly, and where the programmer is forced to consider timing as a primary concern, rather than something to be solved when the computational part of the program is complete. A prototype compiler and run-time system has been developed. In Toc, the timing requirements from the specification of the system can be expressed directly in source code. This is achieved with a dedicated deadline construct and synchronous channels that also propagate deadlines to dependent tasks. A consequence of deadline propagation is an implicit deadline inheritance scheme. The Toc scheduler is lazy in that it does not execute statements without an associated deadline. Server processes or other shared resources do not execute on their own, but must be driven by deadlines implicitly given by their access channels. The laziness means that temporal requirements cannot be ignored when programming, and tasks that traditionally would be implemented as deadline-less “background tasks” must now be given explicit deadlines. occam programs use a strict, round-robin scheduler that closely matches CSP’s principle of maximum progress. This is one reason why CSP is well suited to model occam programs. However, laziness is not closely matched by a principle of maximum progress, and that makes CSP-based analysis of Toc programs harder. For all parallel systems that use synchronization there is a risk of deadlock. In occam, the client-server design paradigm can be used to avoid deadlocks by removing the possibility of a circular wait [8, 9]. Here, each process acts as either a client or a server in each communication, and each communication is initiated by a request and terminated by a reply. If the following three criteria are met, then the system is deadlock-free: 1. Between a request to a server and the corresponding reply, a client may not communicate with any other process. 2. Between accepting a request from a client and the corresponding reply, a server may not accept requests from other clients, but may act as a client to other servers. 3. The client-server relation graph must be acyclic. An alternative deadlock-free design pattern described in [8] is I/O-PAR, guaranteed when all processes proceed in a sequential cycle of computation followed by all I/O in parallel. The whole system must progress in this fashion, similar to bulk synchronous parallelism (BSP) [10]. It is also possible to use a model checker such as SPIN [11] or FDR2 [12] to prove the absence of deadlocks under more general conditions. For some real-time systems, proving the absence of deadlocks is not sufficient, and a formal verification of schedulability is also required. Toc is scheduled using earliest deadline first (EDF), and for EDF systems, a set of independent, periodic tasks with deadline equal to period is schedulable if and only if:
i
Ci ≤1 Ti
(1)
where Ci is the worst-case computation time (WCET) of task i, and Ti is the period of that task [13]. Equation 1 ignores overheads associated with scheduling. Scheduling analysis is more complicated for task sets that are not independent. One effect that complicates analysis of dependent tasks is priority inversion [14], where a low priority task holding a resource will block a high priority task while holding that resource. Priority inversions are unavoidable in systems where tasks share resources. In contrast, an unbounded priority inversion is a serious error, and occurs when the low priority task holding the resource required by the high priority task is further preempted by intermediate priority tasks, in which case the high priority task may be delayed indefinitely. Solutions to this problem include the priority inheritance and the priority ceiling algorithms, both defined in [15]. These algorithms are restricted to priority-based systems. Other systems can be scheduled
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
351
and analyzed by using the Stack Resource Policy [16], for example. SRP supports a wide range of communication and synchronization primitives, but is otherwise similar to the priority ceiling algorithm. Unfortunately, schedulability analysis of task sets that communicate synchronously using rendezvouses are not well developed [17]. If deadlines may be shorter than periods, necessary and sufficient schedulability conditions can be found using the Processor Demand Criterion [18], but the computation is only practical if the relative start-times of all the tasks are known (a “complete task set”). If the relative start-times are not known (an “incomplete task set”) then the problem of determining schedulability is NP-hard. To use these analysis methods one needs to know the WCET C of each task. This is far from trivial, and may be impossible if the code contains variable-length loops or recursions. Fortunately, programmers of real-time systems usually write code that can easily be found to halt, reducing the problem to finding out when. Execution times vary with input data and the state of the system, meaning that measurements of the execution time are not sufficient. The use of modern CPU features such as superscalar pipelines and cache further increases this variation. Despite of these difficulties, bounds on execution times can still be found in specific cases using computerized tools such as aiT [19]. The process is computationally expensive [20] and subject to certain extra requirements such as manual annotations of the maximum iterations in a loop [21]. AiT is also limited to fixed task sets of fixed priority tasks. This paper discusses how analysis of schedulability and deadlocks can be enabled in Toc by using a small set of design patterns. The paper begins with a description of the Toc timing mechanisms in Section 1 and the consequences of these mechanisms. Section 2 describes deadlocks in deadline propagating systems, and how it differs from deadlocks in systems based on the principle of maximum progress. Section 3 describes the suggested design patterns and how they are used to program a system in an intuitive manner that helps avoid conflicts between timing requirements. Section 4 explains how schedulability analysis can be done on systems that are designed using these design patterns. The paper ends with a discussion and some concluding remarks. 1. Temporal Semantics of Toc The basic timing operator in Toc is TIME, which sets a deadline for a given block of code, with the additional property that the TIME construct is not allowed to terminate before the deadline. The deadline is a scheduling guideline, and there is nothing in the language to stop deadlines from being missed in an overloaded system. In contrast, the minimum execution time constraint is absolute and is strictly enforced. The definition of a task in Toc is the process corresponding to a TIME construct. A periodic task with deadline equal to period is simply a TIME construct wrapped in a loop, and a periodic task with a shorter relative deadline than period is written by using two nested TIME constructors: the outer restricting the period while the inner sets the deadline. Toc handles the timing of consecutive task instances in a way that removes drift [6, 7]. Sporadic tasks can be created as a task triggered by a channel by putting the TIME construct in sequence after the channel communication. A TIME construct will drive the execution of all statements in its block with the given deadline whenever this deadline is the earliest in the system. If channel communications are part of that block, then the task will also drive the execution of the other end of those channels up to the point where the communications can proceed. A task has stalled if it cannot drive a channel because the target process is waiting for a minimum execution time to elapse, but execution of the task will continue when the target becomes ready.
352
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
Figure 1. Order of Execution when task communicates with passive process.
Figure 2. Use of Extended Rendezvous. (a) x will not be passed on due to laziness. (b) x will be forwarded with the deadline inherited from input.
The definition of laziness in Toc is that primitive processes (assignment, input/output, SKIP and STOP) not driven by a deadline will not be executed, even if the system is otherwise idle. However, constructed processes will be evaluated even without a deadline in cases where a TIME construct can be reached without first executing primitive processes. Data from the compiler’s parallel usage rules checker is used to determine which process to execute to make the other end of a given channel ready. An example of the order of execution is given in Figure 1. Here, the process P.lazy() will not be executed because it is never given a deadline. 1.1. Timed Alternation In Toc, alternations are not resolved arbitrarily, but the choice of alternative is resolved by analyzing the timing requirements of the system. Choice resolution is straight forward when there is no deadline driving the ALT itself, so that the only reason that the process with the ALT is executing at all is that it is driven by one of its guard channels, which by the definition of EDF must have the earliest deadline of all ready tasks in the system at that time. By always selecting the driving alternative it is ensured that the choice resolution always benefits the earliest deadline task in the system. This usage corresponds to the ALT being a passive server with no timing requirements on its own, that handles requests from clients using the deadlines of the most urgent client. Other cases cannot be as easily resolved: If the ALT itself has a deadline, if the ALT is driven by a channel not part of the ALT itself or if the driving channel is blocked by a Boolean guard, then the optimal choice resolution with respect to timing cannot easily be found. If the design patterns presented in this paper are used, then none of these situations will arise, and therefore these issues are not discussed here. 1.2. Use of Extended Rendezvous A consequence of laziness is that driving a single channel communication with another process cannot force execution of the other process beyond what is required to complete the communication. One consequence of this is that it is difficult to write multiplexers or simple data forwarding processes. This is illustrated in Figure 2a: if something arrives on input then it will not immediately be passed on to output because there no deadline driving the execution of the second communication. The output will only happen when driven by the deadline of the next input.
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
353
A workaround is to add a deadline to the second communication, but this is awkward, and the choice of deadline will either be arbitrary or a repeated specification of an existing deadline, neither of which is desirable. Another solution, though ineffective, is to avoid forwarding data altogether, but this would put severe restrictions on program organization. Extended inputs [3] fix the problem. An extended input is a language extension from occam-π that allows the programmer to force the execution of a process “in the middle” of a channel communication, after the processes have rendezvoused but before they are released. Syntactically, this during process is indented below the input, and a double question-mark is used as the input operator. In Toc, the consequence is that the deadline used to drive the communication will also drive the during process. A correct way to forward a signal or data is shown in Figure 2b. 2. Deadlock in Timed Systems A deadlock in the classical sense is a state where a system cannot have any progress because all the processes in the system are waiting for one of the other processes. For a system to deadlock in this way there has to be at least one chain of circular waits. In Toc, when the earliest-deadline task requires communication over a channel it will pass its deadline over to the process that will communicate next on the other side of that channel. If this process in turn requires communication with another process, the deadline will propagate to this other process as well. Thus a task that requires communication over a channel is never blocked; it simply defers its execution to the task that it is waiting for. This can only fail if a process — through others — passes a deadline onto itself. In that case the process cannot communicate on one channel before it has communicated on another, resulting in a deadlock. This leads to the following alternative definition of deadlock in Toc: Definition 1 (Deadlock) A system has deadlocked if and only if there exists a process that has propagated its deadline onto itself. One advantage of this definition is that it takes laziness into account. Deadlock avoidance methods that ignore laziness are theoretically more conservative, because it is possible to construct programs that would deadlock but does not because the unsafe code is lazy and does not execute. In occam one can avoid deadlocks by the client-server paradigm if clients and servers follow certain rules and the client-server relation graph is acyclic. This ensures that there are no circular waits. In Toc, one can use the same paradigm to avoid deadlocks by ensuring that there is no circular propagation of deadlines. In Toc there can also be another type of deadlock-like state, the slumber, where a system or subsystem has no progress because it has no deadlines. Slumber occurs e.g. in systems that are based on interacting sporadic tasks, where sporadic tasks are triggered by other tasks. If a trigger is not sent, perhaps because of an incorrectly written conditional, the system may halt as all processes wait for some other process to trigger them. Even though the result is similar to a deadlock — a system doing nothing — the cause of a slumber is laziness and incorrect propagation of deadlines rather than a circular wait. 3. Design Patterns By essentially restricting the way a language is used, design patterns can be used to make a program more readable and to avoid common errors [22]. The goals of the design patterns for Toc presented in this paper are:
354
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
Figure 3. Diagram symbols for design patterns. An arrow points in the direction of deadline propagation, not data flow.
Figure 4. A task driving a task distorts timing.
1. To be able to reason on the absence of deadlocks in a system. 2. To allow for schedulability analysis. 3. To provide a flexible way of implementing the desired real-time behaviour of a system. The design patterns should be designed in such a way that knowledge of a program at the pattern level should be sufficient to do both deadlock and schedulability analysis, so that analysis does not need to take further details of the source code into account. The design patterns will be compatible with the client-server paradigm, so that this can be used to avoid deadlocks. Schedulability analysis is done by analyzing the timing response of each pattern separately, and then using the deadline propagation graph of the system at the pattern level to combine these timing responses into sufficient conditions for schedulability. 3.1. The Task A task is any process with a TIME construct, and is depicted with a double circle as shown in Figure 3. All functionality of the system must be within the responsibility of a task. Tasks are not allowed to communicate directly with one another, because this can lead to distorted timing. This is illustrated by the two tasks in Figure 4: T1 has a deadline and period of 10 ms and T2 of 20 ms, which means that T1 wants to communicate over the channel once in every 10 ms period, while T2 can only communicate half as often. A task’s minimum period is absolute, so T1, which was intended to be a 10 ms period task, will often have a period of 20 ms, missing its deadline while stalled waiting for T2. Moreover, T2 will often execute because it is driven by T1 and not because of its own deadline, in which case it is given a higher scheduling priority than the specification indicates. The result of direct communication in this case is that two simple, independently specified timing requirements become both dependent and distorted. 3.2. The Passive Server A process is passive if it has no deadline of its own. Thus a passive server is driven exclusively by deadlines inherited from clients through its access channels. A passive server is depicted as a single circle, and communications between a task and a server is shown as an arrow
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
355
from the driving client to the server. This is usually, but not necessarily, the same as the data direction. The protocol for an access to a server may involve one or more communications. The first communication is called the request, and any later communications are called replies. To follow the client-server paradigm, if the protocol involves more than one communication, the client cannot communicate between the request and the last reply, and the server may not accept requests from other clients in this period. Again, the server is allowed to perform communications with other servers as a client, propagating the deadline further on. The possibility of preemption means that the temporal behaviour of tasks using the same server are not entirely decoupled: If a task that uses a server preempts another task holding the same server, then the server may not be in a state where it can immediately handle the new request. In this case, the deadline of the new request must also drive the handling of the old request, adding to the execution requirements of the earlier deadline task. This is equivalent to priority inversion in preemptive, priority-based systems. The preempted task executes its critical section with the earlier task’s deadline, resolving the situation in a manner analogous to priority inheritance. 3.3. Sporadic Tasks and Events A passive server is useful for sharing data between tasks, but because a server executes with the client’s deadline, a different approach is needed to start a periodic tasks that is to execute with its own deadline. A communication to start a new task will be called a trigger; the process transmitting the trigger is called the source and the sporadic task being triggered is called the target. A simple way to program a sporadic task would be to have a TIME construct follow in sequence after a channel communication. When another task drives communication on the channel, the TIME construct will be discovered and the sporadic task will be started. However, this approach is problematic, for reasons similar to why tasks should not communicate directly. A target sporadic task that has been started will not accept another trigger before its deadline has elapsed; i.e. if the target task has a relative deadline D then the source must wait at least D between each trigger to avoid being stalled. If a new trigger arrives too early, the source will drive the target and eventually stall. This illustrates the need for events that can be used to start a sporadic task, without the side-effects of stalling the source or driving the target. Events must appear to be synchronous when the target is ready and asynchronous when the target is not: If the target is waiting for a trigger it must be triggered synchronously by an incoming event, but if it the target is busy then incoming events must not require synchronization. The event pattern is a way of implementing this special form of communication in Toc. An event pattern consists of an event process between the source and the target, where the communication between the event process and the target must follow certain rules. An event process is depicted in diagrams as an arrow through a square box. An event process forwards a trigger from the source to the target when the target is ready, and stops it from being forwarded when the target is busy. The event process must therefore know the state of the target. After a trigger is sent to the target then the target can be assumed to be busy, but the target must explicitly communicate to the event process when it becomes ready. The event and target processes must be together follow two rules: 1. After the target notifies the event process that is ready then the target cannot communicate with other processes before triggered by a new event. 2. A trigger from a source should never drive the target to become ready. This implies that sending a trigger will never stall the source.
356
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
Because of these rules, a deadline from the source will never propagate beyond the target, so that the event process is never involved in a circular propagation of a deadline. This in turn means that an event can never be the source of a deadlock in the super-system, even if communications with the event process makes the super-system cyclic. It is therefore safe to disregard events when using the client-server paradigm to avoid deadlocks, or equivalently, to model an event as a passive server to both the source and the target. This sets an event apart from other processes, which is emphasized by the use of a square diagram symbol instead of a circle. There are many possible designs for an event process: it may discard the input when the target is not ready or it may buffer the input. It may have multiple inputs or outputs, and it may also hold an internal state to determine which events to output. 4. Schedulability Analysis Traditionally, threads execute concurrently and lock resources when required, so that other threads are blocked when trying to access those resources. On the other hand, in Toc, the earliest deadline task is always driving execution, but the code being executed may belong to other tasks. A task is never blocked. This means that the schedulability analysis can assume that tasks are independent, but that the WCET for each task will then have to include the execution time of all the parts of dependent processes it may also have execute. All sporadic tasks will here be assumed to execute continuously with their minimum periods. This is a conservative assumption that enables schedulability analysis on a pure pattern level, without having to analyze actual source code to find out which sporadic tasks that may be active at the same time. To find the worst-case execution time, each task will also be assumed to be preempted by all other tasks as many times as is possible. The analysis will also be simplified by assuming that D = T for all tasks, which in code terms means that there are no nested TIME constructs. Because no assumption can be made as to the relative start-time of tasks in Toc, this restriction is necessary to make the schedulability analysis problem solvable in polynomial time. In this setting, the schedulability criterion is simply the standard EDF criterion given in Equation 1. The relative deadlines Di = Ti are read directly from code, so the main problem of the schedulability analysis is to find Ci : the worst-case execution time for each task including execution of other tasks due to deadline propagation. This variable is made up of three components: 1. The base WCET if not preempting and not being preempted. 2. The WCET penalty of being preempted by others. 3. The WCET penalty of preempting other processes. Dependencies between processes will be analyzed using the deadline-propagation graph of the system. An example graph in given in Figure 5. 4.1. WCET of a Server or Event The execution time of a server is complex due to preemption. If a server protocol involves at least one reply, then a client waits for the server in one of two phases: Waiting for the server to accept its request, and waiting for the server up to the last reply. In the first phase the server is ready to accept any client, and may switch clients if preempted, but in the second phase it will complete the communication with the active client even if preempted (executing with the deadline of the preempting task). If the server handles the request by an extended rendezvous then this is equivalent to a second phase even if the server has no reply, as the server cannot switch clients during an extended rendezvous.
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
357
Figure 5. Example deadline propagation graph with tasks and servers.
Figure 6. Timing of passive server
The WCETs for the two phases of a protocol with one reply are illustrated in Figure 6. For a server s, the WCET of the server in the first phase is denoted Cs,request and consists of lazy post-processing of the previous access or initialization of the server. The WCET of the server in the second phase is denoted Cs,reply , which is zero if there is no second phase. s,reply , but Cs,reply consists in part of the WCET of code local to the process of the server, C also of execution used by other servers accessed by s. If acc s is the set of servers accessed directly by s, then s,reply + Cs,reply = C
(Cs ,request + Cs ,reply )
(2)
s ∈acc s
This is a recursive formula over the set of servers. As long as the deadline-propagation network is acyclic, then the recursion will always terminate as there will always be a server at the bottom where acc s = Ø. It is assumed that each server accesses another server directly only once for each of its own requests, but the calculations can easily be adjusted to accommodate multiple accesses by simply adding the WCET of those servers multiple times. For a server s, the maximum WCET of any clients in the client process between the request and the last reply is important when calculating the effect of preemptions, and is denoted Cs,client . This variable is also zero if there are no replies.
358
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
From a temporal point of view an event process behaves like a server to both the source and the target. Like with a server, a source that sends a trigger to an event may up to a certain point be preempted by another source of that event. The same applies to a target when it notifies the event that it is ready. When doing the schedulability analysis one can therefore treat an event as a server. 4.2. Schedulability Analysis of the System A client can be preempted by multiple other clients and by each of these clients multiple times. As seen in the source code in Figure 6, a server cannot change clients while in phase two. Therefore, if the client is preempted in the second phase its work will be driven to complete with the deadline of the preempting task. In contrast, if a client is preempted in the first phase its work is in effect given away to the preempting task, and it will have to execute Crequest over again to make the server ready after the preempting process has finished. The worst case is that the client has executed almost all of Crequest every time it is preempted, in which case each preemption leads to another Crequest of work for the preempted client. From the preempting client’s point of view, preempting a task in its first phase in a server may save up to Cs,request of execution, because of the work already done by the task being preempted. On the other hand, if preempting in the second phase, it will have to drive the current client to finish before the server becomes ready. This may also include execution of code in the process of the preempted client. The worst-case is that the current client has just begun the second phase, in which case preempting it leads to Cs,client + Cs,reply of extra work for the preempting client. For any two tasks A and B, where DA < DB , the deadline propagation graph can be used to identify the servers in the network where A will have to drive B if A preempts B. Such a server is called a critical server for the pair of tasks (A, B). Definition 2 (Critical Server) A critical server for the pair of tasks (A, B) is a server that can be accessed directly or indirectly by both A and B, but not only through the same server. If a server s2 is accessible by two tasks only through server s1 , then preemption must necessarily happen in s1 , because both tasks cannot at the same time hold s1 , which is necessary to access s2 . In Figure 5, servers 1, 2 and 3 are critical servers for (A, B), while server 3 is the only critical server for both (A, C) and (B, C). The set of critical servers for any two tasks A and B is denoted crit (A, B). If task A is to preempt task B, then A must start after B, and have an earlier absolute deadline. For tasks A and B, where DA < DB , this means that A can preempt B at most !
DB DA
" (3)
times for each instance of B under EDF. In other words, Equation 3 is the number of consecutive instances of A that fits completely within a single instance of B. Note that a single instance of A can preempt B at most once. X is the execution time required by code local to X; the max function returns the C largest numerical value of a set. T is the set of tasks. With these definitions, CA can be written as:
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
A + CA = C
359
(Cs,request + Cs,reply ) s∈acc A
+ X∈T,DX