Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2860
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Daniel Geist Enrico Tronci (Eds.)
Correct Hardware Design and Verification Methods 12th IFIP WG 10.5 Advanced Research Working Conference, CHARME 2003 L’Aquila, Italy, October 21-24, 2003 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Daniel Geist Haifa University, IBM Haifa Research Lab Mount Carmel, Haifa, Israel E-mail:
[email protected] Enrico Tronci University of Rome "La Sapienza", Computer Science Department Via Salaria 113, 00198 Rome, Italy E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliographie; detailed bibliographic data is available in the Internet at .
CR Subject Classification (1998): F.3.1, B, D.2.4, F.4.1, I.2.3, J.6 ISSN 0302-9743 ISBN 3-540-20363-X Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de ©2003 IFIP International Federation for Information Processing, Hofstrasse 3, A-2361 Laxenburg,Austria Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10966464 06/3142 543210
Preface
This volume contains the proceedings of CHARME 2003, the 12th Advanced Research Working Conference on Correct Hardware Design and Verification Methods. CHARME 2003 continues the series of working conferences devoted to the development and use of leading-edge formal techniques and tools for the design and verification of hardware and hardware-like systems. Previous events in the ‘CHARME’ series were held in Edinburgh (2001), Bad Herrenalb (1999), Montreal (1997), Frankfurt (1995), Arles (1993) and Turin (1991). This series of meetings were organized in cooperation with IFIP WG 10.5 and 10.2. Prior meetings, stretching back to the earliest days of formal hardware verification were held under various names in Miami (1990), Leuven (1989), Glasgow (1988), Grenoble (1986), Edinburgh (1985) and Darmstadt (1984). We now have a well-established convention whereby the European CHARME conference alternates with its biennial counterpart, the International Conference on Formal Methods in Computer-Aided Design (FMCAD), which is held in evennumbered years in the USA. CHARME 2003 took place during 21–24 October 2003 at the Computer Science Department of the University of L’Aquila, Italy. It was cosponsored by the IFIP TC10/WG10 Working Group on Design and Engineering of Electronic Systems. The CHARME 2003 scientific program was comprised of: – A morning Tutorial by Daniel Geist aimed at industrial and academic interchange. – Two Invited Lectures by Wolfgang Roesner and Fabio Somenzi. – Regular Sessions, featuring 24 papers selected out of 65 submissions, ranging from foundational contributions to tool presentations. – Short Presentations, featuring 8 short contributions accompanied by a short presentation. The conference, of course, also included informal tool demonstrations, not announced in the official program. The topics in 2003 represented a change in the traditional conference repertoire. The motivation for this change was the general feeling that the tools and methodologies of the last decade have outrun their course. Specifically, hardware design today is driven to be specified in higher level of abstraction, with the advent of design languages such as SystemC and SystemVerilog. This stems from the fact that there is a definite crisis in our ability to harness the silicon that can today be manufactured on a single chip. The distinction between software and hardware is also getting blurry, since the architectures of systems-on-chips (SOCs) do not always determine up front what part of the chip’s functionality should be implemented in hardware and what part should be implemented in software as embedded code (firmware).
VI
Preface
This situation of large silicon real estate raises many questions, and there are currently very few answers. It is up to the CHARME community to pioneer new directions in which the silicon industry should head in order to sustain the great success it has had in recent times. Our choice was to emphasize modelling and software in this conference. We hope that these will turn out to be the right choices, but only time will tell if we were right. We are very grateful to the program committee and to all the referees for their assistance in selecting the conference papers. Warm recognition is due to Giuseppe Della Penna, Benedetto Intrigila and Igor Melatti for taking care of the CHARME 2003 organization. Special thanks are due to Giuseppe Della Penna for the CHARME 2003 Web, flier and poster design, as well as for taking care of too many aspects of the CHARME 2003 organization to mention them all. IBM Labs in Haifa took care of printing and mailing CHARME 2003 fliers. We are grateful to Ms. Tamar Yogev for assisting us in this effort. The organizers are very grateful to IBM, INTEL, the University of L’Aquila, and Regione Abruzzo, whose sponsorship made a significant contribution to financing the event. Warm recognition is due to the technical support team. Markus Bajohr at the University of Dortmund together with Martin Karusseit of METAFrame Technologies who provided invaluable assistance to all the people using the online service during the crucial months preceding the conference. Finally, we are grateful to Ms. Anna Kramer and to all the Springer LNCS editorial team for their first-class support during the preparation of this volume.
October 2003
Daniel Geist and Enrico Tronci
Organization
CHARME 2003 ws organized by the Department of Computer Science, University of L’Aquila.
Executive Committee Conference Chair Program Chair Organizing Chair Publicity Chairs
Enrico Tronci (University of Rome, Italy) Daniel Geist (IBM, Israel) Benedetto Intrigila (University of L’Aquila, Italy) Giuseppe Della Penna (University of L’Aquila, Italy) Igor Melatti (University of L’Aquila, Italy)
Program Committee Alan Hu (British Columbia) Alan Mycroft (Cambridge) Anna Slobodova (Intel) Armin Biere (Swiss F.I. of Tech.) Byron Cook (Microsoft) Carl Pixley (Synopsys) Daniel Geist (IBM) Dominique Borrione (Grenoble) Eli Singerman (Intel) Enrico Tronci (Rome) Ganesh Gopalakrishnan (Utah) Hans Eveking (Darmstadt) John O’Leary (Intel)
Ken McMillan (Cadence) Laurence Pierre (Marseille) Limor Fix (Intel) Mark Aagaard (Waterloo) Mary Sheeran (Chalmers) Moshe Vardi (Rice) Ofer Strichman (Carnegie-Mellon) Steve Johnson (Indiana) Thomas Kropf (Bosch) Tiziana Margaria (Dortmund) Tom Melham (Oxford) Warren Hunt (Texas at Austin)
Referees M. Aagaard G. Al Sammane S. Ben-David A. Biere D. Borrione M. Boubekeur K. Claessen B. Cook E. Dumitrescu
N. Een H. Eveking D. Fisman L. Fix R. Fraer D. Geist R. Gerth L. Gluhovsky G. Gopalakrishnan
A. Gupta J. Harrison A. Hu W. Hunt S. Johnson R. Jones G. Kamhi S. Keidar D. Kroening
VIII
Organization
T. Kropf T. Margaria K. McMillan T. Melham M. M¨ uller-Olm M. Moulin A. Mycroft Z. Nevo O. Niese N. Nrarasimhan
J. O’Leary J. Ouaknine L. Pierre C. Pixley I. Rabinovitz O. R¨ uthing J. Schmaltz O. Shacham M. Sheeran E. Singerman
Sponsoring Institutions University of L’Aquila Regione Abruzzo IBM Corporation Intel Corporation
A. Slobodova B. Steffen O. Strichman M. Theobald E. Tronci M. Vardi J. Yang K. Yorav E. Zarpas
Table of Contents
Invited Talks What Is beyond the RTL Horizon for Microprocessor and System Design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang Roesner The Charme of Abstract Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabio Somenzi
1 2
Tutorial The PSL/Sugar Specification Language A Language for all Seasons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Geist
3
Software Verification Finding Regularity: Describing and Analysing Circuits That Are Not Quite Regular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mary Sheeran Predicate Abstraction with Minimum Predicates . . . . . . . . . . . . . . . . . . . . . . Sagar Chaki, Edmund Clarke, Alex Groce, Ofer Strichman Efficient Symbolic Model Checking of Software Using Partial Disjunctive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sharon Barner, Ishai Rabinovitz
4 19
35
Processor Verification Instantiating Uninterpreted Functional Units and Memory System: Functional Verification of the VAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sven Beyer, Chris Jacobi, Daniel Kr¨ oning, Dirk Leinenbach, Wolfgang J. Paul
51
A Hazards-Based Correctness Statement for Pipelined Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark D. Aagaard
66
Analyzing the Intel Itanium Memory Ordering Rules Using Logic Programming and SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Yang, Ganesh Gopalakrishnan, Gary Lindstrom, Konrad Slind
81
X
Table of Contents
Automata Based Methods On Complementing Nondeterministic B¨ uchi Automata . . . . . . . . . . . . . . . . . Sankar Gurumurthy, Orna Kupferman, Fabio Somenzi, Moshe Y. Vardi
96
Coverage Metrics for Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Hana Chockler, Orna Kupferman, Moshe Y. Vardi “More Deterministic” vs. “Smaller” B¨ uchi Automata for Efficient LTL Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Roberto Sebastiani, Stefano Tonetta
Short Papers 1 An Optimized Symbolic Bounded Model Checking Engine . . . . . . . . . . . . . . 141 Rachel Tzoref, Mark Matusevich, Eli Berger, Ilan Beer Constrained Symbolic Simulation with Mathematica and ACL2 . . . . . . . . . 150 Ghiath Al Sammane, Diana Toma, Julien Schmaltz, Pierre Ostier, Dominique Borrione Semi-formal Verification of Memory Systems by Symbolic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Husam Abu-Haimed, Sergey Berezin, David L. Dill CTL May Be Ambiguous When Model Checking Moore Machines . . . . . . . 164 C´edric Roux, Emmanuelle Encrenaz
Specification Methods Reasoning about GSTE Assertion Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Alan J. Hu, Jeremy Casas, Jin Yang Towards Diagrammability and Efficiency in Event Sequence Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Kathi Fisler Executing the Formal Semantics of the Accellera Property Specification Language by Mechanised Theorem Proving . . . . . . . . . . . . . . . 200 Mike Gordon, Joe Hurd, Konrad Slind
Protocol Verification On Combining Symmetry Reduction and Symbolic Representation for Efficient Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 E. Allen Emerson, Thomas Wahl
Table of Contents
XI
On the Correctness of an Intrusion-Tolerant Group Communication Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Mohamed Layouni, Jozef Hooman, Sofi`ene Tahar Exact and Efficient Verification of Parameterized Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E. Allen Emerson, Vineet Kahlon
Short Papers 2 Design and Implementation of an Abstract Interpreter for VHDL . . . . . . . 263 Charles Hymans A Programming Language Based Analysis of Operand Forwarding . . . . . . 270 Lennart Beringer Integrating RAM and Disk Based Verification within the Murϕ Verifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Giuseppe Della Penna, Benedetto Intrigila, Igor Melatti, Enrico Tronci, Marisa Venturini Zilli Design and Verification of CoreConnectTM IP Using Esterel . . . . . . . . . . . 283 Satnam Singh
Theorem Proving Inductive Assertions and Operational Semantics . . . . . . . . . . . . . . . . . . . . . . 289 J Strother Moore A Compositional Theory of Refinement for Branching Time . . . . . . . . . . . . 304 Panagiotis Manolios Linear and Nonlinear Arithmetic in ACL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Warren A. Hunt, Jr., Robert Bellarmine Krug, J Moore
Bounded Model Checking Efficient Distributed SAT and SAT-Based Distributed Bounded Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Malay K Ganai, Aarti Gupta, Zijiang Yang, Pranav Ashar Convergence Testing in Term-Level Bounded Model Checking . . . . . . . . . . 348 Randal E. Bryant, Shuvendu K. Lahiri, Sanjit A. Seshia The ROBDD Size of Simple CNF Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Michael Langberg, Amir Pnueli, Yoav Rodeh
XII
Table of Contents
Model Checking and Application Efficient Hybrid Reachability Analysis for Asynchronous Concurrent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 Enric Pastor, Marco A. Pe˜ na Finite Horizon Analysis of Markov Chains with the Murϕ Verifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Giuseppe Della Penna, Benedetto Intrigila, Igor Melatti, Enrico Tronci, Marisa Venturini Zilli Improved Symbolic Verification Using Partitioning Techniques . . . . . . . . . . 410 Subramanian Iyer, Debashis Sahoo, Christian Stangier, Amit Narayan, Jawahar Jain
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
What Is beyond the RTL Horizon for Microprocessor and System Design? Wolfgang Roesner IBM Server Group 11400 Burnet Road Austin, Texas 78758, USA
[email protected] Abstract. The current state of hardware logic design and verification is discussed based on the project flow used for IBM’s Power4 and Power5 projects. The frequency and power requirements for these high-end chips constrain the logic design to a detailed RT-level in order to control physical effects. On the other hand, the complexity of the designs which embrace many speculative mechanisms to push functional performance to higher levels force an early specification of the microarchitecture with a high-level model. A review how high-level modeling has advanced is based on the discussion which mechanisms of abstraction raise the specification above the RTlevel. A critique of specification language design leads to the appeal to the formal verification community to focus efforts on the front-end of the high-level design process to help shape modeling languages with formally defined semantics that avoid the mistakes made in the past with ad-hoc language designs.
D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, p. 1, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Charme of Abstract Entities Fabio Somenzi University of Colorado at Boulder
[email protected] Abstract. Abstraction is fundamental in combating the state explosion problem in model checking. Automatic techniques have been developed that eliminate presumed irrelevant detail from a model and then refine the abstraction until it is accurate enough to prove the given property. This abstraction refinement approach, initially proposed by Kurshan, has received great impulse from the use of efficient satisfiability solvers in the check for the existence of error traces in the concrete model. Today it is widely applied to the verification of both hardware and software. For complex proofs, the challenge is to keep the abstract model small while carrying out most of the work on it. We review and contrast several refinement techniques that have been developed with this objective. These techniques differ in aspects that range from the choice of decision procedures for the various tasks, to the recourse to syntactic or semantic approaches (e.g., “moving fence” vs. predicate abstraction), and to the analysis of bundles of error traces rather than individual ones.
Supported in part by SRC contract 2002-TJ-920.
D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, p. 2, 2003. c Springer-Verlag Berlin Heidelberg 2003
The PSL/Sugar Specification Language A Language for all Seasons Daniel Geist IBM Haifa Research Lab. Haifa University, Mount Carmel Haifa, Israel
[email protected] Abstract. The Accellera EDA standards body has recently approved the PSL a standard property specification language for use in assertionbased verification via simulation and formal verification tools. This language, which is based on the Sugar language from IBM, is now supported by many EDA vendors. More than 40 individuals representing over 20 companies participated in the efforts to form the PSL standard from its Sugar basis. The tutorial comprises 2 parts. In the first part, we describe the basic principles of PSL/Sugar, focusing on the ease with which complex design behaviors may be described with concise, readable PSL/Sugar assertions that crisply capture design intent. We summarize the temporal constructs of the language, including parameterized sequences and properties, directives, and modeling capabilities. We cover the general timing model of PSL/Sugar, which transparently supports both (singleor multi-clock) synchronous and asynchronous design, and, time permitting, we explain how PSL/Sugar has been defined to ensure consistent semantics for both simulation and formal verification applications. In the second part of the tutorial, we present several applications of PSL/Sugar, ranging from simple to advanced assertion-based verification solutions. These include use of PSL/Sugar for dynamic assertion checking and formal model checking, including support for environment modeling and assume/guarantee reasoning. Examples of commercial verification tools which support the PSL/Sugar languages will also be presented. Participants in the tutorial will have an excellent opportunity to learn about both the language and its applications directly from the speaker, Dr. Danny Geist, who heads a research group in the IBM Haifa lab where Sugar was conceived.
D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, p. 3, 2003. c Springer-Verlag Berlin Heidelberg 2003
Finding Regularity: Describing and Analysing Circuits That Are Not Quite Regular Mary Sheeran Chalmers University of Technology
[email protected] Abstract. We demonstrate some simple but powerful methods that ease the problem of describing and generating circuits that exhibit a degree of regularity, but are not as beautifully regular as the text-book examples. Our motivating example is not a circuit, but a piece of C code that is widely used in graphics applications. It is a sequence of compare-andswap operations that computes the median of 25 inputs. We use the example to illustrate a set of circuit design methods that aid in the writing of sophisticated circuit generators.
1
Introduction
In arithmetic and digital signal processing, many algorithms are well understood, and result in efficient regular circuits. The functional approach to hardware design has proved particularly well-suited to the development of such circuits [3, 10]. Here, we continue to explore this theme; this paper is not about verification, but about design methods – a valid, if under-represented, topic of the Charme conference. We emphasise the description of circuits, as we feel that ease of describing the intended circuit is a key to design productivity. The methods presented here go beyond what can be done in VHDL or C, through the use of higher order functions and polymorphism, which are features of many functional programming languages. The examples shown use Lava, a hardware design system implemented as an embedded domain specific language in the functional programming language Haskell [2]. Batcher’s classic odd even merge sorting algorithm illustrates the power and elegance of the combinator-based approach to describing complex networks: oemerge :: Int -> ([a] -> [a]) -> [a] -> [a] oemerge 1 s2 = s2 oemerge n s2 = ilv (oemerge (n-1) s2) ->- odds s2 oesort :: Int -> ([a] -> [a]) -> [a] -> [a] oesort 0 s2 = id -- the identity function oesort n s2 = two (oesort (n-1) s2) ->- oemerge n s2
Here, ilv, for interleave, is a combinator that applies the given function to the odd and even elements of a list of inputs, to produce the odd and even elements of the output list. So, the function ilv reverse applied to the list [1..8] gives D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 4–18, 2003. c Springer-Verlag Berlin Heidelberg 2003
Finding Regularity: Describing and Analysing Circuits
Fig. 1. two (oesort 2 s2)
->-
5
oemerge 3 s2
[7,8,5,6,3,4,1,2]. reverse is a Haskell function whose type is [a] -> [a]. It takes a list of elements of any type a to a list of elements of the same type. It is a polymorphic function and works at many types. Similarly, ilv has type ([a] -> [b]) -> [a] -> [b]. It takes a function from list of a to list of b and returns a function of the same type. In functional programming parlance, it is a higher order function; it takes a function and returns a function. We use polymorphic higher order functions like ilv to capture circuit interconnection patterns. A second such function is two, which applies a function to the first n elements and to the second n elements of a 2n-length input list, so that, for instance, two reverse [1..8] is [4,3,2,1,8,7,6,5]. Serial composition is written ->-, and odds s2 applies s2 to pairs of adjacent elements of the input, but starting with the second element rather than the first. The function oesort is parameterised both on an integer and on a two-input, two-output sorter component, s2. The integer and the s2 parameter determine the size and type of the resulting network. For instance, oesort 3 intSort2 is a circuit that sorts lists of integers of length 23 , built from a component that sorts a 2-list of integers, intSort2. intSort2 :: [Signal Int] -> [Signal Int] intSort2 [x,y] = [imin (x,y), imax(x,y)]
To illustrate the combinators, oesort 3 s2 is shown in figure 1. Values flow through the network from left to right, and the vertical lines are 2-sorters. The first (or leftmost) value of the input list is input along the top wire. The oesort pattern can be instantiated with many different comparator components, depending on the context in which the sorter is to be placed. The same description can be used to give bit-parallel and bit-serial implementations, simply by plugging in new comparator components. The object of study is the connection pattern from which both combinational and sequential sorters can be built. To perform verification, we plug in a 2-sorter on bits (bitSort2) and, using the 0-1 principle [7], verify functional correctness by generating and checking a propositional formula that states that a fixed-size circuit obeys the required
6
M. Sheeran
sorting property. The 0-1 principle states that if a network with n input lines sorts all 2n sequences of 0s and 1s into nondecreasing order, it will sort any arbitrary sequence of n numbers into nondecreasing order. We have studied the design and analysis of sorting networks in a previous paper [3], and we use the same verification methods in this paper. The problem that we want to address here is the fact that not all circuits are beautiful. They don’t all have a number of inputs that is a power of two, and they don’t all have such an obvious recursive structure. For example, how would we describe any 7-sorter that contains the minimal number of comparators (which is known to be 16 [7])? More generally, how do we describe circuits that are somewhat regular? Via a running example, a median circuit, we present a series of ideas for how to make more sophisticated circuit descriptions, using polymorphism and higher order functions. Shadow values and clever components are aids to writing circuit generators. Non Standard Interpretation is an old idea that we (and others) have used before. Here, we use ordinary polymorphism and components of different types, and do not rely on Haskell’s type classes (although type classes are used extensively in the Lava implementation). Finally, we needed to extend our range of combinators in order to explore a variety of solutions to the median problem. We have deliberately not used the more esoteric parts of Haskell, in the hope of making the ideas usable in other contexts. The median example was inspired not by a circuit, but by a piece of C-code, due to Paeth, which appears in Graphics Gems I, a book of classic graphics algorithms [11]. It is a sequence of 99 compare-swap operations that arranges an array of 25 inputs so that the median element is in the middle position, and all smaller elements are at lower indices (and hence all larger are at larger indices). I first came across a transliteration of this code in reference [5], where it is claimed (informally and without justification) that this function cannot be performed in fewer than 99 comparison-swaps without further information about the input. The application area of such programs (and circuits) is median filtering of digital images, in which n by n windows of the image have their middle pixel replaced by the median pixel, thus removing white noise. A 5 by 5 kernel (as it is called) is often used, so the algorithm is of practical interest. A common approach is to actually sort the 25 pixels, using Batcher’s odd even merge sort, but in a more general variant that allows the division of the input into two parts of unequal length. That would take 138 comparators.
2
Shadow Values I
The user of Lava describes circuits by writing circuit generators. For example, in the oesort example above, the recursive description is instantiated at a particular size, and with a particular type of comparator, in order to produce a circuit. When we simulate, say, an 8-sorter on integers, what happens is that in the background a representation of the concrete circuit is created, and the simulate function walks over that representation:
Finding Regularity: Describing and Analysing Circuits
7
simulate (oesort 3 intSort2) [3,2,1,6,5,4,0,7] [0,1,2,3,4,5,6,7]
Here, the values that flow through the circuit are of type Signal Int and are circuit level values (even though they look like integers). The component intSort2 sorts two such circuit level integers. However, the 3 that is a parameter to oesort is an ordinary Haskell integer. This is an important distinction, at least intuitively, as the Lava user must be able to tell what is a circuit description and what is a more general Haskell function. There are circuit level values (with Signal types), and there are ordinary Haskell values that are used in the generation of circuits. Once we have got to a concrete circuit in the internal netlist representation, all the ordinary Haskell values have disappeared. But in writing the Haskell code that is to be used to generate such a netlist, we can make use of ordinary Haskell values, and can make decisions about how the circuit should look, based upon them. A common pattern is to pair a Haskell value with each circuit level value. The shadow values can control the shape of the resulting circuit. The simplest form of shadow value is just a boolean that indicates whether the corresponding wire should have any components attached to it. The Haskell function tomarked f applies f only to those inputs that are paired with True. It simply passes through those inputs that are paired with False. Main> tomarked (map (*2)) [(1,True),(3,False),(5,True)] [(2,True),(3,False),(10,True)]
Here, only the first and third values are doubled. We can use this idea when generating circuits. If f is a connection pattern that places instances of the component s in a particular way on n inputs, to give n outputs, we might want to get a circuit with n − i inputs by deleting the top i wires and all components attached to them. The resulting circuit will take n − i inputs. We pair each of those n − i real inputs with True, and then add i dummy inputs paired with False. Then, we can apply f (tomarked s) to the resulting marked list, secure in the knowledge that the dummy wires will never be touched. Then, we can drop the dummy wires, and all the marks, to produce n − i circuit level outputs. This is what the function cutTop i does. Similarly, cutTopBottom i j cuts i wires at the top and j on the bottom. Note that a component that is an argument to tomarked must be flexible, in that it may be required to deal with a number of arguments that is smaller than usual, because of the presence of inputs marked with False. In our sorting example, this means that we need a component that is not just a 2-sorter, but that can also deal with one or even zero inputs. The function smallSort takes a two-input sorter and makes it flexible in this way. We will have reason to extend this function later. smallSort s2 [] = [] smallSort s2 [a] = [a] smallSort s2 [a,b] = s2 [a,b]
For example, we can make a 7-sorter from an 8-input odd even merge sorter by using cutTop 1 and (oesort 3). The resulting network is shown in figure 2.
8
M. Sheeran
Fig. 2. cutTop 1 (oesort 3) (smallSort s2)
It is derived from the network shown in figure 1 by omitting the top wire, and the three comparators connected to it. In this instance, the resulting netlist has only 7 inputs and 7 outputs, and it no longer looks very regular. All history of how that netlist was generated using shadow booleans is forgotten at this stage. The reader might argue that one could just use padded inputs and leave the pruning of unnecessary gates and wires to the lower level design tools. However, we find this approach more convenient and less error prone. We have found that padding makes for unreadable circuit descriptions, and can lead to the introduction of bugs. Also, we often make designs in which we first develop abstract circuits (say with integers whose representation has not yet been chosen flowing on wires). We want to be able to prune these circuits at an early stage in the design, before we are ready to produce input to lower level design tools. Formal verification using a SAT-solver is done in the usual way [3]. (Satzoo is a SAT-solver developed by Een here at Chalmers [6]. The function satzoo creates a file in DIMACS format that is passed to the solver, the output of which is then passed back to the Haskell interpreter.) sortCheck n cct = satzoo (prop_doesSortsize (cct (smallSort bitSort2)) n) Main> sortCheck 7 (cutTop 1 (oesort 3)) Satzoo: ... (t=0.0) Valid.
Because we consider only restricted forms of networks, we choose not to prove that the networks permute their inputs. Such proofs, if required, can also be done using a SAT-solver in Lava.
3
Non-standard Interpretation
We have already seen how to verify sorting networks by using a 2-sorter on bits and the 0-1 principle. This is an example of non-standard interpretation, in which we replace the circuit components with others that are intended to
Finding Regularity: Describing and Analysing Circuits
9
gather information about the circuit. We then simulate the circuit with the new components, and suitable initialising inputs, to perform the required analysis. To count the number of comparators in a circuit, we replace each comparator by a component that adds one to its left hand input and passes its right hand input through unchanged. Then, at the end, we sum all of the numbers appearing on the output. (This simple method works as long as all of the informationcarrying wires eventually reach the output, but that is the case for all of our networks.) We simulate the resulting circuit on a list of zeros. Note that csize2 is most definitely a circuit level component, whose inputs and outputs are lists of integer signals. It is included as a first step towards the use of such functions during circuit generation, rather than, as here, during simulation. A more general count function would be a recursive function over the internal data type representing circuits. csize2 :: [Signal Int] -> [Signal Int] csize2 [i,j] = [plus(1,i),j] count n cct = simulate(cct (smallSort csize2) ->- sum) (replicate n 0) Main> count 7 (cutTop (oesort 3)) 16
The 7-sorter has as few comparators as possible. Circuit depth is just as easy to calculate. Again, integers flow on the wires, and the depth of the output of a comparator is one more than the integer maximum of the inputs. The 7-sorter has optimal depth (which is 6)[7]. cdepth2 :: [Signal Int] -> [Signal Int] cdepth2 [x,y] = [m,m] where m = plus(1,imax(x,y)) depth n cct = simulate(cct (smallSort cdepth2) ->- imaximum) (replicate n 0)
Cutting 2 wires on the top of an 8-sorter also gives a size-optimal circuit with 12 comparators. We don’t do so well when the number of inputs to the sorter is just above a power of two, rather than just below. The smallest known 9-sorters have 25 comparators, but cutting 7 wires from a 16-input odd even merge sort gives a 28-comparator sorter. Our next step is to generalise the combinators ilv and two to be multi-way rather than two-way. This leads us to a generalisation of odd even merge sort, and also broadens the range of sorters and other networks that can be described easily.
4
Generalised Combinators
Recall that two f applies f to each half of a list. Its generalisation, parI i f applies f to each ith part of the list, so that, for instance, parI 5 f applies f
10
M. Sheeran
to each fifth of the list. The function concat flattens a list of lists back into a list. The general version of ilv instead chops the list into i-length sublists and transposes, to give i sublists, before applying map f and then returning the list to its original order. parI i f
=
chopinto i ->- map f ->- concat
ilvI i f
=
chop i ->- transpose ->- map f ->- transpose ->- concat
Armed with these new combinators, we can generalise oesort, provided we can figure out what odds should become. Well, odds s2 sorts an almost-sorted list. It is able to sort the list by comparing only adjacent elements, and it compares only those elements that have not already been compared. For i = 3, it turns out that the new pattern, which we will call fmerge i, should compare elements a distance two apart, and then adjacent elements, while refraining from comparing elements whose relation is already known. In general, fmerge i should first compare elements a distance i − 1 apart, then i − 2 and so on, down to 1. The function dist i k ss applies ss to elements a distance k apart, but avoids comparing elements in each i-length sublist. fmerge i ss = compose [dist i k ss | k - fmerge i ss oesortI i 0 ss = id oesortI i n ss = parI i (oesortI i (n-1) ss) ->- oemergeI i n ss
Think of the second parameter to oesortI as the number of dimensions. The instance oesort i j sorts a list of length i to the power of j. The i parameter, the size of each dimension, must be odd, although 2 works as a special case (and gives Batcher’s odd even merge sort shown earlier). For larger even-length dimensions, some extra comparators are needed, but we will not pursue this topic here. Now, if we are to use this general sorting algorithm for i greater than 2, we must be able to make sorting components (for use as the ss parameter) for more than two inputs. To do this, we extend the function smallSort that was introduced earlier. The 3-sorter is made from three comparators, and is completely standard. The 4- and 5-sorters are made from oesort (and are optimal in both size and depth). Larger sized sorters are easily included in a similar way, and it may then make sense to change the style of the definition to a case analysis on the length of the input. sort3l s2 [x,y,z] = [a,b,c] where [x1,y1] = s2 [x,y] [y2,c] = s2 [y1,z] [a,b] = s2 [x1,y2] smallSort s2 [] smallSort s2 [a]
= [] = [a]
Finding Regularity: Describing and Analysing Circuits smallSort smallSort smallSort smallSort
s2 s2 s2 s2
[a,b] [a,b,c] [a,b,c,d] [a,b,c,d,e]
= = = =
11
s2 [a,b] sort3l s2 [a,b,c] oesort 2 s2 [a,b,c,d] cutTop 3 (oesort 3) (smallSort s2) [a,b,c,d,e]
If we restrict oesortI to two dimensions, we get the sorting algorithm proposed by Kolte et al [8] from Motorola. In that case, the rows and columns of the i × i grid are first sorted, and then the call of fmerge i sorts all the diagonal lines, starting with the main diagonals. What we add here is both a much more streamlined verification process and the generalisation to more than two dimensions. The paper by Kolte et al proposes an elaborate scheme for testing the proposed sorting network, but the use of a SAT-solver and the 0-1 principle is a much easier option. On the other hand, the Motorola paper develops software for a complete median filter that gives impressive performance on a particular architecture. It would be very interesting to develop an efficient median filter on an FPGA and compare its performance with more standard implementations. That is future work. Using 3 dimensions, for example, we can quickly analyse a 27-sorter (made from 3- and 2-sorters) to find that it has depth 20 and size 154. This is one comparator smaller (though considerably deeper) than the general two-way odd even merge. We will make use of oesortI 3 3 later, when constructing the 25-median circuit. Further discussion of the algorithm oesortI is beyond the scope of this paper. We believe that fmerge could be improved for larger dimension sizes, and Van Voorhis’ work shows how to deal with even-length dimensions [12]. Independent of the example, we are pleased with the simplicity of the generalised combinators. They give the user access to a broader range of connection patterns, without the need to learn many new combinators. Now, we return to the 25-median problem. To solve it, we need to use more complicated shadow values than those that we have seen so far. We aim to keep only those parts of a sorter that contribute to arranging the outputs of the median circuit into an order that satisfies the specification.
5
Shadow Values II
We saw in section 3 that we can gather information about an instantiated circuit by simulating it using specially designed circuit level components like csize2. Here, we use similar ideas, but in the world of shadow values. Shadow values have so far been unchanging Boolean values. Now, we make them more dynamic and more complicated. The idea is to use shadow values to record information about the circuit so far, allowing decisions to be made about how the rest of the circuit should look. For the median example, what we want to do is to figure out for each “wire” in the circuit whether or not it is still in the running to be the median, and so needs to be processed further. And we want to do this figuring out at circuit generation time. This is not straightforward, and requires some insights into
12
M. Sheeran
Fig. 3. bflyI 3 2 (smallSort s2)
->-
fmerge 3 (smallSort s2)
the mathematics of sorting. We cannot go into the details here, but the reader is referred to the work of Van Voorhis to see the kinds of arguments that are required [12]. Our approach is to rewrite our sorter so that the first steps are to sort the different dimensions of the input. So, for example, a two-dimensional sorter will start by sorting the rows and columns, and a three-dimensional sorter will sort along each of the three axes. This pattern is called a butterfly network. It is straightforward to rewrite oesortI into a butterfly network of sorters followed by the rest, which we call bafterI. boesortI 3 2 is shown in Figure 5. It is essentially the same as the optimal 25-comparator 9-sorter due to Floyd [7]. bflyI i 0 f = id bflyI i n f = parI i (bflyI i (n-1) f) ->- (iter (n-1) (ilvI i) f) boemergeI i 1 ss = id boemergeI i n ss = ilvI i (boemergeI i (n-1) ss) ->- fmerge i ss bafterI i 1 ss = id bafterI i n ss = parI i (bafterI i (n-1) ss) ->- boemergeI i n ss boesortI i n ss = bflyI i n ss ->- bafterI i n ss
The reason why we do this is that the sortedness of the different dimensions, which is the result of the initial butterfly network, remains unaffected throughout the rest of the network. Also, inside the butterfly, sorting each new dimension leaves the previously sorted dimensions still sorted. So, after the butterfly, it is easy to figure out, for a given wire, how many other wires are greater than or smaller than it. We give each wire an address that records what happened in the butterfly. So, for example, the address [2,1,2] is given to a wire that has “passed through” the top, bottom, and top of three 2-way comparators. After the butterfly, this wire is greater than or equal to the following set of wires:
Finding Regularity: Describing and Analysing Circuits
13
[[1,1,1],[1,1,2],[2,1,1],[2,1,2]]. Similarly, in the case of 27 inputs, the address [3,1,2] is less than or equal to the addresses [[3,3,3],[3,3,2],[3,2,3],[3,2,2],[3,1,3],[3,1,2]], after a butterfly of 3-sorters. Such calculations have been implemented in the functions under and overI. To calculate the list of addresses greater than a given one, one needs to know the size of the dimensions. Now, inside bafterI, on each shadow wire, we keep lists of the addresses that are over and under it. The shadow component for the 2-sorter manipulates and updates these lists, which represent sets of addresses, and so do not contain duplicates. The standard function nub removes duplicates from a list. combs2 :: [([Address],[Address])] -> [([Address],[Address])] combs2 [(l1,g1),(l2,g2)] = [(nub (l1++l2),g1), (l2,nub(g1++g2))]
So, the wire that “passes through” the lower part of the comparator gets a new (over, under) pair containing the union of the two input over lists, but only the lower under list. For the upper wire, the situation is dual. Then, the lengths of these lists give good information about the status of a wire, and its relation to the remaining wires. On the input to the circuit, we provide information about the target for each wire. In our case, we place a single (shadow) integer on each wire, and the wire should be taken out of the running (in the same way as with the simple shadow Booleans that we saw earlier) once it is known to be either greater than or less than that number of other wires. The target remains unchanged, while the address lists grow longer as one moves through the network. (One could choose to use two integers for the target, which could be different for the over and under lists, but that is not necessary in the median examples shown here.) The new shadow component is combine id combs2 where combine f g [(a,x),(b,y)] = [(fa,gx),(fb,gy)] where [fa,fb] = f [a,b] [gx,gy] = g [x,y]
Each wire has a shadow value of type (Int,([Address],[Address])), that is a pair of an integer and a pair of lists of addresses. A wire is certain not to be the median if the number of distinct addresses that are either smaller than or greater than it is large enough. The target is set to 1 + n/2, where n is the number of inputs to the median circuit. Just after the butterfly, the address lists are all singletons containing the address of the wire to which they are attached. The function placeTargetAddressI introduces the required initial shadow values. To be able to make use of these shadow values, we must generalise tomarked. The function onPredicate p f causes f to be applied only to those inputs for which the predicate p is true of the shadow value. Recall that the version of oesortI with the butterfly in the first columns was boesortI i n ss = bflyI i n ss ->- bafterI i n ss
Following this definition, we define medI i j ss = bflyI i j ss ->placeTargetAddressI i j ->-
14
M. Sheeran bafterI i j (onPredicate ok (smallSort comp)) ->unmark where comp = combine ss (combine id combs2) ok = not . (notmedianI i)
We leave the butterfly alone, but transform bafterI so that it performs the calculations described above when deciding whether or not to include a comparator. The result is promising: Main> medCheck 27 (medI 3 3) Satzoo: ... (t=0.3) Valid. Main> count 27 (medI 3 3) 114 Main> count 27 (boesortI 3 3) 154
We have a circuit that correctly places the median input in the middle output, and all of the smaller values to the left of it in the output list. This property is checked by the observer medCheck, whose key function is reallyMedian, which checks that a given value is larger than all of the elements of a given list, and smaller than all of the elements of another. Logical implication (written ==>) is the ordering on bits, and andl is a multi-input and gate. reallyMedian a smaller bigger = andl ([s ==> a | s b | b 5), (c < d)}, V1 = {0, 1, 1} and V2 = {1, 0, 1}. Then Γs (V1 ) = (¬(a == 0)) ∧ (b > 5) ∧ (c < d) and Γs (V2 ) = (a == 0) ∧ (¬(b > 5)) ∧ (c < d).
Predicate Abstraction with Minimum Predicates
25
Computing the transitions between the states in A(Π, P) requires a theorem prover. We add a transition between two abstract states unless we can prove that there is no transition between their corresponding concrete states. If we cannot prove this, we say that the two states (or the two formulas representing them) are admissible. This problem can be reduced to the problem of deciding whether ¬(ψ1 ∧ ψ2 ) is valid, where ψ1 and ψ2 are arbitrary quantifier free first order logic formulas. In general this problem is known to be undecidable. However for our purposes it is sufficient that the theorem prover be sound and always terminate. Several publicly available theorem provers (such as Simplify [11]) have this characteristic. Given arbitrary formulas ψ1 and ψ2 , we say that the formulas are admissible if the theorem prover returns false or unknown on ¬(ψ1 ∧ ψ2 ). We denote this by Adm(ψ1 , ψ2 ). Otherwise the formulas are inadmissible, denoted by ¬Adm(ψ1 , ψ2 ). A Procedure for Constructing A(Π, P). We now define A(Π, P). Formally, it is a triple SA , IA , TA where: – SA = ∪s∈SCF {s} × Vs is the set of states. – IA = {ICF } × VICF is the initial set of states. – TA ⊆ SA × SA is the transition relation, defined as follows: ((s1 , V1 ), (s2 , V2 )) ∈ TA iff (s1 , s2 ) ∈ TCF and one of the following conditions hold: is an assignment statement and 1. L(s1 ) Adm(Γs1 (V1 ), WP(Γs2 (V2 ), L(s1 ))). 2. L(s1 ) is a branch statement with a branch condition c, L(s2 ) is its then successor, Adm(Γs1 (V1 ), Γs2 (V2 )) and Adm(Γs1 (V1 ), c). 3. L(s1 ) is a branch statement with a branch condition c, L(s2 ) is its else successor, Adm(Γs1 (V1 ), Γs2 (V2 )) and Adm(Γs1 (V1 ), ¬c). 4. L(s1 ) is a goto statement and Adm(Γs1 (V1 ), Γs2 (V2 )). 5. L(s1 ) is a return statement and s2 is the final state. Example 4. Recall the CFA from Example 1 and the predicates corresponding to CFA nodes discussed in Example 2. The A(Π, P) obtained in this case appears in Figure 1(c). Let us see why there is a transition from (L0, ⊥) to (L1, true). Since L(L0) is an assignment statement, by rule 1 above we compute the following expressions: – – – – –
ΓL0 (⊥) = true ΓL1 (true)= (x == 1). L(L0) = (x = 1) WP(ΓL1 (true), L(L0)) = WP((x == 1), x = 1; ) = (1 == 1) = true Adm(true,true).
Thus, we add a transition from (L0, ⊥) to (L1, true). Examining a possible transition from (L0, ⊥) to (L1, false), we similarly compute ΓL1 (false) = (¬(x ==
26
S. Chaki et al.
Input: A trace τ of A(Π, P) s.t. γ(τ ) = s1 , . . . , sn Output: true iff τ is valid (can be simulated on the concrete system) Variable: X of type formula Initialize: X := true For i = n to 1 If si is an assignment X := WP(X, si ) Else If si is a branch with condition c If (i < n) If si+1 is the ‘then’ successor of si , X := X ∧ c else X := X ∧ ¬c If (X ≡ false) return false Return true Fig. 3. Algorithm T C to check the validity of a trace of Π.
1)) and WP((¬(x == 1)), x = 1; ) = (¬(1 == 1)). Since ¬Adm(true, (¬(1 == 1))), there is no transition between these two abstract states. The presence or absence of other transitions can be explained in a similar manner. As no state labeled by L4 is reachable, we have proven that our example property holds. Clearly, if we do not limit the size of Ps , |SA | is exponential in |P|. Hence so are the worst case space and time complexities of constructing A(Π, P). 2.2
Trace Concretization
A trace of A(Π, P) is a finite sequence (s1 , V1 ), . . . , (sn , Vn ) such that (i) for 1 ≤ i ≤ n, (si , Vi ) ∈ SA , (ii) (s1 , V1 ) ∈ IA and (iii) for 1 ≤ i < n, ((si , Vi ), (si+1 , Vi+1 )) ∈ TA . Given such a trace τ = (s1 , V1 ), . . . , (sn , Vn ) of A(Π, P), the concretization of τ is defined as γ(τ ) = L(s1 ), . . . , L(sn ). Thus, the concretization of an abstract trace is a trace of Π: a sequence of statements that correspond to some trace in the control flow graph of Π. 2.3
Trace Checking
The T C algorithm, described in Figure 3, takes Π and a counterexample τ as inputs and returns true if γ(τ ) is a valid trace of Π. This is a backward traversal based algorithm. There is an equivalent algorithm [3] that is forward traversal based and uses strongest postconditions instead of weakest preconditions. 2.4
Checking Trace Elimination
Given a spurious counterexample τ = (s1 , V1 ), . . . , (sn , Vn ) and a set of branches P, we will need to determine if P eliminates τ . To do so we: (i) construct A(Π, P) and (ii) determine if there exists a trace τ of A(Π, P) such that γ(τ ) = γ(τ ). The algorithm, called TraceEliminate, is described in Figure 41 . 1
Note that in practice this step can be carried out in an on-the-fly manner without constructing the full A(Π, P).
Predicate Abstraction with Minimum Predicates
27
Input: Spurious trace τ s.t. γ(τ ) = s1 , . . . , sn and a set of predicates P Output: true if τ is eliminated by P and false otherwise Compute A(Π, P) = SA , IA , TA Variable: X, Y of type subset of SA Initialize: X := {(s, V ) ∈ SA | s = s1 } If (X = ∅) return true For i = 2 to n do Y := {(s , V ) ∈ SA |(s = si ) ∧ ∃(s, V ) ∈ X . ((s, V ), (s , V )) ∈ TA } If (Y = ∅) return true X := Y Return false Fig. 4. Algorithm TraceEliminate to check if a spurious trace can be eliminated.
3
Predicate Minimization
We now present the algorithm for discovering a minimal set of branches P of a program π that will help us prove or disprove a safety property φ. 3.1
The Sample-and-Eliminate Algorithm
Algorithm Sample-and-Eliminate, described in Figure 5, is based on an abstraction refinement loop that keeps the set of predicates minimal throughout the process. It is modeled after the Sample-and-Separate algorithm [6], where it is used in a CEGAR framework for hardware verification. At each step it finds a counterexample if one exists and checks whether it corresponds to a concrete counterexample, as usual. Unlike previous approaches [3,9], however, it finds a minimal set of predicates that eliminate all the concrete spurious traces that were found so far (in the last line of the loop.) Our approach to solving this minimization problem is the subject of Section 3.2. Input: Program Π, safety property φ Output: true if proved that Π |= φ, false if proved Π |= φ, and unknown otherwise. Variable: T set of spurious counterexamples, P set of predicates Initialize: T := ∅, P := ∅ Forever do If MC(A(Π, P ), φ) = true return true Else let τ be the abstract counterexample If T C(τ ) = true return false If P is the set of all branches in Π then return unknown T := T ∪ {τ } P := minimal set of branches of Π that eliminates all elements of T Fig. 5. Algorithm Sample-and-Eliminate uses a minimal set of predicates taken from a program’s branches to prove or disprove Π |= φ, if such a proof is possible.
28
3.2
S. Chaki et al.
Minimizing the Eliminating Set
The last line of Sample-and-Eliminate presents the following problem: given a set of spurious counterexamples T and a set of candidate predicates P (all the branches of Π in our case), find a minimal set p ⊆ P which eliminates all the traces in T . We present a three step algorithm for solving this problem. First, P find a mapping T → 22 between each trace in T and the set of sets of predicates in P that eliminate it. This can be achieved by iterating through every p ⊆ P and τ ∈ T , using TraceEliminate to determine if p can eliminate τ . This approach is exponential in |P | but below we list several ways to reduce the number of attempted combinations: – Limit the size or number of attempted combinations to a small constant, e.g. 5, assuming that most traces can be eliminated by a small set of predicates. – Stop after reaching a certain size of combinations if any eliminating solutions have been found. – Break up the control flow graph into blocks and only consider combinations of predicates within blocks (keeping combinations in other blocks fixed). – Use data flow analysis to only consider combinations of related predicates. – For any τ ∈ T , if a set p eliminates τ , ignore all supersets of p with respect to τ (as we are seeking a minimal solution). Second, encode each predicate pi ∈ P with a new Boolean variable pbi . We use the terms ‘predicate’ and ‘the Boolean encoding of the predicate’ interchangeably. Third, derive a Boolean formula σ, based on the predicate encoding, that represents all the possible combinations of predicates that eliminate the elements of T . We use the following notation in the description of σ. Let τ ∈ T be a trace: – kτ denotes the number of sets of predicates that eliminate τ (1 ≤ kτ ≤ 2|P | ). – s(τ, i) denotes the i-th set (1 ≤ i ≤ kτ ) of predicates that eliminates τ . We use the same notation for the conjunction of the predicates in this set. The formula σ is defined as follows: def
σ =
kτ
s(τ, i)
(1)
τ ∈T i=1
For any satisfying assignment to σ, the predicates whose Boolean encodings are assigned true are sufficient for eliminating all elements of T . From the various possible satisfying assignments to σ, we look for the one with the smallest number of positive assignments. This assignment represents the minimal number of predicates that are sufficient for eliminating T . Since σ includes disjunctions, it cannot be solved directly with a 0-1 ILP solver. We therefore use PBS [1], a solver for Pseudo Boolean Formulas. n A pseudo-Boolean formula is of the form i=1 ci ·bi k, where bi is a Boolean variable and ci is a rational constant for 1 ≤ i ≤ n. k is a rational constant and
represents one of the inequality or equality relations ({, ≥, =}). Each
Predicate Abstraction with Minimum Predicates
29
such constraint can be expanded to a CNF formula (hence the name pseudoBoolean), but this expansion can be exponential in n. PBS does not perform this expansion, but rather uses an algorithm designed in the spirit of the DavisPutnam-Loveland algorithm that handles these constraints directly. PBS accepts as input standard CNF formulas augmented with pseudo-Boolean constraints. Given an objective function in the form of pseudo-Boolean formula, PBS finds an optimal solution by repeatedly tightening the constraint over the value of this function until it becomes unsatisfiable. That is, it first finds a satisfying solution and calculates the value of the objective function according to this solution. It then adds a constraint that the value of the objective function should be smaller by one. This process is repeated until the formula becomes unsatisfiable. The objective function in our case is to minimize the number of chosen predicates (by minimizing the number of variables that are assigned true): min
n
pbi
(2)
i=1
Example 5. Suppose that the trace τ1 is eliminated by either {p1 , p3 , p5 } or {p2 , p5 } and that the trace τ2 can be eliminated by either {p2 , p3 } or {p4 }. 5 The objective function is min i=1 pbi and is subject to the constraint: σ = ((pb1 ∧ pb3 ∧ pb5 ) ∨ (pb2 ∧ pb5 ))∧ ((pb2 ∧ pb3 ) ∨ (pb4 )) The minimal satisfying assignment in this case is pb2 = pb5 = pb4 = true.
Other techniques for solving this optimization problem are possible, including minimal hitting sets and logic minimization. The PBS step, however, has not been a bottleneck in any of our experiments.
4
Experiments and Conclusions
We implemented our technique inside the MAGIC [4] tool. MAGIC was designed to check weak simulation of properties of labeled transition systems (LTSs) derived from C programs. We experimented with MAGIC with and without predicate optimization. We also performed experiments with a greedy predicate minimization strategy implemented on top of MAGIC. In each iteration, this strategy first adds predicates sufficient to eliminate the spurious counterexample to the predicate set P. Then it attempts to reduce the size of the resulting P by using the algorithm described in Figure 6. The advantage of this approach is that it requires only a small overhead (polynomial) compared to Sample-and-Eliminate, but on the other hand it does not guarantee an optimal result. Further, we performed experiments with Berkeley’s BLAST [9] tool. BLAST also takes C programs as input, and uses a variation of the standard CEGAR loop based
30
S. Chaki et al.
Input: Set of predicates P Output: Subset of P that eliminates all spurious counterexamples so far Variable: X of type set of predicates LOOP: Create a random ordering p1 , . . . , pk of P For i = 1 to k do X := P \ {pi } If X can eliminate every spurious counterexample seen so far P := X Goto LOOP Return P Fig. 6. Greedy predicate minimization algorithm.
on lazy abstraction, but without minimization. Lazy abstraction refines an abstract model while allowing different degrees of abstraction in different parts of a program, without requiring recomputation of the entire abstract model in each iteration. Laziness and predicate minimization are, for the most part, orthogonal techniques. In principle a combination of the two might produce better results than either in isolation. Benchmarks. We used two kinds of benchmarks. A small set of relatively simple benchmarks were derived from the examples supplied with the BLAST distribution and regression tests for MAGIC. The difficult benchmarks were derived from the C source code of openssl-0.9.6c, several thousand lines of code implementing the SSL protocol used for secure transfer of information over the Internet. A critical component of this protocol is the initial handshake between a server and a client. We verified different properties of the main routines that implement the handshake. The names of benchmarks that are derived from the server routine and client routine begin with ssl-srvr and ssl-clnt respectively. In all our benchmarks, the properties are satisfied by the implementation. The server and client routines have roughly 350 lines each but, as our results indicate, are non-trivial to verify. Results. Figure 7 summarizes our results. Time for all experiments is given in seconds. All experiments were performed on an AMD Athlon XP 1600 machine with 900 MB of RAM running RedHat 7.1. The column Iter reports the number of iterations through the CEGAR loop necessary to complete the proof. Predicates are listed differently for the two tools. For BLAST, the first number is the total number of predicates discovered and used and the second number is the number of predicates active at any one point in the program (due to lazy abstraction this may be smaller). In order to force termination we imposed a limit of three hours on the running time. We denote by ‘*’ in the Time column examples that could not be solved in this time limit. In these cases the other columns indicate relevant measurements made at the point of forceful termination. For MAGIC, the first number is the total number of expressions used to prove the property, i.e. | ∪s∈SCF Ps |. The number of predicates (the second number)
Predicate Abstraction with Minimum Predicates
Program funcall-nes fun lock driver.c read.c socket-y-01 opttest.c ssl-srvr-1 ssl-srvr-2 ssl-srvr-3 ssl-srvr-4 ssl-srvr-5 ssl-srvr-6 ssl-srvr-7 ssl-srvr-8 ssl-srvr-9 ssl-srvr-10 ssl-srvr-11 ssl-srvr-12 ssl-srvr-13 ssl-srvr-14 ssl-srvr-15 ssl-srvr-16 ssl-clnt-1 ssl-clnt-2 ssl-clnt-3 ssl-clnt-4 TOTAL
Time 1 5 1 6 5 7499 2398 691 1162 284 1804 * 359 * 337 8289 547 2434 608 10444 * * 348 523 469 380 81794
BLAST Iter Pred 3 13/10 7 7/7 4 3/2 11 20/11 13 16/6 38 37/37 16 33/8 13 68/8 14 32/7 11 27/5 19 52/5 39 90/10 11 76/9 25 35/5 10 76/9 20 35/8 11 78/11 21 80/8 12 79/12 27 84/10 31 38/5 33 87/10 16 28/5 15 28/4 14 29/5 13 27/4 447 1178/221
AVERAGE 3146 17
45/9
Mem × × × × × 231 175 60 103 44 71 805 37 266 36 148 51 120 54 278 436 480 43 52 49 45 3584 171
MAGIC Iter Pred 2 10/9/1 4 8/3/3 5 6/2/4 2 15/5/2 3 12/4/2 5 7/7/8 12 56/5/22 16 72/6/30 12 56/5/22 14 63/6/26 5 22/4/8 23 105/11/44 20 94/7/38 8 32/5/14 15 65/8/28 16 75/6/30 19 89/8/36 22 102/9/42 17 75/9/32 22 102/9/42 15 81/28/5 21 98/8/40 10 43/4/18 11 53/5/20 13 52/7/24 10 35/5/18 322 1428/185 /559 1804 12 55/7/22
Time 5 5 6 5 5 145 250 752 331 677 71 11840 2575 130 2621 561 4014 7627 3127 7317 615 3413 110 156 421 125 46904
31
MAGIC + GREEDY MAGIC + MINIMIZE Mem Time Iter Pred Mem Time Iter Pred Mem × 6 2 10/9/1 × 5 2 10/9/1 × × 5 5 8/3/3 × 6 4 8/3/3 × × 5 5 6/2/4 × 5 5 6/2/4 × × 6 3 15/5/1 × 5 2 15/5/1 × × 5 3 12/4/2 × 6 3 12/4/2 × 63 150 5 4/4/4 63 247 25 4/4/4 63 43 * 103 16/3/5 51 226 14 5/4/2 38 72 2106 62 8/4/3 34 216 14 5/4/2 38 47 * 100 22/3/7 53 200 12 5/4/2 38 72 8465 69 14/4/5 56 170 9 5/4/2 38 24 * 117 23/5/9 56 205 13 5/4/2 36 1187 * 84 22/4/8 337 359 14 8/4/3 89 192 * 99 19/3/6 62 196 11 5/4/2 38 58 * 97 19/4/7 142 211 10 8/4/3 40 183 8133 99 11/4/4 69 316 20 11/4/4 38 73 * 97 12/3/4 77 241 14 8/4/3 38 287 * 87 26/4/9 65 356 24 8/4/3 38 536 * 122 23/4/8 180 301 17 8/4/3 42 498 * 106 19/4/7 69 436 29 11/4/4 38 721 * 115 18/3/6 254 406 20 8/4/3 52 188 2112 37 8/4/3 118 179 7 8/4/3 40 557 * 103 22/3/7 405 356 17 8/4/3 58 25 225 27 5/4/2 20 156 12 5/4/2 31 31 1393 63 5/4/2 23 185 18 5/4/2 29 58 * 136 29/4/10 28 195 21 5/4/2 29 27 152 29 5/4/2 20 191 19 5/4/2 29 4942 163163 1775 381/102 2182 5375 356 191/107 880 /129 /67 235 6276 68 15/4/5 104 207 14 7/4/3 42
Fig. 7. Results for BLAST and MAGIC with different refinement strategies. ‘*’ indicate run-time longer than 3 hours. ‘×’ indicate negligible values. Best results are emphasized.
may be smaller, as MAGIC combines multiple mutually exclusive expressions (e.g. x == 1, x < 1, and x > 1) into a single, possibly non-binary predicate, having a number of values equal to the number of expressions (plus one, if the expressions do not cover all possibilities.) The final number for MAGIC is the size of the final P. For experiments in which memory usage was large enough to be a measure of state space size rather than overhead, we also report memory usage (in megabytes). The first MAGIC results are for the MAGIC tool operating in the standard refinement manner: in each iteration, predicates sufficient to eliminate the spurious counterexample are added to the predicate set. The second MAGIC results are for the greedy predicate minimization strategy. The last MAGIC results are for predicate minimization. Rather than solving the full optimization problem, we simplified the problem as described in section 3. In particular, for each trace we only considered the first 1,000 combinations and only generated 20 eliminating combinations. The combinations were considered in increasing order of size. After all combinations of a particular size had been tried, we checked whether at least one eliminating combination had been found. If so, no further combinations were tried. In the smaller examples we observed no loss of optimality due to these restrictions. We also studied the effect of altering these restrictions on the larger benchmarks and we report on our findings later.
32
ELM 50 100 150 200 250
S. Chaki et al.
SUB Time It 250 656 8 250 656 8 250 657 8 250 656 8 250 656 8
ssl-srvr-4 ssl-srvr-15 ssl-clnt-1 |P| Mem TG MG Time It |P| Mem TG MG Time It |P| Mem 2 64 34 1 1170 15 3 72 86 1 1089 13 2 67 2 64 34 1 1169 15 3 72 86 1 1089 13 2 67 2 64 34 1 1169 15 3 72 86 1 1090 13 2 67 2 64 34 1 1170 15 3 72 86 1 1089 13 2 67 2 64 34 1 1168 15 3 72 86 1 1090 13 2 67
TG MG 66 1 66 1 66 1 66 1 66 1
Fig. 8. Results for optimality. ELM = MAXELM, SUB = MAXSUB, It is the number of iterations, TG is the total number of eliminating subsets generated, and MG is the maximum size of any eliminating subset generated.
For the smaller benchmarks, the various abstraction refinement strategies do not differ markedly. However, for our larger examples, taken from the SSL source code, the refinement strategy is of considerable importance. Predicate minimization, in general, reduced verification time (though there were a few exceptions to this rule, the average running time was considerably lower than for the other techniques, even with the cutoff on the running time). Moreover, predicate minimization reduced the memory needed for verification, which is an even more important bottleneck. Given that the memory was cutoff in some cases for other techniques before verification was complete, the results are even more compelling. The greedy approach kept memory use fairly low, but almost always failed to find near-optimal predicate sets and converged much slower than the usual monotonic refinement or predicate minimization approaches. Further, it is not clear how much final memory usage would be improved by the greedy strategy if it were allowed to run to completion. Another major drawback of the greedy approach is its unpredictability. We observed that on any particular example, the greedy strategy might or might not complete within the time limit in different executions. Clearly, the order in which this strategy tries to eliminate predicates in each iteration is very critical to its success. Given that the strategy performs poorly on most of our benchmarks using a random ordering, more sophisticated ordering techniques may perform better. We leave this issue for future research. Optimality. We experimented with two of the parameters that affect the optimality of our predicate minimization algorithm: (i) the maximum number of examined subsets (MAXSUB) and (ii) the maximum number of eliminating subsets generated (MAXELM) (that is, the procedure stops the search if MAXELM eliminating subsets were found, even if less than MAXSUB combinations were tried). We first kept MAXSUB fixed and took measurements for different values of MAXELM on a subset of our benchmarks viz. ssl-srvr-4, ssl-srvr-15 and ssl-clnt-1. Our results, shown in Figure 8, clearly indicate that the optimality is practically unaffected by the value of MAXELM. Next we experimented with different values of MAXSUB (the value of MAXELM was set equal to MAXSUB). The results we obtained are summarized in Figure 9. It appears that, at least for our benchmarks, increasing MAXSUB
Predicate Abstraction with Minimum Predicates
SUB 100 200 400 800 1600 3200 6400
Time 262 474 1039 2182 6718 13656 26203
It 8 7 9 7 9 9 9
ssl-srvr-4 |P| Mem TG 2 44 34 2 57 27 2 71 38 2 165 25 2 410 35 2 461 40 2 947 31
MT 2 2 2 2 3 3 3
MG 1 1 1 1 1 1 1
Time 396 917 1110 2797 10361 14780 33781
It 12 14 8 9 11 9 10
ssl-srvr-15 |P| Mem TG 3 50 62 3 65 81 3 76 45 3 148 51 3 410 76 3 436 50 3 792 51
MT 2 2 2 2 3 3 3
MG Time 1 310 1 683 1 2731 1 5843 1 13169 1 36155 1 > 57528
It 11 12 13 14 12 12 4
ssl-clnt-1 |P| Mem TG 2 40 58 2 51 63 2 208 67 2 296 75 2 633 61 2 1155 67 1 2110 22
33
MT 2 2 3 3 3 4 4
MG 1 1 1 1 1 1 1
Fig. 9. Results for optimality. SUB = MAXSUB, It is the number of iterations, TG is the total number of eliminating subsets generated, MT is the maximum size of subsets tried, and MG is the maximum size of eliminating subsets generated.
leads only to increased execution time without reduced memory consumption or number of predicates. The additional number of combinations attempted or constraints allowed does not lead to improved optimality. The most probable reason is that, as shown by our results, even though we are trying more combinations, the actual number or maximum size of eliminating combinations generated does not increase significantly. It would be interesting to investigate whether this is a feature of most real-life programs. If so, it would allow us, in most cases, to achieve near optimality by trying out only a small number of combinations or only combinations of small size. Acknowledgments. We thank Rupak Majumdar and Ranjit Jhala for their help with BLAST.
References 1. F. Aloul, A. Ramani, I. Markov, and K. Sakallah. PBS: A backtrack search pseudo Boolean solver. In Symposium on the theory and applications os satisfiability testing (SAT), pages 346–353, 2002. 2. T. Ball and S. Rajamani. Automatically validating temporal safety properties of interfaces. Lecture Notes in Computer Science, 2057, 2001. 3. T. Ball and S. K. Rajamani. Generating abstract explanations of spurious counterexamples in C programs. Technical Report MSR-TR-2002-09, Microsoft Research, Redmond, January 2002. 4. S. Chaki, E. Clarke, A. Groce, S. Jha, and H. Veith. Modular verification of software components in C. In International Conference on Software Engineering (ICSE), To appear, 2003. 5. E. Clarke, O. Grumberg, M. Talupur, and D. Wang. Making predicate abstraction efficient: eliminating redundant predicates. In To appear in CAV’03. 6. E. Clarke, A. Gupta, J. Kukula, and O. Strichman. SAT based abstraction – refinement using ILP and machine learning techniques. In Proc. 14th Intl. Conference on Computer Aided Verification (CAV’02), volume 2404 of LNCS, July 2002. 7. D. Dams and K. S. Namjoshi. Shape analysis through predicate abstraction and model checking. In VMCAI, 2003.
34
S. Chaki et al.
8. S. Das, D. L. Dill, and S. Park. Experience with predicate abstraction. In Computer Aided Verification, pages 160–171, 1999. 9. T. A. Henzinger, R. Jhala, R. Majumdar, and G. Sutre. Lazy abstraction. In Symposium on Principles of Programming Languages, pages 58–70, 2002. 10. R. P. Kurshan. Computer-Aided Verification of Coordinating Processes: The Automata-Theoretic Approach. Princeton University Press, 1995. 11. G. Nelson. Techniques for Program Verification. PhD thesis, Stanford, 1980. 12. V. Rusu and E. Singerman. On proving safety properties by integrating static analysis, theorem proving and abstraction. Lecture Notes in Computer Science, 1579:178–192, 1999. 13. S. Graf and H. Saidi. Construction of abstract state graphs with PVS. In O. Grumberg, editor, Computer Aided Verification, volume 1254, pages 72–83. Springer Verlag, 1997.
Efficient Symbolic Model Checking of Software Using Partial Disjunctive Partitioning Sharon Barner and Ishai Rabinovitz IBM Haifa Research Laboratory, Haifa, Israel
Abstract. This paper presents a method for taking advantage of the efficiency of symbolic model checking using disjunctive partitions, while keeping the number and the size of the partitions small. We define a restricted form of a Kripke structure, called an or-structure, for which it is possible to generate small disjunctive partitions. By changing the image and pre-image procedures, we keep even smaller partial disjunctive partitions in memory. In addition, we show how to translate a (software) program to an or-structure, in order to enable efficient symbolic model checking of the program using its disjunctive partitions. We build one disjunctive partition for each state variable in the model directly from the conjunctive partition of the same variable and independently of all other partitions. This method can be integrated easily into existing model checkers, without changing their input language, and while still taking advantage of reduction algorithms which prefer conjunctive partitions.
1
Introduction
Symbolic model checking suffers from the known problem of state explosion. This explosion usually happens while performing the image or pre-image computation. In order to cope with this problem, symbolic model checkers use partitioned transition relations [8]. Using ordered conjunctive partitioning [7] is quite simple and sometimes allows early quantification while computing the image or preimage; this serves to decrease the needed memory. The RuleBase model checker [1] uses ordered conjunctive partitioning, and previous work showed its application to general purpose software [5,6]. In this paper, we show how disjunctive partitioning can be used to increase the efficiency of symbolic model checking for software. Disjunctive partitioning, first introduced in [8], has several advantages over conjunctive partitioning. First, both image and pre-image computations are more efficient using disjunctive partitions, since quantification distributes over disjunction but not over conjunction [9,8]. For the same reason, distributed model checking using disjunctive partitions is also more scalable than using conjunctive partitioning, since each process can do the quantification on its own. As a result, the “heavy” computation is divided by the number of processes. Despite the advantages of disjunctive partitioning, use of the technique is generally hindered by the difficulty in building the partitions. The method presented in [8] is efficient only for asynchronous circuits. It builds the disjunctive D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 35–50, 2003. c Springer-Verlag Berlin Heidelberg 2003
36
S. Barner and I. Rabinovitz
partitions using an interleaving model, which allows only one wire to change its value at a time. Both [2] and [4] suggested how to build disjunctive partitions for synchronous circuits. In [2], we see how to decompose an FSM into smaller FSMs, and then use this decomposition to split the conjunctive partitioned transition relation into a disjunction of conjunctive partitioned transition relations. In [4], a set of mutually exclusive events is used to decompose the behavior of the circuit to disjunctive partitions. Large disjunctive partitions are split into conjunctive partitions, which results in a DNF partitioning as in [2]. Both methods need additional information on the circuit in order to get a good decomposition. Disjunctive partitioning is also used in [10], where each transition is a separate disjunctive partition. The contribution of [10] is in presenting the order in which the transitions should be executed in order to achieve improved performance. While all the above works are applicable to models generated for software, applying them to software is problematic. The method of [8] is applicable to parallel software, but does not decompose each process to disjunctive partitions. On the other hand, [10] creates a large number of disjunctive partitions. The methods of [2] and [4] are not automated and require additional information from the user. We introduce a new method applicable to software models in which the decomposition is generated automatically, without additional information from the user. The number of disjunctive partitions created is similar to that of the conjunctive partitions for the same model, and the BDD size of the disjunctive partitions is comparable to that of the conjunctive partitions. Software has the feature that in each step there is little change in the program variables. It is quite easy to build a model for software where each step changes only the pc (program counter), and at most, one additional state variable. We present a modeling language called ODL, which is natural for defining such models. We also present a method for translating from conjunctive partitions to disjunctive partitions and vice versa. These translations can be easily adapted by any symbolic model checker that uses conjunctive partitioning and by doing so, may benefit from the advantages of disjunctive partitioning. In the traditional image computation algorithm, each disjunctive partition must represent the next value for all variables, so the disjunctive partition of state variable x should indicate the change of x and pc, and the fact that all other variables keep their value. The latter information might severely impact the BDD size of the partition. In this work, we change the image and pre-image computation in such a way that they can work on the partial disjunctive partition of x, which represents only the changes of x and pc, and not the fact that all other variables keep their value. Using this algorithm decreases the BDD size needed to represent the disjunctive partitions and improves the image computation. This method is applicable not only for software models, but also to some other methods ( [8], [2] and [4]) based on the fact that only a subset of the variables in the model can change their value in each disjunctive partition. Finally we suggest two schemes for distributed model checking that use the disjunctive partitioning.
Efficient Symbolic Model Checking of Software
37
In our work we implemented the translation from conjunctive partitioned transition relation to disjunctive partitioned transition relation. We show that the size of the partial disjunctive partitions is equal to, or even smaller than, the size of the conjunctive partitions. In addition, we show that calculating reachability analysis using disjunctive partitions significantly outperforms calculation using conjunctive partitions. The remainder of this paper is structured as follows: Section 2 states the preliminaries. Section 3 presents the generation of the model from the software and the ODL modeling language. Section 4 presents the translation between conjunctive and disjunctive partitions, and vice versa. Section 5 introduces partial disjunctive partitions and their advantages, and Section 6 presents the distributed version. In Section 7 we present some experimental results. We conclude and suggest some directions for future work in Section 8.
2
Preliminaries
A finite program can be modeled by a Kripke structure M over a set of atomic propositions AP . M = (S, S0 , R, L), where S is a finite set of states, S0 is a set of initial states, R ⊆ S × S is a total transition relation, and L : S → 2AP is a labeling function that labels each state with the set of atomic propositions that are true in that state. The states of the Kripke structure are coded by a set of state variables v¯. Each valuation to v¯ is a state in the structure. Model checking is a technique for verifying finite state systems represented as Kripke structures. The basic operations in model checking are the image computation and the pre-image computation. Given a set of states S and a transition relation R, represented in symbolic model checking by the BDDs S(¯ v ) and R(¯ v , v¯ ) respectively, the image computation finds the set of all states related by R to some state in S and the pre-image computation finds the set of all states such that some state in S is related to them by R. More precisely, image(S(¯ v ), R(¯ v , v¯ )) = ∃¯ v (S(¯ v ) ∧ R(¯ v , v¯ )) and pre image(S(¯ v ), R(¯ v , v¯ )) = ∃¯ v (S(¯ v ) ∧ R(¯ v , v¯ )). The result of image(S(¯ v ), R(¯ v , v¯ )) is over v¯ . In order to get the result over v¯, all BDD variables are “unprimed”. A conjunctive partitioned transition relation is composed of a set of partitions and Ri such that R(¯ v , v¯ ) = i and Ri (¯ v , v¯ ). In case each state variable can be described by a single conjunctive partition (as in this work), we have that v )) and thus each partition is a function of v¯ and vi and Rvi = (vi = fvi (¯ rather thanv¯ and v¯ . The image computation in this case is image(S(¯ v )) = v , vi ))). ∃¯ v (S(¯ v ) ∧ ( vi and Rvi (¯ Computing ∃xA(¯ v ) is referred to as quantifying x out of A. Early quantification [8] can make image and pre image computations even more efficient. Early quantification is done by quantifying a variable x out of the intermediate BDD result, after conjuncting the last conjunctive partition that is dependent on x. Quantifying a variable out of the intermediate BDD may reduce the size of the BDD and as a result make the image computation easier.
38
S. Barner and I. Rabinovitz
A disjunctive partitioned transition relation iscomposed of a set of disv , v¯ ) = v , v¯ ). In the case junctive partitions or Ri such that R(¯ i or Ri (¯ where each state variable can be changed only in a single disjunctive parv )) ∧ (∀y = vi : y = y ). The tition, we have that or Rvi = (vi = fvi (¯ image computation when using disjunctive partitions is done by calculating v , v¯ ))). Because existential quantificaimage(S(¯ v )) = ∃¯ v (S(¯ v ) ∧ ( vi or Rvi (¯ tion distributes over disjunction, we have that every quantification is “early”, v (S(¯ v ) ∧ or Rvi (¯ v , v¯ )). Because the quantificaand thus image(S(¯ v )) = vi ∃¯ tion is done “early” for every v in the disjunctive partitioning, all intermediate BDD results depend only on v¯ , while when using conjunctive partitions the intermediate BDD results may depend both on v¯ and v¯ . Thus, using disjunctive partitions usually results in smaller intermediate BDDs than when using conjuncting partitions. Note that as opposed to a conjunctive partition, the naive disjunctive partition is dependent on the entire vector v¯ , rather than just a single vi . We return to this point later and show how to avoid it by modifying the image computation. Let A ⊆ S be a set of states and let x ¯ be a set of variables. We use the notation A|x¯ to indicate the projection of the set A onto x ¯. That is: A|x¯ = {s ∈ S| ∃a ∈ A such that s and a agree on all values of the variables in x ¯}.
3
Generating a Model from Software
Previous work showed the application of symbolic model checking to general purpose software [5,6] by translating C source code to EDL (Environment Description Language), a dialect of SMV [9], which is the input language to the RuleBase model checker. EDL, like SMV, is naturally suited for building of conjunctive partitions. That previous work was based on a specially-built parser and was limited to a small subset of C. In this work, we build a similar model using a full-blown compiler front-end. The most important thing about this model is that it has the following structure. Definition 1. An or-structure is a Kripke structure in which for every two states s, s : if R(s, s ) then s and s’ are different from each other only in the values of the pc and no more than a single additional state variable x. The model we build has a state variable for each global variable in the C code and a state variable named pc (program counter) that holds the value of the next statement to be performed. The model also has stacks to support local variables, functions and recursion, and some special variables to support arrays and pointers (without pointer arithmetic). The basics of the generation process are explained here using a simple example. Afterward, we will discuss the special treatment for pointers and arrays. The translation process first translates the C code to intermediate code. There are two reasons for using intermediate code: 1. It will ease the support of other input languages in the future. 2. It generates the pc in a way such that for each value of pc, a maximum of one memory location changes its value. One
Efficient Symbolic Model Checking of Software
39
18 : r1 ← 0 19 : z ← r1 22 : 24 : r2 ← x 26 : pc ← (r2 > 0)?29 : 27 27 : pc ← 53 1: 2: 3: 4: 5:
z = 0; while(x > 0) { z+ = 5; x − −; }
29 : 35 : r3 ← z 37 : r4 ← 5 39 : r5 ← r3 + r4 41 : z ← r5 44 : r6 ← x 46 : r7 ← r6 − 1 48 : x ← r7 50 : pc ← 22 53 :
(a) C code of div.c
(b) The intermediate code of div.c
Fig. 1. Example of translation from C to intermediate code.
may object to using intermediate code because it increases the number of values pc can get, and therefore increases the number of states in the model. While this is true, the number of pc values is only multiplied by a small factor and herefore adds to the state variables only 2 or 3 bits, which are negligible. In Figure 1.a we can see a fragment of a C program. The code has two global variables called x and z. This code is translated to an assembly-like intermediate code shown in Figure 1.b. In the intermediate code, there is a list of instructions, each with a unique pc (program counter), listed at the beginning of each line. The pc is updated to the pc of the next line if not specified otherwise. The first two lines indicate the behavior for pc = 18 and pc = 19. This is the intermediate code generated for line 1 in the C code (z = 0). At pc = 18 the value 0 is inserted to r1, and r1 is inserted to z in pc = 19 . Lines like the one for pc = 22, which don’t have any code, are used as jump targets and only update the pc to the pc of the next line. Lines for pc = 24 through 27 perform the while condition: first in pc = 24 x is inserted into r2 and then in pc = 26 it is checked if it is bigger than 0. A true answer sets the pc to 29 (enter the loop), while a false answer sets it to the pc of the next line, which in turn sets the pc to 53 - after the loop. Next we translate the intermediate code into a model. There are two possible translations: The first one is to translate the intermediate code to a language that has the style of a guarded transition system. Each transition is of the form: pc = P C1 =⇒ (X ← f (X, Y, Z) ∧ pc ← P C2 ). The guard is always a condition about the value of the pc (each value of the pc has exactly one transition) and the transition changes the value of the pc and perhaps the value of one additional
40
S. Barner and I. Rabinovitz
def ine def ine def ine def ine def ine def ine def ine
main main main main main main main
r1 r2 r3 r4 r5 r6 r7
0 x z 5 main r3 + main r4 x main r6 − 1
pc = 19 =⇒ (z ← main r1 ∧ pc ← 22) pc = 22 =⇒ (pc ← 26) pc = 26 =⇒ (pc ← if (main r2 > 0) then 29 else 27) pc = 27 =⇒ (pc ← 53) pc = 29 =⇒ (pc ← 41) pc = 41 =⇒ (z ← main r5 ∧ pc ← 48) pc = 48 =⇒ (x ← main r7 ∧ pc ← 50) pc = 50 =⇒ (pc ← 22)
(a) Model in ODL representation
def ine def ine def ine def ine def ine def ine def ine
main main main main main main main
r1 r2 r3 r4 r5 r6 r7
0 x z 5 main r3 + main r4 x main r6 − 1
next(pc) ← case pc = 19 : 22 pc = 22 : 26 pc = 26 : if (main r2 > 0) then 29 else 27 pc = 27 : 53 pc = 29 : 41 pc = 41 : 48 pc = 48 : 50 pc = 50 : 22 else : pc esac; next(x) ← case pc = 48 : main r7 else : x esac; next(z) ← case pc = 19 : main r1 pc = 41 : main r5 else : z esac;
(b) Model in EDL representation
Fig. 2. Example of div.c translation to EDL and ODL.
state variable 1 . We refer to this language as ODL. The translation to ODL is presented in Figure 2.a. The other possibility is to translate the intermediate code to EDL (Figure 2.b). For both possibilites we model the registers using a def ine. In this way, the registers won’t use any bits in the model. This is possible because the intermediate code defines and uses each register only once. The translation to ODL is very simple. Each line in the intermediate code is translated to a guarded expression representing the changes for this value of the pc. For example in pc = 19, z gets main r1 (the def ine that represents register r1), and pc is set to 22. In the EDL code, we need to gather all the assigns of a state variable to the same place. For instance, the code for next(z) includes assignments for the lines for pcs 19 and 41 of Figure 1.b. Another difference is that in ODL it is implicit that every state variable that is not mentioned, keeps its value, while the EDL explicitly codes it. At first glance, it seems preferable to translate to ODL because it’s simpler to translate C code to ODL, and it is simpler to translate ODL to disjunctive 1
Note that this transition may change a different variable depending on the value of other state variables. However, only one state variable will change its value at any one time. For instance, an assignment of the form a[i] = 5 will change a[0] or a[1], etc., depending on the value of i. But only one array location will change at any one time.
Efficient Symbolic Model Checking of Software
41
partitions. But translating the C code to EDL allows us to use RuleBase to read EDL, build the conjunctive partitions, and perform pre-model-checking reductions. A reduction is simply a conservative abstraction, that is, one that preserves both positive and negative truth values. Conjunctive partitions are more natural for performing simple reductions such as constant propagation as well as other more sophisticated reductions performed by RuleBase. Thus, even if we did not have conjunctive partitions, we would want to build them and translate the result of the reduction back to disjunctive partitions. Thus, we present methods for translating from conjunctive to disjunctive partitioning and vice versa in order to enable flexibility in our tool. In practice, using the reductions and translating the reduced conjunctive partitioning to disjunctive partitioning indeed proved to be useful. In addition, analyzing the translations enables us to bound the size of the disjunctive partitions, with respect to the conjunctive partitions. 3.1
Dealing with Pointers and Arrays
Modeling pointers and arrays creates a problem, because in general an assignment to a variable X from an array or a pointer causes the variable X to be dependent on more memory locations than an assignment from a scalar. In a naive approach, the BDD size of the partition for X will be quite large, because of the dependence on multiple variables. Furthermore, the large number of variables in a single partition results in many constraints on the BDD order for the entire model, which might result in a larger BDD size not just for the partition in question, but for the entire design. We solve this problem by using cut-points [3]. Our translation adds four variables for each array . For array ar we add: l indexar , l arrayar , r indexar and r arrayar (the prefix l/r means that the array is in the left/right side of the assignment). We translate an assignment x = ar[i] to the three assignments described in Figure 3(a), and an assignment ar[i] = x to the three assignments described in Figure 3(b). When using this translation on code containing assignments x = ar[i]; x = ar[j]; y = ar[i]; y = ar[j];, we get that r indexar is dependent on i and j, r arrayar is dependent on r indexar and all ar cells, and x and y are dependent only on r arrayar . Without cut-points, we would have had that both x and y are dependent on i, j and all cells of array ar. In pointers, the problem is even more severe because there are generally more memory locations that can be affected by a pointer dereference than cells in an array. Still, the same idea is useful for pointers. Note that using cut-points and def ines for modeling registers causes a problem when translating statements like x = a[i] + a[j]. We avoid this problem by splitting such statements into two: temp = a[i]; x = temp + a[j]. Our translation has another attribute. An assignment such as a[a[i]] = 5 is translated in the intermediate code into two different accesses to the array, one to get a[i] and the second to assign to a[a[i]], so that our translation creates the code in Figure 3(c).
42
S. Barner and I. Rabinovitz
r indexar = i r arrayar = ar[r indexar ] x = r arrayar
(a) Translating x = ar[i]
l indexar = i l arrayar = x ar[l indexar ] = l arrayar
(b) Translating ar[i] = x
r indexa = i r arraya = a[r indexa ] l indexa = r arraya l arraya = 5 a[l indexa ] = l arraya (c) Translating a[a[i]] = 5
Fig. 3. Translation of array expressions
3.2
Splitting of Self-Assignment Statements
Assignments statements in the code can be of two kinds: 1. Self-assignment statement - Assignment to a variable x in which the assigned value is a function of x (e.g., x+ = y or x = x + w + z). Such an assignment can be further divided into two kinds: constant self-assignment statement where we update the variable with a constant (e.g., x∗ = 4, x + +), and variable self-assignment statement (e.g. x+ = y, x = x ∗ b + c). 2. Foreign-assignment statement - Assignment to a variable x in which the assigned value is not dependent on the value of x. (e.g. x = y or x = w + z). In order to reduce BDDs size and achaive better performance we split variable self-assignment statements like x+ = y into two: temp = x, x = temp + y. This split increases the number of pc values and adds one variable (for all splits) but improves the overall performance. The reason will be explaind in section 4.1. Constant self-assignment statements can remain as is.
4
Translating between Disjunctive and Conjunctive Partitions
In this section, we show how to build the disjunctive partition of a state variable x, or Rx (¯ v , v¯ ), from its conjunctive partition and Rx (¯ v , x ) and vice versa. Our construction is applicable only to or-structures where each dereference, such as arrays and pointers, is broken by a cut-point. Let pc be the state variable that codes the program counter of the program and y¯ be the state variables which are different from pc and x. v ) is a set of states such that for every Definition 2. dep statesx (¯ s ∈ dep statesx (¯ v ) there exists s such that R(s, s ) and x has different values in s and s . v ) are all the states related to lines in the C program Intuitively, dep statesx (¯ where x is assigned a value, except for the case where x is assigned the same value it had before the assignment.
Efficient Symbolic Model Checking of Software
43
Definition 3. dep pcsx (pc) is the set of pc values which are related to statements in which x may change 2 . Definition 4. The partial disjunctive partition of a state variable x, denoted v , v¯ ) without the by por Rx (pc, x, y¯, x , pc ), is the disjunctive partition or Rx (¯ requirement that the variables in y¯ are left unchanged. v , v¯ ) = por Rx (pc, x, y¯, x , pc ) ∧(¯ y = y¯ )) (or Rx (¯ 4.1
Building Disjunctive Partitions from Conjunctive Partitions
We now show how to build each disjunctive partition from the conjuctive partition of the same state variable and the conjuctive partition of pc. Translation for x = pc: First we show how to build or Rx (¯ v , v¯ ) for x = pc. 1. Calculate dep statesx (¯ v ): dep statesx (¯ v ) = ∃x (and Rx (¯ v , x ) ∧ (x = x )). v ) with the conjunctive 2. Intersect the quantification of x from dep statesx (¯ partitions of x and pc: por Rx (pc, x, y¯, x , pc ) = = (∃x(dep statesx (¯ v ))) ∧ and Rx (¯ v , x ) ∧ and Rpc (¯ v , pc ) 3. Intersect por Rx (¯ v , x , pc ) with y¯ = y¯ to indicate that the other variables do not change: v , v¯ ) = por Rx (pc, x, y¯, x , pc ) ∧ (¯ y = y¯ ) or Rx (¯ We use dep statesx (¯ v ) in our construction and not dep pcsx (pc) because two states in which the pc value is identical do not necessarily change the same state variable. For example, consider the C statement a[i] = 5 and assume that it is related to pc = 7. For each value of i this statement changes a different state variable. Thus, the value pc = 7, which is related to this statement, will be in more than one disjunctive partition. If we had used dep pcsx (pc) the state {pc = 7; i = 2} would have been both in the partition of a[2] and a[1]. As a result, after conjuncting the disjunctive partition of a[1] with y¯ = y¯ it would have contained another transition, that does not exist in the original model and changes only pc and not a[1] or a[2]. This transition would have been entered to the disjunctive partition of a[1] because a[2] is in y¯. The quantification that appears in por Rx (pc, x, y¯, x , pc ) is discussed in detail later. 2
x may not always change its value in a certain pc. For example, when x is a cell in an array, a[0], and the assignment is a[i] = 5, a[0] is assigned a value only if i = 0 and stays unchanged otherwise.
44
S. Barner and I. Rabinovitz
Translation for pc: Calculating por Rpc (pc, x, y¯, pc ) is a bit different. = pc: 1. Calculate dep pcsx (pc) for each x dep pcsx (pc) = dep statesx (¯ v )|pc 2. Calculate the set of pc values jump pcs(pc) that are related to statements in which pc is the only state variable that is changed. These pc values are related to statements in which there is a control branch like an if statement. jump pcs(pc) = (dep pcsx (pc)) x=pc
3. Intersect and Rpc (¯ v , pc ) with jump pcs(pc) to get the value of pc for this pc value. v , pc ) por Rpc (pc, x, y¯, pc ) = jump pcs(pc) ∧ and Rpc (¯ 4. Intersect por Rpc (pc, x, y¯, pc ) with y¯ = y¯ , where y¯ is all variables that are different from pc. or Rpc (¯ v , v¯ ) = por Rpc (pc, x, y¯, pc ) ∧ (¯ y = y¯ ) Discussion: The general idea is that transitions in which only the pc changes should be in the partition of the pc, and transitions in which both the pc and some variable x change should be in the partition of x. Naively, this means that a line with some assignment would appear in the partition of the variable being assigned, while a line without an assignment would appear in the partition of the pc. However, things are not so simple. Consider the assignment x = 5. If x has the value 5 before the assignment, then a transition from this line changes only the pc. If x has another value before the assignment, then this line changes both x and the pc. A naive construction of the or-partitions from the andpartitions would put the transition from a state where x has the value 5 into the partition of the pc, rather than into the partition of x. We would like to put this transition into the partition of x, because in this way the BDDs will be in some sense “cleaner” - that is, we hope that the BDD size will be smaller. Two other related problems are the case of assignments of the form x+ = y, where y has the value 0, and the case of assignments to a[i] for some array a, where i is out of the array bounds. Our method deals with such cases as explained below. In order to deal with assignments of the form x = 5 we quantify x v ) before conjuncting the result of the quantification with out of dep statesx (¯ and Rx (¯ v , x ) and and Rpc (¯ v , pc ). By doing so we ignore the value of x before the assignment. The need to deal with assignments of the form x+ = y (for which y = 0 may cause a problem in a naive construction) is the source of the splitting of variable self-assignment statements into two, as described in 3.2 above. This way, we avoid dealing with such assignments in the construction itself.
Efficient Symbolic Model Checking of Software
45
The problem with assignments such as a[i] = 5 needs some explanation. Consider an array a[0..2] of size three and a statement a[i] = 5, where i equals 7. Because a[7] is not a real variable in the program, there is no corresponding state variable in our model (otherwise the model would have been unbounded). Thus, in such a case, in our model only the pc is changed, and the conjunctive partitioned transition relation contains a transition which changes only the pc. But this statement is related to transitions that do change variable values (for i < 3), and thus does not “belong” in the partition for the pc (according to our notion of “cleanness”). It is possible to overcome this problem by adding a new overf low variable to the model, the disjunctive partition of which will capture this behavior. Finally, we note that in the general case, our translation does not work for statements such as a[i] = a[5] or a[a[i]] = a[i] + 1. However, when the model is generated, as we suggested in section 3.1, such statements are always split up into several statements and therefore the problem is avoided. 4.2
Building Conjunctive Partitions from Partial Disjunctive Partitions
We previously discussed how to build a disjunctive partition from a conjunctive partition. In this subsection, we present the translation in the opposite direction. 1. We first calculate dep pcsx (pc) simply by looking at the pcs that appear in por Rx (¯ v , x , pc ) v , x , pc )|pc dep pcsx (pc) = por Rx (¯ 2. Now we can calculate and Rx (¯ v , x ). It is formed from a union of two sets: the states in which x changes its value and the states in which x saves its value. v , x ) = (∃pc (por Rx (¯ v , x , pc )) ∨ (dep pcsx (pc) ∧ x = x ) and Rx (¯ v , pc ). It is calculated by gathering the tran3. Now we can calculate and Rpc (¯ sition pc to pc in all the partial disjunctive partitions of the variables and v , pc ). conjuncting it with por Rpc (¯ and Rpc (¯ v , pc ) = por Rpc (¯ v , pc ) ∨ (
por Rx (¯ v , pc , x )|(pc,pc ) )
x=pc
5
Using Partial Disjunctive Partitions
In the previous section, we showed how to calculate disjunctive partitions. Using this, we can take advantage of the superior efficiency of disjunctive partitioning. However, if the sizes of the disjunctive partitions are larger than the corresponding conjunctive partitions it is not certain that we have gained anything. In this
46
S. Barner and I. Rabinovitz
section we examine the answer to this question. First let’s look at or Rx (¯ v , v¯ ). v , v¯ ) = por Rx (pc, x, y¯, pc , x ) ∧ (¯ y = y¯ ). It is possible to By definition, or Rx (¯ build an example in which |or Rx (¯ v , v¯ )| = O(n · |por Rx (pc, x, y¯, pc , x )|), where n is the number of state variables. An example is the assignment x ← y, where x is the first variable in the BDD order after pc and y is the last state variable in the BDD order. In order to avoid this factor, we do not calculate or Rx (¯ v , v¯ ). We calculate only por Rx (pc, x, y¯, x , pc ) and rewrite the procedures that calculate image and pre image operations in such a way as to use por Rx (pc, x, y¯, x , pc ) instead of v , v¯ ). In the next subsection, we present the new algorithm for image and or Rx (¯ pre image computation and prove its correctness. After, that we will bound the size of por Rx (pc, x, y¯, x , pc ). 5.1
Image and Pre Image Computations Using Partial Disjunctive Partitions
When computing image(pre image) using disjunctive partitions, it is possible to calculate the image(pre image) on each disjunctive partition independently and then union the results. In this subsection, we introduce how to compute image or pre image when only por Rx (pc, x, y¯, pc , x ) is given for each variable x. Lemma 1. pre image(S(pc , x , y¯ ), or Rx (pc, x, y¯, pc , x , y¯ )) = pre image(S(pc , x , y¯), por Rx (pc, x, y¯, pc , x )) From this lemma, we get a simple algorithm that in the first step unprimes y¯ in S(pc , x , y¯ ) (linear in the size of the BDD), and then performs the ordinary pre image algorithm on the result. The proof of this lemma is given in the full version of this paper. Lemma 2. image(S(pc, x, y¯), or Rx (pc, x, y¯, pc , x , y¯ )) = image(S(pc, x, y¯ ), por Rx (pc, x, y¯ , pc , x )) Here again, we have a simple algorithm. First prime y¯ in S(pc, x, y¯) and in por Rx (pc, x, y¯, pc , x ) and then calculate the image using the results. The proof is almost the same as of the previous lemma. 5.2
Bounding the Size of the Partial Disjunctive Partitions
In this subsection, we bound the size of partial disjunctive partitions. The proofs of these claims are long, technical, and tedious. Proof sketches are given in the full version of this paper. Despite the relatively large upper bound, in practice, these extreme examples are rare. See Section 7 for experimental results. Since every variable is dependent on pc, it seems wise to place pc as the first state variable in the BDD ordering. All of the following lemmas assume that the BDD ordering follows this idea.
Efficient Symbolic Model Checking of Software
47
Rx (¯ We define por v , x ) to be por Rx (¯ v , x , pc ) without the condition on the value of pc : Rx (¯ por v , x ) = (∃x(dep states(x, v¯))) ∧ and Rx (¯ v , x ). Rx (¯ v , x , pc ) using por v , x ): We can now rewrite the definition of por Rx (¯ Rx (¯ v , x , pc ) = por v , x ) ∧ and Rpc (¯ v , pc ). por Rx (¯ Rx (¯ The following lemmas will first bound the size of por v , x ) and only then v , x , pc ). the size of por Rx (¯
6
Scalability for Distributed Model Checking
We now turn to the scalability of disjunctive partitioning. We claim that symbolic model checking with disjunctive partitioning is not only more efficient than with conjunctive partitioning, it also scales better. This is a direct result of the fact that quantification distributes over disjunctive partitioning, but not over v (S(¯ v ) ∧ or Rx (¯ v , v¯ )), when conjunctive partitioning. Since image(S(¯ v )) = x ∃¯ using disjunctive partitions or partial disjunctive partitions we can calculate the image using one partition on each processor including quantification and then union the results of all processors. Because the image computation may be exponential in the number of BDD nodes and the union operation is linear in the number of BDD nodes, distributing the partitions between n processors divides the “heavy” work by n. Note that when image computation is done distributively using conjunctive partitions it requires another step in which the partial results are “anded” together before quantification. Thus, the work done after all the processors have calculated their results may still be exponential in the number of BDD nodes. We now suggest two distributed algorithms for disjunctive partitions. The first algorithm is simple and uses a master and several slaves. The master will send S(¯ v ) to all the slaves and start sending each idle slave a disjunctive partition. Each slave that gets a disjunctive partition will perform the image computation with this partition and union it with previous computations it made. When there are no more partitions and all slaves are idle, the master will gather all the slaves’ results and union them. Reachability computation is then performed by repeated image computations of the former algorithm. One drawback with this scheme is that while the server computes the union of all the slaves’ results, the slaves are idle. The second algorithm avoids this problem. In this algorithm, each process Pi is responsible for several partitions T Ri , and has its own reachability set RSi . There is also a (shared) queue of sets of states and each process has two pointers to this queue: a shared pointer for entering sets to the queue and a private one for reading from the queue. As a result, all processors read all the sets that enter the queue. At the beginning the queue has the initial set of states. Each process Pi , at each iteration takes the next set S from the queue (according to its pointer), removes from it the parts it already handled S = S \ RSi and adds the result to
48
S. Barner and I. Rabinovitz
Example
simple factorial insert sort quick sort merge sort pointer quick sort pointer merge sort
Num of vars
505 159 197 282 654 693 716
Conjunctive partitions Disjunctive partitions Reachability Maximal step Reachability Maximal step time time time time 11024 s 95.7 s 23.5 s 0.54 s 31.8 s 0.9 s 0.11 s 0.01 s 264.6 3.9 s 15.23 s 0.08 s 10197 s 10 s 172 s 0.8 s 952 s 7.77 s 0.62 s 0.04 s 1546 s 5.8 s 57 s 1.8 s >8h > 99 s 78 s 0.25 s
Fig. 4. Comparison of reachability computation using conjunctive partitions against using partial disjunctive partitions.
RSi , then calculates the image of S using T Ri getting imagei = image(S, T Ri ). In order to continue only with the new states, the reachable states are removed = ∅, it is put from imagei getting newi = imagei \ RSi . In the case where newi in the next entry of the queue. When all processors are trying to read from the queue and they are all pointing to an empty slot in the queue, the algorithm has ended. At the end, each process has the whole reachability set because it saw all the image computation results of all processes in the queue and no new set of states is entered to the queue.
7
Experimental Results
We implemented the translation from conjunctive partitioned transition relation to partial disjunctive partitioned transition relation in the IBM model checker RuleBase [1]. We compared reachability analysis using conjunctive partitions with reachability analysis using partial disjunctive partitions on models that were translated from software programs. These software programs were written in C and contain pointers and arrays. In both cases, we applied dynamic BDD reordering. In order to obtain a fair comparison between these algorithms, we ran each one twice. In the first run, the algorithm reordered the BDD with no time limit in order to find a good BDD order. The initial order of the second run was the BDD order found by the first run. The partial disjunctive partitioning outperforms the conjunctive partitioning with respect to execution time, as shown in Figure 4. We compared the sizes of partial disjunctive partitions with those of conjunctive partitions under the same BDD order. The table in Figure 5 shows the maximal and minimal ratios between a specific variable partial disjunctive partition size and its conjunctive partition size. We specifically note the ratio of the pc variable and the size of its partial disjunctive partition. In addition, we show the maximal conjunctive partition and maximal partial disjunctive partition not including pc. We observed that the partial disjunctive partitions were in the same order of magnitude or even smaller than the conjunctive partitions. This was achieved by the use of partial disjunctive partitions instead of ordinary disjunctive partitions. In our experiments we found that the
Efficient Symbolic Model Checking of Software # Relations between partitions size Vars Min Max pc disj/conj disj/conj disj/conj simple 505 0.65 1.54 1.00 factorial 159 0.53 1.27 1.00 insert sort 197 0.58 1.38 0.98 quick sort 282 0.53 1.34 1.00 merge sort 654 0.47 1.49 1.00 pointer quick sort 693 0.46 1.71 1.00 pointer merge sort 716 0.35 1.30 0.99 Examples
Partitions Disj Max pc conj 8101 10788 3562 1447 3201 360 12595 1422 7925 8346 17650 62225 6861 66987
49
size Max disj 10777 1433 390 1124 8341 52155 32145
Fig. 5. Comparison between size of conjunctive partitions and partial disjunctive partitions.
size of each ordinary disjunctive partition (or Rx (¯ v , v¯ )) was up to 84 times the size of it corresponding partial disjunctive partition.
8
Conclusions and Future Work
Using partial disjunctive partitions seems to be a successful and natural scheme for software models. In this work, we show how to apply disjunctive partitioning to software models while keeping the partitions small. We also show how to enhance the image and pre-image computation to support our partial disjunctive partitions and make model checking algorithms more efficient. However, this is only the beginning and there are a number of directions for future work. As we note above, we handle variables with a large number of bits by creating a single partition for each variable containing the behaviors of all its bits. Future work will explore the possibility of implementing the DNF partitioned transition relation [4], where the disjunctive partition of a state variable is composed of conjunctive partitions of its bits. As we claimed in Section 6, disjunctive partitioned transition relation is natural for distributed algorithms. It seems wise to implement and explore both algorithms presented in that section. Special attention should be given to finding a good distribution of the disjunctive partitions over the processes in order to achieve good load balancing. Acknowledgments. We thank Cindy Eisner, Yoad Lustig and Ziv Nevo for many helpful discussions.
References 1. I. Beer, S. Ben-David, C. Eisner, and A. Landver. RuleBase: an industry-oriented formal verification tool. In Proc. DAC96, pp. 655–660, 1996. 2. G. Cabodi, P. Camurati, L. Lavagno, and S. Quer. Disjunctive partitioning and partial iterative squaring: an effective approach for symbolic traversal of large circuits. In Proc. DAC97, pp. 728–733, 1997.
50
S. Barner and I. Rabinovitz
3. G. Cabodi, P. Camurati, and S. Quer. Auxiliary variables for extending symbolic traversal techniques to data paths. In Proc. DAC94, pp. 289–293, 1994. 4. W. Chan, R. Anderson, P. Beame, and D. Notkin. Improving efficiency of symbolic model checking for state-based system requirements. In Proc. ISSTA98, 1998. 5. C. Eisner. Model checking the garbage collection mechanism of SMV. In S. D. Stoller and W. Visser, editors, Electronic Notes in Theoretical Computer Science, volume 55. Elsevier Science Publishers, 2001. 6. C. Eisner and D. Peled. Comparing symbolic and explicit model checking of a software system. In Proc. SPIN2002, LNCS 2318, pp. 230–239, 2002. 7. D. Geist and I. Beer. Efficient model checking by automated ordering of transition relation partitions. In Proc. CAV94, LNCS 818, pp. 299–310, 1994. 8. J.R. Burch, E.M. Clarke, and D.E. Long. Symbolic model checking with partitioned transition relations. In A. Halaas and P.B. Denyer, editors, International Conference on Very Large Scale Integration, pp. 49–58, 1991. 9. K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993. 10. M. Sol´e and E. Pastor. Traversal techniques for concurrent systems. In Proc. FMCAD 2002, LNCS 2517, pp. 220–237, 2002.
Instantiating Uninterpreted Functional Units and Memory System: Functional Verification of the VAMP Sven Beyer1 , Chris Jacobi2 , Daniel Kr¨oning3 , Dirk Leinenbach1 , and Wolfgang J. Paul1 1 2
Saarland University, Computer Science Department, 66123 Saarbr¨ucken, Germany {sbeyer,dirkl,wjp}@cs.uni-sb.de IBM Deutschland Entwicklung GmbH, Processor Dev. II, 71032 B¨oblingen, Germany
[email protected] 3 Carnegie Mellon University, Computer Science, Pittburgh, PA
[email protected] Abstract. In the VAMP (verified architecture microprocessor) project we have designed, functionally verified, and synthesized a processor with full DLX instruction set, delayed branch, Tomasulo scheduler, maskable nested precise interrupts, pipelined fully IEEE compatible dual precision floating point unit with variable latency, and separate instruction and data caches. The verification has been carried out in the theorem proving system PVS. The processor has been implemented on a Xilinx FPGA.
1
Introduction
Previous Work. Work on the formal verification of processors so far has concentrated mainly on the following aspects of architectures: i) Processors with in-order scheduling, one or several pipelines including forwarding, stalling and interrupt mechanisms [3,13,28]. The verification of the very simple, nonpipelined FM9001 processor has been reported in [2]. Using the flushing method from [3] and uninterpreted functions for modeling execution units, superscalar processors with multicycle execution units, exceptions and branch prediction [28] have been verified by automatic BDD based methods. Also, one can transform specification machines into simple pipelines (with forwarding and stalling mechanism) by an automatic transformation, and automatically generate formal correctness proofs for this transformation [15]. ii) Tomasulo schedulers with reorder buffers for the support of precise interrupts [5,8, 16,24]. Exploiting symmetries, McMillan [16] has shown the correctness of a powerful Tomasulo scheduler with a remarkable degree of automation. Using theorem proving, Sawada and Hunt [24] show the correctness of an entire out-of-order processor, precise interrupts, and a store buffer for the memory unit. They also consider self-modifying code (by means of a sync instruction).
The work reported here was done while the author was with Saarland University. Research supported by the DFG graduate program ‘Effizienz und Komplexit¨at von Algorithmen und Rechenanlagen’ Research supported by the DFG graduate program ‘Leistungsgarantien f¨ur Rechnersysteme’
D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 51–65, 2003. c Springer-Verlag Berlin Heidelberg 2003
52
S. Beyer et al.
iii) Floating point units(FPU). The correctness of an important collection of floating point algorithms is shown in [21,22] using the theorem prover ACL2. Correctness proofs using a combination of theorem proving and model checking techniques for the FPUs of Pentium processors are claimed in [4,19]. As the verified unit is part of an industrial product not all details have been published. Based on the constructions and on the paper and pencil proofs in [18] a fully IEEE compatible FPU has been verified [1,11] (using mostly but not exclusively theorem proving). iv) Caches. Multiple cache coherence protocols have been formally verified, e.g., [6,17, 25,26]. Paper and pencil proofs are extremely error prone, and hence the generation of proofs for interactive theorem proving systems is slow. The method of choice is model checking. The compositional techniques employed by McMillan [17] even allow for the verification of parameterized designs, i.e., cache coherence is shown for an arbitrary number of processors. Simplifications, Abstractions, and Restrictions. Except for the work on floating point units, the cache coherence protocol in [6], and the FM9001 processor [2], none of the papers quoted above states that the verified design actually has been implemented. All results cited above except [1,2,6,11] use several simplifications and abstractions: i) The realized instruction set is restricted: always included are the six instructions considered in [3]: load word, store word, jump, branch equal zero, three register ALU operations,ALU immediate operations. Five typical extra instructions are trap, return from exception, move to and from special registers, and sync [24]. The branch equal zero instruction is generalized in [28] by an uninterpreted test evaluation function. Most notably the verification of machines with load/store operations on half words and bytes has apparently not been reported. In [27] the authors report an attempt to handle these instructions by automatic methods which was unsuccessful due to memory overflow. ii) Delayed branch is replaced by non-deterministic speculation (speculating branch taken/not taken). iii) Sometimes, non-implementable constructs are used in the verification of the processors: e.g., Hosabettu et.al. [8] use tags from an infinite set. Obviously, this is not directly implementable in real hardware. iv) The verification of the FPUs does neither cover the handling of denormal numbers nor of exception flags. The verification of a dual precision FPU has not been reported (though, obviously, Intel’s and AMD’s FPUs are capable of dual precision). v) No verification of a memory unit with caches has been reported. Eiriksson [6] only reports the verification of a bit-level implementation of a cache coherence protocol without data consistency. vi) The verification of pipelines or Tomasulo schedulers with instantiated floating point units and memory units with caches and main memory bus protocol has not been reported. Indeed, in [27] the authors state: “An area of future work will be to prove that the correctness of an abstract term-level model implies the correctness of the original bit-level design.” Results and Overview. In the VAMP (verified architecture microprocessor) project we have designed, functionally verified, and synthesized a processor with full DLX in-
Instantiating Uninterpreted Functional Units and Memory System
53
struction set, delayed branch, Tomasulo scheduler, maskable nested precise interrupts, pipelined fully IEEE 754 [9] compatible dual precision floating point unit with variable latency, as well as separate, coherent instruction and data caches. We use only finite tags in the hardware. Thus all abstractions, restrictions and simplifications mentioned above have been removed. Specification and verification was performed using the interactive theorem proving system PVS [20]. All formal specifications and proofs are on our web site.1 The hardware description was automatically extracted from PVS and translated into Verilog HDL by a tool sketched in section 7. Hardware with non verified rudimentary software is up and running on a Xilinx FPGA. The Verilog design can also be downloaded from our web site. In section 2, we summarize the fixed point instruction set, its floating point extension, and the interrupt support realized. We give a micro-architectural overview with a focus on the memory system. Section 3 describes the correctness criterion, the main proof strategy, and the integration of the execution units into the Tomasulo core. Correctness criterion and proof strategy are based on scheduling functions [14,18] (similar to the stg-component of MAETTs [23]). The model of the execution unit is in a nontrivial way more general than previous models without complicating interactive proofs too much. Section 4 presents a delayed branch mechanism, which is automatically constructed and proven correct by the methods for automatic pipeline construction from [15] and summarizes the specification of an interrupt mechanism for maskable nested precise interrupts and delayed PC from [18]. Section 5 deals with the integration of the floating point unit from [11] into our Tomasulo scheduler. Section 6 deals with loads and stores of double words, words, half words, and bytes at a 64 bit cache/memory interface. We also sketch correctness proofs of the implementation of a simple coherence protocol between data cache and instruction cache, as well as the implementation of a main memory bus protocol. Section 7 describes the implementation of the VAMP on a Xilinx FPGA. Section 8 gives an overview of the verification effort for various parts of the project, summarizes our work, and sketches directions of some future work.
2
Overview of the VAMP Processor
Instruction Set. The full DLX instruction set from [7] is realized. This includes loads and stores for double words, words, half words, and bytes, various shift operations, and two jump-and-link operations. Loads of bytes and half words can be unsigned or signed. In order to support the pipelining of instruction fetches, delayed branch with one delay slot is used. Note that delayed branch changes the sequential semantics of program execution. The floating point extension of the DLX instruction set from [18] is supported. The user sees a floating point register file with 32 registers of single precision numbers as well as a single floating point condition code register FCC. Pairs of floating point registers can be accessed as registers for double precision numbers (with an even register address). Supported operations are: i) loads and stores for singles and doubles. ii) +, −, ×, ÷ both for single and double precision numbers. iii) test-and-set, the result is stored in FCC. 1
http://www-wjp.cs.uni-sb.de/forschung/projekte/VAMP/
54
S. Beyer et al. 32
IF IR
ID
Reservation Stations
PC environment
P C
DP C 64
EX
128
MEM
192
XPU 128
FPU1
128
128
192
FPU2
192
FPU3
128
128
Producers
C
Common Data Bus
CDB
128
Reorder Buffer ROB WB
32
GPR
64
FPR
32
SPR
Fig. 1. Main data paths of the VAMP processor
iv) conditional branches as a function of FCC. v) conversions between singles, doubles and integers. vi) moves between the general purpose register file and the floating point register file. Operations are fully IEEE compatible [9]. In particular, all four rounding modes, denormal numbers, and exponent wrapping as a function of the interrupt masks are realized. Interrupt Support. Presently, the interrupts from table 1 in section 4 are supported. Interrupts are maskable and precise. Floating point interrupts are accumulated in 5 bits of a special purpose register IEEEf (IEEE flag) as prescribed by the IEEE standard. All special purpose registers (details in section 4) are collected into a special purpose register file. Operations supporting the interrupt mechanism are: i) moves between general purpose registers and special purpose registers. ii) trap. iii) return-from-exception. Microarchitecture Overview. Figure 1 gives a high level overview of the VAMP microarchitecture. Stages IF and ID are a pipelined implementation of delayed branch as explained in section 4. Stages EX, C and WB realize a Tomasulo scheduler with 5 execution units, a fair scheduling policy on the common data bus CDB, and a reorder buffer ROB (for precise interrupts). The execution units are i) MEM: memory unit with variable latency and internal pipelining. There is presently no store buffer. ii) XPU: the fixed point unit. iii) FPU1 to FPU3: specialized pipelined floating point units with variable latency. FPU1 performs additions and subtractions. FPU2 performs multiplications and divisions. FPU3 performs test-and-set as well as conversions. The data output of the reorder buffer is 64 bits wide. The floating point register file FPR is physically realized as 16 registers, each 64 bits wide. The general purpose registers file GPR and the special purpose register file SPR are both 32 bits wide, and have 32 and 9 entries, respectively. They are connected to the low-order bits of the ROB output.
Instantiating Uninterpreted Functional Units and Memory System
55
reservation station tag
IR
adr
data
valid stall out clear ROBhead ibusy P C
comp adr
gen bw
gen pc
shif t4store
f lags P C[2]
dsel tag
mbw
valid
imr
pc
clear
ibusy inst adr
f lags2 stall in tag
cache reset
data
ctrl
P C[0]P C[1] 0
shif t4load EData
producer
dout
CA
IR
mr mw adr din mbw dbusy dout
M if
imal ipf
fetch
Fig. 2. Data paths of the VAMP memory unit
Figure 2 depicts a simplified view of the memory unit. Internally, it has two pipeline stages. The first stage does address and control signal computations. The second stage performs the actual data cache access via signals adr, din, and dout. Instructions are fetched from the instruction cache via signals pc and inst. The memory interface Mif internally consists of a data cache, an instruction cache, and a main memory. The caches are kept coherent (this does not suffice to guarantee correct execution of self-modifying code). Details are explained in section 6.
3
Correctness Criterion and Tomasulo Algorithm
Notations. We consider a specification machine S and an implementation machine I. Configurations of these machines are tuples, whose components RS and RI , respectively, are registers or memories. Register contents are bit strings. Memory contents are modeled as mappings from addresses (bit strings) to bit strings. For example, P CS denotes the program counter of the specification machine, and memI denotes the main memory of the implementation machine. The specification machine processes a sequence of instructions I0 , I1 , . . . at the rate of one instruction per step. We denote by RSi the content of component R before execution of instruction Ii . One step of the implementation machine is a hardware cycle, and we denote by RIT the content of component R during cycle T . The fetch of the 4 bytes of an an instruction into the instruction register IR of the implementation machine during cycle T can be specified by IRIT +1 := memTI [P CIT + 3 : P CIT ]. Although the instruction register is not a visible register, one can specify the desired content IRSi of the instruction register for the specification machine for instruction Ii as a function of the visible components by IRSi = memiS [P CSi + 3 : P CSi ]. Defining the
56
S. Beyer et al.
next configuration ci+1 of the specification machine involves many such intermediate S definitions, e.g., the immediate constant immiS , the effective address eaiS , etc. Starting from the visible components RS we extend the configuration of the specification machine in this way by numerous (redundant) secondary components. Scheduling Functions. For hardware cycles T and pipeline stages k of the implementation machine, we formally define an integer valued scheduling function sI(k, T ) [14], where sI(k, T ) = i has the intended meaning that an instruction Ii is during cycle T in stage k. By treating instruction numbers like integer valued tags,2 the definition of these functions is straightforward. We initialize sI(k, 0) := 0 for all stages. We then “clock” these tags through the pipeline stages under the control of the update enable signals3 uek for the output registers of stage k. If a stage is not clocked, the scheduling function is not changed, i.e., sI(k, T ) := sI(k, T − 1) if /ueTk −1 . Note that we introduce separate “stages” k for each reservation station and ROB entry. For the fetch stage4 , e.g., we define sI(f etch, T ) := sI(f etch, T − 1) + 1 if T −1 uef etch , meaning that the content of the fetch stage progresses by one instruction in the instruction stream I0 , I1 , . . . If stage k receives data from stage k in cycle T, we define sI(k, T ) := sI(k , T − 1). Note that this covers the case that a stage can receive data from two different stages and k , since in a fixed cycle T , it receives data from only one of these stages. This occurs at the ROB, e.g., where we allow bypassing branch instructions from the instruction register directly into the ROB without going through an execution unit. Thus, the ROB can receive data from the CDB and from the instruction register. As a form of bookkeeping for the memory unit, we introduce an additional “stage” mem . The corresponding scheduling function sI(mem , T ) equals sI(mem, T ) if the memory unit is empty or the instruction in the unit has not accessed the main memory yet. Otherwise, we set sI(mem , T ) := sI(mem, T ) + 1. We need this bookkeeping function in order to model whether the memory is already updated by a store instruction. Correctness Criterion. We are interested in the content of the main memory mem and the register files RF ∈ {GPR, FPR, SPR} after certain instructions Ii respectively before instruction Ii+1 . The main memory is an output “register” of stage mem and the register files are output “registers” of stage wb. The functional correctness criterion requires an instruction Ii in stage mem of the implementation machine I to see the same memory content as the corresponding instruction of the specification machine S; sI(mem ,T ) formally memTI = memS . The corresponding condition for register files RF sI(wb,T ) T . In general, we prove by induction on T for all stages k and is RFI = RFS sI(k,T ) all output registers R of stage k that RIT = RS , where RSi can be a visible or 2
Having integer valued tags is only a proof trick. In hardware, we only use finite tags. During the proof of correctness for the Tomasulo scheduler, we prove that these finite tags properly match to the infinite instruction number. 3 Update enable signals are sometimes called ‘register activates’. They are used to (de-)activate updating of register contents. 4 We introduce symbolic names for some stages k, e.g., f etch and mem.
Instantiating Uninterpreted Functional Units and Memory System
57
redundant component of the configuration of the specification machine. Note that for sI(f etch,T )−1 . technical reasons, we claim for the instruction register that IRIT = IRS The liveness criterion states that all instructions that are not interrupted reach the writeback stage. At the time of submission of this paper, we have separate formal liveness proofs for the scheduler and the execution units; we are currently working on combining them into a single formal liveness proof for the entire machine. Paper and pencil proofs for the correctreservation station ness of Tomasulo schedulers tend to follow datain tagin validin stallout a canonical pattern: i) For instructions Ii and register operand R, one defines last(i, R) Execution Unit clear as the index of the last instruction before Ii which wrote register R. ii) One shows by induction that the formal definitions of tags and dataout tagout validout stallin valid bits have the intended meaning. In our producer setting, this means that the finite tags in hardware correspond to the integer valued tags Fig. 3. Model of an execution unit provided by the scheduling function sI. iii) Finally, one has to show that the reservation last(i,R) station of instruction Ii reconstructs RS . The rest is easy. It is important to observe that the structure of these paper and pencil proofs and their formal (theorem proving) counter parts do not depend much on the fixed or variable latency of execution units or whether these units are pipelined. The scheduler recognizes instructions completed by the execution units simply by examining the tags returned from the units. The situation is very different for model checking [28]. Integration of Execution Units. The proofs for the scheduler and the proofs for the execution units are separated by the following specifications for the execution units [11, 10]. Notations refer to figure 3. T =⇒ validTout , i.e., if the scheduler asserts stallin , the execution unit does i) stallin not return a valid instruction. T , i.e., the stallout signal is never active indefinitely. ii) ∀T ∃T > T :stallout iii) Instructions dispatched with tagin = tg at time T will eventually (at time T ≥ T ) T return a result with the same tag , i.e., tagout = tg. Moreover, dataTout = f (dataTin ) where f is the (combinatorial) function the execution unit is supposed to compute. iv) For each time T at which a result with tag tg is returned, there is an earlier time T ≤ T such that an instruction with tag tg was dispatched at time T , and tag tg was not returned between T and T . Hence, the execution units do not create spurious outputs.
Note that the instructions do not need to leave the execution units in the order they enter the units; all FPUs, e.g., exploit this by allowing instructions on some special operands to overtake other instructions. Moreover, multiplications may overtake divisions (cf. [10] for details). The four conditions above must be shown for each of the execution units provided the scheduler guarantees the following three conditions: i) No instruction is dispatched
58
S. Beyer et al.
to an execution unit which sends a stallout signal to its reservation station. ii) The execution units are not stalled forever by the producers. iii) Tag-uniqueness: no tag which is dispatched into an execution unit is already in use.
4
Delayed Branch and Maskable Nested Precise Interrupts
In the delayed branch mechanism, taken branches yield a new PC of the form P C + imm+4, taken branches are delayed, and P C +8 is saved to the register file during jumpand-link. In the equivalent delayed PC mechanism [14,18], one uses an intermediate program counter P C with branch targets P C +imm, all fetches use a delayed program counter DP C, and P C + 4 is saved during jump-and-link. Figure 4 depicts a pipelined implementation of the delayed PC mechanism in the VAMP processor. This construction and its formal correctness proof are automatically obtained by the method for automatic pipeline construction from [15]. Indeed, fetching instructions from the intermediate program counter P C is—not only intuitively but formally—forwarding of DP C. The role of the multiplexers above P C and DP C are explained in the following paragraphs about interrupts. The formal specification of the interrupt Memory IF mechanism for delayed PC is based on the definitions of [18, Chap. 5, 9.1]. Table 1 shows the supIR ported interrupts.5 The special purpose registers for the interrupt mechanism are: i) status register EDPC ID SR for interrupt masks, ii) two registers ECA for exception cause and EData for parameters passed rfe 1 0 P Cin to the interrupt service routine, iii) two registers SISR EP C and EDP C for return addresses for P C and DP C and iv) a register IEEEf for the accuJISR 1 0 mulation of masked floating point exceptions. ROB DPC PC’ At issue time of an instruction Ii , it is unknow EX whether Ii will be interrupted and whether the interrupt requires to repeat the interrupted instrucf ullID 0 1 tion or not. Therefore, we have to save two pairs of potential return addresses in the reorder buffer: i (P C S , DP CSi ) for interrupts of type “repeat”, Fig. 4. VAMP PC Environment and the results of the uninterrupted next P C and u,i+1 u,i+1 next DP C computations (P C S , DP CS ) for interrupts of type “continue”. The data paths of the PC environment are shown in figure 4. Interrupt handling in the specification machine S depends on the components ECA and EData. In the implementation, these two registers are treated as additional results of the execution units; thus, execution units have up to four 32-bit results. This affects the width of the ROB. The formal correctness of these components in the ROB at writeback time is asserted without additional verification effort by the consistency of the Tomasulo scheduler. Further lemmas are needed for the correctness of the PCs stored in the ROB. The return-from-exception instruction is treated like any other instruction; no special effort is needed here. 5
Page fault signals are presently tied to zero.
Instantiating Uninterpreted Functional Units and Memory System
59
Table 1. Implemented interrupts index 0 1 2 3 4 5 6
name maskable type index name maskable type reset no abort 7 FPU overflow yes continue illegal instruction no repeat 8 FPU underflow yes continue misalignment no repeat 9 FPU loss of accuracy yes continue page fault on fetch no repeat 10 FPU division by zero yes continue page fault load store no repeat 11 FPU invalid yes continue trap no continue 12 FPU unimplemented no continue arithmetic overflow yes continue
Since the main memory is updated before writeback of an instruction, one has to guarantee that in case of an interrupt, all stores prior to the interrupted instruction are executed, but none of the instructions after it. Especially, one has to show that a store that has reached the writeback stage also has accessed the main memory, i.e., it did not enter the wrong execution unit.
5
Floating Point Unit
Execution Units. The FPUs and their verification are described in [11]. The construction and verification of the combinatorial circuits is based on the paper and pencil proofs from [18]. The internal control of the iterative unit for multiplication and division is complex: during cycles, when the division unit performs a subtraction step, the multiplier can be used by multiplication operations or by multiplication steps of other division operations. Moreover, operations with special operands are processed in a single cycle. Thus in general, the units do not process instructions in order, but that is not required by the specifications from section 4. We remark that we have formal proofs but no paper and pencil proofs for the correctness and liveness of the floating point control. The control was constructed and verified with the help of a model checker[10]. At first sight, floating point operations have two operands and one result. However, rounding mode (stored in a special purpose register RM ) and interrupt masks (stored in SR) are two further operands of every floating point operation. Moreover, there is aliasing in connection with the addressing of the floating point registers: each single precision floating point register can be accessed by single precision operations as well as by double precision operations. The ISA does not preclude the construction of a double precision operand by two writes with single precision to the upper and lower half of a double precision register. It can be necessary to forward these two results from separate places whether the double precision operand is read. This is easily realized by treating the upper half and the lower half of double precision operands as separate operands. Thus, reservation stations for dual precision floating point units have 6 operands. IEEE Flags and Synchronization. The exception flags for interrupts 6 to 12 are part of the result of every floating point operation Ii . They are accumulated in special purpose register IEEEf during writeback of Ii . We have already seen in section 4 that this affects
60
S. Beyer et al.
the width of the reorder buffer. A move operation Ij which reads from register IEEEf is issued only after the entire reorder buffer is empty. This simple modification of the issue logic makes it very easy to prove that the flags of all floating point operations preceding Ij are accumulated when IEEEf is read by Ij . A move instruction from IEEEf to general purpose register 0, which is constantly 0, acts as a sync operation for self-modifying code as explained at the end of the following section.
6
Memory Interface
Loads and Stores with Variable Operand Width. The formal specification of the semantics of the memory instructions is based on the definitions in [18, Chap. 3]. Accesses are characterized by their effective address ea and their width in bytes d ∈ {1, 2, 4, 8}. The access is aligned if ea mod d = 0. Effective addresses ea define a double word address da(ea) = ea/8 and a byte address ba(ea) = ea mod 8. A simple “alignment lemma” states that for aligned accesses, the memory operand mem[ea + d − 1 : ea] equals bytes [ba(ea) + d − 1 : ba(ea)] of the double word addressed by da(ea) at the memory interface.6 Details can be found in [18]. Circuits called shift4load and shift4store are used in order to ensure that data is loaded and stored correctly. These circuits are shown in figure 2. “Shift for store” denotes shifting the data, say the halfword which is to be stored, into the correct position of a doubleword before it is sent to the 64-bit wide memory interface. Similarly, “shift for load” denotes extraction of the requested portion (say halfword) of the 64-bit delivered from the memory interface. Also, sign-extension is done during “shift for load” for signed byte- and halfword-loads. Shift for store and load are implemented by means of two simplified shifters with some control logic [18]. The proof of correctness of the VAMP memory interface is structured hierarchically. First, we verify the VAMP with an idealized memory interface m spec, a dual-ported memory without caches. Second, we show that a cache memory interface with split caches backed up by a unified main memory m impl behaves exactly like the dualported memory m spec. Thus, m spec serves as the specification for the cache memory interface. By putting these two independent proofs together, we obtain the correctness of the VAMP with split caches with respect to the memory memS of the specification machine. Cache Specification and Implementation. The memory m spec is defined recursively, i.e., it is updated on the double word address a iff a write access to address a terminates. Separate byte-enables mwbb allow for updating only some of the 8 bytes stored on address a. Formally, we have for any byte b < 8 and any double word address a: m spec[8 · a + b]
T +1
:=
din[b]T m spec[8 · a + b]T
a = adrT ∧ mwT ∧ mwbTb ∧ dbusy T else
The memory interface is implemented with split caches connected to a single main memory as depicted in figure 5. We use a write-back policy for the data cache, i.e., on a 6
Note that this specifies little endian memory organization.
Instantiating Uninterpreted Functional Units and Memory System
61
dout
hit
Dhit
Ddout Idout
Ihit
hit
dout
Ddin
din
Dadr
Idin
din
adr
Iadr
adr
write access of the CPU, the data cache is updated and the corresponding data is marked as dirty. Thus, a slow access to the main memory is avoided. If dirty data is to be evicted from the cache, it is written back to the main memory in order to ensure data consistency. The protocol used to keep the caches codata cache herent works as follows: If a cache signals a dcache hit on a CPU access, the data is read directly from the cache or written to it, depending on bus protocol the type of the CPU access. This allows for memory accesses that take only one cycle to adr complete. If, on the other hand, the cache sigdin Madr nals a miss, the corresponding data has to be memory control dout Mdin CP U mem loaded into the cache. The control first examMdout ines the other cache in order to find out if it pc holds the required data. In this case, the data inst in the other cache is invalidated. If the data to M if be invalidated is dirty, this requires an additional write back to the main memory. icache This consistency protocol guarantees exinstruction cache clusiveness, i.e., for any address, at most one of the two caches signals a hit. In this way, we ensure that on a hit of the instruction cache, Fig. 5. Cache memory interface the data cache does not contain newer data. The instruction and data caches are implemented as k-way sectored set-associative caches using a LRU replacement policy. Cache sectors consist of 4 double words since the bus protocol supports bursts of length 4. Typical Lemmas. The inductive invariant used to show consistency of split caches as described above consists of three parts. Two of these parts are obvious: if the data or instruction cache, respectively, signals a hit, then its output data equals the specified memory content. However, an invariant consisting only of these two claims is not inductive since caches are reloaded from the main memory. Therefore, we need a third part of our invariant stating the consistency of data in the main memory. Thus, we also claim that on a clean hit or a miss in cycle t on address DadrT in the data cache, the main memory m impl on this address DadrT contains the specified memory content. Note that on a clean hit in the data cache, we thus claim data consistency in both the data cache and the main memory. Formally, we have the following claim: =⇒ Idout[b]T = m spec[8 · IadrT + b]T ∧ IhitT T T Dhit =⇒ Ddout[b] = m spec[8 · DadrT + b]T ∧ T T T T (Dhit ∧ dirty ) =⇒ m impl[8 · Dadr + b] = m spec[8 · DadrT + b]T . This invariant is strong enough to show transparency of the whole memory interface since the data word returned to the CPU on a read access is just the cache output in case of a hit, or the data written to the cache during reload in case of a miss. Note that the invariant relies on the exclusiveness property of the protocol, which has to be verified as part of the proof of the invariant.
62
S. Beyer et al. request
clk
Ä¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡ÀÀ ÄÄ¡À
brdy∧ reqp
req
ÄÄ¡ÀÀÀÀÀÀ ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
wait
wr
ÄÄ¡ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ
adr
adr
ÍͧÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎΦ
d0
din
d1
d2
brdy
brdy
brdy ∧ reqp mem
d3
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍͧÎÎÎÎÎάÎÎÎÎÎΦÍÍÍÍÍͧÎÎÎÎÎάÎÎÎÎÎΦ
reqp brdy
brdy ∧ reqp ÄÄÄÄÄÄÄÄÄÄ¡ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¡ÀÀÀÀÀÀÀÀÀÀÀÀÀÀ ÄÄÄÄÄÄ¡ÀÀÀÀÀÀÀÀÀÀÀÀÀÀ ÄÄÄÄÄÄÄÄ
Fig. 6. 4-burst write timing diagram
brdy∧ reqp last mem
Fig. 7. Burst control FSD
Bus Protocol. The main memory is accessed via a bus protocol featuring bursts. The bus protocol signals ready data by raising brdy one cycle in advance. A sample timing of a 4-burst write is depicted in figure 6. Note that the data input din one cycle after brdy is written to the main memory and that the end of the access is signaled by reqp ∧ brdy. As part of our correctness proof for the memory interface, we have formalized this bus protocol and proved that an automaton7 according to figure 7 implements this protocol correctly by means of theorem proving. The main invariant for this proof is the following: in the cycle of the i-th memory access of the burst, i.e., after the i-th brdy, the automaton is in state mem for the i-th time. In the cycle of the last memory access, the automaton is in state last mem . Self-Modifying Code. We consider self-modifying code independent of the implementation of the memory interface. As an additional precondition for the correctness of code, we demand that in case an instruction is fetched from a memory location adr, there is a special sync-instruction between the last write to adr and the fetch of adr.8 In the VAMP architecture, this sync instruction is implemented without additional hardware by a special move from the IEEEf register to R0 as mentioned in section 5. We have formally verified that this use of the sync instruction suffices to show the correctness of the implementation in case of self-modifying code.
7
Synthesis
We have translated the PVS hardware description of the VAMP processor to Verilog HDL using an automated tool called pvs2hdl. The tool unrolls recursive definitions and then performs fairly straightforward translation. The Verilog representation of the 7 8
Note that this bus control FSD is only a part of the FSD for the cache memory interface. This implies the correspondency condition from [23].
Instantiating Uninterpreted Functional Units and Memory System
63
processor (including caches and floating point unit) has been synthesized, implemented, and tested on a Xilinx FPGA hosted on a PCI board. Some additional unverified hardware for controlling the VAMP processor and for accessing its memory from the host PC is also present on this FPGA. The VAMP processor occupies about 18000 slices of a Xilinx Virtex FPGA. This accounts for a gate count of 1.5 million gates as reported by the Xilinx tools. The design contains 9100 bits of registers (not counting memory and caches) and runs at 10 MHz. Note that we assume a fully synchronous design, i.e., all registers share the same clock and RAM blocks for register files or caches are also updated synchronous to this clock; thus, concerning timing, they can be treated like registers. In a fully synchronous design, valid data is needed only at the rising edge of the clock with certain setupand hold-times. The synthesis software analyzes all paths between inputs and registers, registers and registers, and registers and outputs; thus, it can guarantee that our logical design can be implemented with a certain maximum clock speed preserving all our proved properties. In particular, we fully ignore any glitches, i.e., instabilities in signals during a clock period that are resolved until the next rising edge of the clock since these glitches do not influence fully synchronous designs. Thus, our approach does not cover designs where certain signals must be kept stable for several cycles, i.e., where glitches must not occur. This is the case for asynchronous EDO-RAM chips that need stable addresses for a fixed amount of time. Since we use synchronous RAM chips, our proofs guarantee the correctness of the design regardless of any occurring glitches. We have ported the gcc and the GNU C library for the VAMP in order to execute test programs on the VAMP. As it was to be expected from our verified design, we found no errors in the VAMP processor. When testing some cases of denormal results of floating point operations, however, we found differences between the VAMP FPU and Intel’s Pentium II FPU. This is due to some discrepancies of Intel’s FPU to the IEEE standard. See [11] for further details.
8
Conclusion
Verification Effort. The formal verification of the VAMP microprocessor took about eight person-years; for the translation tool and synthesis on the FPGA, an additional person-year was required. Table 2 summarizes the verification effort for the different parts of the VAMP. Note especially that “Putting it all together” took a whole personyear for several reasons. First of all, the proof of the Tomasulo core from [12] was only generic and had to be applied to the VAMP architecture, especially the VAMP instruction set. Unfortunately, in spite of thorough planning on our part, the interfaces between the different parts did not match exactly. Thus, a lot of effort went into patching the interfaces. Additionally, self-modifying code and the special implementation of the IEEEf -register had to be considered. Also, interrupt support and a memory unit still had to be added to the formally verified Tomasulo core. Last but not least, PVS does not really scale too well for projects this large; typechecking of the VAMP alone takes already more than two hours on our fastest machine. To the best of our knowledge, we have reported for the first time the formal verification of i) a processor with the full DLX instruction set including load and store
64
S. Beyer et al. Table 2. Verification effort Part Effort in years Lemmas Proof steps Tomasulo core & ALU 2 521 14367 FPU 3 1046 25936 Cache Memory Interface 2 566 24432 Putting it all together 1 415 23887 Total 8 2548 88622
instructions for bytes, half words, words, and double words, ii) a processor with delayed branch, iii) a processor with maskable nested interrupts, iv) a processor with integrated floating point unit, v) a memory system with separate instruction and data cache. More importantly, the above mentioned constructions and proofs are integrated into a single design and a single correctness proof. Thus, we can be sure that no oversimplifications have been made in any part of the design. PVS ensures that there are no proof gaps left. The design is synthesized9 and implemented on an FPGA. The complexity of the design is comparable to industrial controllers with FPUs. To the best of our knowledge, VAMP is by far the most complex processor formally verified so far. We see several directions for further work in the near future. i) Adding a store buffer to the memory unit. ii) The treatment of a memory management unit with separate translation look aside buffers for data and instructions. iii) Proving formally that a machine with memory management unit and appropriate page fault handlers as part of the operating system gives a single user program the view of a uniform virtual memory. This requires to argue about hardware and software simultaneously. iv) Redoing as much as possible of the present correctness proof with automatic methods. For such methods any subset of our lemmas lends itself as a benchmark suite with a very nice property: we know that it can be completed to the correctness proof of a full bit-level design.
References 1. C. Berg and C. Jacobi. Formal verification of the VAMP floating point unit. In Proc. 11th CHARME, volume 2144 of LNCS, pages 325–339. Springer, 2001. 2. B. Brock, W. A. Hunt, and M. Kaufmann. The FM9001 microprocessor proof. Technical Report Technical Report 86, Computational Logic Inc., 1994. 3. J. R. Burch and D. L. Dill. Automatic verification of pipelined microprocessors control. In CAV 94, volume 818, pages 68–80, Standford, California, USA, 1994. Springer-Verlag. 4. Y.-A. Chen, E. M. Clarke, P.-H. Ho, Y. Hoskote, T. Kam, M. Khaira, J. W. O’Leary, and X. Zhao. Verification of all circuits in a floating-point unit using word-level model checking. In FMCAD, volume 1166 of LNCS, pages 19–33. Springer, 1996. 5. W. Damm and A. Pnueli. Verifying out-of-order executions. In Charme IFIP WG10.5, pages 23–47, Montreal, Canada, 1997. Chapman & Hall. 6. A. P. Eiriksson. The formal design of 1M-gate ASICs. In G. Gopalakrishnan and P. Windley, editors, FMCAD 98, volume 1522 of LNCS, pages 49–63. Springer, 1998. 7. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Mateo, CA, second edition, 1996. 9
The trivial proof of synthesizability.
Instantiating Uninterpreted Functional Units and Memory System
65
8. R. Hosabettu, M. Srivas, and G. Gopalakrishnan. Proof of correctness of a processor with reorder buffer using the completion functions approach. In Computer-Aided Verification, CAV ’99, volume 1633, pages 47–59, Trento, Italy, 1999. Springer-Verlag. 9. Institute of Electrical and Electronics Engineers. ANSI/IEEE standard 754–1985, IEEE Standard for Binary Floating-Point Arithmetic, 1985. 10. C. Jacobi. Formal verification of complex out-of-order pipelines by combining modelchecking and theorem-proving. In CAV, volume 2404 of LNCS. Springer, 2002. 11. C. Jacobi. Formal Verificaton of a fully IEEE compliant floating point unit. PhD thesis, Saarland University, Germany, 2002. 12. D. Kroening. Formal Verification of Pipelined Microprocessors. PhD thesis, Saarland University, Computer Science Department, 2001. 13. D. Kr¨oning, S. M¨uller, and W. Paul. Proving the correctness of pipelined micro-architectures. In 3ITG-/GI/GMM-Workshop Methoden und Beschreibungsprachen zur Modellierung und Verifikation von Schaltungen und System, pages 89–98. VDE Verlag, 2000. 14. D. Kr¨oning, S. M¨uller, and W. Paul. Proving the correctness of processors with delayed branch using delayed PCs. pages 579–588, 2000. 15. D. Kr¨oning and W. Paul. Automated pipeline design. In Proc. of the 38th Design Automation Conference, pages 810–815. ACM Press, 2001. 16. K. McMillan. Verification of an implementation of Tomasulo’s algorithm by compositional model checking. In CAV 98, volume 1427. Springer, June 1998. 17. K. McMillan. Parameterized verification of the FLASH cache coherence protocol by compositional model checking. In CHARME 2001, volume 2144 of LNCS. Springer, 2001. 18. S. M. Mueller and W. J. Paul. Computer Architecture. Complexity and Correctness. Springer, 2000. 19. J. O’Leary, X. Zhao, R. Gerth, and C.-J. H. Seger. Formally verifying IEEE compliance of floating-point hardware. Intel Technology Journal, Q1, 1999. 20. S. Owre, N. Shankar, and J. M. Rushby. PVS: A prototype verification system. In CADE 11, volume 607 of LNAI, pages 748–752. Springer, 1992. 21. D. M. Russinoff. A mechanically checked proof of IEEE compliance of the floating point multiplication, division and square root algorithms of the AMD-K7 processor. LMS Journal of Computation and Mathematics, 1:148–200, 1998. 22. D. M. Russinoff. A case study in formal verification of register-transfer logic with ACL2: The floating point adder of the AMD Athlon processor. In FMCAD-00, volume 1954 of LNCS. Springer, 2000. 23. J. Sawada and W. A. Hunt. Trace table based approach for pipelined microprocessor verification. In CAV 97, volume 1254 of LNCS. Springer, 1997. 24. J. Sawada and W. A. Hunt. Processor verification with precise exceptions and speculative execution. In CAV 98, volume 1427 of LNCS. Springer, 1998. 25. X. Shen, Arvind, and L. Rudolph. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. In International Conference on Supercomputing, 1999. 26. J. Stoy, X. Shen, and Arvind. Proofs of correctness of cache-coherence protocols. In FME, volume 2021 of LNCS. Springer, 2001. 27. M. N. Velev and R. E. Bryant. Superscalar processor verification using efficient reductions of the logic of equality with uninterpreted functions to propositional logic. In Charme, volume 1703 of LNCS. Springer, 1999. 28. M. N. Velev and R. E. Bryant. Formal verification of superscale microprocessors with multicycle functional units, exception, and branch prediction. In DAC. ACM, 2000.
A Hazards-Based Correctness Statement for Pipelined Circuits Mark D. Aagaard Electrical and Computer Engr., University of Waterloo
[email protected] Abstract. The productivity and scalability of verifying pipelined circuits can be increased by exploiting the structural and behavioural characteristics that distinguish pipelines from other circuits. This paper presents a formal model of pipelines that augments a state machine with information to describe the transfer of parcels between stages, and reading and writing state variables. Using our model, we created a definition of correctness that is based on the well-established principles of structural, control, and data hazards. We have proved that any pipeline that satisfies our hazards-based definition of correctness is guaranteed to satisfy the conventional correctness statement of Burch-Dill style flushing.
1
Introduction
In early verifications of pipelined circuits, the manual effort to discover abstraction functions limited both the productivity and scalability of verification. Burch and Dill’s use of flushing a pipeline to derive an abstraction function automatically [5] improved verification productivity and scalability by sheltering the user from the complexities of the pipeline. Unfortunately, realistic circuits are beyond the scope of such push-button verification. To scale verification to larger pipelines, researchers invented a variety of decomposition strategies. Jones et al. used knowledge about pipeline behaviour to create incremental flushing [8]. Pnueli et al. [4] and Sawada and Hunt [12] used pipeline behaviour as a guide for defining intermediate models. Hosabettu et al. developed completion functions to decompose pipelines stage-by-stage [7]. McMillan used knowledge about the behaviour of pipelines to guide assume-guarantee decomposition [10]. We believe that a model of state machines that captures the distinguishing structure and behaviour of pipelined circuits will improve verification productivity and scalability. The structure of a pipeline is a network of stages through which parcels (instructions) flow. The behaviour of a pipeline can be described using the principles of structural, control, and data hazards. This paper presents a formal model and a correctness statement for pipelines based on stages, parcels, and hazards. Our goals were: remain true to the intuitive meaning of pipelines and hazards, separate orthogonal concerns into distinct correctness obligations, and support cutting-edge optimizations. Our model of pipelines augments a state machine with pipeline-specific functions and predicates (Section 2): transferring a parcel between stages, writing to a variable,
This work was supported in part by the National Sciences and Engineering Research Council of Canada and by the Semiconductor Research Corporation Contract RID 1030.001
D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 66–80, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Hazards-Based Correctness Statement for Pipelined Circuits
67
and reading from a variable. The model supports superscalar and out-of-order execution, external kill signals, exceptions, external interrupts, bypass registers, and register renaming [2]. Our correctness statement, PipeOk, separates correctness obligations relating to different hazards, datapath functionality and flushing (Section 3). We have proved that any pipeline that satisfies PipeOk is guaranteed to satisfy the standard Burch-Dill flushing correctness statement (Section 4). PipeOk contains thirteen correctness obligations that provide a natural decomposition strategy. Each obligation describes a single type of behaviour, for example, writeafter-write hazards. Because hazards are well understood by both verification and design engineers, verification engineers will be able to more easily discuss test plans, verification strategies, and counter examples with designers. Because each obligation focuses on a single type of behaviour, verifying the obligations will be amenable to powerful abstraction mechanisms. For example, the ordering of reads and writes can be verified separately for each variable and need only reason about consecutive operations. To prove that PipeOk implies Burch-Dill correctness, we prove that PipeOk implies Flushpoint Equality (flushed states are externally equivalent to specification states) and then use the previously proven result that Flushpoint Equality implies Burch-Dill correctness [3]. We prove that PipeOk implies Flushpoint Equality by showing: read and write operations happen in the correct order, the result of each write operation is correct, and finally that flushing works correctly.
2
Modelling Pipelines
This section describes our formal model of pipelines. We begin with an informal description of the “parcel view” of a pipeline, which motivates our approach. The remainder of the section presents the model, auxilliary functions to relate a pipeline to its specification, and conditions to ensure that the auxilliary functions are consistent. 2.1 The Parcel View of a Pipeline A pipeline is a network of stages. Parcels, or instructions, flow through the stages and read-from and write-to variables, or signals, in the pipeline. Figure 1 shows the runs of a sample program on an instruction set architecture specification, a four-stage pipelined microprocessor, and a “parcel view” of the pipeline. Each run is annotated to show when each parcel moves between stages and when each variable is read or written. The value of a variable is denoted by the label of the instruction that writes to the variable. Conventional verification strategies compare a snapshot of the pipeline state to a specification state. Because a pipeline state contains the effects of multiple partially executed parcels, it is difficult to relate the implementation to the specification. For example, step 4 of the pipeline contains parcels A, B, C, and D, which represents portions of steps 1, 2, 3, and 4 of the specification. A recent trend has been to examine the implementation only when it is in a flushed state, such as steps 0 and 9 of the pipeline, which are externally equivalent to steps 0 and 5 of the specification. The parcel view shows slices of the pipeline state as perceived by each parcel. Different variables in the same slice come from different points in time. The slice to
68
M.D. Aagaard Program
A: B: C: D: E:
R1 R2 R1 R2 R2
= = = = =
#1 R1 R1 R1 R2
Legend + + + +
#1 #3 R2 #1
Spec
0
External storage variable
Write operation
A
1
B
2
C
3
D
E
4
InstrMem
I
I
I
I
I
I
R1
I
A
A
C
C
C
R2
I
I
B
B
D
E
Fetch
Bubble
5
I
A
A
B
B 1 A
B
C
C 2
C
B
D
E
D 3
D
C
Initial value
Undefined
Internal storage variable
A
E E
4 D
5
6
7
8
9
E
PC
I
A
B
C
D
E
E
E
E
E
InstrMem
I
I
I
I
I
I
I
I
I
I
A
B
C
D
E
A
B
C
D
Decode Exec
E
Bypass.data
A
B
C
D
E
Bypass.addr
R1
R2
R1
R2
R2
A
B
C
D
E
Writeback R1
I
I
I
I
A
A
C
C
C
C
R2
I
I
I
I
I
B
B
D
E
E
A 0
A
PC Parcel View
Pipe
Read operation
PC
0
I
Pipeline variable
InstrMem
I
Bypass.data
B 1 2
B
C 3 4
C
D 5 6
D
E 7 8
E
9 10
A A
B B
C C
D D
E E
I I
I I
I I
I I
I I
A A
B B
C C
D D
E E
R1 R1
R2 R2
R1 R1
R2 R2
R2 R2
R1
I
A I
A A
C A
C C
C C
R2
I
I I
B I
B B
D B
E E
Bypass.addr
A
B
C
D
Fig. 1. Specification, pipeline and parcel view of a sample program
E
A Hazards-Based Correctness Statement for Pipelined Circuits
69
Table 1. Definition of a pipeline Conventional state machine Set of states. : state → state → bool Next-state relation. : state → bool Initial-state predicate. Pipeline sets stage Set of identifiers for stages in the pipeline, including Top and Bot addr i Set of identifiers for data storage variables in the pipeline. isExt : (a : addr i ) → (q : state) → bool Variable is externally visible. isStore : (a : addr i ) → (q : state) → bool Variable is for data storage. subPipes : (s : stage) → pipe One pipe record for each stage Probes xfr : (q : state) → (s1 : stage) → (s2 : stage) → bool In state q, a parcel transfers from s1 to s2 Wr : (a : addr i ) → (q : state) → (s : stage) → bool A parcel in s writes to address a in state q Rd : (a : addr i ) → (q : state) → (s : stage) → bool A parcel in s reads from address a in state q state Nsr isInit
the left (right) of each parcel shows the variables as read (written) by the parcel. Gray backgrounds denote values that are with the specification. For example, in step 2 of the parcel view, R1 is shown in gray, because R1 is I in the pipeline and A in the specification. The parcel for B is able to execute correctly, because it reads its operand from the bypass register, which corresponds to R1 at that time. The parcel view of pipelines was inspired by two observations: first, for each parcel, the only state variables that are relevant to its correctness are those that it reads or writes; second, if every parcel is executed correctly, then the pipeline is correct. Our proof that our correctness statement, PipeOk, implies Burch-Dill flushing relies on the parcel view of the pipeline. We have proved that if the order of read and write operations with respect to parcels in the pipeline is the same as the order with respect to states in the specification, then data dependencies are obeyed. 2.2
Formal Model of Pipelines
Our formal model of pipelines (Table 1) augments a standard model of non-deterministic state-machines with predicates to detect when parcels transfer between stages, read from state variables, and write to state variables. We use these predicates to compute the parcel view of a pipeline from the next-state relation. The predicate xfr detects the transfer of a parcel between two stages. We have defined instantiations of xfr for wide variety of protocols for transfering parcels [1]. Transfers can often be detected using one or two signals, such as the valid bits for the stages. In the set of stages, Top and Bot are virtual stages: they do not exist in the pipeline. For input/output pipelines, such as systolic arrays or execution units in microprocessors, Top represents the module in the environment from which parcels enter the pipeline and Bot represents the module to which parcels exit. For closed systems, such as microprocessors
70
M.D. Aagaard Table 2. Functions for comparing a pipeline and specification
Sets Set of identifiers for data storage variables in the specification. Set of data values in the specification. Structural-hazard correctness Match : (σ : run) → (t1 : time) → (tn : time) → bool The parcel that enters at time t1 exits at time tn . Control-hazard correctness ShouldExit : (σ : run) → (t : time) → bool The parcel that enters should eventually exit Data-hazard and datapath correctness addrmap : (a : addr i ) → (q : state) → addr s Maps addresses of implementation to addresses in the specification datamap : (a : addr) → (q : state) → datas Maps the data in q.a to corresponding specification data value Flushing correctness Flush : state → state Flushes a state IsFlushed : state → bool A state is flushed addr s datas
with built-in memory, transfering from/to Top and Bot is defined in terms of operations in the pipeline, such as fetching an instruction. Pipelines may contain atomic stages, which hold at most one parcel, and hierarchical stages, which may themselves be pipelines. We support this with the subPipes field. State machines commonly distinguish internal and external variables (isExt for “is external”). We refine this by dividing variables into data-storage and pipeline variables (isStore for “is storage”). Data-storage variables are used to represent variables in the specification, and can be either internal (e.g., bypass registers) or external (e.g., register files). Pipeline variables are the registers that hold parcels in stages. They are internal and have no corresponding variables in the specification. Read and write predicates need only monitor storage variables.
2.3
Relating Implementations and Specifications
To verify a pipeline against a specification, we need to compare the behaviours of the pipeline and specification. Typically, this is done with a function to say how many instructions are fetched and an external-equivalence relation. Table 2 shows the analagous objects for our model. We use Match to identify the entrance and exit time of each parcel. Match supports superscalar pipelines by instantiating the type time with a pair of a clock cycle and a port [1]. When working with hierarchical pipelines, we want to treat the stages as black boxes. The Match relation allows us to match parcels entering and exiting stages while hiding the internal structure of the stage. We have found five common instantiations for Match: degenerate, constant latency, in-order, unique tags, and tagged in-order [1].
A Hazards-Based Correctness Statement for Pipelined Circuits
71
Table 3. Consistency Conditions on Pipelines and Specifications Specification conditions 1 The specification is deterministic. This is required for flushpoint-equality correctness to imply Burch-Dill correctness. Implementations may be non-deterministic. Traversal conditions 2 If ShouldExit is true, then a parcel entered the pipeline. 3 Parcels cannot transfer from the pipeline to the “Top” stage. 4 Parcels cannot transfer from the “Bot” stage to the pipeline. 5 Time increases monotonically as parcels traverse through the pipeline. 6 IsFlushed cannot be true while a parcel is traversing through the pipeline. 7 A storage operation can happen in a stage only if a parcel is in the stage. Storage Conditions 8 If an address map changed, then a write must have happened in Impl. 9 If a data map changed, then a write must have happened in Impl. 10 If a Spec variable changed, then a write must have happened in Spec. 11 When a pipeline is flushed, external equality and storage equality are identical. Flushing conditions 12 Flush is idempotent on flushed pipelines. 13 All reachable states are reachable from a flushed state. 14 From any state, a flushed state can be reached eventually.
The predicate ShouldExit says whether a parcel that enters the pipeline should be executed. We have identified instantations for ShouldExit that include external kill signals, branch prediction, internal exceptions, and external interrupts [2]. We separate external equivalence into two functions: addrmap, which defines a mapping between variables in the pipeline and specification, and datamap, which maps data in the pipeline to the specification. Address maps may be dependent on the current state: the identity of the specification variable that a bypass register represents is dependent upon the contents the bypass register. When an implementation variable does not represent any specification variable (e.g., a bypass register when it contains a bubble), addrmap returns ⊥, as shown in steps 0–2 for the pipeline in Figure 1. To relate PipeOk to flushpoint equality and Burch-Dill flushing, we require that each pipeline defines a function Flush and a predicate isFlushed. 2.4
Consistency Conditions
Table refconds summarizes the conditions required for the predicates and functions in the pipeline model to be consistent with the behaviour of the state machine in the model. The complete mathematical definitions appear in a technical report [2].
3
Correctness Obligations
We begin with a summary of our notation. We present our correctness obligations according to the different types of hazards, datapath functionality, and flushing
72
M.D. Aagaard
3.1
Notation
When working with theorems relating a run of a specification to a run of an implementation, we often find it useful to draw “box” or commuting diagrams (Figure 2a). In Figure 2a, x and y refer to the states shown as circles. Properties associated with states and edges are listed in Figure 2b. We denote the tth element of a run σ as: σ t . We use run m σ to mean that σ is a run of the state-machine m, as defined by: ∀ t. m σ t σ t+1 . As a syntactic shorthand, we write m q q rather than m.Nsr q q , and we drop the name of the pipeline when refering to parameters other than Nsr.
Px
P P
Q
(P x) ∧ (Q y)
P f Q
(P x) ∧ (Q y) ∧ (f x y)
P
(P x) ∧ (Q y) ∧ (x < y)
Q
0 The initial (“0th ”) state R A read is performed W A write is performed F The state is flushed W No write is performed
P f Q
(P x) ∧ (Q y) =⇒ (f x y)
Fig. 2b. State and step properties
P
Q
(P x) ∧ (Q y) =⇒ (x < y)
P
Q
(P x) =⇒ ∃ y. Q y
a Address q State s Stage t Time σ Run of a state machine
P f Q
(P x) =⇒ ∃ y. (Q y) ∧ (f xy)
P f Q ILLEGAL: (P x) ∧ (f x y) =⇒ (∃ y. Q x)
Fig. 2a. Graphical notation
Fig. 2c. Variable identifiers
Fig. 2. Notation and conventions
3.2 Top-Level Correctness Statements Our top-level correctness statement, Definition 1, PipeOk, is the conjunction of thirteen correctness obligations. Each correctness obligation guarantees that a particular type of behaviour is implemented correctly. Section 3.3 describes structural-hazard correctness; Section 3.4 describes data-hazard correctness; Section 3.5 describes datapath functionality correctness; Section 3.6 describes additional correctness obligations needed to ensure that flushed states are externally equivalent to specification states. There are no correctness obligations that address only control hazards. Instead, control hazards permeate both structural hazard correctness and data hazard correctness. For structural hazards, we make sure that correctly speculated parcels are executed and incorrectly speculated parcels do not exit the pipeline. For data hazards, we make sure that incorrectly speculated parcels do not leave behind data results that are read by correctly speculated parcels.
A Hazards-Based Correctness Statement for Pipelined Circuits
Definition 1 Correctness of pipelines PipeOk Impl Spec ≡ Struct-hazard correctness ∧ 1 EnterTotFun Impl 2 ExitTotFun Impl ∧ 3 MatchIffTrav Impl ∧ Data-hazard correctness 5 WawHazOk Impl Spec ∧ 4 RawHazOk Impl Spec ∧ 6 WarHazOk Impl Spec ∧ 7 SpecRdTotFun Impl Spec ∧ 8 SpecWrTotFun Impl Spec ∧ 9 ImplWrTotFun Impl Spec
3.3
73
∧
Datapath correctness 10 DatapathOk Impl Spec ∧ Flushing correctness 11 ImplWrFlush Impl Spec ∧ 12 SpecWrFlush Impl Spec ∧ 13 ImplInvalidateFlush Impl Spec
Structural-Hazard Correctness Obligations
Structural hazard correctness is concerned with contention between parcels for resources in the pipeline. Typical bugs associated with structural hazards are loss of parcels, duplication of parcels, generation of bogus parcels inside the pipeline, deadlock, and livelock. A pipeline handles its structural hazards correctly if there is a one-to-one mapping between parcels that enter the pipeline and should exit and those parcels that do exit, and if the parcels that exit do so in the correct order. Definition 2 tracks a parcel as it traverses from stage to stage in a pipeline. The σ expression (t1 , s1 ) ; (tn , sn ) means that in the run σ, a parcel enters the stage s1 at t1 , traverses from s1 to sn , and exits the stage sn at tn . In the base case s1 and sn are the same stage. In the inductive case, there is an intermediate stage s2 such that the parcel transfers from s1 to s2 and then traverses from s2 to sn . To detect when the parcel exits s1 , we use the matching relation provided by s1 , according to our hierarchical model of pipelines. Definition 2 supports pipelines with loops, because Match separately identifies each iteration. We use ; to define Trav, which means a parcel traverses through the pipeline from Top to Bot. Definition 2 Traversing between stages in a pipeline (;) σ (t1 , s1 ) ; (tn , sn ) ≡ ∃ t2 , s2 . s1 .Match σ t1 t2 s = sn ∧ 1 ∨ ∧ xfr σ t2 s1 s2 s1 .Match σ t1 tx ∧ σ (t2 , s2 ) ; (tn , sn ) Obligation 1, EnterTotFun, says that for each time (t1 ) that a parcel enters the pipeline and should exit, there exists exactly one time (t2 ) such that the parcels exits at t2 (total and functional). Obligation 2, ExitTotFun, says that each parcel that exits the pipeline (XfrOut) comes from exactly one parcel that entered the pipeline and should have exited (surjective and injective). Together, Obligations 1 and 2 guarantee that the relationship between entering and exiting parcels is bijective.
74
M.D. Aagaard
Obligation 1 Each entrance results in exactly one exit EnterTotFun Impl ≡ ∀ σi , t1 . run Impl σi ∧ isFlushed σi0 =⇒ ∃! t2 . Trav Impl σi t1 t2 ∧ ShouldExit σi t1 Obligation 2 Each exit comes from exactly one entrance ExitTotFun Impl ≡ ∀ σi , t2 . run Impl σi ∃! t1 . ∧ IsFlushed σi0 =⇒ Trav Impl σi t1 t2 ∧ ∧ t2 ShouldExit σi t1 XfrOut σ1 Obligation 3, MatchIffTrav, says that parcels that exit the pipeline do so in the correct order, as defined by the pipeline-specific matching relation (Match). MatchIffTrav allows pipelines to be treated as black boxes in hierarchical verification, by relating the traversal of parcels inside the pipeline, Trav, to the entrance and exit of parcels. Obligation 3 Match correctly identifies when a parcel traverses the pipeline MatchIffTrav Impl ≡ ∀ σ, t1 , t2 . [Match Impl σ t1 t2 ] ⇐⇒ [Trav Impl σ t1 t2 ] 3.4
Data-Hazard Correctness Obligations
A data-depenency exists between a producing (writing) instruction and a consuming (reading) instruction if the producing instruction writes to an address that the consuming instruction reads from and no instruction between the producer and the consumer writes to that address. A pipeline implements data dependencies correctly if every data dependency in the specification is obeyed in the implementation. Data hazards are categorized as: read-after-write, write-after-read, and write-afterwrite. If a pipeline handles all three types of data hazards correctly, then it implements data dependencies correctly. In Figure 3, the gray lines represent orderings between specification and implementation operations that will violate the dependency between Wi and Ri . Read-after-write (Raw) hazard correctness guarantees that Ri occurs after Wi . Together, write-after-write and write-after-read hazard correctness guarantee that no write will occur to this address between Wi and Ri . Write-after-write (Waw) correctness guarantees that no programmatically earlier write happens after Wi . Write-after-read (War) correctness guarantess that no programmatically later write will occur before Ri . Figure 3 has many simplifications that are violated by optimizations such as bypass registers, register renaming, and out-of-order execution. Our formalization supports these optimizations using dynamic address maps, multiple writes, and out-of-order writes [2]. The data hazard obligations ensure that reads and writes in the implementation occur in the correct order. We use the symbols Wr JLIP? NB;N C@ IH? MSMN?G L?"H?M ;HINB?L& NB?H ; L?"H?G?HN G;J ;FQ;SM ?RCMNM( 9B? ?R' CMN?H=? I@ L?"H?G?HN G;JM CH NB? FCH?;L NCG? @L;G?QILE Q;M MNO>C?> CH ;H CH#O?HNC;F J;J ?L<S .C ;H> 5;GJILN( 6S CHN?L?MN CH JLI PCHA ;H;FIAIOM L?MOFNM @IL NB? I ?M HIN>?J ?H>IH ;HS I@ NB? =IH>CNCIHM @IOH> CH NB?QILE I@ .C ;H> 5;GJILN "!#!& G;=BCH? =FIMOL?& "HCN? CHPCMC?N?LGCH' CMG& CHN?LH;F =IH NCHOCNS& NB? OM? I@ BCMNILS ;H> JLIJB?=S P;LC;CL?=N =IHM?KO?H=? CM NB;N L?"H?G?HN G;JM ;FQ;SM ?RCMN CH NB? FCH?;L NCG? @L;G?QILE& MO< D?=N IHFSNI NB? OM? I@ JLIJB?=S'FCE? P;LC;A@CE2?E 4@?46AED E92E ?@E:@?D @7 4@CC64E?6DD >FDE 244@F?E 7@C 2C6( * "()((#'%&$! .:?46 E96 DA 64:!42E:@? :D 56!?65 2E 2 >@C6 23DEC24E =6G 6= E92? E96 :>A=6>6?E2E:@?# ?@E:@?D @7 4@CC64E?6DD D9@F=5 2==@H 7@C &'(''!%#$"# H 96C6 E96 :>A=6>6?E2E:@? >2J C6BF:C6 D6G 6C2= DE6AD 367@C6 >2E49:?8 2 D:?8=6 DE6A @7 E96 DA 64:!42E:@? 0&' 1% D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 304–318, 2003. Springer-Verlag Berlin Heidelberg 2003
A Compositional Theory of Refinement for Branching Time
305
( #$!&$%$&'" >IF JMPLFMFNTBTJON MBY DONTBJN MORF STBTF DOMPONFNTS BNE
MBY USF EJ!FRFNT EBTB RFPRFSFNTBTJONS TIBN TIF SPFDJ"DBTJON( "&!+&*&+/ *#-. BRF USFE TO SIOW IOW TO VJFW BN JMPLFMFNTBTJON STBTF BS B SPFDJ"DBTJON STBTF @*A(
>IF DLBSSJD PBPFR ON TIF TOPJD CY 2CBEJ BNE 9BMPORT @*A& WIJDI IBS MO' TJVBTFE TIF WORK BPPFBRJNH JN TIJS PBPFR& DONTBJNS BN JN'EFPTI EJSDUSSJON OG TIFSF TOPJDS( >IF MBJN JEFB JS TO USF RF"NFMFNT MBPS TO PROVF TIBT SYSTFMS IBVF RFLBTFE JN"NJTF DOMPUTBTJONS& CY RFBSONJNH ),%#))1& BCOUT STBTFS BNE TIFJR SUDDFSSORS& JNSTFBE OG '),$#))1& BCOUT JN"NJTF PBTIS( 2CBEJ BNE 9BMPORT PROVF B TIFORFM BCOUT WIFN SUDI RF"NFMFNT MBPS FXJST JN TIF LJNFBR TJMF GRBMFWORK& WIFRF TIF SFMBNTJDS OG SYSTFMS BNE PROPFRTJFS DORRFSPONE TO SFTS OG JN"NJTF SFQUFNDFS( :Y BPPROBDI EJ!FRS JN TIBT 8 WORK JN TIF CRBNDIJNH TJMF GRBMFWORK& WIFRF TIF SFMBNTJDS OG SYSTFMS BRF HJVFN CY SFTS OG JN"NJTF TRFFS( 5VFN SO& TIF RFSULTS DBN CF BPPLJFE TO TIF LJNFBR TJMF GRBMFWORK& BS 8 FXPLBJN LBTFR( >IF TIFORFM PROVFE CY 2CBEJ BNE 9BMPORT IOLES ONLY UNEFR DFRTBJN DON' EJTJONS( 3RJF#Y& TIFY BLLOW ONF TO BEE IJSTORY BNE PROPIFDY VBRJBCLFS TO TIF JMPLFMFNTBTJON& TIFY RFQUJRF TIBT TIF JMPLFMFNTBTJON JS MBDIJNF DLOSFE& BNE TIFY RFQUJRF TIBT TIF SPFDJ"DBTJON IBS "NJTF JNVJSJCLF NONEFTFRMJNJSM BNE JS JN' TFRNBLLY DONTJNUOUS( :Y TIFORFMS EO NOT EFPFNE ON TIFSF DONEJTJONS& CUT TIFRF BRF JMPORTBNT EJ!FRFNDFS CFTWFFN TIF TWO BPPROBDIFS TIBT BRF FXPLORFE JN EFPTI LBTFR( >IFRF BRF TWO MBJN RFBSONS WIY 8 DIOSF TO WORK JN TIF CRBNDIJNH'TJMF GRBMF' WORK( >IF "RST JS TIBT JN TIF SJMPLF DBSF WIFRF ONF JS EFBLJNH WJTI "NJTF'STBTF SYSTFMS& JT MBKFS SFNSF TO USF BLHORJTIMS TIBT DBN DIFDK JG ONF "NJTF'STBTF SYSTFM RF"NFS BNOTIFR( 6OR FXBMPLF& JN @*/A WF USF BLHORJTIMS GOR EFDJEJNH STUTTFRJNH CJSJMULBTJON TO DOMPLFTF B PROOG OG DORRFDTNFSS GOR TIF BLTFRNBTJNH CJT PROTODOL $TIJS JS BN JN"NJTF'STBTF PROCLFM TIBT WBS RFEUDFE TO B "NJTF'STBTF PROCLFM USJNH B TIFORFM PROVFR%( >IF CRBNDIJNH TJMF NOTJONS OG SJMULBTJON BNE CJSJMULBTJON& EUF TO :JLNFR BNE IF LJTFRBTURF ON TIJS TOPJD JS VBST BNE DONTBJNS MBNY "NF SURVFYS @+,& *-& .A( 8N BE'
306
P. Manolios
JOZOUT& ZNKXK NG\K HKKT \GXOU[Y K^ZKTYOUTY UL ZNK 6HGJO GTJ =GSVUXZ XKY[RZ D*F& OTIR[JOTM D.& 2& +& 1F( KA JBJ?BOPEFM %(&( @LKGRK@QFLK %*& >KA AFPGRK@QFLK %+&( FJMIF@>QFLK %'&( >KA !K>IIV( ?FK>OV BNRFS>IBK@B %#&* 8M>@FKD FP RPBA QL OBFKCLO@B ?FKAFKD. JLOB PM>@B FKAF@>QBP ILTBO ?FKAFKD* 1 ?FK>OV OBI>QFLK( ( $ , " , ( FP 9.".?2=. FC .)5 ( , .. 5(5/* ( FP :@44.;92, FC .)5& 6 ( , .. 5(6 ' 6(5/* ( FP *5;2:@44.;92, FC .)5& 6 ( , .. 5(6 * 6(5 ' 5 0 6 /* ( FP ;9*5:2;2=. FC .)5& 6& 7 ( , .. 5(6 * 6(7 ' 5(7 /* 1 ?FK>OV OBI>QFLK FP > 79.69-.9 FC FQ FP OB"BUFSB >KA QO>KPFQFSB* 1 MOBLOABO QE>Q FP >IPL PVJJBQOF@ FP >K .8QRO>I KRJ?BO 1* 1K 25!52;. :.8 CRK@QFLK COLJ !* :EBK 4 TOFQB 5 ( $ ( CLO > PBNRBK@B $ ( 4 JB>K QE>Q 5 FP FK QEB O>KDB LC $ * 1 >.33%/6 PBQ >KA ! FP > ?FK>OV OBI>QFLK LK + PR@E QE>Q QEBOB >OB KL FK!KFQBIV AB@OB>PFKD PBNRBK@BP LK + ( TFQE OBPMB@Q QL !* 4 RPB ' QL @LJM>OB K>QRO>I KRJ?BOP >KA % QL @LJM>OB LOAFK>I KRJ?BOP* 1 ;9*5:2;265 :@:;.4 %98& FP > PQOR@QROB .*& ! ! "& )/( TEBOB * FP > PBQ LC PQ>QBP( ! ! "$ * " * FP QEB ;9*5:2;265 9.3*;265( ) FP QEB 3*+.3250 /FK FP * >KA FQ QBIIP RP TE>Q FP L?PBOS>?IB >Q > PQ>QB* 4 >IPL OBNRFOB QE>Q ! ! " FP 3./;%;6;*3 . CLO BSBOV 2 ( * ( QEBOB FP PLJB 3 ( * PR@E QE>Q 2 ! ! " 3* 6LQF@B QE>Q > QO>KPFQFLK PVPQBJ FP > I>?BIBA DO>ME TEBOB QEB KLABP >OB PQ>QBP >KA >OB I>?BIBA ?V )* 1 7*;1 $ FP > PBNRBK@B LC PQ>QBP PR@E QE>Q CLO >AG>@BKQ PQ>QBP 2 >KA 3( 2 ! ! " 3* 1 M>QE( $ ( FP > /Q $ FP > CRIIM>QE PQ>OQFKD >Q PQ>QB 2 >KA $ $ ABKLQBP QEB PRÆU CRIIM>QE .$%/& $ %/ ' ,&& % % %/* 4 RPB QEB PVJ?LI QBK>QFLK LC M>QEP TEBOB QEB IBCQ M>QE FP !KFQB( .&0&( -/ -. 0 --.* 9BJMLO>I ILDF@ T>P MOLMLPBA >P > CLOJ>IFPJ CLO PMB@FCVFKD QEB @LOOB@QKBPP LC @LJMRQFKD PVPQBJP FK > I>KAJ>OH M>MBO ?V 7KRBIF ;--=* 4 >PPRJB QE>Q QEB OB>ABO FP C>JFIF>O TFQE QBJMLO>I ILDF@*
" $./..&-(+' $(*/)%.(,+ #&!+&*&+. 8QRQQBOFKD PFJRI>QFLK ABMBKAP LK QEB KLQFLK LC J>Q@EFKD 4 KLT AB!KB* 4 PQ>OQ TFQE >K FKCLOJ>I >@@LRKQ* 3FSBK > OBI>QFLK ( LK > PBQ * ( TB P>V QE>Q >K FK!KFQB PBNRBK@B $ %LC BIBJBKQP COLJ * & J>Q@EBP >K FK!KFQB PBNRBK@B Æ %LC BIBJBKQP COLJ * & FC QEB PBNRBK@BP @>K ?B M>OQFQFLKBA FKQL KLK)BJMQV( !KFQB PBDJBKQP PR@E QE>Q BIBJBKQP FK OBI>QBA PBDJBKQP >OB OBI>QBA ?V ( * 2LO BU>JMIB( FC QEB !OPQ PBDJBKQ LC $ E>P QEOBB BIBJBKQP >KA QEB !OPQ PBDJBKQ LC Æ E>P PBSBK BIBJBKQP( QEBK B>@E LC QEB QEOBB BIBJBKQP FP OBI>QBA ?V ( QL B>@E LC QEB PBSBK BIBJBKQP* 4 RPB J>Q@EFKD( TEBOB QEB FK!KFQB PBNRBK@BP >OB CRIIM>QEP LC > QO>KPFQFLK PVPQBJ( QL AB!KB PQRQQBOFKD PFJRI>QFLK*
$%!'&)&(' #" #4*;,1$ 5BQ / O>KDB LSBO !* 5BQ ()' ?B QEB PBQ LC PQOF@QIV FK)
@OB>PFKD PBNRBK@BP LC K>QRO>I KRJ?BOP PQ>OQFKD >Q +/ CLOJ>IIV( ()' 0 ,# . # . ! & ! * #%+ 0 + * .)/ ( ! .. #%/ ' #%"/ '$ ,&/-* 9EB / PBDJBKQ LC >K FK!KFQB PBNRBK@B $ TFQE OBPMB@Q QL # ( ()'( $ ( FP DFSBK ?V QEB PBNRBK@B .$%#%/&& % % % & $%#%/ ' ,& ! ,&/* 2LO ( $ * " *& #& " ( ()' & /& 0 ( !( >KA FK!KFQB PBNRBK@BP $ >KA Æ ( 4 >??OBSF>QB .)2& 4 . 2 ( " $ $ * 4 ( ! Æ % . 2(4/ ?V % " $ $ &( % ! Æ % &*
308
P. Manolios
0C 7::?I?DC+ 3=@@ $'& $& #& Æ& " % " -&/ % ! ++$ " $ # %' $ ! Æ # %. 7C: ;1B38 $'& $& Æ % -'#& " % ,.) ++ 3=@@ $'& $& #& Æ& " %. (
"
,2770 $# +9D5< A5B , & ' # , ! , & 1 -'#& " % ,.) ++ 3=@@ $'& $& #& Æ& " %. " -'#" & " " % ,.) ++ 3=@@ $'& $& #" & Æ& " " % + -&/ % ! ++ / " $# / - ) , / ! Æ# / - ).. !
!
3>; 78DK; A;BB7 7AADLH JH ID G;7HDC 78DJI H;=B;CIH JH?C= 97H; 7C7AMH?H& L>;G; I>; I>G;; 97H;H 7G;+ 8DI> H;=B;CIH 7G; D< A;C=I> )& I>; G?=>I H;=B;CI ?H D< A;C=I> ) 7C: I>; A; =G;7I;G I>7C )& 7C: I>; A; ) 7C: I>; G?=>I D< A;C=I> =G;7I;G I>7C )(
$ -; HI7GI?C= 7I 0 97C 8; B7I9>;: 8M HDB; <JAAE7I> HI7GI?C= 7I 1(
*2!85;598 %# $/BCBB5@9 K>IM>F G? KL:L>K' >:= LG : KBF@D> KL:L>) 7ABK BK OA>J> LA> O>DD(?GMF=>= J>D:LBGF K BF- O> EMKL KAGO LA:L BF LA>K> K LA>J> BK :F :HHJGHJB:L> E>:KMJ> ?MFDD(?GMF=>= J>D:LBGF LA:L =><J>:K>K) 2GJE:DDQ' O> A:N>-
%*!.,4,/. $" !+/44$'7J> >PBKLK ?MF=' :F= *%/& 0& 2 $ * - /(2 ( / "" # 0 $:% *&1 - 2 "" # 1 - 0(1 + ) $;% $0(2 ( 9,63; $0& 2% ! 9,63; $/& 2%% ) $DD(?GMF=>= KBEMD:LBGF L>DQ K KLMLL>JBF@ KBEMD:LBGF) 7AMK' O> DD(?GMF=>= KBEMD:LBGF :K : KGMF= :F= L> HJGG? JMD>)
'2/0/3,4,/. # '2//+ 5>L
!)7 ?MDDH:LA Æ :F= BF<J>:KBF@ K>IM>FK #& " J><MJKBN>DQ :K ?GDDGOK- Æ%* / -& #%* / *& "%* / *) 7A> B=>: BK LA:L ?JGE #%.& "%.& Æ $"%.% O> "F> # $. & +%& " $. & +%& ! Æ # & Æ%" $. & +% OBLA " $ # & ! Æ # E:L FGO HJGN> LA:L >N>JQ 676 BK : 826) 2GJ LA> HJGG?' O> A:N> LG >PAB;BL LA> J:FC ?MFJ LA> =>"FBLBGF G? 826) 3>J> BK : AB@A(D>N>D GN>JNB>O) 7A> N:DM> G? 9,63; $/& 2% BK BEHGJL:FL GFDQ B? /(2' :K GLA>JOBK> LA>J> :J> FG J>KLJBIMBJ>= ;Q LA> =>"FBLBGF G? 826) 4? /(2' LA>F J LA> D:J@>KL KM;LJ>> G? LA> > JGGL>= :L / KM BF LA> KM;LJ>> E:LK : KM 9J:FC# $: CBF= G? A>B@AL% G? LABK KM;LJ>> BK LA> N:DM> G? 9,63; $/& 2%) 7A> 9J:FC# G? / BK @J>:L>J LA:F LA> 9J:FC# G? :FQ G? BLK F BF LA> LJ>>' KG 8?K,; BK K:LBK">=) 7A> N:DM> G? 9,634 $2& /& 0% BK BEHGJL:FL GFDQ B? /(2 :F= / "" # 0' :K GLA>JOBK> LA>J> :J> FG J>KLJBIMBJ>= ;Q LA> =>"FBLBGF G? 826) 4? /(2 :F= / "" # 0'
310
P. Manolios
ODAI >/;89 #4( 1( 2$ EN ODA GAICOD JB ODA NDJMOANO K=OD BMJH 4 OD=O H=O?DAN 1( 2( 4I ODA ?=NA JB 9BN+?& RA ?=I ?DJJNA ODA IASO NP??ANNJM JB 4 EI ODEN K=OD OJ N=OENBT ODA ?JI@EOEJI( 3EQAI = 87 ' / *,( ! ! "( ++& ODA IJOEJI JB ODA 1O=EIA@ >T PIBJG@EIC ' NO=MOEIC BMJH 1 =I@ ?=I >A @A!IA@ =N BJGGJRN( 8DA IJ@AN JB ODA OMAA =MA !IEOA NALPAI?AN JQAM , ( 8DA OMAA EN @A!IA@ OJ >A ODA NH=GGANO OMAA N=OENBTEIC ODA BJGGJREIC(
*( 8DA MJJO EN *1+( +( 4B *1( ' ' ' ( 4+ EN = IJ@A =I@ 4 ! ! " 3 & ODAI *1( ' ' ' ( 4( 3 + EN = IJ@A RDJNA K=MAIO EN *1( ' ' ' ( 4+(
*.!207032 $" "@>33# 3EQAI =I 787 * & EB *4$& ODAI @>33 #1( 4$ EN ODA AHKOT OMAA& JODAMRENA @>33 #1( 4$ EN ODA G=MCANO NP>OMAA JB ODA ?JHKPO=OEJI OMAA MJJOA@ =O 1 NP?D OD=O BJM AQAMT IJI'MJJO IJ@A JB ODA OMAA& *1( ' ' ' ( 5+& RA D=QA OD=O 5*4 =I@ *$3 - 4 ! ! " 3 - *3 $+( +.11- &" (B3>D =/@6 33 #1( 4$ 7? !;7@3%
7EI?A ODA ?DEG@ MAG=OEJI JI IJ@AN EI @>33 '1 EN RAGG'BJPI@A@& RA ?=I MA?PMNEQAGT @A!IA = G=>AGEIC BPI?OEJI& /& OD=O =NNECIN =I JM@EI=G OJ IJ@AN EI ODA OMAA =N BJGGJRN- /'0 / *(. - . EN = ?DEG@ JB 0 - #/'.$ % *+( 8DEN EN ODA NO=I@=M@ ;M=IF" BPI?OEJI AI?JPIOAMA@ EI NAO ODAJMT :*,AG JB = OMAA EN ODA G=>AG JB EON MJJO(
+.11- '" *4 ,, , ! "$ C63>3
" 7? /; 7;!;7@3 1/>27;/9 " E(A($ & ! "# @63; 4 /99 # , $ @>33 #1( 4$ 7? 9/03932 C7@6 /; 27;/9 27;/97@D ! "% +.11- (" *4 1*4( 1 2( 2 # @>33 #1( 4$ @63; /'@>33 #2( 4$ " /'@>33 #1( 4$% *.!207032 %" "93;5@6# 3EQAI * & =I 787& 93;5@6 #4( 1( 2$ / ) EB *4$ JM 2$& JODAMRENA 93;5@6 #4( 1( 2$ EN ODA GAICOD JB ODA NDJMOANO EIEOE=G NAC' HAIO NO=MOEIC =O 4 OD=O H=O?DAN *1( 2+( 2JMH=GGT1( 4
!! "
!! "
93;5@6 #4( 1( 2$ / *:7; %( Æ( $( # - 4= '%'1 1> #*( %( $( Æ( # $ - , ! Æ " ,+
)
%'* / 2
)
4= 'Æ'4
)
$( #
# *+' )
0N 1*4 =I@ 1 ! ! " 2& ODA =>JQA M=ICA EN IJI'AHKOT =I@ 93;5@6 #4( 1( 2$ # !(
+.11- )" *4 1*4( 1 ! ! " 2 /;2 &*%%( Æ( $( # - 4= '%'1 ) %'* / 2 ) 4= 'Æ'4 ) $( # # *+' - 1> #*( %( $( Æ( # $ ) ! Æ " / *4++$ @63; *%3 - 4 ! ! " 3 - 93;5@6 #3( 1( 2$ ) 93;5@6 #4( 1( 2$ ) 1*3 +% ,5343607032 # "' 0 /)-% 1 CNMREPTEMCE NF SHE ABNUE SHENQEL IR SHAS AKK NF SHE OQNOEQSIER OQNUED FNQ J 8;& 4%6%& IKFFEI; J>7J J>; IF;9?"97J?ED IOIJ;C H;FH;I;DJI DKC8;HI ?D :;9?C7B 8KJ J>; ?CFB;C;DJ7J?ED IOIJ;C H;FH;I;DJI DKC8;HI ?D 8?D7HO& EH J>7J DKC8;HI ?D J>; IF;9?"97J?ED 7H; IFH;7: 79HEII I;L;H7B H;=?IJ;HI ?D J>; ?CFB;C;DJ7J?ED& 7D: IE ED( 1<J;D H;"D;C;DJ C7FI 7H; ;IF;9?7BBO 9B;7H& M>?9> C7A;I ?J ;7IO JE 9>;9A J>7J J>;O 7H; ?D 7J 7IIE9?7J;: M?J> IJ7J;I ?I 7 I;J E< L7H?78B;I& ;79> E< 7 F7HJ?9KB7H JOF;( -KHJ>;HCEH;& IKFFEI; J>7J J>; L7H?78B;I ?D J>; ?CFB;C;DJ7J?ED 7H; 7 IKF;HI;J E< J>; L7H?78B;I ?D J>; IF;9?"97J?ED 7D: J>7J J>; H;"D;C;DJ C7F @KIJ >?:;I J>; ?CFB;C;DJ7J?ED L7H?78B;I J>7J :E DEJ 7FF;7H ?D J>; IF;9?"97J?ED( 3>;D& ?J ?I 9B;7H J>7J J>; H;"D;C;DJ C7F ?I 7 H;7IED78B; ED;( 0EH; FH;9?I;BO& =?L;D 32 % * ,%" ! ! "" $-& ?< $ >7I J>; <EBBEM?D= IJHK9JKH;& M; I7O J>7J % ?I @E=43( /;J .&+, 8; 7 I;J 7D: B;J -/*( 8; 7 EI; :EC7?D ?I .&+, ( 3>?DA E< .&+, 7I J>; L7H?78B;I E< 32 %& M>;H; -/*( =?L;I J>; JOF; E< J>; L7H?78B;I( -EH 7BB ) # % & B;J $!) 8; 7 7J $!)!* # -/*( !* ( 3>; B;CC7 8;BEM I>EMI M>O J>; 7FFHEFH?7J;D;II E< H;"D;C;DJ C7FI J>7J >?:; IEC; E< J>; ?CFB;C;DJ7J?ED L7H?78B;I ?I ;7IO JE 7I9;HJ7?D( )5 % * ,%" ! ! "" $- 0# %! * ,% ! " ! ! "! " $! -$ 14 @E=43 -,?$ 0;3 $! $(!)% * $!).! $ @74; 5 4B4>E =08>