This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
m(e) | new C(e)
Q M
::= ::=
class C<X T> N {T f; M} <X T> T m (T x) {return e;}
v
::=
new C(v)
N R T, U P
::= ::= ::= ::=
C | Object N | X ∃Δ.N | ∃∅.X T |
Δ Γ B
::= ::= ::=
X→[Bl Bu ] x:T T | ⊥
expressions class declarations method declarations values class types non-existential types types type parameters type environments variable environments bounds variables classes type variables
x C X, Y
Fig. 1. Syntax of TameFJ
A Model for Java with Wildcards
7
TameFJ is not a strict subset of the Java language. However, a Java program written in a subset of Java (corresponding to the syntax of Wild FJ) can be easily translated to a TameFJ program, as we discuss in Sect. 4. Part of that translation is to perform Java’s inference of type parameters for method calls (except where this involves wildcards). As is common [17], we regard this as a separate pre-processing step and do not model this in TameFJ. TameFJ is an extension of FGJ [13]. The major extension to FGJ is the addition of existential types, used to model wildcard types. Typing, subtyping and reduction rules must be extended to accommodate these new types, and to handle wildcard capture. We use existential types in the surface syntax and, in contrast to Wild FJ, do not create them during type checking; this simplifies the formal system and our proofs significantly. In particular, capture conversion is dealt with more easily in our system because fresh type variables do not have to be supplied. We also ‘pack’ existential types more declaratively, by using subtyping, rather than explicitly constructing existential types. This means that we avoid obtaining the awkward1 type ∃X.X, found both in [17] and our previous work2 [5]. TameFJ has none of the limitations of our previous approach [5]; we allow lower bounds, have more flexible type environments, allow quantification of more than one type variable in an existential type, and have more flexible subtyping. Thus, together with the absence of open and close expressions, TameFJ is much closer to the Java programming language. 3.1
Notation and Syntax
TameFJ is a calculus in the FJ [13] style. We use vector notation for sequences; for example, x stands for a sequence of ‘x’s. We use ∅ to denote the empty sequence. We use a comma to concatenate two sequences. We implicitly assume that concatenation of two sequences of mappings only succeeds if their domains are disjoint. We use as a shorthand for extends and for super. The function f v() returns the free variables of a type or expression, and dom() returns the domain of a mapping. We assume that all type variables, variables, and fields are named uniquely. The syntax for TameFJ is given in Fig. 1. The syntax for expressions and class and method declarations is very similar to Java, except that we allow as a type parameter in method invocations. In TameFJ (and as opposed to Java) all actual type parameters to a method invocation must be given. However, where a 1
2
There is no corresponding type in Java, so it is unclear how such a type should behave. Such a type is required in earlier work because the construction ∃Δ.T appears in the conclusion of type rules, where T is a previously derived type. Since T may be a type variable, one may construct ∃X.X; this can not happen in our calculus. Under a standard interpretation of existential types, types of the form ∃X T.X have no observably different behaviour from T because Java subtyping already involves subclass polymorphism. Rigorous justification of this fact is outside the scope of this paper, but is part of planned future work.
8
N. Cameron, S. Drossopoulou, and E. Ernst
type parameter is existentially quantified (corresponding to a wildcard in Java), we may use to mark that the parameter should be inferred. Such types can not be named explicitly because they can not be named outside of the scope of their type. The marker is not a replacement for ? in Java; can not be used as a parameter in TameFJ types, and ? can not be used as a type parameter to method calls in Java. Note that we treat this as a regular variable. The syntax of types is that of FGJ [13] extended with existential types. Nonexistential types consist of class types (e.g., C) and type variables, X. Types (T) are existential types, that is a non-existential type (R) quantified by an environment (Δ, i.e., a sequence of formal type variables and their bounds), for example, ∃X → [∃∅.D ∃∅.Object].C<X>. Type variables may only be quantified by the empty environment, e.g., ∃∅.X. In the text and examples, we use the shorthands C for C, ∃X.C<X> for ∃X→[⊥ Object].C<X>, and R for ∃∅.R. Existential types in TameFJ correspond to types parameterised by wildcards in Java. Using T as an upper or lower bound on a formal type variable corresponds to using extends T or super T, respectively, to bound a wildcard. This correspondence is discussed further in Sect. 4. The bottom type, ⊥, is used only as a lower bound and is used to model the situation in Java where a lower bound is omitted. Substitution in TameFJ is defined in the usual way with a slight modification. For the sake of consistency formal type variables are quantified by the empty set when used as a type in a program (∃∅.X). Therefore, we define substitution on such types to replace the whole type, which is [T/X]∃∅.X = T. A variable environment, Γ , maps variables to types. A type environment, Δ, maps type variables to their bounds. Where the distinction is clear from the context, we use “environment” to refer to either sort of environment. 3.2
Subtyping
The subclassing relation between non-existential types ( :), reflects the class hierarchy. Subclassing of type variables is restricted to reflexivity because they have no place in the subclass hierarchy. Subtyping ( : ∃X.List<X>
A Model for Java with Wildcards
9
Subclasses: R : R class C<X Tu > N {...} C : [T/X]N
R : R
R : R R : R R : R
(SC-Sub-Class)
(SC-Reflex)
(SC-Trans)
Extended subclasses: Δ B : B class C<X Tu > N {...} Δ ∃Δ .C : ∃Δ .[T/X]N
Δ ⊥: B
Δ B : B
(XS-Sub-Class)
(XS-Bottom)
(XS-Reflex)
Δ B : B Δ B : B Δ B : B
dom(Δ ) ∩ f v(∃X→[Bl Bu ].N) = ∅ f v(T) ⊆ dom(Δ, Δ ) Δ, Δ [T/X]Bl N {T f; M} ok (T-Class)
Fig. 4. TameFJ class and method typing rules
12
N. Cameron, S. Drossopoulou, and E. Ernst
Forward references are only allowed to occur as parameters of the bounding type. In the well-formedness rule, this is addressed by allowing forward references when checking that the bounds are well-formed types, but not when checking the subtype and subclass relationships of the bounds. This reflects Java where (in a class or method declaration) <X Y, Y Object> is illegal, due to the forward reference in the bound of X; however, <X List, Y Object> is legal. 3.4
Typing
Method and class type checking judgements are given in Fig. 4 and are mostly straightforward. The only interesting detail is the correct construction of type environments for checking well-formedness of types and type environments. The override relation allows method overriding, but does not allow overloading.
Expression typing: Δ; Γ e : T | Δ
Δ; Γ x : Γ (x) | ∅
Δ C ok f ields(C) = f f T ype(f, C) =U Δ; Γ e : U | ∅ Δ; Γ new C(e) : ∃∅.C | ∅
(T-Var)
(T-New)
Δ; Γ e : ∃Δ .N | ∅ f T ype(f, N) = T Δ; Γ e.f : T | Δ
Δ; Γ e : U | Δ Δ, Δ U N {U f; M} f T ype(fi , C) = [T/X]Ui
class C<X Tu > N {U f; M} m ∈M mBody(m, C) = mBody(m, [T/X]N)
class C<X Tu > N {U f; M} U m (U x) {return e0 ;} ∈ M mBody(m, C) = (x; [T/X]e0 ) class C<X Tu > N {U f; M} U m (U x) {return e0 ;} ∈ M
class C<X Tu > N {U f; M} m ∈M mT ype(m, C) = mT ype(m, [T/X]N) mT ype(m, C) = [T/X](U → U)
Fig. 7. Method and field lookup functions for TameFJ
In the following paragraphs we describe unpacking and packing, followed by descriptions of type checking using T-Field and T-Invk, accompanied with examples. Unpacking an existential type (∃Δ.R) entails separating the environment (Δ) from the quantified type (R). Δ can be used to judge premises of a rule and must be added to the guarding environment in the rule’s conclusion. R can be used without quantification in the rule; bound type variables in R will now be free, we must take care that these do not escape the scope of the type rule. If the result of type checking an expression contains escaping type variables (indicated by a non-empty guarding environment), then we must find a supertype (using T-Subs) in which there are no free variables, and use this as the expression’s type. In the case that an escaping type variable occurs as a type parameter (e.g., X in C<X>), then the type may be packed to an existential type (e.g., ∃X.C<X>) using the subtyping rule XS-Env. In the case that the type variable is the whole type, i.e., ∃∅.X, then the upper bound of X can be used as the result type by using S-Bound. Field Access. In T-Field, the f T ype function applied to the unpacked type (N) of the receiver gives the type of the field (T). Because T may contain type variables bound in the environment Δ , the judgement must be guarded by Δ . Example — Field Access. The following example of the derivation of a type for a field access expression demonstrates the sequence of unpacking, finding the field type, and finding a supertype that does not contain free variables. In the
A Model for Java with Wildcards
15
example, the type labelled 1 is unpacked to 2. The type labelled 3 would escape its scope, and so its supertype (4) must be used as the result of type checking. We assume that the TreeNode class declaration has a field datum with type Y and that Γ = x:∃X→[⊥ Shape].TreeNode<X>. ∅; Γ x : ∃X→[⊥ Shape].TreeNode<X>1 | ∅ f T ype(datum, TreeNode<X>2 ) = X3 ∅; Γ x.datum : X3 | X→[⊥ Shape]2
∅, X→[⊥ Shape] X3 , X) = (Tree2, Tree<X>) ∅; this:C this : C | ∅ mT ype(walk, C) = <X> Tree<X>→List<X> ∅; this:C y : ∃Z.Tree1 | ∅ 2 match(Tree2 , Tree<X>, , X, Z ) 2 2 Z Tree ; no such T exists, and hence matching, and thus type checking, fails. <X>Pair<X, X> make(List<X> x) {} <X>void compare(Pair<X, X> x) {} void m() { ∃U,V.Pair p; ∃Z.List b; this.compare(p); //1, type incorrect this.compare(this.make(b)); //2, OK } 2 Type Inference. As is usual with formal type systems, we consider type inference to be performed in a separate phase before type checking. Due to the presence of existential types, some inferred type parameters can not be named and are marked with . These parameters must be inferred during type checking. In T-Invk we only allow the inference of types where they are used as parameters to an actual parameter type (e.g., X in <X>void m(Tree<X> x)...). This is enforced by the sift function (defined in Fig. 6), which excludes pairs of actual and formal parameter types where the formal parameter type is a formal type variable of the method.
A Model for Java with Wildcards
17
Computation: e ; e f ields(C) = f new C(v).fi ; vi (R-Field) v = new N(v ) v = new N(v ) mBody(m, N) = (x; e0 ) mT ype(m, N) = U → U match(sif t(N, U, Y), P, Y, T) v.
m(v) ; [v/x, v/this, T/Y]e0 (R-Invk) Congruence: e ; e e ; e e.f ; e .f
e ; e e.
m(e) ; e .
m(e)
(RC-Field)
(RC-Inv-Recv)
ei ; ei e.
m(..ei ..) ; e.
m(..ei ..)
ei ; ei new C(..ei ..) ; new C(..ei ..)
(RC-Inv-Arg)
(RC-New-Arg)
Fig. 8. TameFJ reduction rules
3.5
Operational Semantics
The operational semantics of TameFJ are defined in Fig. 8. Most rules are simple and similar to those in FGJ. The interesting rule is R-Invk, which requires actual type parameters which do not include , these are found using the match relation. Avoiding the substitution of for a formal type variable in the method body prevents the creation of invalid expressions, such as new C(). Since we are dealing only with values when using this rule, there will be no existential types and so all type parameters could be specified. However, there is no safe way to substitute the appropriate types for s during execution because each may mark a different type. In this rule, mBody (defined in Fig. 6) is used to lookup the body (an expression) and the formal parameters of the method. 3.6
Type Soundness
We show type soundness for TameFJ by proving progress and subject reduction theorems [27], stated below. We prove these with empty environments since, at runtime, variables and type variables should not appear in expressions. A non-empty guarding environment is required in the statement of the progress theorem, because we use structural induction over the type rules; if this environment were empty, the inductive hypothesis could not be applied in the case of T-Subs.
18
N. Cameron, S. Drossopoulou, and E. Ernst
In the remainder of this section, we summarise some selected lemmas; we list most other lemmas in the appendix. We give full proofs in the extended version of this paper, available from: http://www.doc.ic.ac.uk/˜ncameron/papers/cameron ecoop08 full.pdf Theorem 1 (Progress). For any Δ, e, T, if ∅; ∅ e : T | Δ then either e ; e or there exists a v such that e = v. Theorem 2 (Subject Reduction). For any e, e , T, if ∅; ∅ e : T | ∅ and e ; e then ∅; ∅ e : T | ∅. To prove these two theorems, 40 supporting lemmas are required. These establish ‘foundational’ properties of the system, properties of substitution, properties of subtyping and subclassing (discussed in Sect. 3.2), which functions and relations always give well-formed types, and properties specific to each case of subject reduction and progress. Two of the most interesting lemmas concern the match relation: Lemma 36 (Subclassing preserves matching (receiver)). If Δ ∃Δ1 .N1 : ∃Δ2 .N2 and mT ype(m, N2 ) = U2 →U2 and mT ype(m, N1 ) = U1 →U1 and match(sif t(R, U2 , Y2 ), P, Y2 , T) and ∅ Δ ok and Δ, Δ T ok then match(sif t(R, U1 , Y1 ), P, Y1 , T). Lemma 37 (Subclassing preserves matching (arguments)). If Δ ∃Δ1 .R1 : ∃Δ2 .R2 and match(sif t(R2 , U, Y), P, Y, T) and f v(U) ∩ Z = ∅ and Δ2 = Z→[Bl Bu ] and ∅ Δ ok and Δ ∃Δ1 .R1 ok and Δ P ok then there exists U where match(sif t(R1 , U, Y), P, Y, [U /Z]T) and Δ, Δ1 U . We show in the proof that such issues do not affect T, because these types are found only from the actual parameter types of the method call. Lemma 37 performs a similar duty, but for the types of the actual parameters. The conclusion defines a ‘valid’ substitution which is given by lemma 35 (see Sect. 3.2). The types T in match are found from the actual parameter types and so, in contrast to lemma 36, these types are affected by the substitution in the conclusion of the lemma.
A Model for Java with Wildcards
19
Lemma 31 (Inversion Lemma (object creation)) If Δ; Γ new C(e) : T | Δ then Δ = ∅ and Δ C ok and f ields(C) = f and f T ype(f, C) =U and Δ; Γ e : U | ∅ and Δ ∃∅.C >> translates to ∃∅.C>. A subtle aspect of the translation is that wildcards can inherit their upper bound from the upper bound of the corresponding formal type variable in the class declaration. Since we want to avoid doing this in the calculus, we must take care of this in the translation, which is achieved as in the following example: for a class C declared as class C..., the type C is translated to ∃X→[⊥ ∃∅.Circle].C<X>. When an upper bound is declared both for a wildcard and in the corresponding class declaration, then the ‘smallest’ type is taken as the upper bound, if the types are subtypes of each other (M). Hence, C and F> is not immediately obvious, because in Java there is no finite type expression for the least supertype of all legal type arguments to F, i.e., the upper bound of the type argument X is not denotable in Java. However, in TameFJ this upper bound is, in fact, denotable: it is just ∃Y→[⊥ F].F. Indeed, our translation of F gives this type. In the case of F> where the wildcard is translated to the fresh variable Y, the upper bound will be the least subtype of ∃Z.F (the translation of the given bound; where Z is fresh) and F (the bound derived from the class declaration). Since the latter is more strict, it is used, even though this appears to contradict the rule of using fresh type variables for each wildcard; in fact it does no such thing, the second wildcard is translated to a fresh type variable, but is then forgotten.
5
Related Work
In this section we discuss related work. We distinguish three categories: the evolution of wildcards, formal and informal specifications of Java wildcards, and related systems with type soundness results. Wildcards are a form of use-site variance. This means that the variance of a type is determined at the instantiation of the type. The first uses of variant generic types in object oriented languages were declaration-site variance, where the variance of a type is determined by the class declaration. Use-site variance was first expressed in terms of structural virtual types [23]. The concept developed into Variant Parametric Types [14] which were extended to give Java wildcards. Wildcards in Java are officially (and informally) described in the Java Language Specification [11]. Wildcards and generics are described in detail in [3]. Wildcards were first described in a research paper in [24], again informally, but with some description of their formal properties and of the correspondence to existential types. The most important formal description of wildcards is the Wild FJ calculus [17], referred to throughout this paper. Wildcards have also
A Model for Java with Wildcards
23
been described in terms of access restriction [26] and flow analysis [8] (actually Variant Parametric Types). Variant Parametric Types [14] could be thought of as a partial model for Java wildcards (notably missing wildcard capture, but different in several subtler ways also). The calculus in [14] was proven type sound and as such it can be regarded as a partial soundness result for wildcards. In [5] we describe a sound partial model for wildcards using a more traditional existential types approach. In particular, the calculus has explicit open and close expressions, as opposed to the implicit versions found in this paper and in other approaches [14,17]. Subtyping of existential types in [5] is taken from the full variant of System F t. The operator + + stands for trace concatenation. To establish reasonable properties of concurrent programs we assume reasonable properties of the underlying sequential language: Definition 6. We say that program P is well-formed if sequential validity of trace t in P implies: 1. any trace t ≤ t is sequentially valid (prefix closedness), 2. if the last action of t is a read with value v, then the trace obtained from t by replacing the value in the last action by v is also sequentially valid in P (final read value independence),
46
ˇ c´ık and D. Aspinall J. Sevˇ
3. |t| > 0 implies πK (()t0 ) = St (start action first), 4. πK (()ti ) = Fin implies i = |t| − 1 (finish action last). 5. θ = θinit implies ∀i. 1 ≤ i < |t| − 1 → ∃v. πK (()ti ) = Wr(v) ∨ πK (()ti ) = Wrv (v) and πK (()t|t|−1 ) = Fin (initialisation thread only contains writes). The well-formedness of programs should not be hard to establish for any reasonable sequential language. The next definition places some sensible restriction on executions. Definition 7. We say that an execution A, P, ≤po , ≤so , W, V is well-formed if 1. A is finite. 2. ≤po restricted on actions of one thread is a total order, ≤po does not relate actions of different threads. 3. ≤so is total on synchronisation actions of A. 4. ≤so is consistent with ≤po . 5. W is properly typed: for every non-volatile read r ∈ A, W (r) is a non-volatile write; for every volatile read r ∈ A, W (r) is a volatile write. 6. Locking is proper: for all lock actions l ∈ A on monitors m and all threads θ different from the thread of l, the number of locks in θ before l in ≤so is the same as the number of unlocks in θ before l in ≤so . 7. Program order is intra-thread consistent: for each thread θ, the trace of θ in E is sequentially valid for Pθ . 8. ≤so is consistent with W : for every volatile read r of a variable v we have W (r) ≤so r and for any volatile write w to v, either w ≤so W (r) or r ≤so w. 9. ≤hb is consistent with W : for all reads r of v it holds that r ≤hb W (r) and there is no intervening write w to v, i.e. if W (r) ≤hb w ≤hb r and w writes to v then W (r) = w. 10. The initialisation thread θinit finishes before any other thread starts, i.e., ∀a, b ∈ A. K(a) = Fin ∧ T (a) = θinit ∧ K(b) = St ∧ T (b) = θinit → a ≤so b. The following definition of legal execution constitutes the core of the Java Memory Model. In our work, we use a weakened version of the memory model that we suggested in [5] and which permits more transformations than the original version. In Tbl. 1, we label this version by ‘JMM-Alt’. Definition 8. A well-formed execution A, P, ≤po , ≤so , W, V with happens before order ≤hb is legal if there is a finite sequence of sets of actions Ci and well-formed executions Ei = Ai , P, ≤poi , ≤soi , Wi , Vi with happens-before ≤hbi and synchronises-with <swi such that C0 = ∅, Ci−1 ⊆ Ci for all i > 0, Ci = A, and for each i > 0 the following rules are satisfied: 1. Ci ⊆ Ai . 2. For all reads r ∈ Ci we have W (r) ≤hb r ⇐⇒ W (r) ≤hbi r, and r ≤hbi W (r), 3. Vi |Ci = V |Ci . 4. Wi |Ci−1 = W |Ci−1 .
On Validity of Program Transformations in the Java Memory Model
47
5. For all reads r ∈ Ai − Ci−1 we have Wi (r) ≤hbi r. 6. For all reads r ∈ Ci − Ci−1 we have W (r) ∈ Ci−1 . 7. If y ∈ Ci is an external action and x ≤hb y then x ∈ Ci . The original definition of legality from [11,18] differs in rules 2 and 6, and adds rule 8: 2. ≤hbi |Ci =≤hb |Ci . 6. For all reads r ∈ Ci − Ci−1 we have W (r) ∈ Ci−1 and Wi (r) ∈ Ci−1 . 8. If x <sswi y ≤hbi z and z ∈ Ci − Ci−1 , then x <swj y for all j ≥ i, where <sswi is the transitive reduction of ≤hbi without any ≤poi edges, and the transitive reduction of ≤hbi is a minimum relation such that its transitive closure is ≤hbi . The reasons for weakening the rules are invalidity of reordering of independent statements, broken JMM causality tests 17–20 [21], and redundancy. For details, see [5,6]. For reasoning about validity of reordering, we define observable behaviours of executions and programs. Intuitively, a program P has an observable behaviour B if B is a subset of external actions of some execution of P , and B is downward closed on happens-before order (restricted to external actions). The JMM captures non-termination as a behaviour in the definition of allowable behaviours. Definition 9. An execution A, P, ≤po , ≤so , W, V with happens-before order ≤hb has a set of observable behaviours O if for all x ∈ O we have y ≤hb x or y ≤so x implies y ∈ O or T (y) = θinit . Moreover, there is no x ∈ O such that T (x) = θinit . The allowable behaviours may contain a special external hang action if the execution does not terminate. We will use the notation Ext(A)) for all external actions of set A, i.e., Ext(A) = {a | K(a) = Ex}. Definition 10. A finite set of actions B is an allowable behaviour of a program P if either – There is a legal execution E of P with a set of observable behaviours O such that B = Ext(O), or B = Ext(O) ∪ {hang} and E is hung. – There is a set O such that B = Ext(O) ∪ {hang}, and for all n ≥ |O| there must be a legal execution E of P with set of actions A, and a set of actions O such that (i) O and O are observable behaviours of E, (ii) O ⊆ O ⊆ A, (iii) n ≤ |O |, and (iv) Ext(O ) = Ext(O).
B
Proof
We prove validity of irrelevant read elimination, elimination of redundant write before write, elimination of redundant read after write, and reordering of nonvolatile memory accesses to different variables.
48
ˇ c´ık and D. Aspinall J. Sevˇ
The plan of the proof is straightforward—for any behaviour B of a transformed program P we need to show that the original program P had the same behaviour. Given a legal execution E of P with behaviour B we build a legal execution of P with (almost) the same behaviour. Using this construction, we will prove that transformations do not introduce new allowable behaviours (Def. 10), except hanging. The issues with hanging are tricky—its definition does not correspond with the committing semantics. Effects of Transformations on Traces. First, we define the notion of transformed program loosely enough so that redundant read/write elimination, irrelevant read elimination and reordering fit our definition. The idea is that for any trace of the transformed program there should be a trace of the original program that is just reordered with the redundant and irrelevant operations added. To describe the effects of irrelevant read elimination formally we define wildcard traces that may contain star ∗ symbols instead of some values. For example, sequence [Wr(x), 2, Rd(y), ∗, Rd(x), 3] is a wildcard trace. If tˆ is a wildcard trace, then [[tˆ]] stands for a family of all (normal) traces with the ∗ symbols replaced by some values. Given a wildcard trace tˆ, we say its ith component tˆi = a, v is – irrelevant read if a is a read and v is the wildcard symbol ∗, – redundant read if a is a read of some x and the most recent access of x is a write of the same value, and there is no synchronisation or external action in between; formally, there must be j < i such that tˆj = Wr(x), v and for each k such that j < k < i it must be that tˆk = Wr(y), v or tˆk = Rd(y), v for some y = x and some v , – redundant write if a is a write to some x and one of these two cases holds: (i) the write is overwritten by a subsequent write to the same variable and there are no synchronisation or external actions, and no read of x in between, or (ii) tˆi is the last access of the variable in the trace and there are no synchronisation or external actions in the rest of the trace. Definition 11. We will say that P is a transformed program from P if for any trace t in P there is a wildcard trace tˆ and a function f :: {0, . . . , |t | − 1} → {0, . . . , |tˆ| − 1} such that: all traces in [[tˆ]] are sequentially valid in P . if t is finished in P then all traces in [[tˆ]] are finished in P , function f is injective, the action kind-value pair ti is equal to tˆf (i) , for 0 ≤ i ≤ j < |t | we have that f (i) ≤ f (j) if any of the following reordering restrictions holds: (a) ti or tj is a synchronisation or external actions, or (b) ti and tj are conflicting memory accesses, i.e., accesses to the same variables such that at least one is a write, 6. if there is an index j < |tˆ| such that f (i) = j for any i, then tˆj must be a redundant read, a redundant write, or an irrelevant read. 1. 2. 3. 4. 5.
On Validity of Program Transformations in the Java Memory Model
49
A multi-thread program P is a transformed program of P if all single-thread programs of P are transformed programs of single-thread programs of P with the same index. For space reasons we omit the link between the concrete syntax and the meaning in terms of traces. It is straightforward to establish by induction on derivation in operational semantics that if we obtain a program P from a program P by a memory trace preserving transformation, or by an elimination of a redundant read after write, or by an elimination of a redundant write before write, or by an elimination of an irrelevant read, or by reordering of independent non-volatile memory accesses, then the set of traces of P is a transformed program from the set of traces of P . The only non-trivial part is proving that reordering of independent non-volatile memory accesses on source level corresponds to a trace transformation if the trace of the transformed program ends in between the reordered statements. In this case we can consider the missing part of the statement as being eliminated (either as a redundant write or an irrelevant read), and finish the proof. Transforming Executions. Let P be a program transformed from P , and E = A , P , ≤po , ≤so , W , V be a legal execution of P . Our goal is to construct a legal execution E of P with the following properties with the same observable behaviours. The main idea of the construction is to take the memory trace of each thread in E and use Def. 11 to obtain a trace of P , and mapping of actions and program order of E to actions and program order of our newly constructed execution. We will also need to restore actions that were eliminated by the transformation and construct the visibility functions W and V for the reconstructed actions. Given an execution E = A , P , ≤po , ≤so , W , V we construct untransformed execution E of P : for each thread θ = θinit let TrE (θ) be the trace of θ in E . By the definition of transformed program (Def. 11), there must be a wildcard trace of P , let’s denote it by tˆθ and corresponding transformation function fθ . For the initialisation thread θinit we define tˆθinit = [St, 0]+ +TrE (θinit )|W + +InitE + +[Fin, 0], where TrE (θinit )|W is the trace of the initialisation thread of E restricted to (possibly volatile) write actions, and InitE is any sequence of initialisation writes for all variables that appear in any component of tˆθ (θ = θinit ), but are not initialised in E . We set fθinit (i) = i if 0 ≤ i < |TrE (θinit )| − 1, and fθinit (|TrE (θinit )| − 1) = |tˆθinit | − 1. From the traces tˆθ we build action traces tθ of the same length. For 0 ≤ i < |tˆθ |, we set the i-th component of tθ to be – fθ−1 (i)-th element of TrE (θ) if fθ−1 (i) exists, or – fresh action a such that K(a) = tˆθi and T (a) = θ, if there is no j such that i = fθ (j). We use the action traces tθ to construct our untransformed execution E = A, P, ≤po , ≤so , W, V :
50
ˇ c´ık and D. Aspinall J. Sevˇ
1. A = {tθi | 0 ≤ i < |tˆθ |}, 2. order ≤po is the order induced by the traces tθ , i.e. ≤po = {(a, b) | T (a) = T (b) ∧ ι(tT (a) , a) ≤ ι(tT (a) , b)} 3. order ≤so is equal to ≤so , 4. the write-seen function W (a) is – W (a) if a ∈ A , – most recent write8 to x in ≤hb if a ∈ / A and a is a read from x, – a otherwise, i.e., if a is not a read, 5. V (a) is the corresponding value in the wildcard trace tˆθ , i.e., V (a) = T (a) πV (tˆι(tT (a) ,a) ). Lemma 1. Let P be a transformation of P , E be a well-formed execution of P with happens-before order ≤hb and E be the untransformed execution of P with happens-before order ≤hb . Let x and y be two actions from E such that any of them is synchronisation action, or they are conflicting memory accesses9 , or T (x) = T (y). Then x ≤hb y if and only if x ≤hb y. Proof. Observe that by point 3 of Def. 11 we have x ≤po y iff x ≤po y for all x and y from E such that x or y is a synchronisation or external action, or x and y are conflicting memory accesses. By induction on the transitive closure definition of ≤hb we get that for any z ≤hb y either z ≤po y or there is a synchronisation action s such that z ≤po s ≤hb y. With the observation above we conclude that x ≤hb y implies x ≤hb y if x is in E and x or y is a synchronisation action, or x and y are conflicting memory accesses, or T (a) = T (b). On the other hand, we prove that z ≤hb x implies that either z ≤po x or there is a synchronisation action s such that z ≤po s ≤hb x by induction on the definition of ≤hb . This implies the other direction of the equivalence. Lemma 2. Let P be a transformation of P , E be a well-formed execution of P and E be the untransformed execution of P . Then E is a well-formed execution of P . Proof. Properties 1–8 and 10 of well-formedness (Def. 7) are satisfied directly by our construction. We prove property 9, the hb-consistency, i.e., that for all reads in E, r ≤hb W (r) and there is no write w to the same variable as W (r) such that W (r) y, and where T is the least upper bound type of x and y. Both Range and ReverseRange are value types, implementing the Iterable interface in Java. They are thus usable in the “for-each” style loops introduced since Java 5. For example, the following code defines a loop over the range of values greater than or equal to bit.zero, and less than or equal to bit.one: for ( bit b : bit.zero :: bit.one ) { System.out.println(b); }
Furthermore, there is nothing required of the programmer to indicate whether a range is ascending or descending, even when the operands’ values cannot be statically determined. For instance, in the code below, depending on the arguments used to invoke printBits, the object generated by begin :: end can be either Range or ReverseRange: void printBits(bit begin, bit end) { for ( bit b : begin :: end ) { System.out.println(b); } }
80
S.S. Huang et al.
In Lime programs, programmers often need to iterate over the entire range of possible values of a value enum. A convenient shorthand is provided for iterating over this range. For example, for (bit b) {...} is equivalent to for (bit b : bit.first :: bit.last) {...}. Such a default range is always an ascending one. The :: operator is automatically defined for any value type supporting the operators ++, --, < and >. Lime also supports the :: operator for Java’s primitive integral types, such as int, short, etc. Compiler-Defined Fields and Methods. In addition to operators, static fields first and last are automatically defined to reference the smallest and largest values in a value enum’s range. For instance, bit.first returns bit.zero. These fields may seem redundant for a known enum type, but they become invaluable when we iterate over the range of an enum type variable, where the exact values of an enum are not known statically. Methods next() and prev() are generated to return the value proceeding and preceding the value invoking the method: bit.first.next() returns bit.one. Since objects of value enum’s (and in fact, all value types) do not have object identity at the Lime language level (i.e., all instances of bit.zero should be treated as the same object), the Lime compiler automatically generates equals(Object o) and hashCode() methods for these enum’s. The compiler also overloads the == operator for value enum’s to invoke equals(Object o). (An exception to this case is when == is used inside the definition of equals(Object o) itself.) Note that this is exactly the opposite from what is in Java: the equals(Object o) method in Java defaults to invoking == and comparing object identity. User-defined Operators. Lime also allows programmers to define their custom operators, or even override the automatically defined ones. For instance, we can define a unary complement operator for bit: public bit ~ this { return this++; }
A binary operator could be similarly defined. For instance, the operator & for bit can be defined as follows: public bit this & (bit that) { return this == one && that == one; }
Operator definitions are converted into compiler-generated methods, dispatched off of the this operand. For example, the ∼ operator definition becomes: public bit $COMP() { return this++; }, and the definition for & thus becomes public bit $AND(bit that) { ... }. 2.2
enum-indexed Arrays
Lime also extends Java with enum-indexed arrays. For example, int[bit] twoInts; declares an int array, named twoInts. The size of twoInts is bounded by the number of values in the value enum type bit. Thus, twoInts has a fixed size of 2. Furthermore, only an object of the array size’s enum type can be used to index into
Liquid Metal: Object-Oriented Programming
81
an enum-indexed array. The following code demonstrates the use of enum-indexed arrays. int i = twoInts[0]; int j = twoInts[bit.zero]
// ILLEGAL! 0 is not of type bit // OK
An enum-indexed array has space automatically allocated for it by the compiler. enum-indexed arrays provide a nice way to express fixed-size arrays, where the type system can easily guarantee that array indexes can never be out of bounds. Both of these are important properties for laying out the program in hardware – but are also valuable for writing exception-free software and for compiling it for efficient execution. 2.3
A More Complex Value Type: Unsigned
Using value enum’s and enum-indexed arrays, we can now define a much more interesting value class, Unsigned32: public value enum thirtytwo { b0,b1,...,b31; } public value class Unsigned32 { bit data[thirtytwo]; public Unsigned32(bit vector[thirtytwo]) { data = vector; } public boolean this ^ (Unsigned32 that) { bit newData[thirtytwo]; for ( thirtytwo i) newData[i] = this.data[i] ^ that.data[i]; return new Unsigned32(newData); } ... // define other operators and methods. } Unsigned32 is an OO representation of a 32-bit unsigned integer. It uses the value enum type thirtytwo to create an enum-indexed array of exactly 32 bit’s, holding the data for the integer. The definition of Unsigned32 exposes another interesting feature of the Lime compiler. Recall that a value type must have a default value that is assigned to uninitialized variables of that type. This means each value type must provide a default constructor for this purpose. Notice however that there is no such default constructor defined for Unsigned32. Conveniently, the Lime compiler can automatically generate this constructor for value types. The generated constructor initializes each field to its default value. Recall that one of the typing requirements of value types is that all fields must be references to value types. Thus, each field must also have a default value constructor defined (or generated) for it. The base case of our recursive argument ends with value enum’s. Thus, it is always possible for the Lime compiler to generate a default constructor.
82
S.S. Huang et al.
Implications for the Compiler. Even though Unsigned32 is defined using high-level abstractions, the combination of value enum’s and enum-indexed arrays exposes it to bit-level analysis. We can easily analyze the code to see that an object of Unsigned32 requires exactly 32 bits: each element of data[thirtytwo] is of type bit, which requires exactly 1 bit; there are 32 of them in data[thirtytwo]. This high level abstraction provides the Lime compiler with a lot of flexibility in both software and hardware representations of a value object and its operations. In software, Lime programs can be compiled down to regular Java bytecode and run on a virtual machine. We can choose to represent objects of Unsigned32 and thirtytwo as true objects, and iterations over values of thirtytwo are done through next() method calls on the iterator. However, without any optimizations, this would yield very poor performance compared to operations on a primitive int. We can thus also choose to use the bit-level information, and the knowledge that value objects do not have mutable state, to perform optimizations such as semantic expansions [2]. Using semantic expansions, value objects are treated like primitive types, represented in unboxed formats. Method invocations are treated as static procedure calls. These choices can be made completely transparent to the programmer. The same analogy holds for hardware. Existing hardware description languages such as VHDL [3] and SystemC [4] require programmers to provide detailed data layouts for registers, down to the meaning of each bit. In contrast, Lime’s high level abstraction allows the compiler to be very flexible with the way object data is represented in hardware. For instance, in order to perform “dynamic dispatch” in hardware, each object must carry its own type information in the form of a type id number. However, we can also strip an object of its type information when all target methods can be statically determined, and achieve space savings. The hardware layout choices are again transparent to the programmer. The definition of Unsigned32 exposes bit-level parallelism when it is natural to program at that level. Even more performance speed up can be gained through coarser-grained parallelism, where entire blocks of code are executed in a parallel or pipelined fashion. Very sophisticated algorithms have been developed to discover loop dependencies and identify which loops can be parallelized or pipelined safely. The knowledge of immutable objects make Lime programs even more amenable to these techniques. Our eventual goal is to design language constructs that promote a style of programming where different forms of parallelism are easily discovered and easily exploited. 2.4
Generic Value Types
A closer inspection of Unsigned32 shows that its code is entirely parametric to the value enum type used to represent the length of the array data. No matter what enum is used to size data, the definitions for the constructor and operator ^ are exactly the same, modulo the substitution of a different enum for thirtytwo. A good programming abstraction mechanism should allow us to define these operations once in a generic way. Lime extends the type genericity mechanism in
Liquid Metal: Object-Oriented Programming
83
Java to offer exactly this type of abstraction. The following is a generic definition of Unsigned<W>, where type parameter W can be instantiated with different value enum to represent integers of various bit width: public value class Unsigned<W extends Enum<W>> { bit data[W]; public Unsigned(bit vector[W]) { data = vector; } public Unsigned this ^ (Unsigned that) { bit newData[T]; for ( T i ) newData[i] = this.data[i] ^ that.data[i]; return new Unsigned(newData); } ... // similarly parameterize operator definitions }
Thus, to represent a 32-bit integer, we simply use type Unsigned. Similarly, we could use Unsigned<sixtyfour> to represent a 64-bit integer where sixtyfour is defined as follows: public value enum sixtyfour { b0,b1,...,b63; }
Note that type parameters to value type are assumed to be value types, and can only be instantiated with value types. For notational convenience, Lime offers a limited form of type aliasing. A typedef declaration can appear wherever variable declarations are allowed, and are similarly scoped. For example, the following statement declares Unsigned32 as an alias for Unsigned: typedef Unsigned32 = Unsigned;
We use the aliased forms of the Unsigned<W> class for the remainder of the paper. 2.5
Type-Checking Value Types
In order to ensure that the objects of value types are truly immutable, we must impose the following rules on the definition of a value type: 1. A field of a value type must be final, and of a value type. The keyword final is assumed in the definition of value types and is inserted by the Lime compiler. Compile-time checks make sure that assignment to fields only happen in initializers. 2. The supertypes of a value type must also be value types (with the exception of Object). 3. The type parameter of a value type is assumed to be a value type during the type checking process, and can only be instantiated by value types. 4. Objects of value types can only be assigned to variables of value types.
84
S.S. Huang et al.
Fig. 2. Block level diagram of DES and Lime code snippet
The first three rules are fairly straight forward. The last rule requires a bit more elaboration. The Lime compiler imposes that value types can only be subtypes of other value types, except for Object. Therefore, the only legal assignment from a value type to a non-value type is an assignment to Object. In this case, we “box” the object of value type into an object of lime.lang.BoxedValue, a special Lime compiler class. The boxed value can then be used as a regular object. In fact, this is the technique used when a value type is used in synchronized, or when wait() is invoked on it. Method equals(Object o) requires special treatment by these rules. The equals method must take an argument of Object type. It is inefficient to box up a value type to pass into the equals of another value type, which then has to strip the boxed value before comparison. Thus, the Lime compiler allows a value type to be passed into the equals of value types without being boxed. These equals methods have been type-checked to ensure that they do not mutate fields, it is thus safe to do so. It is also important to point out that an array holding objects of value types is not a value type itself. Neither is an enum-indexed array holding objects of value types. The contents of the array can still mutate. A value array, then, is expressed as (value int[]) valInts. Similarly for value enum-indexed arrays. A value array must be initialized when it is declared. All further writing into the array is disallowed. Our syntax does not allow multiple levels of immutability in arrays. It is not possible to express a mutable array of value arrays, for example. The value keyword at the outside means the entire array, at all dimensions, are immutable.
Liquid Metal: Object-Oriented Programming
85
Finally, methods finalize(), notify(), and notifyAll() can never be called on objects of value types. Objects of value types have no storage identity, thus these methods do not make sense for value objects.
3
Running Example
The Liquid Metal system is somewhat complex, consisting of a front-end compiler that generates bytecode or an FPGA-oriented spatial intermediate representation (SIR), a high-level SIR compiler, a layout planner, a low-level compiler, and finally a synthesis tool. In order to demonstrate how all of these components fit together, we will use a single running example throughout the rest of the paper. Our example program implements the Data Encryption Standard (DES). The program inputs plain text as 64-bit blocks and generates encrypted blocks (cipher text) of the same length through a series of transformations. The organization of the DES algorithm and its top level implementation in Lime are shown in Figure 2. The transformations occur in 16 identical rounds, each of which encrypts the input block using an encryption key. The plain text undergoes an initial permutation (IP) of the bit-sequence before the first round. Similarly, the public value class Unsigned { ... Unsigned permute ( (value T[T]) permTable ) { bit newBits[T]; for ( T i ) { newBits[i] = data[permTable[i]]; } return new Unsigned(newBits); } ... } // initial permutation (IP) import static DES.sixtyfour.*; public value class IP { public static (value sixtyfour[sixtyfour]) b57, b49, b41, b33, b25, b17, b9, b1, b61, b53, b45, b37, b29, b21, b13, b5, b56, b48, b40, b32, b24, b16, b8, b0, b60, b52, b44, b36, b28, b20, b12, b4, }; ... }
Permutation = { b59, b51, b43, b35, b63, b55, b47, b39, b58, b50, b42, b34, b62, b54, b46, b38,
b27, b31, b26, b30,
b19, b23, b18, b22,
Fig. 3. DES code snippets showing initial permutation
Fig. 4. Permutation pattern for IP
b11, b15, b10, b14,
b3, b7, b2, b6
86
S.S. Huang et al.
static Unsigned32 Fiestel(KeySchedule keys, Sixteen round, Unsigned32 R) { // half-block expansion Unsigned48 E = expand(R); // key mixing Unsigned48 K = keys.keySchedule(round); Unsigned48 S = E ^ K; // substitutions Unsigned4 Substitutes[eight]; fourtyeight startBit = fourtyeight.b0; for ( eight i ) { // extract 6-bit piece fourtyeight endBit = startBit + fourtyeight.b5; Unsigned6 bits = S.extractSixBits(startBit, endBit); // substitute bits Substitutes[i] = Sbox(i, bits);
}
// move on to next 6-bit piece startBit += fourtyeight.b6;
// concatenate pieces to form // a 32-bit half block again thirtytwo k; bit[thirtytwo] pBits; for ( eight i ) { for (four j) { pBits[k] = Substitutes[i].data[j]; } }
}
// permute result and return Unsigned32 P = new Unsigned32(pBits); return reversePermute(P);
Fig. 5. DES Fiestel round
Fig. 6. Block level diagram of Fiestel round
bit-sequence produced in the final round is permuted using a final permutation (FP). The output of the initial permutation is partitioned into two 32-bit half blocks. One half (R) is transformed using a Feistel function. The result of the
Liquid Metal: Object-Oriented Programming
87
function is then exclusive-OR’ed (xor) with the other half (L). The two halves are then interchanged and another round of transformations occurs. The initial and final permutations consume a 64-bit sequence and produce a sequence of bits according to a specific permutation pattern. The pattern for the initial permutation is illustrated in Figure 4. We implemented the permutations using a lookup table as shown in Figure 3. The permute method loops through the output bit indexes in order, and maps the appropriate input bit to the corresponding output bit. The enumerations and their iterators make it possible to readily name each individual bit, and as a result, bit-permutations are easy to implement. The ability to specify transformations at the bit-level provides several advantages for hardware synthesis. Namely, the explicit enumeration of the bits decouples their naming from a platform-specific implementation, and as a result there are no bit-masks or other bit-extraction routines that muddle the code. Furthermore, the enumeration of the individual bits means we can closely match permutations and similar transformations to their Verilog or VHDL counterparts. As a result, the compiler can command a lot of freedom in transforming the code. It has also been shown that such a bit-level representation of the computation leads to efficient code generation for conventional architectures and processors that support short-vector instructions [5,6]. There are also various benefits for a programmer. For example, the permute method can process the input or output bits in any order, according to what is most convenient. Similarly, off-by-one errors are avoided, through the use of enum-indexed arrays. The Fiestel method performs the transformations illustrated in Figure 6. The 32-bit R half block undergoes an expansion to 48-bits, and the result is mixed with an encryption key using an xor operation. The result is then split into eight 6-bit pieces, each of which is substituted with a 4-bit value using a unique substitution box (Substitutes[i]). The eight 4-bit resultant values are concatenated to form a 32-bit half block that is in turn permuted. The Fiestel method and coding rounds run in hardware on the FPGA. The main method, shown below, runs in software on the CPU. public static void main(String[] argv) { Unsigned64 key = makeUnsigned64("0xFEDCBA9876543210"); Unsigned64 text = makeUnsigned64("0x0123456789ABCDEF"); KeySchedule keys = new KeySchedule(key); Unsigned64 cipher = DEScoder(keys, text); System.out.println(Long.toHexString(cipher.longValue())); }
The program exercises co-execution between hardware and software, and demonstrates the use of varying object sizes and object-oriented features in hardware.
4
From Lime to the Virtual Machine
Lime programs can be compiled to regular Java bytecode and executed on any Java VM. The Lime bytecode generation performs two additional steps than the
88
S.S. Huang et al.
Java compiler. First, the Lime compiler generates bytecode to add “value” to value types: – Default constructors, equals(Object o) and hashCode() methods are created for those value classes that do not define them. – Uninitialized variables of value types are rewritten with default initializers. – Operator definitions listed in Section 2.1 are added for value enum’s. Value types that support ++, --, < and > operators have the range operator, ::, defined for them. – Operator expressions are converted to appropriate operator method calls. E.g., x == y is converted to x.equals(y), assuming x is of value type. For the purpose of separate compilation, all value types are translated to implement the lime.lang.Value interface. When loaded as a binary class, this interface indicates to the Lime compiler that it is a value class. Additional interfaces are added for value types supporting different operators. For example, all value types supporting operator < implement the interface lime.lang.HasGT, where HasGT contains one operator, boolean this < (T op2). Next, instantiations of value generic types must be expanded. Generics is a powerful abstraction tool for programmers. However, generic value classes also significantly complicate our compilation process. To see why, consider generating a default constructor for Unsigned<W>. This constructor needs to initialize the data field to a bit-array of size w, where w is the number of values defined for enum type W. However, the value of w changes for each concrete instantiation of W. We have no way of initializing this field without knowing what W is typeinstantiated to. For this reason, the erasure-based compilation technique used by Java generics is not applicable. We must employ an expansion-based compilation scheme, where each instantiation of Unsigned<W> creates a different type. Java generic classes not annotated with the value modifier are translated using the standard erasure technique, as long as they do not instantiate generic value types with their type variables. As a result, pure Java code that is compiled with our compiler remains backward compatible. There are of course numerous optimizations that exploit bit-width information and the immutable properties of value types (see Section 2.3 for examples). Such optimizations are well studied and understood. In this paper, we primarily focus on the less understood parts of our language, such as translating Object-Oriented semantics down to the hardware fabric. The Lime frontend compiler (source to bytecode or spatial intermediate representation) is implemented using the JastAdd extensible Java compiler [7].
5
Liquid Metal Runtime for Mixed-Mode Execution
A Lime program may run in mixed-mode. That is, some parts of the program will run in the virtual machine, and some parts will run in hardware (FPGA). An example mixed-mode architecture is a CPU coupled with an FPGA coprocessor, or a desktop workstation with an FPGA PCI card. Yet another example is an FPGA with processors that are embedded within the fabric. We use a Xilinx
Liquid Metal: Object-Oriented Programming
89
Virtex-4 board as an instance of the latter. The Virtex-4 is also our evaluation platform for this paper. Programs that run in software use its embedded IBM PowerPC 405 which runs at a frequency of 300 MHz. The processor boots an embedded Linux kernel and can run a JVM. The Liquid Metal runtime (LMRT) provides an API and a library implementation that allows a program to orchestrate its execution on a given computational platform. It simplifies the exchange of code and data between processing elements (e.g., PowerPC and FPGA), and automatically manages data transfer and synchronization where appropriate. The API calls are typically generated automatically by our compiler, although a programmer can make use of the API directly and manually manage the computation when it is desirable to do so. The LMRT organizes computation as a set of code objects and buffer objects. A buffer is either an input buffer, an output buffer, or a shared buffer. A code object reads input data from an attached input buffer. Similarly it writes its output to an attached output buffer. Data is explicitly transferred (copied) between input and output buffers. In contrast, a shared buffer simultaneously serves as an input and output buffer for multiple code objects. All communication between code objects is done through buffers. 5.1
Code Objects
The LMRT assumes there is a master processing element that initiates all computation. For example, the VM running on the PowerPC processor serves as the master on our Virtex board. The VM can invoke the LMRT API through JNI. The master creates code objects, attaches input and output buffers, and then runs, pauses, or deletes the code object as the computation evolves. A code object embodies a set of methods that carry out computation. It can contain private mutable data that persists throughout its execution (i.e., stateful computation). However, code objects are not allowed to maintain references to state that is mutated in another object. A Lime program running wholly in the virtual machine can be viewed as a code object with no input or output buffers. A program running in mixed-mode consists of at least two code objects: one running in software, and the other running in hardware. Data is exchanged between them using buffer objects. 5.2
Buffer Objects
A buffer is attached to a code object which can then access the buffered data using read and write operators. The LMRT defines three modes to read data from or write data to a buffer. – FIFO: The buffer is a first-in first-out queue, and it is accessed using push or pop methods. For example, code running in the VM can push data into the buffer, and code running in the FPGA pops data from the buffer. – DMA: The buffer serves as a local store, with put operations to write data to the buffer, and get operations to read data from it. The put and get commands operate on contiguous chunks of data.
90
S.S. Huang et al.
– RF: The buffer serves as a scalar register file, shared between code objects. The LMRT makes it possible to decouple the application-level communication model from the implementation in the architecture. That is, a buffer decouples (1) the program view of how data is shared and communicated between code objects from (2) the actual implementation of the I/O network in the target architecture. Hence a program can use a pattern of communication that is suitable for the application it encodes, while the compiler and runtime system can determine the best method for supporting the application-level communication model on the architecture. 5.3
The LMRT Hardware Interface Layer
One of the main reasons for the LMRT is to automatically manage communication and synchronization between processing elements. In a mixed-mode environment, communication between the VM and FPGA has to be realized over a physical network interconnecting the FPGA with the processor where the VM is running. In our current Virtex platform, we use the register file (RF) interface between the processor and the FPGA. The RF is synthesized into the fabric itself. It is directly accessible from the FPGA. From the processor side, the registers are memory mapped to a designated region of memory. The RF we use consists of 32 registers, each 32 bits wide. The 32 registers are portioned into two sets. The first is read accessible from the FPGA, but not write accessible. Those registers are read/write accessible from the VM. The second set is read accessible from the VM, but not write accessible. The registers in the second set are read/write accessible from the FPGA. The FIFO and DMA communication styles are implemented using the RF model. The FIFO model maintains head and tail pointers and writes the registers in order. The DMA model allows for 15x32 bits of data transfer, with 32 bits used for tags. While we use a register file interface between the VM and the FPGA, other implementations are feasible. Namely, we can implement a FIFO or a DMA directly in the FPGA fabric, and compile the code objects to use these interfaces. This kind of flexibility makes it possible to both experiment with different communication models, and adapt the interconnect according to the characteristics of the computation.
6
From Lime to a Spatial Intermediate Representation
Compiling a Lime program to execute on the FPGA requires a few transformations. Some transformations are necessary to correctly and adequately handle object orientation in hardware. Others are necessary for exposing parallelism and generating efficient circuits. Performance efficiency in the FPGA is attributed to several factors [8]:
Liquid Metal: Object-Oriented Programming
91
1. Custom datapaths: a custom datapath elides extraneous resources to provide a distinct advantage over a predefined datapath in a conventional processor. 2. Multi-granular operations: a bit-width cognizant datapath, ALUs, and operators tailor the circuitry to the application, often leading to power and performance advantages. 3. Spatial parallelism: FPGAs offer flexible parallel structures to match the parallelism in an application. Hence bit, instruction, data, and task-level parallelism are all plausible forms of parallelism. We refer to parallelism in the FPGA as spatial since computation typically propagates throughout the fabric. In this paper we focus exclusively on the issues related to discovering spatial parallelism and realizing such parallelism in hardware. Toward this purpose, we employ a spatial intermediate representation (SIR) that facilates the analysis of Lime programs. The SIR also provides a uniform framework for refining the inherent parallelism in the application to that it is best suited for the target platform. 6.1
Spatial Intermediate Representation
The SIR exposes both computation and communication. It is based on the synchronous dataflow model of computation [9,10]. The SIR is a graph of filters interconnected with communication channels. A filter consists of a single work method that corresponds to a specific method call derived from a Lime program. A filter may contain other methods that are called helpers. The difference between the work method and the helpers is that only the work method may read data from its input channel or write data to its output channel. For example, each static call to permute() in the DES example corresponds to a specific filter in the SIR. A filter consumes data from its input channel, executes the work method, and writes its results to an output channel. The input and output of the permute method that performs the initial permutation is an Unsigned64 value. Hence, the work method for permute consumes 64 bits and produces 64 bits on every execution. The filter work method runs repeatedly as long as a sufficient quantity of input data is available. Filters are independent of each other, do not share state, and can run autonomously. Filters have a single input channel and a single output channel. A filter may communicate its output data to multiple filters by routing the data through a splitter. A splitter can either duplicate the input it receives and pass it on to its siblings, or it can distribute data in a roundrobin manner according to a specified set of weights. The splitter’s counterpart is a joiner. A joiner collects and aggregates data from multiple filters in a roundrobin manner, and routes the resultant data to another filter. The single-input to single-output restriction placed on filters, and the routing of data through splitters and joiners for fan-out and fan-in imposes structure on the SIR graphs. The structure can occasionally lead to additional communication compared to an unstructured graph. In DES, this
92
S.S. Huang et al.
Fig. 7. SIR example for box substitutions in DES
occurs between Fiestel rounds where the values of L and R are interchanged1 . However we believe that the benefits of a structured SIR outweigh its drawbacks, and prior work has shown that structured graphs can be practically refined to their unstructured counterparts [11]. The SIR graph in Figure 7 illustrates an example derived from the box substitutions (Sbox) that occur in the Fiestel rounds. In the Figure, the output of the xor operator is duplicated to eight filters labeled Extract, each of which implements the extractSixBits methods but for different bit indexes. For example, the left-most filter labeled Extract b0..b5 inputs a 32-bit value and always extracts a value consisting of the bits at locations b0..b5. Similarly, the Extract b42..b47 filter always extracts the bits b42..b47. The output of the former is the input to the Sbox 1 filter which performs the appropriate bit substitutions for bits b0..b5. The Extract and Sbox filters make up a producer-consumer pair and are said to form a pipeline. Pipelines in the SIR graph expose pipeline parallelism that is readily exploited in hardware. The output of each Sbox is routed to a joiner that collects each of the 4-bit pieces in a roundrobin manner and outputs a 32-bit half block. Filters, like objects, may have fields. The fields are initialized using an init method whose parameters must be resolved when the SIR is constructed. Each of the Extract filters is initialized with the start and end bits that it is responsible for. Similarly, each of the Sbox filters is initialized with a table that encodes the unique substitution pattern for the bits it is responsible for. The fields of a filter cannot be shared and are conceptually stored in a local memory that is exclusive to that filter. In the Figure 7, the cylinders labeled Box 1..8 store the substi1
Figure 2 illustrates unstructured communication. It is left as an exercise for the reader to determine the structured SIR equivalent.
Liquid Metal: Object-Oriented Programming
93
tution boxes. The Extract method requires no storage since the initialization parameters are constant-propagated throughout the filter work method. 6.2
Compiling Lime to SIR
There are three key considerations in translating a Lime program into a spatial representation. We must determine the dataflow of the program: which objects (or primitive values) need to be passed from filter to filter, and which can be statically initialized (or calculated from statically initialized variables). We must also determine what constitutes a filter: what Lime code is a filter responsible for executing? Lastly, we must determine how important object-oriented features can be supported in hardware: how are objects represented? How do we support virtual method dispatch? How do we handle object allocation? Answering these questions requires us to first construct a control flow graph from program entry to exit, including inlining recursive method calls2 . The only cycles the control flow graph can have are those produced by Lime’s looping primitives, such as for or while. The inlining of recursive method calls necessarily places a restriction on the type of programs that can be synthesized into hardware: programs involving recursive method calls that are not statically bounded are out of the reach of synthesis. The basic approach is to construct a dataflow graph of non-static data in a program. Methods that receive non-static data as input are conceptually mapped to filters. The flow of data between methods outlines the overall SIR topology. Determining Dataflow. We use constant propagation to determine which variables have statically computable values. For example, in for (eight i) { ... } used in the box substitution in Fiestel, the variable i is statically initialized to be eight.b0, and subsequently updated by i + eight.b1 during each iteration. This updated value can be computed from statically known values. Thus, i does not need to be an input to a filter work method. Instead, it is used as a filter initializer or mapped to a filter field. On the other hand, bits is initialized by expression S.extractSixBits(startBit, endBit). S does not have a statically computable value—its value depends on the filter input to method Fiestel. Thus, the computation of S.extractSixBits(startBit, endBit) requires S as an input. (Note that the receiver of a method invocation is considered an input, as well.) Consequently, bits is not statically computable either, and must be the output of the filter/pipeline for the expression S.extractSixBits(startBit, endBit). Using standard dataflow techniques, we can determine the data necessary at each point of the program. Defining Filters. The identifying characteristic of filters is that they perform input or output (I/O) of data that is not statically computable. Once we determine what data is needed for input and output at each program location, 2
There is no good way to deal with unbounded recursion in hardware.
94
S.S. Huang et al.
we decompose the program into (possibly nested) I/O “containers”, and then construct filters and pipelines from these containers. The entry and exit of a Lime method form natural bounds for an outermost I/O container. For example, an outermost I/O container is constructed for method Fiestel. Within these bounds, we identify two types of I/O containers. First, an I/O container is indicated by a method or constructor invocation, where at least one of the arguments (including this, if method call is not static) has been identified as a filter input. For example, S.extractSixBits(startBit, endBit) in Fiestel becomes an I/O container, with S as its input. We then analyze the declaration of extractSixBits, and inline the I/O containers for the method declaration inside the container for the method invocation. A second type of I/O container is formed from branching statements such as for loops, or if/else, where the body of a branch references filter inputs. Each branching container may include nested containers depending on the body of the branch. For example, the box substitution for (eight i) { ... } loop in Fiestel becomes a branching I/O container. Nested within it, are a series of containers, such as the one for method call S.extractSixBits(startBit, endBit), as well as a container of Sbox(i, bits). Figure 8 illustrates the I/O containers identified for Fiestel in Figure 3. Note that expression E^K constitutes an I/O container because operator ^ is defined for Unsigned. Thus, E^K is turned into method invocation E.$XOR(K). Also note that non-I/O statements, such as loop index update (e.g., sIndex += fourtyeight.b6;), become local to their enclosing I/O container. For space reasons, ... represents elided I/O containers.
SIR from I/O Containers. An I/O container has a natural mapping down to the SIR. An I/O container with no nested containers naturally maps to a filter. Its work method contains all the statements enclosed by the container. These are generally arithmetic computations that have a straight-forward mapping to hardware. If these statements involve any static references, the definitions of the referenced data or methods are declared as local fields in the filter or as local variables in the work method. Filters (or pipelines) from I/O containers at the same nesting level are connected to form a pipeline. Thus, an I/O container with nested containers is mapped down to a pipeline formed by its children. A branching I/O container that is formed by a for statement, creates a more interesting mapping to the SIR. First, the work statements or nested I/O containers within the loop body are turned into a filter (or pipeline, respectively). If the loop iterations are independent of each other with respect to the filter input and output data, then the filter (pipeline) that makes up the loop body is considered data-parallel. It can be replicated once for each iteration of the loop. This basically translates the Lime code to a data-parallel representation in the SIR. A data splitter is added at the beginning of the for I/O container. The splitter duplicates the incoming data, and sends it down each replicated loop body filter (pipeline). Data that is not part of the filter input and that may
Liquid Metal: Object-Oriented Programming
95
Fig. 8. I/O containers for Fiestel
depend on the loop index are used as init values for the filter construction. A joiner is then added at the exit of the for I/O container to assemble the output of each replicated filter (pipeline). When we cannot determine that the loop iterations are independent, we have to explore an alternative mapping. In the case, the computation is considered stateful. In this case, we can statically unroll the loop and connect the unrolled loop body filter (pipeline) sequential to form longer pipelines. Alternatively, we can create a feedback loop such that the output of the loop body filter (pipeline) feeds back into itself. This second option, however, is untested in our SIR compiler. Similar split/join structures are generated for other branching statement I/O containers. Applying these rules, it is easy to see how I/O containers from Figure 8 can be mapped to exactly the SIR structure in Figure 7. Object Representation in Hardware. The most general way an object can be represented in hardware is by serializing it into an array of bits that is either packed into registers, or stored in memory. The kind of Lime programs most amenable for synthesis to hardware use data with efficient representations. Lime’s language design is geared toward exposing such representations from a high level, as we illustrated in Section 2. Objects of value types have no mutable state, and thus can be safely packed into registers, instead of being stored in much slower memory. Dynamic Dispatch in hardware. One of the defining features of objectoriented paradigms is the dynamic dispatch of methods. In order to perform dynamic dispatch in hardware, we assign a unique identifier to each type, which
96
S.S. Huang et al.
is then carried by the object of that type. Thus, object representation may require bits for the type identifier to be serialized, as well. When mapping an I/O container resulting from a virtual method invocation to SIR filters, we must generate a pipeline for each possible target method of the virtual call. All pipelines from target methods are then added to a switch connector. The condition for the switch is the type identifier that is carried by the incoming this object. A pipeline of the target method is only invoked if the type identifier of the input this object is equal to the type identifier of the method’s declaring class, or one of its subclasses. We use analysis such as RTA [12] to reduce the number of potential target methods that need to be synthesized. If the target method of a virtual call can be statically identified, then the object does not have to carry a type identifier. Object Allocation in Hardware. Lime programs can use the new keyword to create new objects. However, laying out a program in hardware means all memory needed must be known ahead of time. Thus, a program for synthesis must be able to resolve statically all new’s, and space is allocated in registers or memory. Repeatedly new-ing objects in an unbounded loop, with the objects having lifetimes persisting beyond the life of the loop, is not permitted in synthesized programs.
7
SIR Compiler
The SIR that we adopt is both a good match for synthesis and also convenient for performing coarse-grained optimizations that impact the realized parallelism. We build heavily on the StreamIt compiler [13] to implement our SIR and our SIR compiler. The StreamIt compiler is designed for the StreamIt programming language. In StreamIt, programs describe SIR graphs algorithmically and programmatically using language constructs for filters, pipelines, splitters/joiners, and feedback loops. The latter create cycles in the SIR graph although we do not currently handle cycles. 7.1
Lowering Communication
The SIR compiler transforms the SIR graph to reduce the communication overhead and cost. In an FPGA, excessive fan-out and fan-in is not desirable. Hence the compiler attempts to collapse subgraphs that are dominated by a splitter and post-dominated by a joiner. This transformation is feasible when the filters that make up the subgraph are stateless. In a Lime program, methods of a value class are stateless. For example, the Extract and Sbox filters in the SIR graph shown in Figure 7 are stateless since neither of the two has any mutable state. However, since each of these filters is specialized for a specific set of bits, collapsing the subgraph results in at least one stateful filter, namely the Sbox filter in this case. The collapsed graph is shown in Figure 9. Each execution of the work method updates the state of the filter (shown as i in the Figure) so that
Liquid Metal: Object-Oriented Programming
97
Controller
XOR Work
S (48 bits)
Work
Controller
Extract b0..b5
bits (6 bits) i
Work
Box 1..8
Controller
Sbox
(32 bits)
Work
P
Controller
Permute
Fig. 9. Result of collapsing SIR shown in Figure 7
on the first execution it performs the substitution that correlates with Sbox 8, on its second execution it performs the substitution for Sbox 7, and so on until its ninth execution where it resets the state and resumes with Sbox 8. The Extract filter does not need to keep track of its execution counts if the compiler can determine that each of the Extract filters in the original graph operated in order on mutually exclusive bits. Such an analysis requires dataflow analysis within the filter work method, and is aided by very aggressive constant propagation, loop unrolling, and dead code elimination. More powerful analysis is also possible when filters carry out affine computations [14,15,6]. The SIR compiler employs these techniques to reduce overall communication. The impact on the generated hardware can be significant in terms of speed (time) and overall area (space). We demonstrate the space-time tradeoff by synthesizing the SIR graphs in Figures 7 and 9. The results for these two implementations appear as Sbox Parallel Duplicate and Sbox State respectively in Figure 10. The evaluation platform is a Virtex-4 FPGA with an embedded PowerPC processor (PPC). The speedup results compare the performance of each hardware implementation to the implementation that yields the best results on the PPC. We also performed two other transformations: Sbox Parallel Roundrobin and Sbox Coarse. The former uses dataflow analysis to determine that a roundrobin splitter can replace the duplicate splitter and the Extract filters in Figure 7. The latter eliminates the state from the Sbox filter in Figure 9 by substituting all 48 bits in one execution of the work method. The results are as one should expect. The fastest hardware implementation uses the roundrobin splitter and parallel Sbox filters. This implementation is roughly 3x faster than the duplicate splitter implementation and 100% more space efficient since the roundrobin splitter avoids needless communication and optimizes the datapaths between filters aggressively. The area overhead is 50% larger than that of the most compact implementation, namely Sbox State which is a pipelined implementation with an aggressively optimized datapath. The coarse implementation
98
S.S. Huang et al. 10
area (number of FPGA slices)
speedup compared to best PPC405 implementation (Sbox Coarse)
1600
9 8 7 6 5 4 3 2 1 0
1400 1200 1000 800 600 400 200 0
Sbox Parallel Roundrobin
Sbox Parallel Duplicate
Sbox State
Sbox Coarse
Sbox Parallel Roundrobin
Sbox Parallel Duplicate
S box S ta te
S box Coa r s e
Fig. 10. Speedup and area results for different SIR realizations of Sbox
is the slowest of the four variants since it performs the most amount of work per execution of the work method and affords little opportunity for pipeline parallelism. It is however the best implementation for software although it is worthy to note that it does not use the natural width of the machine in this case. In other words, the version of Sbox Coarse that we benchmark in software uses the same granularity as the FPGA and runs the work methods at bit-granularity. This purpose of the performance comparison is to illustrate the space-time trade-off that exists. In Section 9 we compare our synthesis results to various optimized baselines. 7.2
Load Balancing
The SIR compiler also attempts to refine the SIR graph to realize a more loadbalanced graph. This is important because it minimizes the effects of a bottleneck in the overall design. Toward this end, we currently use the heuristics and optimizations described in [11] and implemented in the StreamIt compiler. The compiler uses various heuristics to fuse adjacent filters when it is profitable to do so. The heuristics rely on a work-estimation methodology to detect load imbalance. In our case, work-estimation is a simple scalar measure of the critical path length through the filter work method. It is calculated using predefined latencies for individual primitive operations. We believe however that there are other ways of dealing with load imbalance on an FPGA platform but we have not yet thoroughly investigated alternatives.
8
Getting to Hardware
The last step in our toolflow is HDL code generation. It is accomplished using our SIR to Verilog compiler called Crucible. It compiles each filter in the SIR to a Verilog hardware module. It then assembles the modules together according to the dataflow edges in the SIR graph. The Crucible also generates the HDL interfaces used to exchange data between the processor and the FPGA in order to support mixed-mode execution. The interfaces work in conjunctions with the
Liquid Metal: Object-Oriented Programming
99
Table 1. Comparison of DES implementation on different processing platforms processor frequency throughput performance DES version
PPC 405 FPGA Pentium-II 300 MHz 129 MHz 400 MHz 27 Mbps 30 Mbps 45 Mbps 1 1.11 1.69 C reference Lime C reference
Core 2 Duo 2.66 GHz 426 Mbps 16 C reference
Liquid Metal runtime to provide the network between processing elements, as well as the API implementation from the FPGA side. The completed design is finally synthesized using commercial synthesis tools to produce a bitstream that can be used to program the FPGA. We use the Xilinx synthesis tool (XST) for this purpose. FPGAs typically require vendor specific tools, so for other targets, the appropriate synthesis tool is used. The Crucible controls and guides the synthesis tool by setting appropriate synthesis parameters that impact resource allocation policies, arithmetic circuit implementation, the placement of objects in FPGA memory, etc. The Crucible is best suited to guide these policies since it has a global view of the application. The Crucible address both micro-functional (intra-filter) and macro-functional (inter-filter) synthesis issues. It extends the Trimaran [16] compiler with optimizations and heuristics that are space-time aware. We leverage many of the existing analysis and optimizations in Trimaran to optimize the code within each filter. These optimizations include critical path reduction, region formation for instruction-level parallelism, predication, vectorization, and aggressive instruction scheduling algorithms. In addition, the Crucible is bit-width cognizant, and although the compiler can perform bit-width analysis, we primarily rely on the high level semantics of the Lime program to elide or augment the analysis where feasible. Inthemicro-functionalsense,thecompiler operatesona controlflowgraph(CFG) consisting of operations and edges. Operations are grouped into basic blocks. Basic blocks are in turn grouped into procedures. Each procedure typically represents a filter. The code generation applies a bottom-up algorithm to the CFG, starting with the operations. It generates Verilog for each operation, then routes the operands between them. Basic blocks serve as a hierarchical building block. They are composed together with dataflow edges, eventually encompassing the entire procedure. Since procedures represent filters, it is also necessary to generate the FIFOs that interconnect them according to the SIR. The size of each FIFO is either determined from the SIR according to the data types exchanged between filters, or using a heuristic that is subject to space-time constraints. This is an example of a macro-functional optimization. If too little buffering is provided, then throughput decreases as modules stall to send or receive data; whereas too much buffering incurs substantial space overheads. Macro-functional optimizations require careful consideration of area and performance trade-offs to judiciously maximize application throughput at the lowest costs. In addition to the buffering considerations, the Crucible also generates hardware controllers that stage the execution of the filters in hardware. The results
100
S.S. Huang et al.
presented in this paper use a basic execution model that executes the filter work methods when the input data is ready, and reads from an empty channel (writes to full channel) block the filter under the channel until other filters make progress. A greater description of the Crucible and its optimizations are beyond the scope of this paper.
9
Experimental Results
We compiled and synthesized the DES Lime code from Section 3 to run in an FPGA. We measured the application throughput at steady state in terms of Mbits per second second (Mbps). We compare our results to an optimized implementation of DES (reference implementation) running on an Intel Pentium-II at 400 MHz, a Core 2 Duo processor with a frequency of 2.66 GHz, and a 300 MHz PPC 405 which is the embedded processor available in the Virtex-4 LX200. The frequency of the DES design generated from the Lime program is 129 MHz. The results are summarized in Table 1. The row labeled performance shows the relative throughput compared to the PPC 405. The PPC is a reasonable baseline since it is manufactured in the same technology as the FPGA fabric. Compared to the embedded processor, the FPGA implementation is 11% faster. It is 66% slower than a reasonably optimized DES coder running on a Pentium-II, and 14x slower than the fastest processor we tested. The results show that we can achieve a reasonable hardware implementation of DES starting from a high level program that was relatively easy to implement. Compared to reference C implementations that we found and studied, we believe the Lime program is easier to understand. In addition, the Lime program is arguably more portable since computation is explicitly expressed at the bit-level and is therefore platform independent. This is in contrast to software implementations that have to match the natural processing width of their target platforms and hence express computation at the granularity of bytes or words instead of bits. We believe that starting with a bit-level implementation is more natural for a programmer since it closely follows the specification of the algorithm. The FPGA implementation that we realized from the Lime program requires nearly 84% of the total FPGA area. This is a significant portion of the FPGA. The area requirement is high because we are mapping the entire DES coder pipeline (all 16 rounds) to hardware and we are not reusing any resources. The spatial mapping is straightforward to realize but there are alternative mapping strategies that can significantly reduce the area. Namely, sharing resources and trading off space for throughput is an important consideration. We showed an example of this kind of trade-off earlier using the Sbox code (refer to Figure 10). We believe that there is significant room for improvement in this regard and this is an active area of research that we are pursuing. Our goal however is not to be the world’s best high-level synthesis compiler. Rather, our emphasis is on extending the set of object-oriented programming features that we can readily and efficiently implement in hardware so that skilled
Liquid Metal: Object-Oriented Programming
101
Java programmers can transparently tap the advantages of FPGAs. In the current work, we showed that we can support several important features including value types, generics, object allocation, and operator overloading. We are also capable of supporting dynamic dispatch in hardware although the DES example did not provide a natural way to showcase this feature.
10 10.1
Related Work Languages with Value Types
Kava [1] is an early implementation of value types as lightweight objects in Java. The design of Lime is very much inspired by Kava. However, Kava was designed before enum types or generics were introduced into Java. Thus, Kava chose a different type hierarchy which put value types and Object at the same level. This design does not fit in well with the current Java design. Lime remedied this by using a value modifier. Lime also provides support for value generic types. Additionally, Kava value types are not automatically initialized, nor are default constructors generated. C# [17] offers value types in the form of structs. One important difference between C# value types and Lime value types is that C# value types cannot inherit from other value types. Inheritance and dynamic dispatch of methods are key features of the OO paradigm. Value types should be able to take advantage of these abstractions. Furthermore, C# struct references must be manually initialized by the programmer, even though a default constructor is provided for each struct. Lime value type references are automatically initialized, similar to the way primitive types are treated. Recent work by Zibin et al. [18] has shown a way to enforce immutability using an extra immutability type parameter. In this work, a class can be defined such that it can be used in a mutable or immutable context. In Lime, a value class and a mutable class must be separately defined. The method proposed in [18] is an interesting way to integrate a functional style with Java’s inherently mutable core. We could incorporate similar techniques in Lime in the future. 10.2
Synthesizing High-Level Languages
Many researchers have worked on compilers and new high-level languages for generating hardware in the past few years. Languages such as SystemC [4] have been proposed to provide the same functionality as lower-level languages such as Verilog and VHDL at a higher-level of abstraction. SystemC is a set of library routines and macros implemented in C++, which makes it possible to simulate concurrent processes, each described by ordinary C++ syntax. Similarly, HandleC [19] is another hardware/software construction language with C syntax that support behavioral description of hardware. SA-C [20] is a single assignment high-level synthesizable language. An SA-C program can be viewed as a graph where nodes correspond to operators, and edges to data paths. Dataflow graphs are ideal (data driven, timeless) abstractions for hardware circuits.
102
S.S. Huang et al.
StreamC [21] is a compiler which focuses on extensions to C that facilitate expressing communication between parallel processes. Spark [22] is another C to VHDL compiler which supports transformations such as loop unrolling, common sub-expression elimination, copy propagation, etc. DEFACTO[23] and ROCCC[24] are two other hardware generation systems that take C as input and generate VHDL code as output. To the best of our knowledge, none of these compilation systems support high level object-oriented techniques. Work by Chu[25] proposes object oriented circuit-generators. Circuit-generators, parameterized code which produces a digital design, enable designers to conveniently specify reusable designs in a familiar programming environment. Although object oriented techniques can be used to design these generator, this system is not intended for both hardware and software programming in a parallel system. Additionally, the syntax used in the proposed system is not appropriate for large-scale object oriented software designs.
11
Conclusion
In this paper, we introduce Lime, an OO language for programming heterogeneous computing environments. The entire Lime architecture provides end-toend support from a high-level OO programming language, to compilation to both the Java VM, the FPGA, and a runtime that allows mixed-mode operation such that code can run on partly on the VM and partly on the FPGA, delegating work to the most optimal fabric for a certain task. Lime is a first step toward a system that can “JIT the hardware”, truly taking advantage of the multitude of computing architectures.
Acknowledgments This work is supported in part by IBM Research and the National Science Foundation Graduate Research Fellowship. We thank Bill Thies and Michael Gordon of MIT for their help with the StreamIt compiler, Stephen Neuendorffer of Xilinx for his help with the Xilinx tools and platforms, and the reviewers for their helpful comments and appreciation of our vision.
References 1. Bacon, D.F.: Kava: A Java dialect with a uniform object model for lightweight classes. Concurrency—Practice and Experience 15, 185–206 (2003) 2. Wu, P., Midkiff, S.P., Moreira, J.E., Gupta, M.: Efficient support for complex numbers in java. In: Java Grande, pp. 109–118 (1999) 3. IEEE: 1076 IEEE standard VHDL language reference manual. Technical report (2002) 4. IEEE: IEEE standard SystemC language reference manual. Technical report (2006) 5. Narayanan, M., Yelick, K.A.: Generating permutation instructions from a highlevel description. In: Workshop on Media and Streaming Processors (2004)
Liquid Metal: Object-Oriented Programming
103
6. Solar-Lezama, A., Rabbah, R., Bod´ık, R., Ebcio˘ glu, K.: Programming by sketching for bit-streaming programs. In: PLDI 2005: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pp. 281–294. ACM, New York (2005) 7. Ekman, T., Hedin, G.: The jastadd extensible java compiler. In: OOPSLA 2007: Proceedings of the 22nd annual ACM SIGPLAN conference on Object oriented programming systems and applications, pp. 1–18. ACM, New York (2007) 8. Babb, J., Frank, M., Lee, V., Waingold, E., Barua, R., Taylor, M., Kim, J., Devabhaktuni, S., Agarwal, A.: The raw benchmark suite: Computation structures for general purpose computing. In: Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines (1997) 9. Lee, E.A., Messerschmitt, D.G.: Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. on Computers (1987) 10. Bhattacharyya, S.S., Murthy, P.K., Lee, E.A.: Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, Dordrecht (1996) 11. Gordon, M., Thies, W., Amarasinghe, S.: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (2006) 12. Bacon, D.F.: Fast and effective optimization of statically typed object-oriented languages. PhD thesis (1997) 13. StreamIt (2003), http://cag.csail.mit.edu/streamit 14. Lamb, A.A., Thies, W., Amarasinghe, S.: Linear Analysis and Optimization of Stream Programs. In: PLDI (2003) 15. Agrawal, S., Thies, W., Amarasinghe, S.: Optimizing stream programs using linear state space analysis. In: CASES (2005) 16. Trimaran Research Infrastructure (1999), http://www.trimaran.org 17. Hejlsberg, A., Wiltamuth, S., Golde, P.: C# Language Specification. AddisonWesley Longman Publishing Co., Inc., Boston (2003) 18. Zibin, Y., Potanin, A., Ali, M., Artzi, S., Kie˙zun, A., Ernst, M.D.: Object and reference immutability using Java generics. In: ESEC/FSE 2007: Proceedings of the 11th European Software Engineering Conference and the 15th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia (2007) 19. Handle-C Language Overview (2004), http://www.celoxica.com 20. Najjar, W., Bohm, W., Draper, B., Hammes, J., Rinker, R., Beveridge, J., Chawathe, M., Ross, C.: High-level language abstraction for reconfigurable computing (2003) 21. Mencer, O., Hubert, H., Morf, M., Flynn, M.J.: Stream: Object-oriented programming of stream architectures using PAM-blox. In: FPL, pp. 595–604 (2000) 22. Gupta, S.: Spark: A high-level synthesis framework for applying parallelizing compiler transformations (2003) 23. Diniz, P.C., Hall, M.W., Park, J., So, B., Ziegler, H.E.: Bridging the gap between compilation and synthesis in the defacto system. In: Lecture Notes in Computer Science, pp. 52–70 (2001) 24. Guo, Z., Buyukkurt, B., Najjar, W., Vissers, K.: Optimized generation of data-path from c codes for fpgas. In: Design Automation Conference (2005) 25. Chu, M., Sulimma, K., Weaver, N., DeHon, A., Wawrzynek, J.: Object oriented circuit-generators in Java. In: Pocek, K.L., Arnold, J. (eds.) IEEE Symposium on FPGAs for Custom Computing Machines, pp. 158–166. IEEE Computer Society Press, Los Alamitos (1998)
Kilim: Isolation-Typed Actors for Java (A Million Actors, Safe Zero-Copy Communication) Sriram Srinivasan and Alan Mycroft University of Cambridge Computer Laboratory, Cambridge CB3 0FD, UK {Sriram.Srinivasan,Alan.Mycroft}@cl.cam.ac.uk Abstract. This paper describes Kilim, a framework that employs a combination of techniques to help create robust, massively concurrent systems in mainstream languages such as Java: (i) ultra-lightweight, cooperatively-scheduled threads (actors), (ii) a message-passing framework (no shared memory, no locks) and (iii) isolation-aware messaging. Isolation is achieved by controlling the shape and ownership of mutable messages – they must not have internal aliases and can only be owned by a single actor at a time. We demonstrate a static analysis built around isolation type qualifiers to enforce these constraints. Kilim comfortably scales to handle hundreds of thousands of actors and messages on modest hardware. It is fast as well – task-switching is 1000x faster than Java threads and 60x faster than other lightweight tasking frameworks, and message-passing is 3x faster than Erlang (currently the gold standard for concurrency-oriented programming).
1
Imagine No Sharing
Computing architectures are getting increasingly distributed, from multiple cores in one processor and multiple NUMA processors in one box, to many boxes in a data centre and many data centres. The shared memory mindset – synonymous with the concurrent computation model – is at odds with this trend. Not only are its idioms substantially different from those of distributed programming, it is extremely difficult to obtain correctness, fairness and efficiency in the presence of fine-grained locks and access to shared objects. The “Actor” model, espoused by Erlang, Singularity and the Unix process+pipe model, offers an alternative: independent communicating sequential entities that share nothing and communicate by passing messages. Addressspace isolation engenders several desirable properties: component-oriented testing, elimination of data races, unification of local and distributed programming models and better optimisation opportunities for compilers and garbage collectors. Finally, data-independence promotes failure-independence [1]: an exception in one actor cannot fatally affect another. 1.1
Motivation
The actor and message-passing approach, with its coarse-grained concurrency and loosely-coupled components is a good fit for split-phase workloads (CPU, J. Vitek (Ed.): ECOOP 2008, LNCS 5142, pp. 104–128, 2008. c Springer-Verlag Berlin Heidelberg 2008
Kilim: Isolation-Typed Actors for Java
105
javac annotated src
parse
external annotations
typecheck
heap model
bytecode
isolation check
CPS transform
bytecode
Kilim weaver Fig. 1. javac output post-processed by Kilim weaver
network and disk) [4] and service-oriented workflows. With a view to immediate industrial adoption, we impose the following additional requirements: (a) no changes to Java syntax or to the JVM, (b) lightweight actors1 (c) fast messaging (d ) no assumptions made about a message receiver’s location and implementation language (e) widespread support for debugging, logging and persistence. 1.2
The Kilim Solution
This paper introduces Kilim2 , an actor framework for Java that contains a bytecode post-processor (“weaver”, see Fig. 1) and a run-time library. We list below some important features as well as the design points: Ultra-lightweight threads. Kilim’s weaver transforms methods identified by an @pausable annotation into continuation passing style (CPS) to provide cooperatively-scheduled lightweight threads with automatic stack management and trampolined call stack [3, 20]. These actor threads are quick to context-switch and do not need pre-allocated private heaps. The annotation is similar in spirit to checked exceptions in that all callers and overriding methods must be marked @pausable as well. Messages as a special category. For the reasons outlined above, we treat message types as philosophically distinct from, and much simpler than other Java objects. Messages are: – Unencapsulated values without identity (like their on-the-wire counterparts, XML, C++ structs, ML datatypes and Scala’s case classes). The public structure permits pattern-matching, structure transformation, delegation and flexible auditing at message exchange points; these are much harder to achieve in the presence of encapsulation. – Not internally aliased. A message object may be pointed to by at most one other message object (and then only by one field or array element of 1
2
For example, threads are too heavyweight to assign per HTTP connection or per component in composable communication protocol state machines. Kilims are flexible, lightweight Turkish flat rugs woven with fine threads.
106
S. Srinivasan and A. Mycroft
it). The resulting tree-structure can be serialized and cloned efficiently and effortlessly stored in relational and XML schemas. The lack of internal aliasing is less limiting in practice than would first appear, mostly because loosely-coupled components tend to have simple interfaces. Examples include events or messages in most server frameworks, windowing systems, the Singularity operating system [18] and CORBA valuetypes. – Linearly owned. A message can have at most one owner at any time. This allows efficient zero-copy message transfer where possible. The programmer has to explicitly make a copy if needed, and the imperative to avoid copies puts a noticeable “back pressure” on the programmer. Statically-enforced isolation. We enforce the above properties at compiletime. Isolation is interpreted as interference-freedom, obtained by keeping the set of mutable objects reachable from an actor’s instance fields and stack totally disjoint from another actor’s. Kilim’s weaver performs a static intraprocedural heap analysis that takes hints from isolation qualifiers specified on method interfaces. Run-time support. Kilim contains a run-time library of type-parametrised mailboxes for asynchronous message-passing with I/O throttling and prioritised alting [23]; SEDA-style I/O conditioning [36] is omnipresent. Mailboxes can be incorporated into messages, π-calculus [28] style. Space prevents us from presenting much of the run-time framework; this paper concentrates on the compile-time analysis and transformations. The contribution of this work is the synthesis of ideas found in extant literature and in picking particular design points that allow portability and immediate applicability (no change to the language or the JVM). 1.3
Isolation Qualifiers and Capabilities: A Brief Overview
Drossopoulou et al [16] present in their brief survey the choices of syntactic representations for controlling aliasing. One issue they raise is the need to “develop lightweight and yet powerful [shape] systems”. We have adopted “only trees may be transferred between actors” as our guiding principle. The motivations given in Sec. 1.1 led us to choose a scheme with (i) a marker interface Message to identify tree-shaped message types which may contain primitive types, references to Messages and arrays of the above; and (ii) three qualifiers (@free, @cuttable, @safe) on method parameters, which we formalise within a calculus. These qualifiers can be understood in terms of two orthogonal capabilities of an object in a tree: first, whether it is pointed to by another object or not (called a root in the latter case) and second, whether or not it is structurally modifiable (whether its pointer-valued fields are assignable). The latter is a transitive property; an object is structurally modifiable if its parent is.
Kilim: Isolation-Typed Actors for Java
107
Given this, an object is free 3 if it is the root of a tree and is structurally modifiable. A cuttable object may or may not be the root, but is structurally modifiable. An object with a safe capability cannot be structurally modified (transitively so), and does not care whether or not it is the root. These capabilities represent in decreasing order the amount of freedom offered by an object (in our ability to modify it, send to another actor, to placel on either side of a field assignment). We use the term send (sent) to mean that the message is effectively transferred out of the sender’s space after which the sender is not permitted access to the message. Clearly, in all cases, a node in our Message tree can have at most one other object pointing to it4 ; in Boylands’ terminology [9], all fields of our Messages are unique, which provides a system-wide invariant that permits an easy intuitive grasp of our isolation qualifiers as deep qualifiers. The cut operator (see below) can be read as an explicit version of the notion of destructive reads [9]. The cuttable and safe capabilities can be seen as variants of Boylands’ borrowed. The relationship between qualifiers and capabilities is this: the qualifiers are specified on method interfaces and imply a interface contract between a method and its caller and, in addition, bestow the corresponding capability on the object referred to by the method parameter. Sec. 3 gives the specifics. The cut operator performs a specific structural modification: it cuts a branch of a tree, severing a subtree from its parent. In addition, it grants the root of the subtree a free capability. Only new and cut can create free objects. As an aside, we provide an additional (unchecked) escape interface Sharable that allows the programmer to identify classes that do not follow our message restrictions, yet can be safely transferred across to another thread. These may include immutable classes and those with internal aliasing.
2
Example
Fig. 2 shows a simple Actor class TxtSrvr that blocks on a mailbox awaiting a message, transforms the message and responds to a reply-to mailbox specified in the message itself. TxtMsg is a message class identified as such with the marker interface Message. The programming model for actors (TxtSrvr here) is similar to that for Java threads – replace Thread with Actor and run() with execute(). Similarly, an actor is spawned thus: new TxtSrvr().start(); The entry point of a Kilim task is execute(), the only method of the actor required to be public. Its other non-private methods may only have messagecompatible parameters and results. The @pausable annotation on a method informs Kilim’s weaver that the method may (directly or transitively) call other pausable methods such as Actor.sleep() and Mailbox.get(). 3
4
Note: parameters have qualifiers, objects have capabilities; we write @free for the programmer-supplied qualifier and free for the corresponding object’s capability. At most one heap alias. Multiple local variables may also have the same pointer value.
108
S. Srinivasan and A. Mycroft
import kilim.*; class Mbx extends Mailbox {} class TxtSrvr extends Actor { Mbx mb; TxtSrvr(Mbx mb) {this.mb = mb;} @pausable public void execute() { while(true) { TxtMsg m = mb.get(); transform(m); reply(m); }} @pausable void reply(@free TxtMsg m) { m.replymb.put(m); } // @safe is default, so optional void transform(@safe TxtMsg m) {· · ·}
class TxtMsg implements Message { Mbx replymb; byte [ ] data; }
// Sample driver code // spawn actor Mbx outmb = new Mbx(); new TxtSrvr(outmb).start(); // Send and recv message Mbx replymb = new Mbx(); byte [ ] data = ... outmb.put(new TxtMsg(replymb, data)); ... = replymb.get();
} Fig. 2. Example Kilim code showing annotations for message and stack management. Kilim’s semantic extensions are in bold.
The blocking call (to Mailbox.get()) in an infinite loop illustrates automatic stack management. A typical state machine framework would have the programmer rewrite this in a callback-oriented style and arrange to return to a main loop; this style is prevalent even in multi-threaded settings because threads are expensive and slow resources. Kilim’s mailboxes are type-specific and thread-safe message queues, and being sharable objects (see Sec. 5.2), they can be passed around in messages. They support blocking, timed-blocking and non-blocking variants of get and put. An actor may simultaneously wait for a message from one of many mailboxes using select (like CSP’s alt [23]). Rudimentary I/O throttling is provided in the form of bounded queue sizes (default is unbounded), and the caller of Mailbox.put() is suspended if the queue is full (which is why reply()) must be marked as pausable in the example. The isolation qualifier @free on the reply() method’s parameter is a contract between the caller (execute()) and the callee. The weaver checks that the caller supplies an object with a free capability to the callee and subsequently does not use any local variables pointing to or into the message. In turn, reply cedes
Kilim: Isolation-Typed Actors for Java
109
FuncDcl ::= free opt m( p:α ) { (lb : Stmt)∗; } Stmt ::= x := new | x := y | x := y.f | x.f := y | x := cut(y.f ) | x := y[·] | x[·] := y | x := cut(y[·]) | return x | x := m( y) | if/goto lb x, y, p ∈ variable names f ∈ field names lb ∈ label names m ∈ function names sel ∈ field names ∪ {[·]} [·] pseudo field name for array access α, β ∈ isolation qualifier {free, cuttable , safe} null is treated as a special readonly variable Fig. 3. Core syntax. All expressions are in A-normal form. Variables not appearing in the parameter list are assumed to be local variables.
all rights to the message after calling the mailbox’s put() method (because the latter too has a @free annotation on its formal parameters). The transform() method does not require its supplied arguments to be free. This means that execute() is permitted to use the message object after transform() returns. Note also that transform() is not marked with @pausable, which guarantees us that it does not call any other pausable methods.
3
Core Language
Fig. 3 shows our core syntax, a Java-like intra-procedural language. The language is meant for the isolation checking phase only; it focuses solely on message types and its statements have a bearing on variable and heap aliasing only. We confine ourselves to purely intra-procedural reasoning for speed, precision and localising the effect of changes to code (whole program analyses sometimes show errors in seemingly unrelated pieces of code). Primitive fields and normal Java objects, while tracked for the CPS transformation phase, are not germane to the issue of isolation checking. A program in this language is already in A-normal form (all intermediate expressions named). Isolation Qualifiers and Capabilities. We mentioned earlier that isolation qualifiers (α, β) are specified in the form of annotations on method parameters and return values. Like normal types, they represent the capabilities of the arguments expected (an object must be at least as capable). Internally to the method, the qualifiers represent the initial capability for each parameter object; the object’s capability may subsequently change (unlike its Java type). Other objects’ capabilities are inferred by a data-flow analysis (Sec. 5). In all cases, we enforce the invariant that there can be at most one heap pointer to any message object. The list below informally describes object capabilities (Fig. 8 has the precise semantics). It bears repeating that they reflect a lattice composed of two boolean
110
S. Srinivasan and A. Mycroft
properties – root node or not and, whether or not its pointer-valued fields are assignable (structurally modifiable). free: The free capability is granted to the root of a tree by new and by cut, and to a method parameter marked as @free. A free object is guaranteed to be a root, but not vice-versa. It is field-assignable to another non-safe object and can be used as an argument to a method with any qualifier. cuttable: This capability is granted to an object obtained via a field lookup of another non-safe object, from downgrading a free object by assigning it to a field of another (it is no longer a root) and to a method parameter marked @cuttable. This capability permits the object to be cut, but not to be assigned to another object (because it is not necessarily a root). This capability is transitive: an object is cuttable if its parent is. safe: The safe capability is granted to a method parameter marked @safe or (transitively) to any object reachable from it. A safe object may not be structurally modified or further heap-aliased or sent to another actor. The qualifiers on method parameters impose the following interface contracts on callers and callees: @free: This allows the method to treat the parameter (transitively the entire tree rooted there) as it sees fit, including sending it to another actor. The type system ensures that the caller of the method supplies a free argument, and subsequently forbids the use of all local variables that may point to any part of the tree (reachable from the argument). @cuttable: The caller must assume that the corresponding object may be cut anywhere, and must therefore forget about all local variables that are reachable from the argument (because the objects they refer to could be cut off and possibly sent to another actor). @safe: The caller can continue to use a message object (and all aliases into it) if it is passed to a @safe parameter. The callee cannot modify the structure. The cut operator severs a subtree from its (cuttable) parent thus: y = cut(x.sel)
def
=
y = x.sel; x.sel = null;
Crucially, and in addition, it marks y as free; ordinarily, performing the two operations on the right hand side would only mark y as cuttable. The cut operator works identically on fields and arrays. Because it is a single operation and because messages (and their array-valued components) are tree-structured by construction, the subtree can be marked free. Remark 1. The most notable aspect of this calculus is that we amplify the requirement that at most one actor owns a given message into the stronger one that at most one dynamically active method stack frame may refer to a free message. This is justified by the requirements that (i) a free object is a root object and (ii) the rules on passing it to a method expecting a @free parameter cause all local variables pointing to it to be marked inaccessible. Therefore
Kilim: Isolation-Typed Actors for Java
111
inter-actor communication primitives of the form send and receive are treated as simple method calls; in other words, all that is required of an inter-actor messaging facility like the mailbox is that they annotate their parameters and return values (for send and receive operations respectively) with free, thereby trivially isolating the intricacies of inter-actor and inter-thread interaction, Java memory model, serialization, batched I/O, scheduling etc. Remark 2. One could readily add an intermediate qualifier between @cuttable and @safe, say @cutsafe, which permits all modifications except cutting. That is, it could allow additions to the tree and nullification, but not extraction via cut for possible transfer of ownership. In addition to matching object capabilities with isolation qualifiers on method parameters, Kilim enforces a rule to eliminate parameter-induced aliasing: arguments to a method must be pairwise disjoint (trees may not overlap) if any one of them is non-safe, and the return value, if any, must be free and disjoint from the input parameters. 3.1
Why Qualifiers on Variables Are Not Enough
One might hope that a simple type system a` la PacLang [17] can be created by associating variables of Message type with isolation type qualifiers, which change with the program point. However, such type systems do not take relationships between variables into account. For example, if we know that x and y are aliases, or y points within the structure rooted at x, then passing x to a method accepting a free message (e.g. Mailbox.put()) must result in not only x but also y being removed from the objects accessible from the scope of the actor. In other words, while it is convenient to think of variables as having a qualifier such as @free, it is really the objects that have such a qualifier. We need to analyse methods to infer variable dependencies; the next two sections expand on this subject. We split isolation checking into two phases for exposition, although the implementation performs them pointwise on the control flow graph. These two phases are covered in Sec. 4 and Sec. 5.
4
Heap Graph Construction
A program may create an unbounded set of message objects at run-time. A compile-time analysis of such a program requires that we first create an abstract model of the heap, called a heap graph. Each node of this (necessarily finite) graph represents a potentially infinite set of run-time objects that have something in common with each other at a given program point, and different heap analyses differ on the common theme that binds the objects represented by the node. We base our heap graph abstraction on a simple variant of shape analysis [37]; we claim no novelty. Our contribution is the set of design choices (isolation qualifiers, tree-structure, local analysis, the cut operator) that make the problem
112
S. Srinivasan and A. Mycroft
G : L, E
Heap graph is a pair of local var info L and edges E
L ∈ P(Var , LNode)
L = relation between local variable names and nodes (LNode is logically the nodes of the graph)
E ∈ P(Node, sel, Node)
E = a set of Node-Node edges labelled with field names
LNode ∈ P(V ar)
Heap Graph node; in this formalism the name of the node consists of the set of local variable names that may point to it. Well-formedness: x, N ∈ L ⇔ x ∈ N
Node ∈ P(V ar) ∪ {∅}
Labelled nodes plus summary node.
Convenience: def
L(x) = {N | x, N ∈ L}
set of LNodes to which a local variable might point.
Fig. 4. Heap Graph formalism following [37]
simpler and faster to reason about; it is a shape-enforcement rather than a general analysis problem. A heap graph G (see Fig. 4) is a pair L, E; L is the set of associations between variable names and nodes, and E represents the set of labelled edges between nodes. A node may be pointed to by more than one variable and is identified by a label that is merely the set of variable names pointing to it (a reverse index). Fig. 5 shows example heap graphs at two program points. The sample heap graph l1 is represented algebraically as follows5 : L = { a, {a}, b, {b, d}, d, {b, d}, c, {c, d}, d, {c, d}, e, {e} } E = { {a}, f, {b, d}, {a}, f, {c, d}, {b, d}, g, {e}, {c, d}, g, {e} } The common theme among run-time objects represented by a shape analysis node is that they are all referred to by the set of variables in the node’s label, at that program point, for any given run of the program – a node is an aliasing configuration. In addition to the labelled nodes mentioned thus far, there is one generic summary node with the special label ∅ that represents all heap objects not directly referred to by a local variable. When a node ceases to be pointed to by any variable, its label set becomes empty and it is merged with the summary node (hence ‘∅’—by analogy with the empty set symbol). Note that edges originate or end in labelled nodes only; the heap graph does not know anything about the connectivity of anonymous objects (inside the ∅ node) The most important invariant in heap graph construction is that there cannot be an edge between two nodes whose labels are not disjoint. Without the invariant, an edge such as {x, y}, f, {x, u} would represent the following impossible situation. x and y point to the same set of run-time objects (at that 5
Parallels to shape analysis [37]: G is their static shape graph, L is Ev with a layer of subscripting is removed; we write y, {x, y, z} for their y, nx,y,z .
Kilim: Isolation-Typed Actors for Java
l1: l2:
a = new; b = new; c = new if ... a.f = b d = b else a.f = c d = c e = d.g d = null b.g = null
b,d
l1: a
f
g
f
g
113
e
c,d b f
l2: a
f
g
e
c
Fig. 5. Sample heap graphs at l1 and l2 . Only edges E are shown; L is implicit.
program point, on any run of the program). These objects in turn are connected to another bunch of objects, referred to by x and u. This is clearly not possible, because x’s objects have both an outgoing and an incoming edge while its aliasing partners (y and u) only have one or the other edge. Non-disjoint alias sets can coexist in the graph, as long as they do not violate this invariant. Given the control flow graph CFG mentioned earlier, we use the following equations to construct the heap graph G after every program point. The analysis is specified in terms of an iterative forward flow performed on the lattice G, ⊆. We merge the heap graphs at control-flow join points to avoid the exponential growth in the set of graphs (like [37], unlike [29]). This means all transfer functions operate on a single heap graph (rather than a set of graphs). Ginit out = { }, { } Glin = {Glout | (l , l) ∈ CF G} Glout = · (Glin ) The second equation merges the graphs from the CFG node’s incoming edges (simple set union of node and edge sets). · represents the transfer functions for each CFG node (Fig. 6). Note that if goto and return do not have transfer functions; they are turned into edges of the CFG. The transfer functions are simpler than the ones in shape analysis because they do not deal with sharing (attempts to share are faulted in the isolation checking phase). Note that the heap graph may have nodes with multiple incoming edges, but it reflects a may-alias edge, not an edge that induces sharing. The node labelled e in Fig. 5 represents two disjoint sets of run-time objects, one of which has incoming edges from the {b, d} set of objects and the other from {c, f }. The transfer function for x := y.f deserves some attention. It associates x with all nodes T pointed to by y.f , which may or may not have been created as yet by the analysis procedure. Fig. 7 covers both possibilities. In the case where a node does not exist, it is treated as if it belongs as a discrete blob inside the summary node, implicitly referred to by y.f (the grey region in Fig. 7). In
114
S. Srinivasan and A. Mycroft
Notation: V (any Node), S (source Node), T (target Node) Sx Sxy
def
= S ∪ {x} S ∪ {x} if y ∈ S = S otherwise
def def
= L = { v, V ∈ L | v = x ∧ V = V \ {x} ∧ v, V ∈ L} E = { S \ {x}, sel, T \ {x} | S, sel, T ∈ E} entry(mthd) G L = i { pi , {pi } } where pi is the ith parameter of mthd E = {} kill(G, x)
x := new G
G : L , E = kill(G, x) L = L ∪ x, {x}, E = E
x := y G
G : L , E = kill(G, x) L = { v, Vxy | {v, V } ∈ L } E = { Sxy , sel, Txy | S, sel, T ∈ E }
x.f := y G
E = E \ { S, f, ∗ ∈ E | x ∈ S } E if y ≡ null E = E ∪ { S, f , T | x ∈ S ∧ y ∈ T } otherwise L = L E if y ≡ null E = E ∪ { S, ‘[·] , T | x ∈ S ∧ y ∈ T } otherwise L = L
x[·] := y G
x := y.sel G
G : L , E = kill(G, x) L = L ∪ { t, Tx | t, T ∈ L ∧ y, S ∈ L ∧ S, sel, T ∈ E } ∪ { x, Tx | y, S ∈ L ∧ S, sel, T ∈ E } E =(E \ {y, sel, ∗ ∈ E }) ∪ { y, sel, Tx | y, sel, T ∈ E } ∪ { Tx , sel, U | T, sel, U ∈ E }
x := cut(y.sel)
y.sel := null
x := m(v) G
◦
x := y.sel
G : L , E = kill(G, x) L = L ∪ { x, {x} } E = E
Fig. 6. Transfer functions · for heap graph construction. They transform G : L, E to G : L , E . ‘∗’ represents wildcards and sel represents field and array access.
this case, the node is materialized [37] out of the summary node and all edges outgoing from that node are replicated and attached to the newly materialized node. This replication is necessary because we do not have precise information about which portion of the anonymous heap (represented by the summary node) is responsible for the outgoing edges (the grey blob, or the non-grey portion). Note that we do not have to replicate the incoming edges because we know that nodes are not shared and that the newly materialized node is already pointed to by the y.f edge.
Kilim: Isolation-Typed Actors for Java
115
Shape analysis provides strong nullification and disjointness [37], as illustrated in Fig. 5 by the transition from heap graph at l1 to that of l2 . Unfortunately, shape analysis cannot do the same for arrays: setting “x[i] = y” tells us nothing at all about x[j]. However, cut performs strong nullification even on arrays, because our type system ensures that the array’s components are disjoint both mutually and from the variable on the right hand side. Remark 3. There is an important software engineering reason for having cut, instead of relying on shape analysis to inform us about disjointness: we want to make explicit in the code the act of cutting a branch from the tree and giving the subtree a free capability. Most methods do not need to cut; they can have the default @safe qualifier, which allows them to (transitively) modify the arguments, but not cut or send the object.
y,z
f
t
u
g
y,z
f
x,t
y
f
x
u
g
h y
f
h
v
Before
h
v
After
Fig. 7. Example heap graph before and after transformation by x := y.f . Double lines show the newly materialized node and edge. The grey blob is the portion of the anonymous heap that is the implicit target of y.f .
5
Isolation Capability Checking
Having built heap graphs at every program point, we now associate each labelled node n in each heap graph with a capability κ(n), as mentioned earlier. All runtime objects represented by n implicitly have the same capability. Fig. 8 shows the monotone transfer functions operating over the capability lattice in a simple forward-flow pass. At CFG join points, the merged heap graph’s nodes are set to the minimum of the capabilities of the corresponding nodes in the predecessor heap graphs (in the CFG). For example, a = new if ... b.f = a send(a)
// κ(a) := free // κ(a) := cuttable // join point. κ(a) := min(free, cuttable) // ERROR: κ(a) is not free
116
S. Srinivasan and A. Mycroft
Assumption 1: the current method’s signature is free mthd ( p:α ). Assumption 2: E and L used (e.g.) in dependants result from Heap Graph analysis for the current instruction. κ
entry(mthd) κ =
[ p →α ]
x := new T κ
=
κ[x → free]
x := y κ
=
κ[x → κ(y)]
x.f := y κ
=
precondition : κ(y) = free κ κ[y → cuttable]
x := y.f κ
=
x := m( y ) κ
=
κ κ
κ
κ[x →j s] safe if κ(y) = safe s= cuttable if κ(y) ∈ {free, cuttable} precondition : βi κ(yi ) ∧ (∀i = j)(disjoint(yi , y#j ) ∨ βi = βj = safe) " κ h i dependants(yi ) ∪ {yi } → ⊥, if (βi = free) κ κ x → free κ dependants(yi ) → ⊥, if (βi = cuttable) Return value is always free) (assumption: m’s signature is free m(β).
x := cut(y.f ) κ =
precondition : κ(y) ∈ {free, cuttable} κ κ[x → free]
return x κ
precondition : κ(x) = free ∧ ∀i(αi = cuttable =⇒ disjoint(x, pi )) κ (no change)
=
where: κ(n) : LN ode → Capability gives the Capability associated with a node n ∈ LN ode (Capability, ref_folder := (ref<SharedFolder>) ref_object; SharedFolder folder := dereference(ref_folder); // creates a proxy external folder_ep := endpointſřũŪƀŚ internal my_ep := new_endpoint(); my_ep.AddedElement += ...; KHUH¶VDFRGHWKDWUHJLVWHUVDQHYHQWKDQGOHU connection my_connection := connect(folder_ep, my_ep);
// some code to store the newly created proxy and endpoint connection references } }
A. References to Live Objects. Operations that can be performed on these references include reflection (inspecting the referenced object’s type), casting, and dereferencing (the example uses are shown in Code 1, in lines 03, 05, and 06 accordingly). Dereferencing results in the local runtime launching a new proxy of the referenced object (recall from Section 3.1 that references include complete instructions for how to do this). The proxy starts executing immediately, but its endpoints are disconnected A reference to the new proxy is returned to the caller (in our example it is assigned to a local variable folder). This reference controls the proxy’s lifetime. When it is discarded and garbage collected, the runtime disconnects all of the proxy’s endpoints and terminates it. To prevent this from happening, in our example code we must store the proxy reference before exiting (we would do so in line 11). Whereas a proxy must have a reference to it to remain active, a reference to a live object is just a pointer to a recipe for constructing a proxy for that object, and can be discarded at any time. An important property of object references is that they are serializable, and may be passed across the network or process boundaries between proxies of the same or even different live objects, as well as stored on in a file etc. The reference can be dereferenced anywhere in the network, always producing a functionally equivalent proxy – assuming, of course, that the node on which this occurs is capable of running the proxy. In an ideal world, the environmental constraints would permit us to determine whether a proxy actually can be instantiated in a given setting, but the world is obviously not ideal. Determining whether a live object can be dereferenced in a given setting, without actually doing so, is probably not possible. The types of live object references are based on the types of live objects, which we will define formally below. To avoid ambiguity, if Θ is a live object type, and x is a reference to an object of type Θ, we will write ref to refer to the type of entity x. The semantics of casting live object references is similar to that for regular objects. Recall that if a regular reference of type IFoo points to an object that implement IBar, we can cast the reference to IBar even if IFoo is not a subtype of IBar, and while as a
Programming with Live Distributed Objects
475
result the type of the reference will change, the actual referenced object will not. In a similar manner, casting a live object reference of type ref to some ref produces a reference that has a different type, and yet dereferencing either of these references, the original one or the one obtained by casting, result in the local runtime creating the same proxy, running the same code, with the same endpoints. A reference can be cast to ref for as long as the actual type of the live object is a subtype of Θ. B. References to Proxies. The type of a proxy reference is simply the type of the object it runs, i.e. if the object is of type Θ, references to its proxies are of type Θ. Proxy references can be type cast just like live object references. One difference between the two constructs is that proxy references are local and can’t be serialized, sent, or stored. Another difference is that they have the notion of a lifetime, and can be disposed or garbage collected. Discarding a proxy reference destroys the locally running proxy, as explained earlier, and is like assigning null to a regular object reference in a language like Java. The live object is not actually destroyed, since other proxies may still be running, but if all proxy references are discarded (and proxies destroyed), the protocol ceases to run, as if it were automatically garbage collected. Besides disposing, the only operation that can be performed on a proxy reference is accessing the proxy endpoints for the purpose of connecting to the proxy. An example of this is seen in line 07, where we request the proxy of the shared folder object to return a reference to its local instance of the endpoint named “folder”. C. References to Endpoint Instances. There are two types of references to endpoint instances, external and internal. An external endpoint reference is obtained by enumerating endpoints of a proxy through the proxy reference, as shown in line 07. The only operation that can be performed with an external reference is to connect it to a single other, matching endpoint (line 10). After connecting successfully, the runtime returns a connection reference that controls the connection’s lifetime. If this reference is disposed, the two connected endpoints are disconnected, and the proxies that own both endpoints are notified by sending explicit disconnect events. An internal endpoint reference is returned when a new endpoint is programmatically created using operator new (line 08). This is typically done in the constructor code of a proxy. Each proxy must create an instance of each of the object’s endpoints in order to be able to communicate with its environment. The proxy stores the internal references of each of its endpoints for private use, and provides external references to the external code per request, when its endpoints are being enumerated. Internal references are also created when a proxy needs to dynamically create a new endpoint, e.g. to interact with a proxy of some subordinate object that it has dynamically instantiated. An internal reference is a subtype of an external reference. Besides connecting it to other endpoints, it also provides a “portal” through which a proxy that created it can send or receive events to other connected proxies. Sending is done simply by method calls, and receiving by registering event callbacks (line 09). An important difference between external and internal endpoint references is that the former could be serialized, passed across the network and process boundaries, and then connected to a matching endpoint in the target location. The runtime can implement this e.g. by establishing a TCP connection to pass events back and forth between proxies communicating this way. This is possible because events are serializable.
476
K. Ostrowski et al.
Internal endpoint references are not serializable. This is crucial, for it provides isolation. Since any interaction between objects must pass through endpoints, and events exchanged over endpoints must be serializable, this ensures that an internal endpoint reference created by a proxy cannot be passed to other objects or even to other proxies of the same object. Only the proxy that created an endpoint has access to its portal functionality of an endpoint, and can send or receive events with it. D. References to Connections. Connection references control the lifetime of connections. Besides disposing, the only functionality they offer is to register callbacks, to be invoked upon disconnection. These references are not strongly typed. They may be created either programmatically (as in line 10 in Code 1), or by the runtime during the construction of a composite proxy. The latter is discussed in detail in Section 4.2. E. Template Object References. Template references are similar to generics in C# or templates in C++. Templates are parameterized descriptions of proxies; when dereferencing them, their parameters must be assigned values. Template types do not support subtyping, i.e. references of template types cannot be cast or assigned to references of other types. The only operation allowed on such references is conversion to nontemplate references by assigning their parameters, as described in Section 4.2. Template object references can be parameterized by other types and by values. The types used as parameters can be object, endpoint, or event types. Values used as parameters must be of serializable types, just like events, but otherwise can be anything, including string and int values, live object references, external endpoint references etc. Example (c). A channel object template can be parameterized by the type of messages that can be transmitted over the channel. Hence, one can e.g. define a template of a reliable multicast stream and instantiate it to a reliable multicast stream of video frames. Similarly, one can define a template dissemination protocol based on IP multicast, parameterized with the actual IP multicast address to use. A template shared folder containing live objects could be parameterized by the type of objects that can be stored in the folder and the reference to the replication object it uses internally. F. Casting Operator Extensions. This is a programmable reflection mechanism. Recall that in C# and C++, one can often cast values to types they don’t derive from. For example, one can assign an integer value to a floating-point type. Conversion code is then automatically generated by the runtime, and injected into this assignment. One can define custom casting operators for the runtime to use in such situations. Our model also supports this feature. If an external endpoint or an object reference is cast to a mismatching reference type, the runtime can try to generate a suitable wrapper. Example (d). Consider an application designed to use encrypted communication. The application has a user interface object u exposing a channel endpoint, which it would like to connect to a matching endpoint of an encrypted channel object. But, suppose that the application has a reference to a channel object c that is not encrypted, and that exposes a channel endpoint of type lacking the required security constraints. When the application tries to connect the endpoints of u and c, normally the operation would fail with a type mismatch exception. However, if the channel endpoint of c can be made compatible with the endpoint of u by injecting encryption code into the connection, the compiler or the runtime might generate such wrapper code instead. Notice
Programming with Live Distributed Objects
477
that proxies for this wrapper would run on all nodes where the channel proxy runs, and hence could implement fairly sophisticated functionality. In particular, they could implement an algorithm for secure group key replication. In effect, we are able to wrap the entire distributed object: an elegant example of the power of the model. The same can be done for object references. While casting a reference, the runtime may return a description of a composite reference that consists of the old proxy code, plus the extra wrapper, to run side by side (we discuss composite references in Section 4.2). In addition to encryption or decryption, this technique could be used to automatically inject buffering code, code that translates between push and pull interface, code that persists or orders events, automatically converts event data types, and so on. Currently, our platform uses casting only to address certain kinds of binary incompatibilities, as explained in Section 5.2. In future work, we plan to extend the platform to support more sophisticated uses of casting, e.g. as in the example above, and define rules for choosing casting operators when more that one is available. 4.2 Construction and Composition As noted in Section 4.1, a live object exists if references to it exist, and it runs if any proxies constructed from these references are active. Creating new objects thus boils down to creating references, which are then passed around and dereferenced to create Code 2. An example live object reference, based on a shared document template, parameterized by a reliable communication channel. The channel is composed of a dissemination object and a reliability object, connected to each other via their “UnreliableChannel” endpoints, much like r and u in Figure 2. The “ReliableChannel” endpoint of the reliability object is exposed by the channel. The dissemination object reference is to be found as an object named “MyChannel”, of type “Channel”, in an online repository (“Id” and “Channel” are predefined types). The reference to the repository is to be found, as an object named “QuickSilver”, of type “Folder”, i.e. containing channels, in another online repository, the “registry” object (see Section 0). 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19
parameterized object // an object based on a parameterized template using template primitive object 3 WKHLGRID³VKDUHGGRFXPHQW´WHPSODWH { parameter "Channel" : composite object // a complex object built from multiple component objects { component "DisseminationObject" : external object "MyChannel" as Channel from external object "QuickSilver" as Folder from primitive object 2 // the LGRIDSUHGHILQHG³registry´REMect component "ReliabilityObject" : «// specification of some loss recovery object, omitted for brevity connection // an internal connection between a pair of component endpoints endpoint "UnreliableChannel" of "DisseminationObject" endpoint "UnreliableChannel" of "ReliabilityObject" export // endpoints of the components to be exposed by the composite object endpoint "ReliableChannel" of "ReliabilityObject" } }
478
K. Ostrowski et al.
running applications. Object references are hierarchical: references to complex objects are constructed from references to simpler objects, plus logic to “glue” them together. The construction can use four patterns, for constructing composite, external, parameterized, and primitive objects. We shall now discuss these, illustrating them with an example object reference that uses each of these patterns, shown in Code 2. A. Composite References. A composite object consists of multiple internal objects, running side by side. When such an object is instantiated, the proxies of the internal objects run on the same nodes (like objects r and u in Figure 2). A composite proxy thus consists of multiple embedded proxies, one for each of the internal objects. A composite reference contains embedded references for each of the internal proxies, plus the logic that glues them together. In the example reference shown in lines 05 to 18 in Code 2, there is a separate section “component name : reference” for each of the embedded objects, specifying its internal name and reference. This is followed by a section of the form “connection endpoint1 endpoint2”, for each internal connection. Finally, for every endpoint of some embedded internal object that is to be exposed by the composite object as its own, there is a separate section “export endpoint”. When a proxy is constructed from a composite reference, the references to any internal proxies and connections are kept by the composite proxy, and discarded when the composite proxy is disposed of (Figure 3). The lifetimes of all internal proxies are thus connected to the lifetime of the composite. Embedded objects and their proxies thus play the role analogous to member fields of a regular object. B. External References. An external reference is one that has not been embedded and must be downloaded from somewhere. It is of the form “external object name as type from reference”, where reference is a reference to the live object that represents some online repository containing live object references, and name is the name of the object, the reference to which is to be retrieved from this repository. The type Θ of the retrieved object is expected to be a subtype of type, and the type of the external reference is ref. One example of such a reference is shown in lines 08 to 10, and another (embedded in the first one) in lines 09 to 10. The repository could be any object of type Θ ≤ folder, where type folder is a builtin type of objects with a simple dictionary-like interface. Objects of this type have an endpoint with input event get(n) and with output events item(n, r) and missing(n). To retrieve an external reference, the runtime creates a repository object proxy from the embedded reference, runs it, connects to its folder endpoint, submits the get event, and awaits response. Once the response arrives, the repository proxy can be discarded. The “as type” clause allows the runtime to statically determine the type of the reference without having to engage in any protocol. In case of composite, parameterized, or primitive references, the runtime can derive the type right from the description. The “as type” clause can still be used in the other categories of references as an explicit type cast, in case it is necessary e.g. to hide some of the object’s endpoints. The types in the reference (such as Channel in line 08 or Folder in line 09) could either refer to the standard, built-in types, or they could be described
Programming with Live Distributed Objects
479
explicitly using a language based on the formalisms in Section 3.2. To keep our example simple, we assume that all types are built-in, and we refer to them by names. C. Parameterized References. These references are based on template objects introduced in Section 4.1. They include a section “using template reference”, where reference is an embedded template object reference, and a list of assignments to parameter values, each in a separate section of the form “parameter name : argument”, where the argument could be a type description or a primitive value, e.g. an embedded object reference. For example, the reference in Code 2 is parameterized with a single parameter, Channel. The type of the parameter needn’t be explicitly specified, for it is determined by the template. In our example, the template expects a live object reference to a reliable communication channel. The specific reference used here to instantiate this template is the composite reference in lines 05 to 18. D. Primitive References. The types of references mentioned so far provide a means for recursively constructing complex objects from simple ones, but the recursion needs to terminate somewhere. Hence, the runtime provides a certain number of builtin protocols that can be selected by a known 128-bit identifier (lines 02 and 10 in Code 2). Of course even a 128-bit namespace is finite, and many implementations of the live objects runtime could exist, each offering different built-in protocols. To avoid chaos, we reserve primitive references only for objects that either cannot be referenced using other methods, or where doing so would be too inefficient. We will discuss two such objects: the library template and the registry object. composite object 0.. 0.. 1 reliability object
1 dissemination object
exposed endpoint composite REMHFW¶VSUR[\ internal proxies
references to internal proxies and connection maintained automatically by the runtime
Fig. 3. A live object class diagram for the composite object in Code 2 (left) and the structure of the composite proxy (right). When constructing a composite proxy, the runtime automatically constructs all the internal proxies and the internal connections between them, and stores their references in the composite proxy. Embedded proxies and connections are destroyed together with the composite proxy. The latter can expose some of the internal endpoints as its own. Code 3. An example live object reference for a custom protocol, implemented in a library that can be downloaded from http://www.mystuff.com/mylibrary.dll. Objects running this protocol are of type “MyType1”, and can be found in the library under name “MyProtocol1”. The library template provides the folder abstraction introduced in Section 0. 01 02 03 04 05 06
external object "MyProtocol1" as MyType1 // my own, custom implementation from parameterized object // an instance of the library template using template primitive object 1 // an id of a built-in library template { parameter "URL" : http://www.mystuff.com/mylibrary.dll }
480
K. Ostrowski et al.
application object (a)
a1
a2
composite multicast object (m) local multicast object (x)
t1
x1
x2
x3
tunnel object (t)
a3
a4
y2
y3
t2
y1
Fig. 4. An example of a hybrid multicast object m, constructed from two local protocols x, y that disseminate data in two different regions of the network, e.g. two LANs, combined using a tunnel object t that acts as a repeater and replicates messages between the two LANs. Different proxies of the composite object m, running on different nodes, are configured differently, e.g. some use an embedded proxy of object x, while others use an embedded proxy of object y. Code 4. A portable reference to the “hybrid” object m from Figure 4 built using the registry. 01 02 03
external object "MyChannel" as Channel from external object "MyPlatform" as Folder from primitive object 2 // the registry
Code 5. An example of a “proper” use of the registry, to specify a locally configured multicast platform, which could then be used by external references like the one in Code 4. Here, the local instance of the communication platform is configured with the address of a node that controls a region of the Internet, from which other objects can be bootstrapped.
01 02 03 04 05 06
parameterized object using template external object "MyPlatform" as Folder from parameterized object // from a binary downloaded from the url below using template primitive object 1 // the library template { parameter "URL" : http://www.mystuff.com/mylibrary.dll } { parameter "LocalController" : tcp://192.168.0.100:60000 }
D.1 Library. A library is an object of type folder, representing a binary containing executable code, from which one can retrieve references to live objects implemented by the binary. The library template is parameterized by URL of the location where the binary is located (see Code 3, lines 02 to 06). The binary can be in any of the known formats that allow the runtime to locate proxy code, object and type definitions in it, either via reflection, or by using an attached manifest (we show one example of this in Section 5.2). After a proxy of a library is created, the proxy downloads the binary and loads it. When an object reference retrieved from a library is dereferenced, the library locates the corresponding constructor in the binary, and invokes it to create the proxy. D.2 Registry. The registry object is again a live object of type folder, i.e. a mapping of names to object references. The registry references are stored locally on each node, can be edited by the user, and in general, the mapping on each node may be different. Proxies respond to requests by returning the locally stored references.
Programming with Live Distributed Objects
481
The registry enables construction of complex heterogeneous objects that can use different internal objects in different parts of the network, as follows Example (e). Consider a multicast protocol constructed in the following manner: there are two LANs, each running a local IP multicast based protocol to locally disseminate messages: local multicast objects x and y (Figure 4). A pair of dedicated machines on these LANs also run proxies of a tunneling object t, connected to proxies of x and y. Object t acts as a “repeater”, i.e. it copies messages between x and y, so that proxies running both of these protocols receive the same messages. Now, consider an application object a, deployed on nodes in both LANs, and having some of its proxies connected to x, and some to y. From the point of view of object a, the entire infrastructure consisting of x, y, and t could be thought of as a single, composite multicast object m. Object m is heterogeneous in the sense that its proxies on different machines have a different internal structure: some have an embedded object x and some are using y. Logically, however, m is a single protocol and we’d like to be able to fully express it in our model. The problem stems from the fact that on one hand, references to m must be complete descriptions of the protocol, so they should have references to x and y embedded, yet on the other hand, references containing local configuration details are not portable. The registry object solves this problem by introducing a level of indirection (Code 4). The reader might be concerned that the portability of live objects is threatened by use of the registry. References that involve registry now rely on all nodes having properly configured registry entries. For this reason, we use the registry sparingly, just to bootstrap the basic infrastructure. Objects placed in the registry would represent the entire products, e.g. “the communication infrastructure developed by company XYZ”, and would expose the folder abstraction introduce earlier, whereby specific infrastructure objects can be loaded. An example of such proper use is shown in Code 5.
5 System 5.1 Embedding Live Objects into the Operating System Via Drag and Drop Our implementation of the live object runtime runs on Microsoft Windows 2 with .NET Framework 2.0. The system has two major components: an embedding of live objects into Windows drag and drop technologies, discussed here, and embedding of the new language constructs into .NET, discussed in Section 5.2. Our drag and drop embedding is visually similar to Croquet [53] and Kansas [54], and mimics that employed in Windows Forms, tools such as Visual Studio (or similar ones for Java), and in the Object Linking and Embedding (OLE) [8], XAML [40], and ActiveX standards used in Microsoft Windows to support creation of compound documents with embedded images, spreadsheets, drawings etc. The primary goal is to enable non-programmers to create live collaborative applications, live documents, and business applications that have complex, hierarchical structures and non-trivial internal logic, just by dragging visual components and content created by others from toolbars, folders, and other documents, into new documents or design sheets. 2
Porting our system from C#/.NET to Mono, to run under Linux, or building a Java/J2EE version of the runtime, shouldn’t be a problem, but we haven’t yet undertaken this task.
482
K. Ostrowski et al.
Our hope is that a developer who understands how to create a web page, and understands how to use databases and spreadsheets as part of their professional activities, would use live objects to glue together these kinds of components, sensors capturing real-world data, and other kinds of information to create content-rich applications, which can then be shared by emailing them to friends, placing them in a shared repository, or embedding them into standard productivity applications. Live object references are much like other kinds of visual components that can be dragged and dropped. References are serialized into XML, and stored in files with a “.liveobject” extension. These “.liveobject” files can easily be moved about. Thus, when we talk about emailing a live application, one can understand this to involve embedding a serialized object reference into an HTML email. On arrival the object can be activated in place. This involves deserializing the reference (potentially running online repository protocols to retrieve some of its parts), followed by analysis of the object’s type. Live objects can also be used directly from the desktop browser interface. We configured the Windows shell to interpret actions such as doubleclick on “.liveobject” files by passing the XML content of the file to our subsystem, which processes it as described above. Note that although our discussion has focused on GUI objects, the system also supports services that lack user interfaces. We have created a number of live object templates based on reliable multicast protocols, including 2-dimensional and 3-dimensional desktops, text notes, video streams, live maps, and 3-dimensional objects such as airplanes and buildings. These can be mashed up to create live applications such as the ones on our web site (Figure 5). Although the images in Figure 5 are evocative of multi-user role-playing systems such as Second Life, Live Objects differ in important ways. In particular, live objects can run directly on the user nodes, in a peer-to-peer fashion. In contrast, systems such as Second Life are tightly coupled to the data centers on which the content resides and is updated in a centralized manner. In Second Life, the state of the system lives in that data center. Live objects keep state replicated among users. When a new proxy joins, it must obtain some form of a checkpoint to initialize itself, or starts in a null state. As noted earlier, live objects support drag and drop. The runtime initiates a drag by creating an XML to represent the dragged object’s reference, and placing it in a clipboard. When a drop occurs, the reference is passed on to the application handling the drop. The application can store it as XML, or it can deserialized it, inspect the type of the dropped object, and take the corresponding action based on that. For example, the spatial desktop on Figure 5, only supports objects with a 3-dimensional user interface. Likewise, the only types of objects that can be dropped onto airplanes are those that represent textures or streams of 3-dimensional coordinates. The decision in each case is made by the application logic of the object handling the drop. Live objects can also be dropped into OLE-compliant containers, such as Microsoft Word documents, emails, spreadsheets, or presentations. In this case, an OLE component is inserted with an embedded XML of the dragged object’s reference. When the OLE component is activated (e.g. when the user opens the document), it invokes the live objects runtime to construct a proxy, and attaches to its user interface endpoint (if there is one). This way, one can create documents and presentations, in which instead of static drawings, the embedded figures can display content powered by any type of a distributed protocol. Integration with spreadsheets and databases is also possible, although a
Programming with Live Distributed Objects
483
little trickier because these need to access the data in the object, and must trigger actions when a new event occurs. As mentioned above, one can drag live objects into other live objects. In effect, the state of one object contains a reference to some other live object. This is visible in the desktop example on Figure 5. This example illustrates yet another important feature. When one object contains a reference to another (as is the case for a desktop containing references to objects dragged onto it), it can dynamically activate it: dereference, and connect to the proxy of the stored object, and interact with the proxy. For example, the desktop object automatically activates references to all visual objects placed on it, so that when the desktop is displayed, so are all objects, the references of which have been dragged onto the desktop. space object airplane object building object map object
text note object image object desktop object
Fig. 5. Screenshots of our platform running live objects with an attached user interface logic. The 3-dimensional space, the area map embedded in this space, as well as each of the airplanes and buildings (left) is a separate live object, with its own embedded multicast channel. Similarly, the green desktop, and the text notes and images embedded in it are independent live objects. Each of these objects can be viewed and accessed from anywhere on the network, and separately embedded in other objects to create various web-style mash-ups, collaborative editors, online multiplayer games, and so on. Users create these by simply dragging objects into one another. Our download site includes a short video demonstrating the ease with which applications such as these can be created.
By now, the reader will realize that in the proposed model, individual nodes might end up participating in large numbers of distributed protocol instances. Opening a live document of the sort shown on Figure 5 can cause the user’s machine to join hundreds of instances of a reliable, totally ordered multicast protocol with state transfer, which support the objects embedded in the document. This might lead to scalability concerns. In our prior work we demonstrated that this problem is not a showstopper. Our Quicksilver Scalable Multicast (QSM) system [46], can support thousands of overlapping multicast groups, communicating at network speeds with low overhead. 5.2 Embedding Live Object Language Constructs into .NET Via Reflection Extending a platform such as .NET to support the new constructs discussed in Section 4.1 would require extending the underlying type system and runtime, thus precluding incremental deployment. Instead, we leverage the .NET reflection mechanism to implement dynamic type checking. This technique doesn’t require modifications to the
484
K. Ostrowski et al.
.NET CLR, and it can be used for other managed environments, such as Java. The key idea is to use ordinary .NET types as “aliases” representing our distributed types. Whenever such an alias type is used in a .NET code, the live objects runtime “understands” that what is “meant” by the programmer is actually the distributed type. Aliases are defined by decorating.NET types with attributes, as in Code 6 and Code 7. Example (f). Consider a template object type channel for multicast channels, parameterized by the .NET type of the messages that can be transmitted. One defines an alias type as a .NET interface annotated with ObjectTypeAttribute (Code 6, line 01). When a library object (of Section 4.2) loads a new binary, the runtime scans the binary for .NET types annotated this way and registers them on its internal list of aliases. Code 6. A .NET interface can be associated with a live object type using an “ObjectType” attribute (line 01). The interface may then be used anywhere to represent the represented live object type. The live objects runtime uses reflection to parse such annotations in binaries it loads and build a library of built-in objects, object types and templates. Object and type templates are defined by specifying and annotating generic arguments (line 03). 01 02 03 04 05 06 07 08 09
[ObjectTypeAttribute] DQQRWDWHV³,&KDQQHO´DVDQDOLDVIRUDOLYHREMHFWW\SH interface IChannel< [ParameterAttribute(ParameterClass.ValueClass)] MessageType> { [EndpointAttribute("Channel")] EndpointTypes.IDual< Interfaces.IChannel<MessageType>, Interfaces.IChannelClient<MessageType>> ChannelEndpoint { get; } UHWXUQVDQH[WHUQDOUHIHUHQFHWRHQGSRLQW³&KDQQHO´ }
Parameters of the represented live object type are modeled as generic parameters of the alias. Each generic parameter is annotated with Parameter Attribute (line 03), to specify the kind of parameter it represents. The classes of parameters supported by the runtime include Value, ValueClass, ObjectClass, EndpointClass, and others we won’t discuss here. Value parameters are simply serializable values, including .NET types or references to live objects, The others represent the types of values, types of live objects and types of endpoints. For example, we could define a live object type template parameterized by the type of another live object. A practical use of this is a typed folder template, i.e. a folder that contains only references to live objects of a certain type. For example, an instance of this template could be a folder that contains reliable communication channels of a particular type. Another good example is a factory object that creates references of a particular type, e.g. an object that configures new reliable multicast channels in a multicast platform. An alias interface for a live object type is expected to specify only .NET properties, each annotated with EndpointAttribute (line 05). Each property defines one named endpoint for all live objects of this type. The property can only have a getter (line 08), which must return a value of a .NET type that is an alias for some endpoint type. The example in Code 6 uses alias EndpointTypes.IDual. This is an alias template built into the runtime, but parameterized by two .NET interfaces.
Programming with Live Distributed Objects
485
Code 7. A live object template is defined by decorating a generic class definition (line 01), its generic class parameters (line 03), and constructor parameters (line 08) with .NET attributes. To specify the template live object’s type, the class must implement an interface that is annotated to represent a live object type (line 04 referencing the definition shown in Code 6). In the body of the class, we create endpoints to be exposed by the proxy (created in lines 11-12, exposed in lines 19-25), handle incoming events (line 27) and send events through its endpoints. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[ObjectAttribute("89BF6594F5884B6495F5CD78C5372FC6")] sealed class MyChannel< [ParameterAttribute(ParameterClass.ValueClass)] MessageType> : ObjectTypes.IChannel<MessageType>, // specifies the live object type Interfaces.IChannel // we implement handlers to all incoming events, see line 12 { public MyChannel( [Parameter(ParameterClass.Value)] // also a parameter of the template ObjectReference membership_reference) { this.myendpoint = new Endpoints.Dual< Interfaces.IChannel, Interfaces.IChannelClient>(this); ... // the rest of the constructor would contain code very similar to that in Code 1 }
// this is our internal reference to the channel endpointWKHHQGSRLQW¶V³EDFNGRRU´ private Endpoints.Dual< Interfaces.IChannel, Interfaces.IChannelClient> myendpoint; EndpointTypes.IDual< Interfaces.IChannel<MessageType>, Interfaces.IChannelClient<MessageType>> ObjectTypes.IChannel.ChannelEndpoint { get { return myendpoint; } // returns an external endpoint reference }
// this is a handler for one of the incoming events of the channel endpoint Interfaces.IChannel.Send(MessageType message) { ... } // details omitted ... // the rest of the alias definition, containing all the other event handlers etc. }
The methods defined by these interfaces, again accordingly annotated, are used by the runtime to compile the list of this endpoint’s incoming and outgoing events, and similar annotations can be used to express its constraints. When the alias defined this way is used in some context with its generic parameters assigned (lines 05-07), the runtime treats it as an alias for the specific endpoint type, with the specific events defined by those interfaces. Having defined the object’s type, we can define the object itself. This is again done via annotations. An example definition of a live object template is shown in Code 7. A live object template is defined as a .NET class, the instances of which represent the object’s proxies. The class is annotated with ObjectAttribute (line 01) to instruct the runtime to build a live object definition from it. This template has two parameters: the type parameter representing the type of messages carried by the channel (line 03), and a “value” parameter - the reference to the membership object that this channel
486
K. Ostrowski et al.
should use (lines 08-09). To specify the type of the live object, line 03 inherits from an alias. This forces our class to implement properties returning the appropriate endpoints (lines 19-25). The actual endpoints are created in the constructor (lines 11-12). While creating endpoints, we connect event handlers for incoming events (hooking itself up, in line 12, and implementing these handlers, as in line 27). While the use of aliases is convenient as a way of specifying distributed types, alias types are, of course, not distributed, and the .NET runtime doesn’t understand subtyping rules we defined in Section 3.2. The actual type checking is done dynamically. When the programmer invokes a method of a .NET alias to request a type cast, or to create a connection between endpoints, the runtime uses its internal list of aliases to identify the distributed types involved and performs type checking by itself. The physical .NET types of aliases are irrelevant. Indeed, if the runtime determines that two different .NET types are actually aliases for the same distributed type, it will inject a wrapper code, as demonstrated below. Example (g). Suppose that binary Foo.dll defines an object type alias IChannel as in Code example 6, and an object template alias MyChannel as in Code example 7. Now, suppose that a different, unrelated binary Bar.dll also defines an alias IChannel in exactly the same way, as in Code 6, and then uses this alias, e.g. in the definition of an application object that could use channels of the corresponding distributed type. If both binaries are loaded by the live objects runtime, we will end up with two distinct, binary-incompatible .NET aliases IChannel, representing the same distributed type. Whenever the programmer makes an assignment between these two types, the runtime dynamically creates, compiles, and injects the appropriate wrapper to forward method calls between the incompatible interfaces, to make the assignment legal in .NET.
6 Conclusions Our paper described the architecture and implementation of a system supporting live distributed objects, a strongly typed, object-oriented platform in which distributed protocols are treated as first-class objects. The platform is working and quite versatile, but is still a work in progress. Future challenges include implementing our security and WAN architectures (designed but not yet operational), providing runtime monitoring and debugging tools, and automated self-configuration and tuning. Acknowledgements. Our work was funded by AFRL/IF, AFOSR, NSF, I3P and Intel. We’d like to thank Mahesh Balakrishnan, Kathleen Fisher, Paul Francis, Lakshmi Ganesh, Rachid Guerraoui, Chi Ho, Maya Haridasan, Annie Liu, Tudor Marian, Greg Morrisett, Andrew Myers, Anil Nerode, Robbert van Renesse, Yee Jiun Song, Einar Vollset, and Hakim Weatherspoon for the feedback they provided.
References 1. de Alfaro, L., Henzinger, T.: Interface automata. SIGSOFT Softw. Eng. Notes 26, 5 (2001) 2. Anceaume, E., Charron-Bost, B., Minet, P., Toueg, S.: On the Formal Specification of Group Membership Services. Cornell University Tech. Report TR95-1534 (August 1995)
Programming with Live Distributed Objects
487
3. Andrews, T., et al.: Business Process Execution Language for Web Services v1.1. May (2003), http://download.boulder.ibm.com/ibmdl/pub/software/dw/ specs/ws-bpel/ws-bpel.pdf 4. Banerji, A., et al.: Web Services Conversation Language (WSCL), http://www.w3.org/TR/wsc110 5. Birman, K.: The Process Group Approach to Reliable Distributed Computing. Communications of the ACM 36(12), 37–53 (1993) 6. Birrell, A., Nelson, G., Owicki, S., Wobber, W.: Network Objects. In: SOSP 1993 7. Brand, D., Zafiropulo, P.: On communicating finite-state machines. JACM, 30(2) (1983) 8. Brockschmidt, K.: Inside OLE. Microsoft Press (1995) 9. Burrows, M., Abadi, M., Needham, R.: A Logic of Authentication. TOCS 8(1), 18–36 (1990) 10. Carriero, N., Gelernter, D.: Linda in Context. CACM 32(4), 444–458 (1989) 11. Cheriton, D., Zwaenepoel, W.: Distributed Process Groups in the V Kernel. ACM Transactions on Computer Systems 3(2), 77–107 (1985) 12. Chockler, G., Keidar, I., Vitenberg, W.: Group Communication Specifications: A Comprehensive Study. ACM Computer Surveys 33(4):1, 43 (2001) 13. Christensen, E., Curbera, F., Meredith, G., Weerawarana, S.: Web Services Description Lan-guage (WSDL). W3C Note 15 March (2001), http://www.w3.org/TR/wsdl 14. Eugster, P., Guerraoui, R.: On Objects and Events. In: OOPSLA 2001, pp. 254–269 (2001) 15. Eugster, P., Guerraoui, R.: Distributed Programming with Typed Events. IEEE Software 21(2), 56–64 (2004) 16. Eugster, P., Damm, H., Guerraoui, R.: Towards Safe Distributed Application Development. In: ICSE 2004, pp. 347–356 (2004) 17. Eugster, P., Guerraoui, R., Sventek, J.: Distributed Asynchronous Collections: Abstractions for Publish/Subscribe Interaction. In: Bertino, E. (ed.) ECOOP 2000. LNCS, vol. 1850, pp. 252–276. Springer, Heidelberg (2000) 18. Fu, X., Bultan, T., Su, J.: Conversation Specification: A New Approach to Design and Anal-ysis of E-Service Composition. In: WWW 2003, Budapest, Hungary, May 20-24 (2003) 19. Fuzzati, R., Nestmann, U.: Much Ado About Nothing. In: Algebraic Process Calculi, the First Twenty Five Years and Beyond. Process algebra, http://www.brics.dk/NS/05/3/ 20. Garbinato, B., Guerraoui, R.: Using the Strategy Pattern to Compose Reliable Distributed Protocols. In: Proceedings of 3rd USENIX COOTS, Portland, Oregon (June 1997) 21. Goldberg, A., Robson, D.: Smalltalk-80: the language and its implementation. AddisonWesley Longman Publishing Co., Inc., Boston (1983) 22. Halpern, J., Fagin, R., Moses, Y., Vardi, M.: Reasoning about Knowledge. MIT Press, Cambridge (1995) 23. Hickey, J., Lynch, N., van Renesse, R.: Specifications and proofs for Ensemble layers. In: Cleaveland, W.R. (ed.) ETAPS 1999 and TACAS 1999. LNCS, vol. 1579, Springer, Heidelberg (1999) 24. Hoare, C.: Communicating sequential processes. CACM 21(8), 666–677 (1978) 25. Jul, E., Levy, H., Hutchinson, N., Black, A.: Fine-Grained Mobility in the Emerald System. ACM TOCS 6(1), 109–133 26. Karr, D.: Specification, Composition, and Automated Verification of Layered Communication Protocols. Ph.D. Thesis. Cornell University (1997)
488
K. Ostrowski et al.
27. Keidar, I., Khazan, R., Lynch, N., Shvartsman, A.: An inheritance-based technique for building simulation proofs incrementally. ACM Trans. Soft. Eng. Methodol. 11(1), 63–91 (2002) 28. Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J.-M., Irwin, J.: Aspect-Oriented Programming. In: Aksit, M., Matsuoka, S. (eds.) ECOOP 1997. LNCS, vol. 1241, pp. 220–242. Springer, Heidelberg (1997) 29. Krumvieda, C.: Distributed ML: Abstractions for Efficient and Fault-Tolerant Prgramming. Technical Report, TR93-1376, Cornell University (1993) 30. Lamport, L.: The Temporal Logic of Actions. ACM Toplas 16(3), 872–923 (1994) 31. Liskov, B.: Distributed Programming in Argus. CACM 31(3), 300–312 (1988) 32. Liskov, B., Schieffler, R.: Guardians and Actions: Linguistic Support for Robust, Distributed Programs. ACM TOPLAS 5, 3 (1983) 33. Liu, X., Kreitz, C., van Renesse, R., Hickey, J., Hayden, M., Birman, K., Constable, R.: Building Reliable, High-Performance Communication Systems from Components. In: SOSP (1999) 34. Live Objects at Cornell, http://liveobjects.cs.cornell.edu/ 35. Loesing, K., Wirtz, G.: An Implementation of Reliable Group Communication Based on the Peer-to-Peer Network JXTA. In: AICCSA 2005 (2005) 36. Lynch, N., Tuttle, M.: Hierarchical correctness proofs for dist.ributed algorithms. In: PODC 1987 (1987) 37. Maffeis, S., Schmidt, D.: Constructing Reliable Distributed Communication Systems with CORBA. IEEE Communications Magazine 14 (February 1997) 38. Makpangou, M., Gourhant, Y., Le Narzul, J.-P., Shapiro, M.: Fragmented Objects for Distri-buted Abstractions, pp. 170–186. IEEE Computer Society Press, Los Alamitos (1994) 39. Microsoft. Microsoft Office Groove, http://office.microsoft.com/en-us/groove/ 40. Microsoft. XAML Overview, http://msdn2.microsoft.com/en-us/library/ms752059.aspx 41. Milner, R., Parrow, J., Walker, D.: A Calculus of Mobile Processes, parts I and II. LFCS Report 89-85. University of Edinburgh (June 1989) 42. Miranda, H., Pinto, A., Rodrigues, L.: Appia, a Flexible Protocol Kernel Supporting Multiple Coordinated Channels. In: Proc. of 21st ICDCS, Phoenix, Arizona, pp. 707–710 (2001) 43. Montresor, A., Davoli, R., Babaoglu, O.: Enhancing Jini with group communication. In: ICDCS Workshop, April 2001, pp. 69–74 (2001) 44. Necula, G.: Proof-Carrying Code. ACM SIGPLAN-SIGACT POPL 1997 (1997) 45. O’Malley, S., Peterson, L.: A Dynamic Network Architecture. TOCS 10(2), 110–143 (1992) 46. Ostrowski, K., Birman, K., Dolev, D.: Quicksilver Scalable Multicast. In: 7th IEEE International Symposium on Network Computing and Applications (IEEE NCA 2008) (to appear, 2008) 47. Ostrowski, K., Birman, K., Dolev, D.: Declarative Reliable Multi-Party Protocols. Cornell University Technical Report, TR2007-2088 (March 2007) 48. Ostrowski, K., Birman, K., Dolev, D.: Extensible Architecture for High-Performance, Scalable, Reliable Publish-Subscribe Eventing and Notification. JWSR v. 4, no 4 (October- December 2007) 49. Parastatidis, S., Webber, J., Woodman, S., Kuo, D., Greenfield, P.: SOAP Service Description Language (SSDL). Technical Report, University of Newcastle, CS-TR-899 (2005) 50. Reiter, M., Birman, K.: How to securely replicate services. In: TOPLAS, vol. 16(3), pp. 986–1009 (1994)
Programming with Live Distributed Objects
489
51. van Renesse, R., Birman, K., Hayden, M., Vaysburd, A., Karr, D.: Building Adaptive Systems Using Ensemble. Software Practice and Experience. 28(9), pp. 963-979 (August 1998) 52. Schneider, F.: Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial. ACM Computng Surveys 22(4), 299–319 (1990) 53. Smith, D., Kay, A., Raab, A., Reed, D.: Croquet: a collaboration system architecture. Creating, Connecting and Collaborating Through Computing, C5 2003, p. 2–9 (2003) 54. Smith, R., Wolczko, M., Ungar, D.: From Kansas to Oz: Collaborative Debugging When a Shared World Breaks. CACM, 72–78 (1997) 55. Snyder, A.: Encapsulation and Inheritance in Object-Oriented Programming Languages. In: OOPLSA 1986 56. van Steen, M., Homburg, P., Tanenbaum, A.: Globe: A Wide Area Distributed System. IEEE Concurrency 7(1), 70–78 (1999) 57. Sun Microsystems, Inc. JXTA v2.0 Protocols Specification, http://www.jxta.org 58. Waldo, J.: The Jini architecture for network-centric computing. CACM 42(7), 76–82 (1999)
Bristlecone: A Language for Robust Software Systems Brian Demsky and Alokika Dash University of California, Irvine
Abstract. We present Bristlecone, a programming language for robust software systems. Bristlecone applications have two components: a high-level organization description that specifies how the application’s conceptual operations interact, and a low-level operational description that specifies the sequence of instructions that comprise an individual conceptual operation. Bristlecone uses the high-level organization description to recover the software system from an error to a consistent state and to reason how to safely continue the software system’s execution after the error. We have implemented a compiler and runtime for Bristlecone. We have evaluated this implementation on three benchmark applications: a web crawler, a web server, and a multi-room chat server. We developed both a Bristlecone version and a Java version of each benchmark application. We used injected failures to evaluate the robustness of each version of the application. We found that the Bristlecone versions of the benchmark applications more successfully survived the injected failures.
1 Introduction Software faults pose a significant challenge to developing reliable, robust software systems. The current approach to addressing software faults is to work hard to minimize the number of software faults through development processes, automated tools, and testing. While minimizing the number of software faults is a critical component in the development process for reliable software, it is not sufficient: the faults that inevitably slip through the development and testing processes will still cause deployed systems to fail. The Lucent 5ESS telephone switch, the Ericsson AXD301 ATM switch, and the IBM MVS operating system are examples of critical systems that use recovery routines to automatically recover from software failures [1,2]. The software in these systems contains a set of manually coded recovery procedures that detect errors and then take actions to automatically recover from the errors. The reported results indicate that the recover routines can provide an order of magnitude increase in the reliability of these systems [3]. This additional reliability comes at a significant additional development cost — the recovery routines for the Lucent 5ESS telephone switch constitute more than 50% of the switch’s software [4]. As a result of these high costs, recovery procedures have been primarily relegated to the domain of critical infrastructure software that can justify the cost. A wide range of other applications including desktop applications such as web browsers, office applications, games, servers, and control systems could potentially benefit from lower-cost automated recovery. The goal of Bristlecone is to J. Vitek (Ed.): ECOOP 2008, LNCS 5142, pp. 490–515, 2008. c Springer-Verlag Berlin Heidelberg 2008
Bristlecone: A Language for Robust Software Systems
491
provide a lower-cost approach to software recovery that will enable a larger class of applications to benefit from this technique. The key inspiration for this research is the observation that many software errors propagate through software systems to cause further damage either through data structure corruption or control-flow–induced coupling between conceptual operations. We have developed Bristlecone, a programming language for robust software systems, to address the error propagation problem. The basic idea is to address error propagation by having developers write software systems as a set of decoupled tasks with each task encapsulating an individual conceptual operation. The developer also provides specifications that describe how these decoupled tasks interact and optionally what consistency properties should hold for data structures. The runtime checks for data structure consistency violations and monitors for illegal operations (such as illegal memory accesses or arithmetic errors) to detect software errors. If the runtime detects an error in the execution, the runtime rolls back the data structures to their state at the beginning of the task’s execution, and then uses the task specifications to adapt the execution of the software system to avoid re-executing the same error and make forward progress. Alternatively, we can view Bristlecone as a programming language that allows for a large space of possible execution paths for any given software system with an implicit ordering of how desirable any given path is. If the most desirable path results in an error, the runtime rolls back the execution enough to follow a different path thereby avoiding the error. The result is a robust software system that can continue to successfully provide service even in the presence of errors. 1.1 Bristlecone Language Figure 1 gives an overview of the components in the Bristlecone system. We can view software systems as a composition of thousands of conceptual operations — in practice, the correct execution of any conceptual operation is likely to be independent of many of the other conceptual operations. However, many traditional programming languages force developers to linearize the conceptual operations of a software system. This linearization tightly couples these conceptual operations: if one conceptual operation fails, it becomes unclear how to safely execute any future conceptual operations. Bristlecone avoids artificially coupling operations by providing the developer with the task program construct. The developer uses a task to encompass a single conceptual operation. Tasks are represented in Figure 1 as rectangles. A set of task specifications loosely couple the tasks together. Each task contains a task specification that the runtime uses to determine (1) when to execute the task, (2) what data the task needs, and (3) how the task changes the role this data plays in the computation. If a task fails, the runtime uses the task specifications to reason how to adapt the future execution of the software system so that the execution does not depend on the failed task. Bristlecone contains the following components (represented by rounded boxes in the figure): • Bristlecone Compiler: The Bristlecone compiler compiles the tasks and task specifications into C code. Our implementation then uses the gcc C compiler to generate executables. The ellipse labeled Compiled Tasks represents the compiled tasks.
492
B. Demsky and A. Dash
Software System Task Specification A
Bristlecone Runtime: 1) Controls task sequencing 2) Reverts failed tasks 3) Adapts execution after failure 4) Provides standard runtime functionality (GC, system calls, etc)
Task A
Task Specification B Task B
Bristlecone Compiler : 1) Compiles code
Compiled Tasks
Task Specification C Compiled Task Specifications Task C
Fig. 1. Overview of the Bristlecone System
• Runtime: The runtime uses the compiled code and compiled specifications generated by the compilers (represented by the ellipses in the figure) to execute the software system. It uses the consistency checker to detect errors that silently corrupt data structures. The runtime then uses rollback to recover consistent data structures if it detects a software error. Finally, it uses the task specifications to determine when to execute the tasks and how to recover from errors. 1.2 Scope Bristlecone is not suitable for all software systems. Certain computations, such as some scientific simulations, are inherently tightly coupled. While Bristlecone may detect errors in such software systems, it is unlikely to enable these systems to recover in any meaningful way. For other computations, it may be desirable for a software system to shut down rather than deviate from a specific designed behavior or produce a partial result. Bristlecone is designed for software systems that place a premium on continued execution and that can tolerate some degradation from a specific designed behavior. For example, we expect that Bristlecone will be useful for financial server software, e-commerce systems, office applications, web browsers, online game servers, sensor networks, and control systems for physical phenomena. For applications like finance, Bristlecone can be used to develop software systems that only process error-free transactions and back out all changes that corrupt data structures, while still ensuring that cosmetic errors do not cause potentially expensive downtime. Ultimately, the software developer must decide whether using this approach is reasonable for a given software system. This decision could depend on the environment in which a system is deployed. For example, in systems with redundant backup systems, we expect that developers would
Bristlecone: A Language for Robust Software Systems
493
design the primary system to fail-fast and the backup system to be robust in the presence of errors. 1.3 Contributions This paper makes the following contributions: • Bristlecone Language: It presents a programming language which exposes both the conceptual operations and the ordering and data dependences between these conceptual operations to the compiler and runtime system. • Recovery Strategy: It presents a strategy for repairing the damage caused by a software error and adapting the software system’s execution in response to the error to enable it to safely continue execution. • Experience: It presents our experience using Bristlecone to develop three robust software systems: a web crawler, a web server, and a multi-room chat server. For each benchmark, we developed both a Bristlecone version and a Java version. We designed the Java versions to be resilient: they use threads to tolerate failures. Our experience indicates that the Bristlecone versions are able to successfully recover from significantly more of the injected failures. The remainder of the paper is structured as follows. Section 2 presents an example that illustrates our approach. Section 3 presents the Bristlecone languages. Section 4 presents the runtime system. Section 5 presents our experience using Bristlecone to develop several robust software applications. Section 6 discusses related work; we conclude in Section 7.
2 Example We next present a web server example that illustrates the operation of Bristlecone. This web server has specialized e-commerce functionality and maintains state to track inventory. As the example web server executes, the conceptual state or role of objects in the computation evolves. This evolution changes the way that the software system uses the object and can change the functionality that the object supports. For example, the Java connect method changes the functionality of a Socket object in a computation: after the connect method is invoked, data can be written to or read from that Socket object. The Bristlecone language provides flags to track the conceptual state of an object. The runtime uses the conceptual state of the object as indicated by the object’s flag to determine which conceptual operations or tasks to invoke on the given object. When a task exits, it can change the values of the flags of its parameter objects. 2.1 Classes Figure 2 gives part of the WebRequest class definition. The web server example uses instances of the WebRequest class to manage connections to the web server. The WebRequest class definition declares three flags: the initialized flag, which
494
B. Demsky and A. Dash
class WebRequest { /* This flag indicates that the WebRequest object is in its initial state. */ flag initialized; /* This flag indicates that the system has received a request to send a requested file. */ flag file_req; /* This flag indicates that the connection should be logged. */ flag write_log; ... }
Fig. 2. WebRequest Class Declaration
indicates whether the connection is in the initial state; the file req flag, which indicates that the server has received a file request from this client connection; and the write log flag, which indicates whether the connection information is available for logging. In many cases, the developer may need to invoke a task on multiple objects that are related in some way. Bristlecone provides a tag construct, which the developer can use to group objects together. New tag instances are created using tag allocation statements of the form tag tagname=new tag(tagtype). Such a tag allocation statement allocates a new tag instance of type tagtype and assigns the variable tagname to this tag instance. The developer can tag multiple objects with a tag instance to group them, and then use that tag instance to ensure that the runtime invokes a task on two or more objects in the group defined by the tag instance. The developer can tag an object by including the statement add tagname in an object allocation site to tag the newly allocated object or in a taskexit statement to tag a parameter object. The example uses the connection tag to group a WebRequest object with the corresponding Socket object that provides the TCP connection for that web request. Tag instances can be added to objects when the object is allocated, and they can be added or removed to or from a task’s parameter objects when the task exits. 2.2 Tasks Bristlecone software systems consist of a collection of interacting tasks. The key difference between tasks and methods is that the runtime invokes a task when the heap contains objects with the specified flag settings to serve as the task’s parameters. Note that while the runtime controls task invocation, tasks can call methods. The runtime uses a task’s specification to determine which objects serve as the task’s parameters and when to invoke the task. Each task declaration consists of the keyword task, the task’s name, the task’s parameters, and the body of the task. Figure 3 gives the task declarations for the web server example. We indicate the omission of the Java-like imperative code inside the task declarations with ellipses. The first task declaration declares a task named
Bristlecone: A Language for Robust Software Systems
495
/* This task starts the web server */ task startup(StartupObject start in initialstate) { ... ServerSocket ss=new ServerSocket(80); Logger l=new Logger() (initialized:=true); taskexit(start: initialstate:=false); } /* This task accepts incoming connection requests and creates a Socket object. */ task acceptConnection(ServerSocket ss in pending_socket) { ... tag t=new tag(connection); WebRequest w=new WebRequest(...)(initialized:=true, add t); ss.accept(t); ... } /* This task reads a request from a client. */ task readRequest(WebRequest w in initialized with connection t, Socket s in IO_Pending with connection t) { ... if (received_complete_request) taskexit(w: initialized:=false, file_req:=true, write_log:=true); } /* This task sends the request to the client. */ task sendPage(WebRequest w in file_req with connection t, Socket s with connection t) { ... taskexit(w: file_req:=false); } /* This task logs the request. */ task logRequest(WebRequest s in write_log, Logger l in initialized) { ... taskexit(s: write_log:=false); } Fig. 3. Flag Specifications for Tasks
startup that takes a StartupObject object as a parameter and points the parameter variable start to this object. The declaration also contains a guard that states that the StartupObject object must have its initialstate flag set before the runtime can invoke this task. The runtime invokes the task when there exist parameter objects in the heap that satisfy the parameters’ guard expressions. Before exiting,
496
B. Demsky and A. Dash
the taskexit statement in the startup task resets the initialstate flag in the StartupObject to false to prevent the runtime from repeatedly invoking the startup task. Task declarations can contain constraints on tag bindings to ensure that the parameter objects are related. A tag binding constraint contains the keyword with followed by the type of the tag and the tag variable. For example, the task declaration task readRequest(WebRequest w in initialized with connection t, Socket s in IO Pending with connection t) ensures that the runtime only invokes the readRequest task on a set of parameter objects in which the first parameter object is bound to an instance of a connection tag and the second parameter object is bound to the same connection tag instance. When the task executes, the tag variable t is bound to that connection tag instance. 2.3 Error-Free Execution Figure 4 gives a diagram of the dependences between tasks in the web server example. The ellipses in the diagram represent tasks and the edges represent the control and data dependences between the tasks. The rectangle labeled Runtime initialization represents the initialization performed by the Bristlecone runtime. From this diagram, we can see that the web server performs the following operations in an error-free execution (although not necessarily in this order): 1. Startup: When a Bristlecone program is executed, the Bristlecone runtime creates a StartupObject object and then sets its initialstate flag to true. Setting this flag causes the runtime to invoke the startup task in our example. Note that the code never explicitly calls a task. Instead, the runtime keeps track of the status of the flags of objects in the heap and invokes a task when the heap contains objects with the specified flag settings to serve as parameters. When the runtime invokes the startup task, the startup task creates a ServerSocket object to accept incoming connections to the web server. Next, it creates a Logger object to manage logging web page requests and sets its initialized flag to indicate that the object is ready to provide logging functionality. Finally, it resets the StartupObject object’s initialstate flag to false to prevent the runtime from repeatedly invoking the startup task. 2. Accepting an Incoming Connection: At some point, the web server will receive an incoming connection request from a web browser. This causes the runtime to set the ServerSocket object’s pending socket flag to true, which in turn causes the runtime to invoke the acceptConnection task with this ServerSocket object as its parameter. The acceptConnection task creates a WebRequest object to store the connections state and calls the accept method on the ServerSocket to create a Socket object to manage communication with the web browser. Note that the acceptConnection task creates a new connection tag instance to group the Socket object and WebRequest object together by binding this tag instance to the WebRequest object and then passing this tag instance into the accept method to bind the newly created Socket object.
Bristlecone: A Language for Robust Software Systems
497
Runtime initialization StartupObject {initialstate} startup Task Logger {initialized} ServerSocket {} + Incoming Connection ServerSocket{pending_socket} acceptConnection Task
WebRequest {file_req, ...} with connection tag + Socket{} with same connection tag
WebRequest {initialized} with connection tag + Socket {IO_pending} with same connection tag readRequest Task
WebRequest {write_log, ...}
sendPage Task
logRequest Task
Fig. 4. Task Diagram for the Web Server
3. Reading a Request: After a connection is established, the client web browser sends a web page request to the server. In response to this incoming web page request, the runtime sets the Socket object’s IO pending flag to true1 , which in turn causes the runtime to invoke the readRequest task. The readRequest task checks whether the server has received the complete request.2 If it has received the complete request, it sets both the file req flag and the write log flag to true and resets the initialized flag to false. These flag changes cause the runtime to eventually invoke both the sendPage and the logRequest tasks and prevents repeated invocations of the readRequest task on the same object. 4. Sending the Page: The runtime invokes the sendPage task when the WebRequest object’s request processed flag is set to true. The sendPage task then reads the requested file and sends the contents of the file to the client browser. The sendPage task then resets the received request flag to false to prevent repeated invocations of the sendPage task. 5. Logging the Request: The runtime invokes the logRequest task when both the WebRequest object’s write log flag is set to true and the Logger object’s 1
2
The IO pending flag is declared with the external keyword to indicate that the runtime manages setting and clearing this flag. The current runtime implementation of Bristlecone is single-threaded and, therefore, uses non-blocking I/O. Future runtime implementations will support multiple concurrent tasks and (transactional) blocking I/O [5]. Note that it is possible for a client browser to split a long request across multiple packets and therefore it may be necessary to invoke the readRequest task multiple times to receive a single request.
498
B. Demsky and A. Dash
initialized flag is set to true. The logRequest task writes a log entry to record which web page was requested. The logRequest task then resets the write log flag to false to prevent repeated invocations of the logRequest task. 2.4 Error Handling The Bristlecone runtime uses task specifications to automatically recover from errors. For example, suppose that the logRequest task fails while updating the Logger object. If the web server were written in a traditional programming language, it could be difficult to recover from such a failure. While some traditional languages provide exceptional handling mechanisms, using them effectively is challenging — the developer must both identify which failures are likely to occur and reason about how to recover from those failures. Alternatively, the program could simply ignore the failure. Unfortunately, if the web server were to simply ignore the failure, it could easily leave the Logger object in an inconsistent state, possibly eventually causing a catastrophic failure later. To address this issue Bristlecone tasks have transactional semantics — upon failure, the Bristlecone runtime aborts the enclosing transaction to return the affected objects, including the Logger object, to consistent states. The runtime then records that the logRequest task failed when invoked on the combination of those specific WebRequest and Logger objects. The runtime uses this record to avoid re-executing the same specific failure. At this point, the Bristlecone runtime has returned the web server to a known consistent state and must now determine how to safely continue the web server’s execution. The traditional problem with using transactions to recover from deterministic software faults is that after aborting a transaction the software system cannot make forward progress — retrying the same transaction will cause the system to repeat the same failure. Bristlecone solves this problem by using the flags, tags, and task specifications to determine which other tasks are safe to execute after the error. Although the software fault prevents the system from logging this request, since the file req flag is set to true, the task specification for the sendPage task allows the runtime to invoke the sendPage task. Therefore, the runtime can still safely serve the web page request. The end result is that the software system is able to safely continue to execute even in the presence of software errors. Bristlecone is able to successfully isolate the effects of the error to a minimal part of the web server’s execution — only a single task is aborted and the abort is logged. Without Bristlecone, the web server could potentially leave the Logger object in an inconsistent state, possibly causing the web server to fail to log future requests. If the web server written in a conventional language was designed to log request before serving a request, corruption of the log data structure could even cause the server to stop serving requests.
3 Language Design The Bristlecone language includes a task specification language that describes how to orchestrate task execution. Bristlecone introduces object flags to store the conceptual
Bristlecone: A Language for Robust Software Systems
499
flagdecl := flag flagname; | external flag flagname; tagdecl := tagtype tagname; taskdecl := task name(taskparamlist) taskparamlist := taskparamlist, taskparam | taskparam taskparam := type name in flagexp | type name in flagexp with tagexp flagexp := flagexp and flagexp | flagexp or flagexp |!flagexp | (flagexp) | flagname | true | false tagexp := tagexp and tagtype tagname | tagtype tagname statements := ... | taskexit(flagactionlist) | tag tagname = new tag(tagtype) | new name(params)(flagortagactions) | assert(expression) flagactionlist := flagactionlist; name : flagortagactions | name : flagortagactions params := ... | tag tagname flagortagactions := flagortagactions, flagortagaction | flagortagaction flagortagaction := flagaction | tagaction flagaction := flagname := boolliteral tagaction := add tagname | clear tagname
Fig. 5. Task Grammar
state of the object. Each task contains a corresponding task specification that describes which objects the task operates on, when the task should execute, and how the task affects the conceptual state of objects. Bristlecone is an object-oriented, type-safe language with syntax similar to Java. Figure 5 presents the grammar for Bristlecone’s task extensions to Java. We omit the Java-like imperative component of Bristlecone from the grammar to save space. The developer includes a flag declaration inside a class declaration to declare that objects of that class contain the declared flag. Flag declarations use the flag keyword followed by the flag’s name. The developer may optionally use the external keyword to specify that the flag is set and reset by the runtime system. External flags are intended to handle asynchronous events such as communication over the Internet or mouse clicks. External flags are intended to be declared in library code with the corresponding runtime component setting and clearing the external flag. The developer can use tags to enforce relations between the parameters of a task. The developer can create new tag instances with the new tag statement and a tag type. Note that there may be many instances of a given type of tag. Each different instance of that tag is distinct — objects labeled by two different instances of the same tag type are not grouped together. The developer can bind tags to objects when an object is allocated or bind or unbind tags to or from parameter objects at the task’s exit. The developer declares a task using the task keyword followed by the task’s name, the task’s parameters, and the task’s code. Each task parameter declaration contains the parameter’s name, the parameter’s type, a flag guard expression that specifies the state of the parameter’s flags, and an (optional) tag guard expression that specifies the tags the object has. The task may be executed when all of its parameters are available. A parameter is available if the heap contains an object of the appropriate type, that object’s flags satisfy the parameter’s guard expression, and that object contains the tag instances that the parameter’s guard expression specifies. Bristlecone adds a modified new statement that specifies the initial flag settings and tag bindings for a newly allocated object. These take effect when the task exits. Bristlecone contains a taskexit statement that
500
B. Demsky and A. Dash
specifies how the task changes the state of the flags or tag bindings of its parameter objects at that task exit point. Bristlecone contains an assert statement that can be used to specify correctness properties that must hold. The goal of assert statements is to provide a mechanism to detect higher-level errors that do not cause low-level exceptions. The compiled application uses the assert statements to detect errors at runtime— if it detects an error, the runtime system will invoke the recovery algorithm. These assertion statements can be used with data structure consistency checking tools [6,7], JML assertions [8], or design by contract methodologies [9]. In many cases, the assertions can be generated automatically using dynamic invariant detection tools [10,11,12].
4 Runtime System The Bristlecone runtime is responsible for dispatching tasks, detecting errors, and recovering from errors. 4.1 Task Execution Recall that the task specification contains guard expressions for all of the task’s parameters and that the runtime executes a task when parameter objects are available that satisfy these guards. We next discuss how our implementation efficiently performs task dispatch. A naive approach to task dispatch could potentially be very inefficient — a parameter’s guard expression is quantified over all objects in the heap! Parameter Sets. The runtime maintains a parameter set for each parameter of each task. A parameter set contains all of the objects that satisfy the corresponding parameter’s guard. For each object type, the runtime precomputes a list of parameter sets that objects of this type can potentially be a member of. When a task exit changes an object’s flag settings or tag bindings, the runtime updates that object’s membership in the parameter sets by traversing the precomputed list of possible parameter sets for the class and evaluating whether the object satisfies the guard expression to be a member of the parameter set. Bristlecone also uses the parameter sets as root sets for garbage collection. Objects in Bristlecone are garbage collected if (1) the object is unreachable from any potential parameter objects and (2) the object cannot be a parameter object of any task as determined by membership in a parameter set. Note that it is possible to write incorrect programs that leave objects in task queues (e.g consider a two parameter task with tagged parameters, the program might only change one parameter object’s flags leaving the other parameter object in the queue). We have developed a static analysis that the developer can use to automatically identify this type of memory leak [13]. Task Queue. A task invocation is a tuple that includes both a task and bindings for that task’s object parameters and tag parameters. An active task invocation is a task invocation that satisfies all of the task specification’s guards and can therefore safely be invoked by the runtime. The runtime maintains the task queue of all active task invocations and executes task invocations from this task queue.
Bristlecone: A Language for Robust Software Systems
501
Our implementation maintains a conservative approximation of the task queue — our implementation’s task queue may contain a number of non-active task invocations in addition to all of the active task invocations. When an object is added to a parameter set, the implementation generates all active task invocations that bind that object to the corresponding parameter and then adds these active task invocations to the task queue. When an object is removed from a parameter set, our implementation does not remove task invocations from the task queue. Instead, before the implementation executes a task invocation in the queue, the implementation verifies that the task invocation is still active. Iterators. We next describe how our implementation efficiently generates all active task invocations. Note that tag bindings restrict how parameter objects can be grouped together into a task invocation, and therefore, a naive implementation can needlessly explore many task invocations that do not satisfy tag guards. For example, the sendPage task in a web server may require both a WebRequest object and a Socket object tagged with the same connection instance as parameters. An efficient implementation must prune the search space of possible task invocations to avoid the overhead of exploring many task invocations that do not satisfy the tag guards. Our implementation searches the parameter binding space using a sequence of iterators. It uses two iterator types: object instance iterators and tag instance iterators. Object instance iterators iterate over the objects in the corresponding parameter set that are compatible with all tag variable bindings made by previous iterators. In general, we expect that relatively few objects will be bound to a given tag instance and relatively few tag instances will be bound to a given object. Our implementation uses this expectation to optimize the object iterators: if the parameter has a tag guard with a tag variable that was bound by a previous tag iterator, the implementation optimizes the object iterator to only iterate over the objects bound to that tag instance. Tag iterators iterate over tag instances that are bound to an object. Tag iterators bind the tag variables in tag guards to tag instances. As described above, our iterators use the constraints provided by the tag guards to prune the search space. Note that the order of the iterators can affect the size of the search space that the implementation explores to generate all active task invocations. Our implementation precomputes iterator orderings for each parameter of each task. The implementation uses the following ordering priority: 1. Tag iterators for tags bound to parameter objects that have already been iterated over have the highest priority. We expect that the set of iterated tag instances will be small and, therefore, tag bindings will substantially prune subsequent object iterations for parameters bound to the same tag variable. 2. Object iterators for parameters with tags that are bound by previous tag iterators. 3. Object iterators for parameters with tags that have not yet been iterated over. 4. Remaining object iterators have the lowest priority. Task Execution Semantics. Tasks may fail either as a result of software errors, hardware failures, or user errors. If a task fails, it may leave data structures in inconsistent
502
B. Demsky and A. Dash
states. Further computation using these inconsistent data structures will likely have unpredictable and potentially catastrophic results. To avoid this problem, tasks in Bristlecone have transactional semantics — if a task fails, the Bristlecone runtime aborts the task’s transaction. Recall that a potential issue with the use of transactions in traditional programming languages is that after the system recovers to the previous point, the system may simply re-execute the same deterministic fault and that fault will cause the system to fail repeatedly in the same way. Bristlecone addresses this issue by using the flexibility provided by the task-based language to avoid re-executing the same failure. The Bristlecone runtime records the combination of task and parameter assignments that caused the failure and uses this record to avoid re-executing the failed combination task and parameter assignments. Instead, the runtime executes other tasks to avoid retriggering the same underlying fault. 4.2 Error Detection Errors can cause the computation to produce incorrect results and corrupt data structures, potentially eventually causing the software system to perform unacceptably. Bristlecone uses runtime checks to detect errors, enabling the software system to adapt its execution. The Bristlecone runtime uses error detection routines to trigger recovery actions. Bristlecone uses checks to detect many software errors. For example, the Bristlecone compiler generates array bounds checks. These checks verify that the software system does not read or write past the end of arrays. The Bristlecone compiler also generates the necessary type checks for array operations and cast operations. These checks ensure that the dynamic types of objects do not violate type safety. The runtime uses hardware page protection to perform null pointer checks. This is implemented by catching the segmentation fault signal from the operating system. These checks ensure that the software system does not attempt to dereference a null pointer or write values to the fields of a null pointer. The runtime also uses hardware exceptions to detect arithmetic errors including division by zero. Native library routines also signal errors to the runtime. For example, if a software systems attempts to send data over a closed network connection, the runtime will signal an error. Software errors can also cause a program to loop. Looping can prevent the software system from providing services. It is straightforward to support developer-provided task time-outs that the runtime can use to detect looping tasks. Bristlecone includes a runtime assertion mechanism to ensure that the execution is consistent with respect to specified properties. The developer can simply write imperative code to check properties or can use the assertion mechanism to call external consistency checking code. This mechanism is intended to be used to ensure data structure consistency or to use techniques such as design by contract to detect higherlevel errors. The mechanism can be used in conjunction with JML assertions [8], data structure consistency specifications languages [6,7], or other runtime checkable specifications.
Bristlecone: A Language for Robust Software Systems
503
4.3 Error Recovery Bristlecone was designed to support reasoning about failures using the high-level task abstraction. In Bristlecone, a task either successfully completes execution or does not execute at all. The Bristlecone runtime uses a straightforward checkpointing-based transaction approach to implement this failure abstraction. Because a task can only access the part of the heap that is reachable from the task’s parameter, it suffices to create a snapshot of all objects reachable from the task’s parameters. While the current prototype implementation uses a naive checkpointing-based approach, it is conceptually straightforward for future Bristlecone implementations to leverage the large body of work on efficiently implementing software or hardware transactional memory. A second issue with the current implementation is transactionalizing I/O. One solution is to use a transactional I/O API that delays the effects of I/O operations until a task commits. If Bristlecone detects an error, it simply fails the entire task and uses this stored checkpoint to rollback the state affected by the failed task. This recovery strategy greatly simplifies reasoning about the state of the software system after a failure. Restoring state from the previous checkpoint ensures that a failure does not leave partially updated data structures in inconsistent states. Many software errors are deterministic. If Bristlecone re-executes a failed task on the same parameters in the same state, it is likely that the task will fail again due to the same error. Bristlecone addresses this issue by maintaining a record of failures. For each failure, this record contains the combination of the failed task and the parameter assignments that failed. Bristlecone uses this record to avoid re-executing the same failures by checking reference equality of the task’s parameters. The Bristlecone runtime then uses the object flags to determine which tasks can be executed even though part of the computation has failed. To better handle non-deterministic failures, the approach can be extended to automatically retry failed task executions a few times. We note that after a failure, a failed object can remain in task queue and never be garbage collected. We expect that in practice, software systems will be mostly correct and therefore a failure will be a rare occurrence and only small amounts of memory will be leaked due to failures. 4.4 Debugging and Error Logging While it is desirable for deployed Bristlecone software systems to make every effort to avoid failures, during the development phase this behavior can mask failures and therefore complicate the debugging process. To facilitate debugging, Bristlecone can be configured to fail-fast. The fail-fast mode ensures that developers will notice software errors during the development process. Moreover, it would be straightforward to have the runtime record the state of the objects that caused the task failure by using the stored checkpoints. This information could help with debugging many software errors. Furthermore, both developers and system administrators often want to be aware of failures in deployed systems so that the underlying software faults, if any, can be fixed. Bristlecone contains a logging mechanism that records both the task that failed and the type of error. This log ensures that developers and system administrators are aware of
504
B. Demsky and A. Dash
failures in Bristlecone software systems and gives the developers a starting point for diagnosing the cause of the failure. In some cases, developers may wish to create a custom framework to communicate failure data. It would be possible to provide an API that applications could use to query the runtime system about failures.
5 Experience We next discuss our experiences using Bristlecone to develop three robust software systems: a web crawler, a web server, and a multi-room chat server. 5.1 Methodology We have implemented the Bristlecone compiler. Our implementation consists of approximately 22,400 lines of Java code and C code for the Bristlecone compiler and runtime system. The Bristlecone compiler generates C code that runs on both Linux and Mac OS X. The Bristlecone runtime uses precise stop-and-copy garbage collection. The source code for our compiler and runtime is available at http://demsky.eecs.uci.edu/bristlecone/. We ran the benchmarks on a MacBook with a 2 GHz Intel Core Duo processor, 1 GB of RAM, and Mac OS X version 10.4.8. For each benchmark, we developed two versions: a Bristlecone version and a Java version. We designed the Java versions to tolerate faults by isolating components of the computation using threads. Without the use of threads to provide fault tolerance, the Java versions would have halted with the first failure. Our evaluation was designed to evaluate how robust each version of the benchmark applications was to the large class of faults that cause the faulty thread or task to perform an illegal operation. This fault class includes faults that cause null pointer dereferences, out of bound array index errors, failed assertions, failed data structure consistency checks, library usage errors, and arithmetic exceptions. Our evaluation simulated the effects of this fault class by randomly injecting halting failures. We used the Bristlecone compiler to automatically insert failure injection code after each instruction. We used the Java frontend of our compiler framework to compile and instrument the Java versions. The failure injection code takes three parameters at runtime: the number of instructions to execute before considering injecting a failure, the probability that a failure will be injected, and the total number of failures to inject. For each benchmark, we selected the number of failures and then set the failure probability to ensure that the normal execution of the benchmark would reach the set number of failures. 5.2 Web Crawler The web crawler takes an initial Uniform Resource Locator (URL) as input, visits the web page referenced by the URL, extracts the hyperlinks from the page, and then repeats this process to visit all of the URLs transitively reachable from the initial URL.
Bristlecone: A Language for Robust Software Systems
505
The Bristlecone version contains four tasks. The Startup task creates a Query object to store the initial URL that was specified on the command line and creates a QueryList object to store the list of URLs that the web crawler has extracted. The requestQuery task takes a newly created Query object as input, contacts the web server specified by the Query object, and then requests the URL specified by the Query object. The readResponse task reads the data that is currently available on the connection and then checks if the task has received the complete web page. The processPage task extracts URLs from the web page, checks the QueryList object to see if the crawler has seen this URL before, and then creates a Query object if the URL has not been seen before. The Java version uses a pool of three threads to crawl web pages. Each thread dequeues a URL from a global list of pages to visit, downloads the corresponding web page, extracts URLs from the web page, and then stores any URLs it has not seen before into the global list of pages to visit. We evaluated the robustness of the web crawler by developing both a workload and a failure injection strategy. Our workload consisted of a set of 100 web pages that each contain 3 hyperlinks to other web pages in the set. We used randomized failure injection to inject failures into the executions of the web crawlers. We injected 3 failures into each execution with each instruction having a 1 in 426,000 chance of failing. We performed 100 trials of the experiment on each of the two versions. For each trial, we measured how many web pages the crawler downloaded. Figure 6 presents the results of the web crawler experiments. Without the injected failures, both versions download 100 web pages. With the injected failures, on average the Bristlecone version downloaded 91 out of 100 web pages and the Java version downloaded 6 out of 100 web pages. While most of the injected failures in the Bristlecone version only affect crawling a single web page, failures that are injected into either the startup task or the processing of the initial web page can affect crawling many web pages. Such failures prevent the Bristlecone version from discovering the URLs of any further pages and significantly lowered the Bristlecone version’s average number of crawled pages. Java Bristlecone Web Pages Crawled (out of 100) 6 91 Fig. 6. Summary of Web Crawler Benchmark Results
5.3 Web Server The web server benchmark contains features that are intended to model an e-commerce server. The web server maintains an inventory of merchandise and supports requests to perform commercial transactions on this inventory, including adding new items, selling items, and printing the inventory. The Bristlecone version contains six tasks. The StartUp task creates a ServerSocket object to accept incoming connections, creates a Logger object to log the connections, and creates an Inventory object to keep track of the current inventory of merchandise. The AcceptConnection task processes incoming connections and creates a WebSocket objects to manage each connection. The
506
B. Demsky and A. Dash
ProcessRequest task reads the data that is currently available from the incoming connection and then checks if the task has received the complete request. When the complete request is available, the ProcessRequest task parses the request to determine whether the request is an e-commerce transaction or a simple file request. The Transaction task processes e-commerce transaction requests. It first inspects the request to determine whether the request is to add new items to the inventory, to make a purchase, or to display inventory and then performs the requested operation. For example, after receiving a purchase request the task looks up the price of the item in the Inventory object, verifies that the item is available, and if so, decrements the inventory count for the item and adds the price of the item to the sales figure. The SendFile task processes file requests. It opens the requested file, reads the file’s contents, and writes the file’s contents to the socket. The LogRequest task logs all of the requests to the log file. The Java version of the web server uses a thread to monitor for incoming connections. When a new connection arrives, the server spawns a separate connection thread for that incoming connection. The server uses a global object to store the inventory values. We used this design to isolate failures in connection threads to that specific request as much as possible. Note that failures can potentially corrupt the shared state. Note that unlike the Bristlecone version of the web server, a failure in a connection thread will prevent the server from performing any further operations for that connection including logging the request. We evaluated the robustness of both versions of the web server by developing both a workload and a failure injection strategy. Our workload simulated web traffic to the server. Our workload consisted of a sequence of 4,400 transaction requests. Our failure injection strategy utilized the failure injection code described in the previous section. We used failure injection to randomly inject 50 failures into the execution with a probability of injecting a failure after a given instruction of 1 in 2,100,000. We performed 200 trials on each of the two versions. For each trial we recorded whether the final inventory request was served, whether the final inventory was consistent, how many requests each version failed to serve, and how many request each version failed to log. Figure 7 summarizes the results of the fault injection experiments with the web server. The Java version failed to serve the inventory request in 4.5% of the trials while the Bristlecone version failed to serve the inventory request in 1.5% representing a three-fold reduction in the number of failures to serve inventory requests. More importantly, while the Java version served correct inventory responses only 68.6% of the time, the Bristlecone version served the correct inventory response 100% of the time.
Failures to serve Inventory Responses Correct Inventory Responses Failures to Serve Request Failures to Log Request
Java Bristlecone 4.5% 1.5% 68.6% 100% 3.8% 2.2% 3.9% 2.6%
Fig. 7. Summary of Web Server Benchmark Results
Bristlecone: A Language for Robust Software Systems
507
The Java version failed to serve 3.8% of the web requests and Bristlecone version failed to serve 2.2% of the web requests, representing a 42% reduction in the failure rate. The Java version failed to log 3.9% of the web requests and Bristlecone version failed to log 2.6% of the web requests, representing a 33% reduction in the failure rate. 5.4 Chat Server The multi-room chat server benchmark accepts incoming connections, asks the user to create a new room or select an existing room, and then allows users to chat with other users in the same chat room. The Bristlecone version contains six tasks. The StartUp task creates a ServerSocket object to accept incoming connections and a RoomObject to manage the chat rooms. The AcceptConnection task processes incoming chat connections. It creates a ChatSocket object to manage this connection and then sends a message to ask the user to select a chat room. The ReadRequest task reads the user’s chat room selection. It reads the currently available data from the incoming connection and checks if the chat server has received the complete chat room selection. When the complete room request has been received, the ProcessRoom task processes the request. If the requested room does not exist, it creates the requested chat room. It then adds the user to the requested chat room. The chat server stores the mapping of chat room names to the set of chat room participants and for each room, maintains a list of participants in the corresponding room. The Message task processes incoming chat messages and stores these message in a Message object. The SendMessage task then reads these Message objects, parses the messages, and then sends the messages to all of the participants in the chat room. Note that a problematic message or other error condition that causes the SendMessage task to fail will not prevent the server from processing future messages from the same connection. The Java version of the chat server uses a thread to monitor for incoming connections. When a new connection arrives, the server spawns a separate connection thread for that incoming connection. The server uses a global object to store the set of chat rooms. Unless a failure corrupts the room list, this design isolates failures in connection threads to the specific connection. Note that unlike the Bristlecone version of the chat server, a single failure in a connection thread will prevent the server from relaying any further messages from that connection. We evaluated the robustness of both versions by developing both a workload and a failure injection strategy. Our workload simulated multiple users chatting on the server. Our workload sent a total of 800 messages. Our failure injection strategy utilized the failure injection code described in the previous section. We used failure injection to randomly inject 10 failures into the execution with a probability of injecting a failure after a given instruction of 1 in 270,000. We performed 100 trials on each of the two versions. For each trial we recorded how many messages were successfully transmitted. In the presence of the injected failures, the Java version failed to deliver 39.9% of the messages and the Bristlecone version failed to deliver 19.3% of the messages, representing a factor of two reduction in the failure rate.
508
B. Demsky and A. Dash
5.5 Experiences Writing Bristlecone Applications We have developed Bristlecone and Java versions of three different benchmark applications. In general, we found writing Bristlecone applications to be straightforward. Typically, writing the Bristlecone version of an application simply requires reorganizing the application’s code. The Bristlecone versions of the benchmarks were approximately the same size as the Java versions. The Bristlecone version of the web crawler contained 20% fewer lines of code than the Java version, the Bristlecone version of the web server contained 2% more lines of code than the Java version, and the Bristlecone version of the chat server contained 5% more lines of code. The Bristlecone version of the web crawler was shorter because it did not require an auxiliary data structure to store queries. One potential concern with Bristlecone is that developers may make mistakes writing the high-level task specifications that Bristlecone requires. In our experience, we have found that task declarations were in general simpler than the lower-level imperative code and therefore easy to write correctly. However, we have developed an analysis that can analyze the task specification to extract state transition diagrams for each class [13]. Developers can use these state transition diagrams to quickly visually verify that their task specifications have the desired behaviors. 5.6 Performance Although Bristlecone uses standard compilation techniques for the body of methods and tasks, it incurs extra overheads supporting transactions and task invocation. Our current runtime implements transactions using a combination of checkpointing and single-threaded execution. We have measured the current implementation’s checkpointing and task invocation overhead to be 4.7 microseconds per task invocation on a 3 GHz Pentium-D machine for a microbenchmark. Researchers have developed efficient hardware or software transactional memory implementations [14,15,16,17,18,19,20,21] that could be used to lower the transaction overhead. Static task scheduling could also be used to statically schedule a sequence of task invocations to further reduce the task invocation overhead. 5.7 Discussion Our experience indicates that software systems developed using Bristlecone can recover from many otherwise fatal failures. The Bristlecone versions of all three benchmarks were able to recover from many more injected failures and provided a higher of quality of service than the hand-designed Java versions. Note that these results only hold for software faults that can be automatically detected. These results can be generalized to include faults that cause the application to silently perform an incorrect action, if the developer provides Bristlecone with a runtime-checkable correctness specification the detects the error. Examples of such specifications include runtime assertions or data structure consistency specifications.
Bristlecone: A Language for Robust Software Systems
509
6 Related Work We survey related work in testing, static analysis, exception mechanisms, fault tolerance, programming languages, and software architectures. 6.1 Approaches to Reliable Software The standard approach to dealing with software failures is to work hard to find and eliminate software faults. Approaches such as extensive testing [22], static analysis [23,24,25], software model checking [26], error correction codes [27], and software isolation mechanisms [28] are all designed, in part, to eliminate as many potential errors as possible. We expect that Bristlecone will complement these other techniques: Bristlecone will enable software systems to recover from software errors that the other techniques miss. Many programming languages, including Java, provide an exception handling mechanism [29]. Writing exception handlers requires developers to reason about what parts of the computation are effected by the failure and how to recover the computation from a failure — note that the failed operation may leave critical data structures in inconsistent, partially updated states. Fault tolerance researchers have developed many methods to address software failures. Recovery blocks allow a developer to provide multiple implementations of an algorithm and an acceptance test for these implementations [30]. This technique requires the developer to expend the effort to develop multiple implementations and acceptance tests. Furthermore, the recovery block technique may fail if the algorithms share a common defect or if there is an error in the acceptance test. Backward recovery uses a combination of checkpointing and acceptance tests (or error detection) to prevent a software system from entering an incorrect state [31,32,33,34]. Unfortunately, it can be difficult to handle deterministic failures using backward recovery as the same software error will likely cause the software system to repeatedly fail. Forward recovery uses multiple copies of a computation to recover from transient errors [35]. Forward recovery is designed to handle intermittent failures — it cannot help deterministic errors that affect all copies of the computation. Databases utilize transactions to ensure that the database is never left in a halfupdated state by a partially completed sequences of operations [3]. In N-version programming, the developer constructs a software system out of multiple, independent implementations and a decision algorithm to decide which result to use in the event of a disagreement [36]. However, N-version programming may be prohibitively expensive — it requires developers to perform the difficult task of implementing multiple versions that are independent enough to not share failure modes but similar enough to be comparable. The Recovery-Oriented Computing project has explored integrating an undo operation into software systems [37] and constructing systems out of a set of individually rebootable components [38]. Failure oblivious computing is designed to address memory errors in C programs [39]. It detects erroneous memory operations and discards
510
B. Demsky and A. Dash
illegal write operations and manufactures values for invalid read operations. DieHard handles similar memory errors by using replication and randomization of the memory layout [40]. Randomization probabilistically ensures that illegal memory operations can only damage data structures in one of the replicants. Specification-based data structure repair automatically generates repair algorithms from declarative specifications [7] and imperative consistency checking code [41]. This technique enables software systems to recover from data structure consistency errors. Researchers have used meta-languages to decompose numerical computations into parallelizable tasks [42]. This technique is applicable to parallelizable numerical computations that compute many subproblems and then combine the subproblem results to compute an overall result. If one of the subcomputations executes slowly, this approach can ignore the subcomputation. Bristlecone is designed to handle a broader class of software systems including servers, control systems, and office applications. Bristlecone can provide stronger correctness guarantees. 6.2 Related Languages A key component of Bristlecone is decoupling unrelated conceptual operations and tracking data dependences between these operations. Bristlecone’s approach contains common elements with many parallel programming paradigms [43]. Dataflow computation was one of the earlier computational models that keeps track of data dependences between operations so that the operations can be parallelized [44]. Note that dataflow languages are not designed to handle failures — a failure in a data flow program will likely cause an operation to fail to place a value in a queue, which would likely cause the application to fail catastrophically because operations that operate on multiple queues would pair the wrong values for the rest of the computation. Bristlecone ensures that failures cannot cause the wrong parameter objects to be paired together or prevent a task from operating on parameter objects that were not affected by the error. Tuple-space languages, such as Linda [45], decouple computations to enable parallelization. The threads of execution communicate through a set of primitives that manipulate a global tuple space. While these systems were not designed to address software errors as errors in these systems can permanently halt the execution of threads, Bristlecone implements a similar technique to decouple the execution of its tasks. The orchestration language Orc [46] specifies how work flows between tasks. Orc is designed to decouple operations and expose parallelism. Note that if an operation fails, any work (and any corresponding data) flowing through the task may be lost. Since the goal of Orc is not failure recovery, it was not designed to contain mechanisms to recover data from failed tasks. Therefore, errors can cause critical information to disappear, eventually causing the software system to fail. Bristlecone uses flags to keep track of the conceptual states (or roles) that objects are in, enabling software systems to recover data from software errors and to continue to execute successfully. Actors communicate through messages [47,48]. Actors were originally designed as a concurrent programming paradigm. Failures may cause actors to drop messages and corrupt or lose their state. Bristlecone’s objects persist across task failures and can still
Bristlecone: A Language for Robust Software Systems
511
be used by other tasks. Moreover, state corruption in actors can cause actors to permanently crash. Since Bristlecone’s tasks are stateless, a previous failure of a task does not affect future invocations of that task on different inputs. Argus is a distributed programming language that organizes processes under guardians and isolates a process failure to the guardian under which it executes [49]. Inconsistencies could potentially cause the enclosing guardian to shut down. Argus supports failure recover through an exception handling mechanism. This approach is complementary to Bristlecone: a developer can write exception handlers for anticipated failures and Bristlecone can be used to recover from unexpected failures. Oz is a concurrent, functional language that organizes computations as a set of tasks [50,51]. Tasks are created and destroyed by the program. A task becomes reducible (executable) once the constraint store satisfied the task’s guard. Task reducibility is monotonic — once a task is reducible it is always reducible. Task activation in Bristlecone is not monotonic — the developer can temporarily disable a task when other tasks have placed objects into states that are incompatible with the task or when the effect of a task is no longer desirable. Non-monotonicity makes it straightforward for a Bristlecone application to use multiple implementations of the same functionality for redundancy. Moreover, since task creation is controlled by the program in Oz, it is more difficult to reason statically about tasks. Concurrent Prolog is logic-based language that uses unification to prove a goal [52,53]. The proof corresponds to the execution of the program. Concurrent Prolog’s guarded notation is similar to Bristlecone’s flag expressions, but Concurrent Prolog’s evaluation strategy starts from an end goal and reasons backwards. Concurrent Prolog programs may be able to recover from some failures by finding a different execution that reaches the same end goal. The downside is that if a failure prevents the program from completely achieving its end goal, the program will be unable to make partial progress. Bristlecone works forward and therefore can make progress even if a failure prevents the system from completely achieving its goal. Erlang has been used to implement robust systems using a set of supervisors and a hierarchy of increasingly simple implementations of the same functionality [54]. The supervisors monitor the computation for errors. If an error is detected, the system falls back to a simpler implementation in the hierarchy. Ericsson has taken this approach in their telephone switches. Bristlecone is complementary to the supervisor approach — while the supervisor approach gives the developer complete control of the recovery process, the downside of this approach is that it requires the developer to manually develop multiple implementations of the same functionality. Bristlecone requires minimal development effort and could potentially make recovery cost effective for a larger set of applications. Furthermore, while a shared but minor fault could cause the entire Erlang implementation hierarchy to fail, in many cases Bristlecone may be able to execute around the fault and still provide nearly complete functionality. Several research projects use type state-based approaches to automatically check that an API is used correctly [55,56]. Puntigam proposes tokens as a synchronization mechanism for object-oriented languages [57]. Bristlecone flags are similar to these mechanisms with one significant difference — Bristlecone uses flags to determine the
512
B. Demsky and A. Dash
execution of a program while these mechanisms only check (or synchronize) the actions of traditional imperative programs. 6.3 Related Software Architectures The staged event-driven architecture (SEDA) pushes events through stages [58]. Note that this architecture was been designed for high-performance computation and not fault tolerance. An error in a stage can prevent relaying the event and cause information to be lost. Stages also have local state, therefore, corruption of this state will cause that stage to shutdown until reboot. It appears difficult to specify that an application should either execute one sequence of operations or a second sequence, but not both.
7 Conclusion We have successfully developed several robust software systems using Bristlecone. Bristlecone software systems consist of a set of interacting tasks with each task implementing one of the conceptual operations in the software system. The developer specifies how these tasks interact using task specifications. Bristlecone uses transaction to recover data structures from task failures. Bristlecone then uses task specifications to reason about how to continue execution in the presence of a failed task. The key results in this paper include the Bristlecone language, the Bristlecone compiler and runtime, and our experience using the Bristlecone language. Our experience indicates that the task-based approach used in Bristlecone can effectively enable software systems to recover from otherwise fatal errors. Bristlecone promises to increase the robustness of software systems and to decrease the cost of developing many classes of robust software systems. Acknowledgments. We would like to thank the anonymous referees for their insightful feedback on our paper. This work was funded in part by NSF Grant CCF-0725350 and NSF Grant CNS-0720854.
References 1. Haugk, G., Lax, F., Royer, R., Williams, J.: The 5ESS(TM) switching system: Maintenance capabilities. AT&T Technical Journal 64(6 part 2), 1385–1416 (1985) 2. Mourad, S., Andrews, D.: On the reliability of the IBM MVS/XA operating system. IEEE Transactions on Software Engineering (September 1987) 3. Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Francisco (1993) 4. Baker, W.O., Ross, I.M., Mayo, J.S., Stanzione, D.C.: Bell labs innovations in recent decades. Bell Labs Technical Journal 5(1), 3–16 (2000) 5. Harris, T.: Exceptions and side-effects in atomic blocks. Science of Computer Programming 58(3), 325–343 (2005) 6. Demsky, B., Cadar, C., Roy, D., Rinard, M.C.: Efficient specification-assisted error localization. In: Proceedings of the Second International Workshop on Dynamic Analysis (2004)
Bristlecone: A Language for Robust Software Systems
513
7. Demsky, B., Rinard, M.: Data structure repair using goal-directed reasoning. In: Proceedings of the 2005 International Conference on Software Engineering (May 2005) 8. Leavens, G.T., Leino, K.R.M., Poll, E., Ruby, C., Jacobs, B.: JML: notations and tools supporting detailed design in Java. In: OOPSLA 2000 Companion, pp. 105–106 (2000) 9. Meyer, B.: Applying Design by Contact. Computer 23(10), 40–51 (1992) 10. Demsky, B., Ernst, M.D., Guo, P.J., McCamant, S., Perkins, J.H., Rinard, M.: Inference and enforcement of data structure consistency specifications. In: Proceedings of the 2006 International Symposium on Software Testing and Analysis (2006) 11. Burdy, L., Cheon, Y., Cok, D., Ernst, M., Kiniry, J., Leavens, G.T., Leino, K.R.M., Poll, E.: An overview of JML tools and applications. International Journal on Software Tools for Technology Transfer 7(3), 212–232 (2005) 12. Ernst, M.D., Czeisler, A., Griswold, W.G., Notkin, D.: Quickly detecting relevant program invariants. In: Proceedings of the 22nd International Conference on Software Engineering (June 2000) 13. Demsky, B., Sundaramurthy, S.: Static analysis of task interactions in bristlecone for program understanding. Technical Report UCI-ISR-07-7, Institute for Software Research, University of California, Irvine (October 2007) 14. Shavit, N., Touitou, D.: Software transactional memory. In: Proceedings of the 14th ACM Symposium on Principles of Distributed Computing (August 1995) 15. Ananian, C.S., Asanovi´c, K., Kuszmaul, B.C., Leiserson, C.E., Lie, S.: Unbounded transactional memory. In: 11th International Symposium on High Performance Computer Architecture (February 2005) 16. Harris, T., Plesko, M., Shinnar, A., Tarditi, D.: Optimizing memory transactions. In: Proceedings of the 2006 Conference on Programming Language Design and Implementation (June 2006) 17. Spear, M.F., Marathe, V.J., Schereer, W.N., Scott, M.L.: Conflict detection and validation strategies for software transactional memory. In: Proceedings of the Twentieth International Symposium on Distributed Computing (2006) 18. Harris, T., Plesko, M., Shinnar, A., Tarditi, D.: Optimizing memory transactions. In: Proceedings of the 2006 ACM SIGPLAN conference on Programming Language Design and Implementation, pp. 14–25. ACM Press, New York (2006) 19. Herlihy, M., Moss, J.E.B.: Transactional memory: Architectural support for lock-free data structures. In: Proceedings of the Twentieth Annual International Symposium on Computer Architecture (1993) 20. Kumar, S., Chu, M., Hughes, C.J., Kundu, P., Nguyen, A.: Hybrid transactional memory. In: Proceedings of the Eleventh ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (2006) 21. Hammond, L., Wong, V., Chen, M., Hertzberg, B., Carlstrom, B., Prabhu, M., Wijaya, H., Kozyrakis, C., Olukotun, K.: Transactional memory coherence and consistency (tcc). In: Proceedings of the 11th Intl. Symposium on Computer Architecture (June 2004) 22. Boyapati, C., Khurshid, S., Marinov, D.: Korat: Automated testing based on java predicates (2002) 23. Ghiya, R., Hendren, L.J.: Is it a tree, a dag, or a cyclic graph? a shape analysis for heapdirected pointers in c. In: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages (1996) 24. Wies, T., Kuncak, V., Lam, P., Podelski, A., Rinard, M.: Field constraint analysis. In: Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation (2006) 25. Sagiv, M., Reps, T., Wilhelm, R.: Parametric shape analysis via 3–valued logic. In: Symposium on Principles of Programming Languages, pp. 105–118 (1999)
514
B. Demsky and A. Dash
26. Corbett, J.C., Dwyer, M.B., Hatcliff, J., Laubach, S., Pasareanu, C.S., Robby, Zheng, H.: Bandera: Extracting finite-state models from Java source code. In: Proceedings of the 2000 International Conference on Software Engineering (2000) 27. Shirvani, P.P., Saxena, N.R., McCluskey, E.J.: Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49(3), 273–284 (2000) 28. Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., Young, M.: Mach: A new kernel foundation for UNIX development. In: Proceedings of the USENIX Summer Conference (1986) 29. Goodenough, J.B.: Structured exception handling. In: POPL 1975: Proceedings of the 2nd ACM SIGACT-SIGPLAN symposium on Principles of programming languages (1975) 30. Anderson, T., Kerr, R.: Recovery blocks in action: A system supporting high reliability. In: Proceedings of the 2nd International Conference on Software Engineering, pp. 447–457 (1976) 31. Zhang, Y., Wong, D., Zheng, W.: User-level checkpoint and recovery for LAM/MPI. ACM SIGOPS Operating Systems Review 39(3), 72–81 (2005) 32. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under Unix. In: Usenix Winter Technical Conference, January 1995, pp. 213–223 (1995) 33. Chandy, K.M., Ramamoorthy, C.: Rollback and recovery strategies. IEEE Transactions on Computers C-21(2), 137–146 (1972) 34. Young, J.W.: A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9), 530–531 (1974) 35. Huang, K., Wu, J., Fernandez, E.B.: A generalized forward recovery checkpointing scheme. In: Proceedings of the 1998 Annual IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems (April 1998) 36. Avizienis, A.: The methodology of n-version programming (1995) 37. Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kcman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical Report UCB//CSD-02-1175, UC Berkeley Computer Science (March 15, 2002) 38. Candea, G., Fox, A.: Recursive restartability: Turning the reboot sledgehammer into a scalpel. In: HotOS-VIII, May 2001, pp. 110–115 (2001) 39. Rinard, M., Cadar, C., Dumitran, D., Roy, D.M., Leu, T., William, S., Beebee, J.: Enhancing server availability and security through failure-oblivious computing. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (December 2004) 40. Berger, E., Zorn, B.: Diehard: Probabilistic memory safety for unsafe languages. In: Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation (June 2006) 41. Khurshid, S., Garc´ıa, I., Suen, Y.L.: Repairing structurally complex data. In: Proceedings of the 12th International SPIN Workshop on Model Checking of Software (August 2005) 42. Rinard, M.: Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In: Proceedings of the 20th ACM International Conference on Supercomputing (2006) 43. Benton, N., Cardelli, L., Fournet, C.: Modern concurrency abstractions for C#. In: Proceedings of the 16th European Conference on Object-Oriented Programming (2002) 44. Johnston, W.M., Hanna, J.R.P., Millar, R.J.: Advances in dataflow programming languages. ACM Comput. Surv. 36(1) (2004) 45. Gelernter, D.: Generative communication in Linda. ACM Transactions on Programming Languages and Systems 7(1), 80–112 (1985) 46. Cook, W.R., Patwardhan, S., Misra, J.: Workflow patterns in Orc. In: Proceedings of the 2006 International Conference on Coordination Models and Languages (2006)
Bristlecone: A Language for Robust Software Systems
515
47. Hewitt, C., Baker, H.G.: Actors and continuous functionals. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA (1978) 48. Agha, G., Mason, I.A., Smith, S.F., Talcott, C.L.: A foundation for actor computation. Journal of Functional Programming 7(1), 1–72 (1997) 49. Liskov, B., Day, M., Herlihy, M., Johnson, P., Leavens, G., Scheifler, R., Weihl, W.: Argus reference manual. Technical Report MIT-LCS-TR-400, Massachusetts Institute of Technology (November 1987) 50. Smolka, G.: The Oz programming model. In: Proceedings of the European Workshop on Logics in Artificial Intelligence, p. 251. Springer, London (1996) 51. Mehl, M.: The Oz Virtual Machine - Records, Transients, and Deep Guards. PhD thesis, Technische Fakult¨at der Universit¨at des Saarlandes (1999) 52. Shapiro, E.: The family of concurrent logic programming languages. ACM Computing Surveys 21(3), 413–510 (1989) 53. Shapiro, E.: Concurrent Prolog: A progress report. Computer 19(8), 44–58 (1986) 54. Armstrong, J.: Making Reliable Distributed Systems in the Presence of Software Errors. PhD thesis, Swedish Institute of Computer Science (November 2003) 55. DeLine, R., Fahndrich, M.: Typestates for objects. In: Proceedings of the 18th European Conference on Object-Oriented Programming (2004) 56. Bierhoff, K., Aldrich, J.: Modular typestate checking of aliased objects. In: Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications, pp. 301–320 (2007) 57. Puntigam, F.: Internal and external token-based synchronization in object-oriented languages. In: Modular Programming Languages, Proceedings of the 7th Joint Modular Languages Conference, pp. 251–270 (2006) 58. Welsh, M., Culler, D.E., Brewer, E.A.: SEDA: An architecture for well-conditioned, scalable internet services. In: Proceedings of the Eighteenth Symposium on Operating Systems Principles (October 2001)
Session-Based Distributed Programming in Java Raymond Hu1 , Nobuko Yoshida1 , and Kohei Honda2 2
1 Imperial College London Queen Mary, University of London
Abstract. This paper demonstrates the impact of integrating session types and object-oriented programming, through their implementation in Java. Session types provide high-level abstraction for structuring a series of interactions in a concise syntax, and ensure type-safe communications between distributed peers. We present the first full implementation of a language and runtime for session-based distributed programming featuring asynchronous message passing, delegation, and session subtyping and interleaving, combined with class downloading and failure handling. The compilation-runtime framework of our language effectively maps session abstraction onto underlying transports and guarantees communication safety through static and dynamic session type checking. We have implemented two alternative mechanisms for performing distributed session delegation and prove their correctness. Benchmark results show session abstraction can be realised with low runtime overhead.
1
Introduction
Communication in object-oriented programming. Communication is becoming a fundamental element of software development. Web applications increasingly combine numerous distributed services; an off-the-shelf CPU will soon host hundreds of cores per chip; corporate integration builds complex systems that communicate using standardised business protocols; and sensor networks will place a large number of processing units per square meter. A frequent pattern in communication-based programming involves processes interacting via some structured sequence of communications, which as a whole form a natural unit of conversation. In addition to basic message passing, a conversation may involve repeated exchanges or branch into one of multiple paths. Structured conversations of this nature are ubiquitous, arising naturally in serverclient programming, parallel algorithms, business protocols, Web services, and application-level network protocols such as SMTP and FTP. Objects and object-orientation are a powerful abstraction for sequential and shared variable concurrent programming. However, objects do not provide sufficient support for high-level abstraction of distributed communications, even with a variety of communication API supplements. Remote Method Invocation (RMI), for example, cannot directly capture arbitrary conversation structures; interaction is limited to a series of separate send-receive exchanges. More flexible interaction structures can, on the other hand, be expressed through lower-level J. Vitek (Ed.): ECOOP 2008, LNCS 5142, pp. 516–541, 2008. c Springer-Verlag Berlin Heidelberg 2008
Session-Based Distributed Programming in Java
517
(TCP) socket programming, but communication safety is lost: raw byte data communicated through sockets is inherently untyped and conversation structure is not explicitly specified. Consequently, programming errors in communication cannot be statically detected with the same level of robustness as standard type checking protects object type integrity. The study of session types has explored a type theory for structured conversations in the context of process calculi [27,12,13] and a wide variety of formal systems and programming languages. A session is a conversation instance conducted over, logically speaking, a private channel, isolating it from interference; a session type is a specification of the structure and message types of a conversation as a complete unit. Unlike method call, which implicitly builds a synchronous, sequential thread of control, communication in distributed applications is often interleaved with other operations and concurrent conversations. Sessions provide a high-level programming abstraction for such communications-based applications, grouping multiple interactions into a logical unit of conversation, and guaranteeing their communication safety through types. Challenge of session-based programming. This paper demonstrates the impact of integrating session types into object-oriented programming in Java. Preceding works include theoretical studies of session types in object-oriented core calculi [10,8], and the implementation of a systems-level object-oriented language with session types for shared memory concurrency [11]. We further these works by presenting the first full implementation of a language and runtime for session-based distributed programming featuring asynchronous message passing, delegation, and session subtyping and interleaving, combined with class downloading and failure handling. The following summarises the central features of the proposed compilation-runtime framework. 1. Integration of object-oriented and session programming disciplines. We extend Java with concise and clear syntax for session types and structured communication operations. Session-based distributed programming involves specifying the intended interaction protocols using session types and implementing these protocols using the session operations. The session implementations are then verified against the protocol specifications. This methodology uses session types to describe interfaces for conversation in the way Java interfaces describe interfaces for method-call interaction. 2. Ensuring communication safety for distributed applications. Communication safety is guaranteed through a combination of static and dynamic validations. Static validation ensures that each session implementation conforms to a locally declared protocol specification; runtime validation at session initiation checks the communicating parties implement compatible protocols. 3. Supporting session abstraction over concrete transports. Our compilationruntime framework maps application-level session operations, including delegation, to runtime communication primitives, which can be implemented over a range of concrete transports; our current implementation uses TCP. Benchmark results show session abstraction can be realised over the underlying transport with low runtime overhead.
518
R. Hu, N. Yoshida, and K. Honda
A key technical contribution of our work is the implementation of distributed session delegation: transparent, type-safe endpoint mobility is a defining feature that raises session abstraction above the underlying transport. We have designed and implemented two alternative mechanisms for performing delegation, and proved their correctness. We also demonstrate how the integration of session types and objects can support extended features such as eager remote class loading and eager class verification. Paper summary. Section 2 illustrates the key features of session programming by example. Section 3 describes the design elements of our compilation-runtime framework. Section 4 discusses the implementation of session delegation and its correctness. Section 5 presents benchmark results. Section 6 discusses related work, and Section 7 concludes. The compiler and runtime, example applications and omitted details are available at [26].
2
Session-Based Programming
This section illustrates the central ideas of programming in our session-based extension of Java, called SJ for short, by working through an example, an online ticket ordering system for a travel agency. This example comes from a Web service usecase in WS-CDL-Primer 1.0 [6], capturing a collaboration pattern typical to many business protocols [3,28]. Figure 1 depicts the interaction between the three parties involved: a client (Customer), the travel agency (Agency) and a travel service (Service). Customer and Service are initially unknown to each other but later communicate directly through the use of session delegation. Delegation in SJ enables dynamic mobility of sessions whilst preserving communication safety. The overall scenario of this conversation is as follows. 1. Customer begins an order session s with Agency, then requests and receives the price for the desired journey. This exchange may be repeated an arbitrary number of times for different journeys under the initiative of Customer. 2. Customer either accepts an offer from Agency or decides that none of the received quotes are satisfactory (these two possible paths are illustrated separately as adjacent flows in the diagram). 3. If an offer is accepted, Agency opens the session s with Service and delegates to Service, through s , the interactions with Customer remaining for s. The particular travel service contacted by Agency is likely to depend on the journey chosen by Customer, but this logic is external to the present example. 4. Customer then sends a delivery address (unaware that he/she is now talking to Service), and Service replies with the dispatch date for the purchased tickets. The transaction is now complete. 5. Customer cancels the transaction if no quotes were suitable and the session terminates. The rest of this section describes how this application can be programmed in SJ. Roughly speaking, session programming consists of two steps: specifying the intended interaction protocols using session types, and implementing these protocols using session operations.
Session-Based Distributed Programming in Java
Customer
Agency
519
Service
Customer
Agency
Fig. 1. A ticket ordering system for a travel agency
Protocol specification. In SJ, session types are called protocols, which are declared using the protocol keyword. The protocols for the order session (between Customer and Agency) are specified below as placeOrder, which describes the interactions from Customer’s side, and acceptOrder , from Agency.1 protocol placeOrder { begin. // Commence session. ![ // Can iterate: !<String>. // send String ?(Double) // receive Double ]*. !{ // Select one of: ACCEPT: !.?(Date), REJECT: } } Order protocol: Customer side.
protocol acceptOrder { begin. ?[ ?(String). ! ]*. ?{ ACCEPT: ?(Address).!, REJECT: } } Order protocol: Agency side.
We first look at placeOrder: the first part says Customer can repeat as many times as desired (expressed by ![..]*), the sequence of sending a String (!<String >) and receiving a Double (?(Double)). Customer then selects (!{...}) one of the two options, ACCEPT and REJECT. If ACCEPT is chosen, Customer sends an Address and receives a Date, then the session terminates; if REJECT, the session terminates immediately. The acceptOrder protocol is dual to placeOrder, given by inverting 1
SJ also supports an alternative syntax for protocols (session types) that replaces the symbols such as ‘!’ and ‘?’ with keywords in English [26].
520
R. Hu, N. Yoshida, and K. Honda
the input ‘?’ and the output ‘!’ symbols in placeOrder , thus guaranteeing a precise correspondence between the actions of each protocol. Session sockets. After declaring the protocols for the intended interactions, the next step is to create session sockets for initiating sessions and performing session operations. There are three main entities: – Session server socket of class SJServerSocket , which listens for session requests, accepting those compatible with the specified protocol. – Session server-address of class SJServerAddress , which specifies the address of a session server socket and the type of session it accepts; and – Session socket of class SJSocket, which represents one endpoint of a session channel, through which communication actions within a session are performed. Clients use session sockets to request sessions with a server. SJ uses the terminology from standard socket programming for familiarity. The session sockets and session server sockets correspond to their standard socket equivalents, but are enhanced by their associated session types. Client sockets are bound to a session server-address at creation, and can only make requests to that server. Session server sockets accept a request if the type of the server is compatible with the requesting client; the server will then create a fresh session socket (the opposing endpoint to the client socket) for the new session. Once the session is established, messages sent through one socket will be received at the opposing endpoint. Static type checking ensures that the sent messages respect the type of the session; together with the server validation, this guarantees communication safety. The occurrences of a session socket in a SJ program clearly delineate the flow of a conversation, interleaved with other commands. Session server sockets. Parties that offer session services, like Agency, use a session server socket to accept session requests: SJServerSocket ss_ac = SJServerSocketImpl.create(acceptOrder,port);
After opening a server socket, the server party can accept a session request by, s_ac = ss_ac.accept();
where s ac is an uninitialised (or null) SJSocket variable. The accept operation blocks until a session request is received: the server then validates that the protocol requested by the client is compatible with that offered by the server (see § 3 for details) and returns a new session socket, i.e. the server-side endpoint. Session server-address and session sockets. A session server-address in the current SJ implementation identifies a server by its IP address and TCP port. At the Customer, we set: SJServerAddress c_ca = SJServerAddress.create(placeOrder, host, port);
A server-address is typed with the session type seen from the client side, in this case placeOrder. Server-addresses can be communicated to other parties, allowing them to request sessions with the same server. Customer uses c ca to create a session socket:
Session-Based Distributed Programming in Java
521
SJSocket s_ca = SJSocketImpl.create(c_ca);
and request a session with Agency: s_ca.request();
Assuming the server socket identified by c ca is open, request blocks until Agency performs the corresponding accept. Then the requesting and accepting sides exchange session types, independently validate compatibility, and if successful the session between Customer and Agency is established. If incompatible, an exception is raised at both parties (see ‘Session failure’ below). Session communication (1): send and receive. After the session has been successfully established, the session socket s ca belonging to Customer (respectively s ac for Agency) is used to perform the actual session operations according to the protocol placeOrder. Static session type checking ensures that this contract is obeyed modulo session subtyping (see § 3.2 later). The basic message passing operations, performed by send and receive, asynchronously communicate typed objects. The opening exchange of placeOrder directs Customer to send the details of the desired journey (!<String>) and receive a price quote (?(Double)). s_ca.send("London to Paris, Eurostar"); // !<String>. Double cost = s_ca.receive(); // ?(Double)
In this instance, the compiler infers the expected receive type from the placeOrder protocol; explicit receive-casts are also permitted. Session communication (2): iteration. Iteration is abstracted by the two mutually dual types written ![...]* and ?[...]* [10,8]. Like regular expressions, [...]* expresses that the interactions in [...] may be iterated zero or more times; the ! prefix indicates that this party controls the iteration, while its dual party, of type ?[...]*, follows this decision. These types are implemented using the outwhile and inwhile [10,8] operations, which can together be considered a distributed version of the standard while-loop. The opening exchange of placeOrder, ![!<String>.?(Double)]*, and its dual type at Agency can be implemented as follows. boolean decided = false; ... // Set journey details. s_ca.outwhile(!decided) { s_ca.send(journeyDetails); Double cost = agency.receive(); ... // Set decided to true or ... // change details and retry }
s_ac.inwhile() { String journeyDetails = s_ac.receive(); ... // Calculate the cost. s_ac.send(price); }
Like the standard while-statement, the outwhile operation evaluates the boolean condition for iteration (!decided) to determine whether the loop continues or
522
R. Hu, N. Yoshida, and K. Honda
terminates. The key difference is that this decision is implicitly communicated to the session peer (in this case, from Customer to Agency), synchronising the control flow between two parties. Agency is programmed with the dual behaviour: inwhile does not specify a loop-condition because this decision is made by Customer and communicated to Agency at each iteration. These explicit constructs for iterative interaction can greatly improve code readability and eliminate subtle errors, in comparison to ad hoc synchronisation over untyped I/O. Session communications (3): branching. A session may branch down one of multiple conversation paths into a sub-conversation. In placeOrder , Customer’s type reads !{ACCEPT: !.?(Date), REJECT: }, where ! signifies the selecting side. Hence, Customer can select ACCEPT, proceeding into a subconversation with two communications (send an Address, receive a Date); otherwise, selecting REJECT immediately terminates the session. The branch types are implemented using outbranch and inbranch. This pair of operations can be considered a distributed switch-statement, or one may view outbranch as being similar to method invocation, with inbranch likened to an object waiting with one or more methods. We illustrate these constructs through the next part of the programs for Customer and Agency. if(want to place an order) { s_ca.outbranch(ACCEPT) { s_ca.send(address); Date dispatchDate = s_ca.receive(); } } else { // Don’t want to order. s_ca.outbranch(REJECT) { } }
s_ac.inbranch() { case ACCEPT: { ... } case REJECT: { } }
The condition of the if-statement in Customer (whether or not Customer wishes to purchase tickets) determines which branch will be selected at runtime. The body of ACCEPT in Agency is completed in ‘Session delegation’ below. Session failure. Sessions are implemented within session-try constructs: try (s_ac, ...) { ... // Implementation of session ‘s_ac’ and others. } catch (SJIncompatibleSessionException ise) { ... // One of the above sessions could not be initiated. } catch (SJIOException ioe) { ... // I/O error on any of the above sessions. } finally { ... } // Optional.
The session-try refines the standard Java try-statement to ensure that multiple sessions, which may freely interleave, are consistently completed within the specified scope. Sessions may fail at initiation due to incompatibility, and later at any point due to I/O or other errors. Failure is signalled by propagating terminal exceptions: the failure of one session, or any other exception that causes the
Session-Based Distributed Programming in Java
523
flow of control to leave the session-try block, will cause the failure of all other ongoing sessions within the same scope. This does not affect a party that has successfully completed its side of a session, which will asynchronously leave its session-try scope. Nested session-try statements offer programmers the choice to fail the outer session if the inner session fails, or to consume the inner exception and continue normally. Session delegation. If Customer is happy with one of Agency’s quotes, it will select ACCEPT. This causes Agency to open a second session with Service, over which Agency delegates to Service the remainder of the conversation with Customer, as specified in acceptOffer. After the delegation, Agency relinquishes the session and Service will complete it. Since this ensures that the contract of the original order session will be fulfilled, Agency need not perform any further action for the delegated session; indeed, Agency’s work is finished after delegation. At the application-level, the delegation is exposed to only Agency and Service; Customer will proceed to interact with Service unaware that Agency has left the session, which is evidenced by the absence of any such action in placeOrder. The session between Agency and Service is specified by the following mutually dual protocols: protocol delegateOrderSession { begin.!