This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Welcome to Earth l Visitors may also be interested in laking a look at the Moon.
I 2 3 4
XML Prolog: Every XML document mu st ha ve a prolog as its first line. The prolog mu st at least specify the version of XML in use (w hi ch is c urrent ly 1.0). For example:
A third attribute may be used to state whether the documen t stand s a lone or is dependent. on external definitions.
Ellcodillgs: The prolog may also spec ify the e ncoding (UTF-8 is the default and was explained in Section 4.3.2). The te nn encoding refers to the set of codes used to represe nt characters - ASCII being the best known example. Note that in the XML prolog, ASCII is specified as us-ascii. Other possible encodings include ISO-8859- 1 (or Latin- I ), an e ight-bit. encoding whose fi rst. 128 values are ASC n , the rest are used to represent the characters in western european languages. Other eight-bi t encodin gs are ava ilable for representing other alph abets, for examp le, greek or cyrilli c.
Figure 4.11
Illustration of the use of a namespace in the Person structure Smilh London
1934
XML namespaces 0 Traditionally, namcspaces provide a means for sco ping nam es. An XML namespace is a set of names for a coll ection of element types and attr ibutes, that is referenced by a URL. An XML nam es pace may be used by any other XML document by referring 1.0 il s URL. Any e lement that makes use of an XML namespace can specify Ihal names pace as an attribute call ed xmlns, whose va lue is a URL refe rrin g to the fiJe containing the namcspace definitions, For example; xmlns:pers = ..!Jttp:l/www.cdk4.netlpersoll .. The name after xmllls, in thi s case pers can be lIsed as a prefi x to refer La th e e lements in a particular namespace, as shown in Figure 4.11. The pers prefix is bound to hrrp:// www.cdk4.l1et/persol1 for the persoll e lement A namespace applies w ithin the context of the enclosing pair of start and end tags unless overridden by an enclosed nam es pace dec laration . An XML document may be defined in term s of several different namespaces, each of which would be referenced by a unique prefix. The namespace co nvention allows an application to make use of multipl e sets of external definitions in different namespaces without the ri sk of nam e clashes. XML schemas 0 An XML schema [www.w3.org VllIj defines the elements and attributes that can appea r in a doc ument, how the elements are nested and the order and number of e lements, whether an e le ment is empty or can include text. For eac h element , it define s the Iype and de fault va lue. Figure 4,12 gives an exampl e of a sc hema that
Figure 4.12
An XML schema for the Person structure
T: Requ es t ti cket for serv ice S
l all rh(C)I
KCT
C requests the ticket- grantin g serve r T to suppl y a ti cket for communication with another server S.
'
I ricker(C .Tj IKT' S. II
4. T --> C: Service ticket
T checks the ti cket. If it is valid T generates a new random sess ion key K CS and relUrns it w ith a ticket for S (encrypted in the server' s sec ret key KS ).
C is th en read y to iss ue requ est m essages to the se rver. S :
C.
Issue a sen'er request with a lide ,
5. C --> S: Service reque st
I aur"(C) IKCS . I ricker(C.5) IKS reqllest. 1/
.
C se nds th e li cke t 10 S wilh a new ly generated auth enti cator for C and a reques t. The request would be enc ry pted in Kcs if secrecy of the data is required.
For the client to be sure of the server' s authenti c ity. S should return the nonce f/ to C . (To reduce the number of messages required , thi s could be includ ed in th e messages (hat contain th e serve r's repl y to (h e request): D.
Authenticate server (optional)
6. S --> C: Se rv er auth el1licati o n
(Optional ): S send s the non ce to C, encrypted in Kcs.
Application of Kerberos 0 Ke rbe ros was developed for use in Projec t Ath ena at MIT - a c e => lEe
A cOJlsislelJ! global sIGle is one thai corres ponds to a cOllsistelll c ut. We may characterize Ih e exec ution of a di stributed sys tem as a se ries of tran sition s between global states of the sys te m:
In each tran sition, prec ise ly o ne even t occurs at so me single process in the sys tem. This event is either the sending o f a message, th e rece ipt of a message. or an internal event. If two event s happe ned simultaneo usly. we may nonetheless deem them to have occurred in a definite order - say ordered accordi ng to process identifi e rs. (Eve nts that occur simultan eously must be concurrent: ne ither happened-before the other. ) A system evolves in thi s way through consistent globa l states. A rull is a total o rderin g of a ll th e eve nt s in a global hi story that is consiste nt with each local hi story 's ordering, -')j (i = 1,2, ... , N) . A lilleari:atioll or COIIsiSle11l rull is an ordering of the eve nt s in a globa l hi story that is consistent with thi s happened-before re lation -') on H . Note that a linearization is also a run . No t all run s pass throug h consistent g lobal stat es. but a ll lineari zation s pass onl y through consistent global stat es. We say that a stat e 5' is reachable from a stale 5 if there is a linearizat ion that passes throug h S and then 5'. Sometimes we may alte r th e o rderin g of conc urrent eve nt s within a linearizati o n. and derive a run that still passes through onl y co nsistent global f'i tates. Fo r example, if IWO success ive even ts in a lineari za ti on are the rece ipt of messages by two processes, Ihen we may swap Ihe o rder of these two eveJ1l s.
11 .5.2 Global state predicates, stability, safety and live ness Detectin g a co ndition such as deadl oc k or termination amounts to evaluating a global stale predicate. A glo bal stal e predi cate is a functi on that maps from the set of global
states of processes in th e system ~o to 1True, False I. O ne of the lIseful c haracte ri sti cs of th e predi cat es associated with the stat e of an o bjec t be ing garbage. of the system bei ng deadlocked or th e system be ing te rminated is th at they are a ll stable: once the system e nte rs a stat e in whi ch the predicate is Tl'lle , it re mains True in a ll future states reac hable from that state. By contrast, when we monitor o r debug an app li cati o n we are often interes ted in non -s tabl e predicates. s uch as that in a LII" example of va ri ables who se difference is supposed to be bounded. Even if the app licat ion reac hes a state in which the bound obtains. it need not stay in that state. We al so note here two furth er notions relevant to glo ba l state predicates: safe ty and li vc ncss. Suppose th ere is an unde sirable property Ct th at is a predi cate or the system 's glo bal stat e - for example, a could be the propert y of be ing dead locked. Le t
So be the original state of the system. SafelY wit h respec t to 0. is the assertion that 0. evaluates 10 Fahe for all slales S reac hab le from SO' Conversel y. let P be a desirable property of a system 's global stal e - for exa mpl e. the prope rl y of reachin g termination. Li relless with res pect to P is the property th at, for any lineari zation L start ing in the stat e SO' P evaluates 10 True for so me slate SL reac hable from So'
11 .5.3 Th e 'snapshot' algorithm of Chandy and Lamport Chandy and Lamport [1 985] desc ribe a 'snapshot' algorithm for dete rminin g g lobal state s of di slributed sys tems. whi ch we no w prese nt. The goa l of the algo rithm is to record a se t of process and channel states (a 'snapshot' ) for a se t of processes Pi (i = 1.2, .'" N) such Ihal, evcllthough the combination of recorded Slates may never have occ urred at th e sam e lime, Ihe recorded global state is consisteill. We shall see th at Ihe slate thai the snapshot algor ilhm reco rds has conven ient properti es for evaluatin g stable global pred icates. The algorithm records state loca ll y at processes; it does nOI g ive a method for ga the rin g the global state at one sile. An obvio us method for gathe rin g th e slate is for all processes 10 send Ihe Slate they recorded to a designat ed collector process. but we shall not add ress thi s iss ue further here. The algorithm ass umes th at: ne ither channels nor processes fa il ; communi cation is reliable so thaI eve ry message se nt is eve ntually received intact, exac tl y once; channels are unidirectional and pro vide FIFO-ordered message de li ve ry; the graph of processes and c hann els is strongl y connec ted (there is a path between any IwO processes); any process may initiate a global snapshot at any time; the processes may continue the ir exec uti on and send and receive nomMI messages whil e the snapshot takes place. For eac h process Pj' lei the incoming chanflels be those at Pi over which othe r processes send it messages; similarl y. P i ' S outgoing challllels are those on whic h it send s messages to oth er processes. The esse ntial idea of the al gorithm is as follow s. Each process records its state and al so for eac h incoming chan ne l a set of messages sent 10 it. The process record s, for each channe l, any messages that arr ived after it recorded its state and before the sende r recorded its own state. This arrangement allows us 10 reco rd the stat es of processes at different tim es but to account for the diffe rentials between process states in lerms of messages tran smitted but not yet receiv ed. If process Pi has sent a message J1/ to process pl' bu t P j has not received it, then we account for m as be longing to th e state of the c hann el between them. The algori thm proceeds through use of spec ial marker messages. wh ich are di stinct from any othe r messages the processes send, and whic h the processes may send and receive whil e they proceed with their nomlal exec ution. The marker has a dual rol e: as a prompt fo r the receive r 10 save its own state, if it has not already done so; and as a means of determinin g which messages to include in the channel slale.
Figure 11.10 Chandy and Lamport's 'snapshot' algorithm Marker receiving rule Jor process Pi On P i'S receipt of a marker message over channel c if(Pi has no t yet recorded its state) it records its process state now; records the state of c as the empty set; turn s on recording of messages arriving over o th er inco ming c hannels; else Pi records the Sl.ate of c as the sel of messages it has received over c since it saved its state. elld if
Marker sending rule for process Pi After Pi has recorded its state, for each o utgo ing c hanne l c: Pi sends one marker message over c (before it sends any o th er message ove r c) .
The a lgorithm is de fined th roug h two rules. the marker re ceiviJ/~ rule and the marker sending rule (Figu re 11.1 0), The marker sending rule obli gales processes to send a marker after th ey ha ve recorded th eir state, but before they send any other me ssages. The marker receivi ng ru le obligates a process [ha t has not recorded its state to do so. In th at case, th is is the first marke r th at it has received. It notes whi ch messages subsequenLl y arri ve on the o th er incom ing channe ls. When a process that has already saved its state receives a marker (o n another channe l), it records th e state of that chann el as the set of messages it rece ived o n it since it saved its stare. Any process may begin the algorithm at any time. It ac ts as th oug h it has received a marker (over a non·ex istent channel) and foll ows the marke r receivin g rule. T hu s it record s its state and begi ns to record messages arrivin g over a ll its incomin g channels. Several processes may init iate recordin g co ncurrentl y in thi s way (as lo ng as the markers they use ean be di stin gu ished). We illustra te th e algorithm for a system of two processes, P I and P2 connected by two uni direct ional channels, C I and ('2' The two processes trade in ' w idget s' . Process PI send s orders For widgets o ve r ('2 to P2' enclosin g payment at the rate of $ 10 per widget. Some time later. process P2 sends w idge ts a lo ng channel c 1 to P I' The
Figure 11.11 Two processes and their initial states
I$1000 I I(none) I account
widgets
~~ account
widgets
Figure 11 .12 The execution of the processes in Figure 11 .11 1. Global state
So
8
Cz
(empty)
C1
(empty)
4
2. Global state 51
8 C2 4
3. Global state 52
8 4
4. Global state 53
C1
(empty)
C2
(Order 10, $100) , M
c,
(five Widgets)
8 C2 4
(Order 10, $1 00), M
c,
(Order 10, $100) (empty)
·8 '8 ·8 '8 P2
P2
P2
P2
1M = marker message)
processes have the in iti a l sta tes s hown in Fi g ure 11 . 11 . Process p ,., has already rece ived an o rder fo r fi ve widge ts. whic h it will sho rtl y di spatch to p '" Fi gure I 1.12 s hows an exec uti o n o f th e system whil e the state is recorded. Process PI records its siale in the ac tual g lo ba l slate So > when PI's stat e is . Fo ll ow in g the marke r se ndin g ru le, process P I th en emit s a marker message over its o utgo ing c han ne l c 2 before it se nd s the nex t appl icati o n- leve l message: (Orde r 10 , $ 100) ove r channel ('2 ' The system enlers actual global sta le 51' Before P2 rece ives th e marker. it e mit s an a pp lication message (fi ve w id gets) over c i in response to PI 'S prev io us orde r. yie lding a new ac tual g lobal state S 2' Now process jJ I rece ives P2's message (five widgets), and P2 receives the marker. Following the marker rece iv in g ru le, P2 reco rds its state as and Ihal o f channe l ('2 as the empt y sequence, Followi ng the marker sending rul e, it sends a mark er message over (' I . Whe n process P I rece ives P2's marker message, it record s the state of channel C I as th e s ing le mess age (fi ve widge ts) that it rece ived after it first reco rded its slale. The fi nal actual g loba l stal e is S 3' The tin al record ed siale is 1',: ; 1'2: : C I: « fi ve wicigels»; ('2: < >. NOle Ihm Ihis slalC differs from all Ihe global slares Ihrough which the sys tem ac tu a ll y passed.
Termination of the snapshot algorithm 0 We assume Ihal a process Ihal has received a marke r message record s its s tate w ith in a finite time and se nds marker messages ove r each o ut go in g chan nel w ithin a fin ite time (eve n when it no longer need s to send app l ieal iOil messages ove r th ese channe ls). Ir there is a path o f cOIllmunicatio n channe ls and processes from a process Pi to a process Pj ( j *- i), then it is c lea r on II~ ese ass umpti ons th ai Pj will record its state a finite tim e afte r Pi recorded its stale. S mce we are assuming th e g ra ph of processes and channe ls to be strong ly connected , it foll ows
Figure 11 .13 Reachability between states in the snapshot algorithm actual execution eO, el '"
- ---+.
Sinit
recording beg ins
~ ... e'R_'
pre-snap: e'o,e'"
recording
ends Ssnap
Sfinal
•
~ post-snap:
e'R,e'R,""
thai all processes w ill have reco rded th eir stales le: ,,¥(S) = Fal se I end while output "defillitely $":
11.6.2 Evaluating possibly
To evaluate possibly $, the monit or process Illust trave rse the latti ce of reac hable states, starting from the in it ia l stat e (s~), .\. g... s~). The a lgorithm is show n in F igure 11 . 16. T he a lgo rithm ass umes that th e exec uti on is infinit e. It may easi ly be adapted fo r a fi nite execu tion . The monitor process may di scover Ihe sel of co nsistent Siaies in level L + I reac habl e from a give n consistent state in level L by the fo ll owing meth od. Let 5 = (sl' s2' ... , sN) be a consiste nt state. The n a consistent state in the nex t level reachab le from S is or lhe form S' = (Sl' .1'2 ' ... 5;, ... , sN)' wh ich differs from S onl y by containin g th e nex t state (after a single event ) of some process Pi' The monitor ca n fin d a ll sti ch states by travers ing the que ues of state messages Qi (i = 1, 2, . . ., N). The state S' is reachab le frolll S if and onl y if: for j = 1.2 .... ,N,j"';:
\I(s)U ] ~\I(s;)Ij ]
Thi s cond ition comes fro m cond ition CGS above and from the fact that 5 was already a consistent globa l state. A give n stat e may in ge neral be rei.l ched from severa l states at th e prev ious leve l. so th e moni to r process shoul d take care to eva luate th e consistency of each state on ly once.
Figure 11 .17
Eva luating definitefy ljJ
Level 0
F F
2 3 4
"
/
"
F/
F/
5
/
"
F= 1(5) = False); T = 1(5) = True)
T
"
?/
11 .6.3 Eval uating definitely To eva luate definitely $, the mo nitor process aga in tra ve rses the lattice of reachab le states a leve l at a lime. startin g from th e initi a l Slale (.II? s~, .. "' s~). The a lgorith m (show n in Fi g ure 11.16) aga in ass um es that the exec uti on is in finite but m ay eas il y be adapted for a finite exec uti o n. It ma intains the set Slates . whi ch contains those stat es at the c urrent leve l that may be reac hed on a lin eari zati on from the init ial state by traversin g o nl y stat es for whi ch 4> ev aluates to False. As lo ng as suc h a linearizati on exi sts, we may no t assert definirely $ : th e exec uti o n co ul d have taken thi s lineari zati on. and
11 15 j )[i l for hw. IS
where Sj
j = 1,2, .... N . j#i
the last slate that th e
. m OllllOr
.
process has rece ive
df
rom process P j'
11.6.4 Evaluating possibly
and definitely in synchronous systems
The algori thm s we have given so far work in an asy nchronous system: we have made no timin g ass umptions. But the price paid for this is that the monitor may examine a consistent g lobal state 5 ::: CSt> 05 2, .. " sN ) for whi ch any two loca l states si and s· occurred an arbi trarily long tim e apart in the ac tu al execu ti on of the system. Ou~ requirement, by cont rast, is to co nsider only those global states that the ac tual executi on coul d in princip le have trave rsed. In a synchronous system, suppose th at the processes keep the ir physical c locks internall y synchroni zed within a kn ow n bound, and that the observed processes provide physical timestamps as we ll as vector timestamps in the ir state messages. T hen the 1110nilOr process need consider onl y those consistent global stal es whose local states could possibl y have ex isted simultaneously, given the approximate synchroni zation of the c locks. With good eno ugh c lock synchroni zati on, these will number many less than all globally consistent slates . We now give an algo rithm to exploit synchroni zed c loc ks in thi s way. We ass ume that eac h obse rved process Pi (i = 1, 2, . .. , N) and the monitor process, which we sha ll call Po- keeps a ph ys ical clock C j (i = 0, I, ... , N ). These are synchroni zed to within a known bound D > 0; that is, at th e same real time:
IC/I)-CP)I < Dfori,j = 0, I , ... , N The obse rved processes send both their vec tor lime and physical time with their state messages to the monitor process. The monitor process now app lies a condition that not onl y tests for co nsisten cy of a global state S = (S I ' .\"2 ' .. . , sN)' but also tests wheth er each pair o f stal es could have happened at the same real time, given the physical cloc k val ues. In oth er words, for i, j = I, 2, ... , N: V(S jlliJ " V (sjlli J and Sj and Sj co uld have occu rred at the same real time. The first c lause is th e condition that we used earli e r. For the second cla use, note that Pi is in the state si from the time it first noti fies th e monitor process, Cj(s), to some later local time L i(si)' say, when the nex t stat e transition occurs at Pi ' For si and Sj to have obta ined aI the sa me real tim e we thu s have. allowing for th e bound on clock synchroni zati on: Cj(S;l- D $ C j (s j ) $ L j(sj) + D - or vice l'Crsa (swapping i alld j). The monitor process mu st calcu late a value for Li(s) , whi ch is meas ured against p,.' s clock. If the monitor process has received a state message for Pi's next state sf' the n Li(s,.) is Ci(sj). Otherwise. the moni tor process estimates Li(si) as Co - max + D , whe re Co is the monitor's c urren t local clock value, and max is th e maximum tran smiss ion time for a SLale message.
11 .7 Summary Th is chapter began by desc ri bin g the importance of accurate timekeeping for d istributed system s. It th en described algori thm s for sy nchroni zi ng c loc ks despite the drift between th em and th e variabilit y of message de lays be twee n computers. The degree of sy nchroni zat ion acc uracy that is prac ticall y obtain able fulfil s many requirements bu t is none theless not suffi cient lO dete rm ine th e ord ering of an arbitrary pair o f eve nt s occ urrin g at different computers. T he happened· before re lati on is a part ial orde r o n event s th at re flec ts a flow of info llna li o n betwee n them - w ith in a process, or v ia messages between processes. Some algo rithms require eve nt s to be orde red in happened-before ord er. for example s uccess ive upda tes m ade at separa te co pies of data. Lampo rt c loc ks are co unte rs th at are upd ated in accordance w ith the happened-before re lation ship between eve nt s. Vector clocks are an imp rovement on Lamport c loc ks. becau se it is poss ib le to determ in e by examining the ir vec tor timcstamps wheth er two eve nt s arc o rdered by happened-before o r are concurrent. We introdu ced the concepts o f eve nt s, loca l and g loba l hi sto ri es. cut s, loca l and g loba l states, runs, cons istent states, linea ri zation s (consistent run s), and reachab ili ty. A cons istent state or run is o ne that is in accord w ith the happened-before re latio n. We we nt o n to co ns ide r the probl em of recording a consistent global state by observ in g a sys tem 's exec utio n. Our objec tive was to eva lu ate a predicate on thi s state, An im pona nt class of predi cate s are the stable predicates. We descri bed the snapshot algorithm of Chand y and Lamport, wh ich capture s a cons istent g lobal state and all ows us to make assert ions about whethe r a stabl e pred icate holds in the actual exec ution . We went on to give Marz ullo and Ne iger's algorith m for de ri ving assertion s about whether a pred icate he ld or may ha ve held in the actual run . The algorithm employs a monitor process to co llec t states. Th e monitor exam in es vec tor timestamps to ex tract consistent g lobal states. and it constructs and exam ines the latt ice of a ll cons isten t g lobal states. This algorithm in vo lves grea t computati o nal compl ex ity but is va luable for understanding and can be of so me prac tica l bene fit in rea l systems where re lati ve ly few eve nts change the g lo ba l pred icate's va lue. The a lgo rithm has a mo re efficient va ri ant in sy nchronous syste ms, where c locks may be sy nchron ized .
EXERCISES 11.1
Why is computer clock synchroni za tion necessary? Desc ribe the des ig n requi reme nt s page 434 fo r a system to sy nchroni ze the clocks in a di stributed syste m.
11.2
A clock is readin g 10:27:54.0 (hr:min :sec) when it is di sco ve red to be 4 second s fast. Exp la in why it is undes irabl e to set it back to the ri ght time at that po int and show (numeri call y) how it sho ul d be adjus ted so as to be correc t aft er 8 second s has e lapsed. page
11. 3
438
A sc heme for implementin g at-most-once re liable message de li very uses synchroni zed cloc ks to reject dup li cate messages. Processes place the ir local clock va lue (a ' timestamp') in the messages they send . Eac h rece ive r keeps a tab le g ivin g, for eac h
sending process, the largest message timestamp it has seen. Assume thaL clocks are synchroni zed to within 100 ms, and that messages can arrive at mo st 50 I11 S after transmi ssion. (i)
When maya process ignore a message bearing a tim estamp T, if it has recorded the last message recei ved from that process as having timestam p T' ?
(ii)
When Jllay a receiv er remove a timestamp 175 ,000 (ms) from its table'? (Hint: lise th e receiver's local c lock va lue.)
(iii ) Should the clocks be internally sync hronized or externally synchronized? page 439 11.4
A client attempts to sy nchronize with a time server. It record s th e round-trip times and timesta mps returned by th e server in th e table below. Which of these times should it Lise to set its clock? To what time sho uld it set it? Estimate the accuracy ort he setring with respect to the server's clock. l fit is known that the time between sending and receiving a message in the system co ncerned is at least 8 J11 S, do you r answers change? Round-trip (ms)
Time (hr:min:sec)
22
/0'54:23.674
25
10:54:25.450
20
10.54:28.342 page 439
11.5
In the system of Exercise 11.4 it is required to sy nchroni ze a fil e server's clock to within page 439
± 1 milli second. Di scllss this in relation to Cristian's algorithm.
I 1.6
What reconfiguralions would yo u expect to occur in th e NTP synchroni zation subn et? page 442
11. 7
An NTP server B receives server A's message at 16:34: 23.480 bearing a timestamp 16:34 :13.430 and replies to it. A receives the message at 16:34:15.725 , bearing 8's timestamp 16: 34:25. 7. Estimate the offset between B and A and th e accuracy of the estimate. page 443
11. 8
Discuss th e factors to be taken into account when deciding to which NTP server a client should synchroni ze its c lock. page 444
I 1.9
Di sc Liss how it is possible to compensate for clock drift between sync hroni zat ion points by observing the drift rate over time. Di sc Liss any limitations to your method. page 445
I I . I ()
By considering a chain of zero or more messages connectin g eve nts e and e' and usin g page 446 induction, show that e --> e' => L(e) < L(e').
11.1 I
Show that
VPl e' => V( e) < V( e') .
page 447 page 448
11 . 13 Using the result of Exerci se 11.11 , show that if events e and e' are concurre11l the n ne ithe r V(e) '; Ve e') nor V(e')'; V(e) . I-Ience show that if Vee) < Vee') th en e -> e'. page 448 11.14 Two processes P and Q are co nnec ted in a ring lIsing two chann e ls. a nd th ey co nstant ly rotate a message m. At an yo ne tim e, th ere is o nl y o ne copy o f 1/1 in th e system. Eac h process's stat e consists o rth e numbe r of tim es it has received Ill, and P sends 111 first. At a cert ain point. P has the message and its state is 101 . Immed iate ly after se nd ing m , P initiates th e snapshot a lgorithm . Ex plai n th e operation of the a lgorithm in thi s case, givin g th e poss ibl e global state(s) reported by it. page 453
7-
0
P2
-
0
- -
0
-
- -0 -
Time ,
--
I 1.1:5 The fi g ure above shows event s occ urrin g fo r eac h of two processes , PI and P2' Arrows between processes deno te message trans miss ion. Draw and labe l th e latt ice of co nsistent states (PI state, p ] state), beginning \v ith th e initi al state (0,0). page 460 11 . 16 Jones is flmnin g a coll ec ti o n of processes PI'P] , " " P/I/' Eac h process Pi co nt ai ns a vari able vi' She w ishes to dete rmin e w heth er a ll th e va ria bles vI' \I ~, ... , " N we re eve r equ a l ill the co urse o rlh e exec ution. (i)
Jo nes proeesses run in a sy nc hronous sys tem. She uses a monito r process to determine wheth er th e va riabl es were eve r equal. Wh en should th e appli cati on processes commun icate w ith th e mo nito r process, and what sho uld the ir messages co nt ain?
(ii )
Ex pla in th e stateill ent possibiv (vI ~ 1'2 ~ whether thi s statement is tru e of her executi o n?
V
N
). How ca n Jones determ ine
pC/ge 461
COORDINATION AND AGREEMENT 12.1 12.2 12.3 12.4 12.5 12.6
Introduction Distributed mutual exclusion Elections Multicast communication Consensus and related problems Summary
In this chapter, we introduce some topics and algorithms related to the issue of how processes coordinate their actions and agree on shared values in distributed systems, despite failures. The chapter begins with algorithms to achieve mutual exclusion among a collection of processes, so as to coordinate their accesses to shared resources. It goes on to examine how an election can be implemented in a distributed system. That is, it describes how a group of processes can agree on a new coordinator of their activities after the previous coordinator has failed . The second half examines the related problems of multicast communication , consensus, byzantine agreement and interactive consistency. In multicast, the issue is how to agree on such matters as the order in which messages are to be delive red . Consensus and the other problems generalize from this: how can any collection of processes agree on some value, no matter what the domain of the values in question? We encounter a fundamental result in the theory of distributed systems: that under certain conditions - including surprisingly benign failure conditions - it is impossible to guarantee that processes wi ll reach consensus.
467
12.1 Introduction Thi s chaplc r introduces a collection of a lgori thm s whose goa ls vary but which share an aim that is fundamental in di stributed system s: for a set of processes to coordinate the ir act ions or to agree on one or more va lues. For example. in the case a complex piece of machinery sllc h as a spacesh ip. it is essential that the com pute rs controllin g it agree on such condit ions as whether the spaces hip' s mi ss ion is proceeding or has been abo n ed. Furthermore. the comp uters mus t coordi nate the ir actio ns co rrectly wi th re spect to shared resources (the spaces hi p' s se nso rs and actuators ). Th e compu ters mu st be able to do so even whe re there is no fixed master-slave relation sh ip between th e component s (which wou ld mak e coord ination particularl y simple). The reason for avoiding fi xed master-slave relationships is that we often requ ire our system s to keep working correctly eve n if fai lures occ ur, so we need to avoid single points o f failure. such as fi xed masters. An important di stin ct io n for us. as in Chapter II. w ill be whethe r the di stributed sys tem under study is asy nchrono us or sy nchronou s. In an asynchronou s sys tem we can make no timi ng ass umpti ons. Til a sy nchrono us system. we shall ass ume that the re are bo unds on the maximum message tran smission de lay. on the time to exec ute eac h step of a process. and on cloc k drift rates. The sy nchrono us assumpti ons al low LI S to lise timeollts to detect process crashes. Another importan t aim or th e chapter while disc uss ing algorit hms is to consider failures. and how todea l with them when des ig ning a lgor ithms. Section 2.3.2 iJ1lroduced a failure model. whi ch we shall use in thi s chap te r. Coping wi th fa ilures is a subtl e business, so we begin by conside ri ng some a lgorithms lhal to lerate no failure s and progress throu g h be ni gn failures until we consider how to to lerate arbitrary failures. We enco unter a fundamen ta l res ult in the theo ry or dis tri buted sys tems. Even un der s urpri sin gly benign failure cond itions, it is imposs ible 10 gua rantee in an asynchron ous system that a co ll ec tion of processes can agree 011 a shared val ue - for exa mple , for all of a spaceship' s control ling processes to ag ree 'miss ion proceed' or 'mi ss ion abort'. Secti on 12.2 examin es the problem of di strib uted mutual exc lusion. This is the ex tension to di stribut ed sys tems of the fam iliar problem of avoidi ng race condit ions in kernels and multi-threaded app li ca ti ons, Since mu ch of wha! occurs in d istributed system s is re source sharing, thi s is an important problem to so lve. Next. Sec tio n 12.3 introduces a related but more general iss ue of how to 'elect' one of a collection or processes to perform a spec ia l role. Fo r exa mple, in Chapter I j we saw how processes sy nchroni zed the ir clocks to a des ignated time se rver. If thi s server fai ls and several surviv ing servers ca n fulfil that rol e. then for the sake of consistency it is necessary to choose just o ne se rv er to tak e over. Multi cas t communication is th e subject of Section 12.4. As Sec tion 4.5.1 ex plained, mu lticast is a ve ry usel'u l commun ica ti on paradigm. w ith applications from locatin g resources to coo rdi nat ing the update s to repli cated data. Section 12.4 examines mu lti cast re liability and ordering semanti cs, and g ive s algorithms to achie ve the var iation s. Mul ticast de li ve ry is esse nti all y a problem of ag ree ment betwee n processes: the recipient s agree on which me ssages they w ill rece ive, and in whi ch order they will receive them. Secti on 12.5 d isc usses the problem of agreement more ge nerall y. primaril y in th e form s known as conse nsus and byzantine agree ment.
or
Figure 12.1
A network partition
The treatm e nt fo ll owed in thi s c hapte r in vo lves sta tin g th e ass umpti ons and th e goa ls to be met, a nd g iv in g an informa l acco unt of why rhe a lgorithms presen ted are co rrec t. T here is insuffi c ient space \0 prov ide R.~ + I in a hold-back queue (Figure 12.1 1) - such queues are often used to meet message delivery g uarantees. II req uests mi ss ing messages by sendin g negative acknow ledge ments - to the origina l sender or to a process q from whi ch it has received an ack nowledge me nt with RZ no less th an th e req uired sequence number. The hold-back qu eue is not strict ly necessary fo r re li abi lit y bu t it simplifies the protocol by e nabling LI S to use seq uence numbe rs to represen t sets of de li vered messages. It al so provides us with a guarantee of de li very o rder (see Section 11.4.3). The integrity propert y fo ll ows from th e det ection of dup li cates and the underlyin g propert ies of IP multi cast (which uses checks um s to ex punge corrupt ed messages). The va lid it y propert y ho lds because IP multicast has that property. For agreeme nt we require, fi rst. that a process can a lways detect mi ssin g messages. That in turn means that it w ill always receive a furth er message that enables it to detect the o miss ion. As thi s simplified protocol stands. we g uarantee detect ion of miss in g messages on ly in the case where co rrect processes multi cast messages indefinitely. Second, the agree ment property requires that th ere is always an ava ilable copy of any message needed by a process that did not rece ive it. We therefore ass ume that processes retain copies of th e messages they have de li ve red - inde finite ly. in thi s simp li fied protocol.
R;>.
R'\
Ne ither of the ass umpti o ns we made to ensure agreement is practi cal (see Exercise I 1. 14). However. ag ree ment is practi ca ll y ad dresse d in th e protoco ls from which ours is deri ved: the Psync protocol IPeterson el al. 1989], T rans protoco l [Me lli ar-Smith el al. 1990] and scalab le re li able mult icast protocol [Floyd el al. 1997]. Psy nc and Trans a lso prov ide fU11he r de li ve ry ordering g uarantees. Un iform properties 0 The de finiti o n of agreement g iven above re fers onl y to the behavio ur of correct processes - processes th at never fa il. Consider what would happen in the algo rith m of Fig ure 12. 10 if a process is not correct and crashed after it had R(Il livered a message. Since any process that R-deli\'ers the message must first BlJIulficast it. it fo ll ows th at all correc t processes will still e ventuall y de li ver the message. Any prope rt y thm ho ld s wheth er o r not processes are correct is called a uniform propert y. We de fine uniform agreement as fo ll ows: J
Uniform agreemen t: If a process, whether it is co rrect or fails, de li vers message m, the n all correc t processes in group(lII ) will eventu all y de li ver 111. Uni for m agreement a ll ows a process to crash a fter it has de li ve red a message, while still ensuring that all co rrec t processes will de li ve r th e mess age. We have arg ued th at the a lgorithm of Fig ure 12. 10 sati sfies thi s propert y, which is stronger than the no n-uni fo rm ag ree ment propert y de fined above . Uni fo rm ag reement is useful in applicat io ns where a process may take an acti on th at prod uces an o bse rvable inconsistency befo re it crashes. Fo r exa mpl e, consider th at the processes arc serve rs that manage cop ies of a ba nk acco unt, and th at upd ates to the acco unt are sent using re li able multicast to the group of se rvers. If the multi cast does not sati sfy un ifo nn ag reeme nt. the n a c li e nt that accesses a server just before it crashes may observe an update that no o the r se rver will process. It is interesting to no te th at if we reverse the lines ' R-deliver nt ' and ' tf( q:;:. p ) tliell B-nllliticast(g, 11/ ); end if in Fig ure 12. 10. (hen the result ant algo ri thm does not sati sfy unifo rm ag ree ment. Just as th ere is a uni fo rm versio n of agreement , there are also unifo rm versions of any multi cast propert y, incl udi ng vali d ity and integ rity and the ordering properties that we are abo ut to de fine.
12.4 .3 Ordered multicast T he basic multi cast algorith m of Sectio n 12.4. 1 de li vers messages to processes in an arb itrary orde r, d ue to arbitrary de lays in the underlying one-to-one send operatio ns. T his lac k of an o rderi ng g uarantee is no t satisfactory fo r lTIany applications. For exa m ple, in it nuc lear power plant it Ill ay be impo rtant that events signifying threats to safety conditi o ns and e vent s sig ni fy ing acti ons by co ntro l unit s are observed in the same orde r by all processes in th e system. Th e comm o n ordering req uirement s are tota l ordering, cau sal orderin g. FI.FO o rdering and th e hybri ds total-cau sa l and to tal-FI FO. To simpl ify o ur d iscussion, we de fine these orde rin gs under the assumpti on that any process be longs to at most one g roup. We sha ll later d isc Li ss th e implicati ons of a ll o win g g roups to o verl ap.
FIFO ordering: If a correct process issues mu/ticasr(g, 11/) and th en multicast(g, m' ), the n eve ry co rrect proce ss that de li vers m' w ill de live r III before m ' .
Figure 12.12 Total , FIFO and causal ordering of multicast messages
F,
F3
F2
j
Time
C, ~
C3
Notice the consistent ordering of totally ordered messages T, and h the FIFO-related messages F, and F2 and the causally related messages C, and C3 - and the otherwise arbitrary delivery ordering of messages
Causalorderillg: If I1Il1lticasr(;: . m) ~ IIIl1/ficas/(g, m'). where ---7 is the happened-be fore relation induced only by messages sent between th e members of g. then any correct process thaI delivers m' wi ll deli ver m before m' . Toralordering: I f a correct process delivers message
11/
before it delivers m', then
any olher correct process that deli vers 11/' w ill deliver m before m',
Causal ordering implies FlFO orderi ng, since any IwO l11ulticasts by the same process are related by happened-before . Note that FIFO ordering and ca usal orderin g are on ly partial orderings: not all messages are sent by the same process, in general ; simi larl y. some mu lt icasls are concurrent (not ordered by happened-before), Figure 12. 12 illustrates the orderin gs for th e case of three processes . Close inspection of the figure shows th at th e totall y ordered messages are deli vered in the opposite order 10 the ph ys ical lime at which they we re sent. In fact, the definition of total
Figure 12.13 Display from bulletin board program Bulletin board: os.interesting Item
From
Subject
23 24
A.Hanl on
Mach
G.Joseph
Mi c rokerne ls
25
A. Hanlon
Re: Mi croke rn els
26
T.L'H e ureu x
RPC perform ance
27
M .W alker
Re: Mach
end
o rdering all ows message deli very to be o rdered arbitraril y. as lo ng as the o rder is the same at diffe rent processes. Since total ordering is no t necessa ril y also a FIFO o r cau sa l ordering, we defin e the hybrid o f FIFO-lOwl order in g as one fo r whi ch message de li very obeys bo th FIFO and lo ta l ordering: s imilarl y. under ('a llsa/-lOla/ ordering message de li very obeys both ca usa l and total ordering. The definiti o ns o f o rdered multicast do not ass ume or impl y re liability. For exampl e, the read er sho uld chec k Ihal . und er lotal orderi ng. if correc t process p de li ve rs message m and th en del ivers m'. then a correc t process q can de li ve r m wi thollt al so de li veri ng 111' or an y oth er message ordered after IJI. We can also form hybrids of ord ered and re liable protoco ls. A re li ab le totall y orde red multi cast is o fte n referred to in the lit erature as;m aramie multicast. S imilarl y. we may fo rm re li ab le FIFO multicas t. re li able cau sal multi cast and re liable ve rs ions of the hybrid o rdered multi casts. Ordering the de live ry of mu lti cas t messages . as we shall sec, can be expe ns ive in term s of deli very lat ency and bandwid th consumpti o n. Th e orde rin g se manti cs that we have desc ri bed ma y de lay the deli ve ry of messages unn ecessaril y. That is. at the appli cation leve l, a message may be de layed fo r ano ther message that it does not in fact depend upon. For thi s reason. some ha ve proposed multi cast systems thai use the applica tion- spec ific message semanti cs alone to de termine the order o f message delivery [Cheriton and Skeen 1993. Pedone and Schi per 19991. The example of the bulletin board 0 To make multicas t deli very semantics more concrete. cons ide r an appli cation in which use rs post messages to bulletin boards. Each use r runs a bulletin-board application process. Eve ry topi c o f di scuss ion has its own process group . When a use r posts a message to a bullet in board ) the application multi casts the user 's postin g to the corres ponding g roup. Each use r' s process is a member of th e group for the top ic in whi ch he or she is inte res ted. so that th e user w ill receive just the postings conce rnin g thai topic. Re liable multi cast is req uired if every user is 10 receive every posting eve ntuall y. The users al so have orderin g req uireme nt s. Fi g ure 12. 13 sho ws the pos tings as th ey appear to a partic ular use r. At a minimum, FIFO orde rin g is des irable, s ince then eve ry posting from a gi ve n use r - ' A. Han lon ', say - wi ll be rece ived in the same order. and users can ta lk consistent ly about A. Hanl o n·s second postin g.
L
Note that the mess;]ge whose subj ec ts are 'Re: Mi croke rn e ls' (25) and 'Re: Mach' (27) appem afle r the messages to whi c h they re fer. A cau sall y orde red multi cast is needed to guarant ee thi s relat ionship. Ot herwi se. arbi trary message de lays could mean that. say. a message 'Re: Mach' co uld appear before the ori gina l message about Mach. If the muhicast de li very was totall y ordered. the n the numbe ring in th e left·hand column would be consistent be twee n use rs. Use rs cou ld refer unambiguo usly. for example . to 'message 24'. In practi ce . the USENET bulletin board system imple ment s ne ither causal nor total orderin g. The comm un icmion costs o r achi evin g these orderings on a large scal e outwe ighs the ir advantages. Implementing FIFO ordering 0 FIFO-ordered Illulticast (with opera ti on s FO -IIIII /timst and FO ·deli l·er) is achieved with sequence numbers. mu ch as we would achie ve it 1'01' on e~ t o~o n e communica tion. We shall conside r only nOJ1~o ve rlappin g groups. The read er should ve rify th at the reliable multi cast protocol that we de fined on top of IP mult icast in Secti on 12.4.2 a lso guarantees FIFO ord ering, but we shall sho w how to construct a FIFO~orde red multi cast on top of any gi ven basic multi cast. We use the va riabl es S~ and R (~ he ld at process p fro m the re li ab le multi cast protocol o f Sec tion 12.4.2: S ~: is a co unt S q ~ ot ho w man y mess;]ges p has se nt to g and . for eac h q. R I! is the sequc nce number of the latest message p has de live red from process q that was se nt 10 group g. For p to FO~ /}l/I lfic{lSf ..1 message to group g , it piggy backs the value S~ onto the message . B ~lIl11/f;CClSfS the message to g and th en increment s S~ by I. Upo n receipt of a message from q bearing the sequ ence num be r S. p chec ks wliethe r S = R~ + I. If so, thi s message is th e nex t one ex pected from the sende r q and p FO -delh'ers it, setting R ~ :=S. If S > R ~ + I. it places the message in the hold ~bac k que ue until the intervenin g messages ha ve bee n de livered and S = R~ + I. Since all messages from a give n se nder are de li ve red in the same seque nce. and since a message 's de li very is de layed until its sequence number has been reached. th e condition for FIFO ordering is clearl y satisfi ed. But th is is so on ly under th e ass umpt ion that groups are n o n ~ o ve rl appin g. No te that we can use an y impl ementation of B-l1Iu lr;cCls( in this protocol. Moreover. if we lise a rc liable R-lIlI1ltic{l sr primitive instead of B-l1Il1lricasr, then we obtain a re li able F1FO mu lt icast. Implementing total ordering 0 The ba sic approach to implementing total ordering is to ass ign totall y ordered ident ifiers to mult icast messages so that each process makes the sam e orderin g decision based upon th ese identifi ers. The de li very al gorithm is very sim ila r to th e one we described for FIFO ordering; the diffe rence is Ih at processes keep group ~s pec ifi c sequence num bers rath er than process~s pec ifi c sequence numbers. We onl y consider how to tota ll y order messages se nt to non·ove rlapping groups. We ca lilhe multi cast operat ions TO ~ I1II/'{icaSf and TO -dc/i re/'. We di scllss two main methods for assigning ide ntifi ers to messages. The first of these is for a process ca lled a seqll el/ ce/' to ass ign the m (Figure 12. 14). A proce ss wishin g to TO -IIlII/ricasf a message m 10 group g attaches a unique identifi e r id(m ) to it. The messages for g are sent to th e sequ encer for g . seq llellcer (g ). as we ll as to the membe rs of g. (The sequencer may be chose n to be a member of g.) The process seq llellcer(g) maintains a g roup~spec ifi c sequence number .'II! ' whi ch it uses to ass ign inc re as in g an d con sec uti ve seque nce numbe rs to th e messages that it B-delil'c rs. It
d
Figure 12.14 Total ordering using a sequencer 1. A lgorithm for group member fJ 011 illiliali:arion: r" := 0:
.' T o TO -I1I II /ticasllllcssage /J/1O groujJ.Ii'
B-flllllticasf(g u {sequcllca(g )} . write time stamp of the committed objecl.
Timestamp ordering write rule: By combining rules I and 2 we have the following rule for decidin g whether to accept a wrire ope ration requested by transaction Tc on object D: ~
maximum read timestamp on D & & Tc > write tim es tamp o n committed vers ion
if (1'('
or D)
perform wriTC operation on tentati ve vers ion of D with write times tamp Tc e lse /* write is too late */ Abort tran sacti on T( If a tentative ve rs ion with writ e tim es lamp Tc already e xists, the wriTe operation is addre ssed to it. o therw ise a ne w tentati ve ve rs ion is creat ed and given wrile timestamp T". Note that any I1'ri ff! that ' arrives too late ' is aboned - it is too late in the sense thai a tran saction with a late r tim es tamp has already read or written the object. Figure 13.30 illustrates the action of a II'rife operation by tran saction T3 in cases where T3 ~ ma ximum rcad timestamp on the object (the read tim cstamps are not shown ). In cases (a) to (e) T3 > write timestamp on the conlIlljtted vers ion of the object and a te ntati ve ve rsion w ith write time stamp T3 is insened at the appropriate place in the list of tentati ve vers ions ordered by their tran saction tim estamps. In case (d ), T3 < writ e timestamp on th e committed vers ion of the object and the transaction is abort ed.
Timestamp ordering read rule: By using rul e 3 we ha ve the following rul e for dec iding whethe r to accept immediat ely. to wait or to reject a read operation requested by tran sacti o n T(, on objec t D: if ( T(' > write time stamp on committed ve rs ion of D ) I let Dsc tccled be th e version of D with the maximum write timestamp ::; Tc if (D .-.clected is committed) pe rfonn read operation on the vers io n D selected e lse WaiT until the tran sacti o n that made ve rs ion D sc lccled commits or aborts th en reappl y th e read rul e I else Abort tran saction Tc
Figure 13.30 Write operations and timestamps
(b) T3 write
(a) T3 write
Before
T2
After
T2
T3
..
Before
T1
T2
After
T1
T2
Time
T3
..
Time
(d) T3 write
(e) T3 write
Before
T1
T4
After
T1
T3
T4
..
Time
Key: Committed
Tentative
Before
T4
After
T4
Transaction aborts
..
Time
object produced by transaction ~ (with write timestamp n) T1 < T2< T3< T4
Note: If tran saction Tc has already written it s own vers ion of til e object, thi s w ill be used. A read operation that a rri ves too earl y wait s for the earli e r transacti on to comple te. If the earli e r transaction com mit s, th e n Tc wi ll read from it s committed version . If it aborts, then Tc wi ll repeal the read rul e (and selec t the previous ve rs ion). This rule prevents dirty reads. A read ope ration that 'arrives too late ' is abol1ed - it is 100 la le in the se nse th a t a transaction with a late r timesta mp has already writte n th e objec t. Figure 13.3 1 illustra tes th e timestamp ordering read rule. It includes four cases labelled (a) to (d), each of whi c h illustrates th e ac tion of a read o perat ion by tra nsacti on T3. In each case, a version w hose write timestam p is less than or eq ual 10 T3 is selected. If s uc h a version ex ists , it is indi cated w ith a line . In cases (a) a nd (b) the read operation is directed to a committed vers io n - in (a) it is the onl y vers ion. w he reas in (b) the re is a ten tati ve ve rs ion be lo ng in g lO a late r tran sac tion. In case (c) the read operation is directed lO a tentative ve rs ion and mu st wa it until the transaction that mad e it commit s or a borts. In case (d) the re is no suitable version to read and tran sacti on T3 is abort ed.
Figure 13.31
Read operations and tim estamps (b) TI read
(a) T3 read
read proceeds
IT2
T2
1
Selected _ _ _ _ _ _-... Time (c) T3 read
I
T4
read proceeds
Selected _ _ _ _ _ _ _ _ ~ Time
~
(eI) TJ read
read waits
Tran sacti on aborts
1
__-,S:..ce-,Ie.cct-=-e~ d _ _ __ ~ Time
Key: Committed
Tentative
___ _ _ _ _
~
Time
object produced by transaction T; (with write timestamp T;) T, < T2< T3< T4
When a coord inator receives a request to commit a tran saction, it will a lways be ab le to do so because all the operati ons of transac tions are checked for consistency wi th th ose of earli er tran sac ti ons before being carri ed out. The committed versions of each
object mu st be created in tim estamp order. There fore, a coo rdinator sometimes needs to wa it for earlier tran sactions to compl ete before writing a ll the committed vers ions oflhe objects accessed by a parti cular transac ti on, but the re is no need for the clie nt to wa it. Tn orde r to make a transacti on recoverabl e aft.e r a se rver crash, th e tentative versions of objects and the fact th at the tran sacti on has committed mu st be written to permanent storage before acknow ledging the clie nt 's request to commit the tran sacti on. Note th at thi s timestamp ordering algorithm is a stri ct one - it ensures strict exec uti ons of tran sacti ons (see Section 13.2). The timestamp ordering read ru le de lays a tran sac ti on's read operation on any object until a ll tran sacti ons th at had prev iously writte n th at object have comm itted or aborted. The arrangeme nt to commit ve rsions in order ensures that the exec uti on of a tran sacti on ' s write operaLion on any object is de layed until all tran sactions that had previously written that objec t have committed or abort ed. In Figure 13.32, we return to our illustration concerning the two concurrent banking lransaclions T and U inlrod uced in Figure 13.7. The columns headed A, Band C refer to in fornlat ion about accounts with those nam es. Each acco unt has an entry RTS that records the max imum read Limes tamps and an e ntry WTS that records the write timestamp of each vers ion - with tim estamps of com mitted versions in bold. lnitia ll y, a ll accoun ts have committed versions wrillen by tran saction S, and the set of read
Figure 13.32 Tirnestarnps in tran sactions Tand U Timestamps and versions of objects
U
T
B
A
C
RTS
WTS
RTS
WTS
RTS
WTS
II
S
II
S
II
S
openTl'ollsacrion
IT!
bal = b.gefBalal/ ce() opel1Trallsacfion
S,T
b.sefB alallce( bal"' I .1) bal = b.gefBalall ce(J
wair for T
a.wir//(/ra w( ball !0)
S ,T T
COif/mil
bal = b.geIBalallce(J b.s to the server of the objec t at which tran saction U is blocked. If U is sharing a lock , probes are sent to all the holders of the lock. Sometimes furth er transactions may start sharing the loc k later on, in which case probes can be sent to them 100. D elee/i on: De tection consists of receiving probes and deciding whether deadlock has occ urred and whether to for ward the probes . For example, when a server of an object receives a probe < T -) U > (indicatin g that T is wa iting for a transaction U that holds a local object), it checks to see whether U is al so wa iting. If it is. the transaction it waits for (for example, V) is added to the probe (making it < T --> U --> V » . and if the new transaction (\I) is waiting for another object elsewhere, the probe is forwa rd ed.
In thi s way, paths through the global wa it-for graph are built one edge at a tim e. Before forwarding a probe, the server check s to see whether the tran saction (for it has just added has ca used the probe to contain a cycle (for example, example, < T --> U --> If --> T ». If thi s is the case, it has found a cycle in th e graph and deadl ock has been detected.
n
Resolfllioll: When a cycle is detected, a tran saction in the cycle is aborted to break
the deadlock. In Olll' exampl e, the following steps describe how deadlock detection is initiated and the probes that are forwarded durin g the corresponding detec tion phase.
L..
Server X initiates detection by sendin g probe < W
~
U > 10 the se rver of B (Server
Y).
Server Y receives probe < W ~ U >, notes thai B is held by V and appends V to the probe to produce < W ~ U ~ V>. Ii notes that V is wa iting for C at server Z. This probe is forwarded to server Z. Server Z receives probe < W --> U --> V> and notes C is held by Wand appends tv to the probe to produce < tv --> U --> V --> tv >. This path contains a cycle. The server detects a deadlock. One of the transactions in the cycle must be aborted to break the deadlock. The transaction to be aborted can be chosen according LO transaction priorities, whi ch are described shortl y. Figure 14.15 shows the progress of the probe messages from the initiation by the server of A La the deadlock detection by the server of C. Probes are shown as heavy arrows. objects as circles and transaction coordinators as rectangl es. Each probe is shown as go ing directl y from one object to another. In reality, before a se rver transmits a probe to another server, it consults the coordinaror of the last transaction in the path to find out whether the latter is wa iting for another object elsewhere. For example, before the serve r of B transmits the probe it consults the coordinator of V to find out that V is waiting for C. In most of the edge-chasing algorithms, the servers of objects send probes to transaction coordinators, whi ch then forward them (if the transaction is waiting) to the se rver of the object the transaction is waiting for. In our exam ple, the server of B transmits the probe to the coordinator of V, which then forward s it to th e server of C. This shows th at when a probe is forwarded , two messages are required. The above algorithm should find any deadlock that occurs, provided that waiting transactions do not abort and there are no failures such as lost messages or servers crashing. To understand this, co nsider a deadlock cycle in which the last transaction, W. starts wa iting and completes the cycle. When W starts waiting for an object, the server initiates a probe that goes to the server of the object held by eac h transaction that W is waiting for. The recipients ex tend and forward the probes to the servers of objects req uested by all wa iling transactions they find. Thus every transaction that W waits for directly or indirectly will be added to th e probe unl ess a deadlock is detected. When th ere is a deadlock, W is wa iting for itsel f indirectl y. Therefore, the probe will return to th e object that tv holds. It mi ght appear that large numbers of messages are sent in order to detect deadlock. In th e above example. we see two probe messages to detect a cycle involving three transactions. Each of the probe messages is in gene ral two messages (from object to coordinator and then from coord inator to object). A probe that detects a cycle involving N transactions will be forwarded by (N - I) transaction coordinators via (N - I) servers of objects, requiring 2(N - I) messages. Fortunately, the majorit y of deadlocks in volve cyc les containing only two transactions, and there is no need for undue concern about the number of messages involved. This observation has been made from st udies of databases. It can also be argued by considering the probability of eo nlli ct ing access to objects. See Bernstein et al. [1987]. Transaction priorities 0 In the above algorithm , every transaction involved in a deadlock cycle can cause deadlock detection to be initiated. The effect of seve ral transactions in a
Figure 14.16 Two probes initiated (a) initial situation
(b) detection initiated at object (c) detection initiated at object requested by W requested by T Waits fo~ T /""
Waitsr T
V
~
( U
wJwaits for
V
U
~'.~
j/
L- W
cycle initiatin g deadlock det ec ti o n is that detec tio n may happen at several diffe rent servers in the cycle with the result that more than one tran sac ti on in th e cyc le is aborted. In Figure 14.16(a), co nside r tran saction s T , U, V and W , where U is wa iling for W and V is waiting for T. At about the same time. T req uests th e object he ld by U and W requ ests the object held by V. Two separate probes < T --7 U > and < W --> II > are initiated by the servers of these objec ts and are circul ated until deadlock is de tec ted by
each of two different servers. See in Fi gure 14.16(b), where the cycle is < T --> U --7 W --> II --7 T >. and (c), whe re the cyc le is < W --7 II --7 T --7 U --7 W >. In order to e nsure thai o nl y one tran saction in a cycle is aborted, transaction s are g iven priorities in such a way th at all tran sact io ns are to tall y ordered. Timestamps for example, may be used as priorities. When a dead lock cycle is fo und , the transaction with th e lowest priorit y is aborted. Even if several d ifferent servers detect the same cyc le, they wi ll all reach the same dec ision as to which tran sac tion is to be aborted. We writ e T > U to indicate that T has hi gher priorit y than U. [n th e above exa mpl e , ass ume T > U> V> W. The n the transaction W will be abort ed when either of the cycles < T --7 U --> W --> II --> T > or < W --> II --7 T --> U --> W > is de tec ted. It might appear that tran saction priorit ies could al so be used to red uce the number of situations th at cali se dead lock de tec ti on to be in itiated , by using the rule that detection is in it iated only when a hi g her-priority transac tion start s to wa it for a lower-priority one. In our example in Fig ure 14 . 16, as T > U the initialin g probe < T ~ U > wou ld be sent. but as W < V the initiatin g probe < W --> \I> wou ld not be sent. If we assume th at wh en a tran saction starts waiting for anot he r tran sact ion it is equall y lik ely that th e wail ing tran saction has hi gher or lower prior ity than the waited -for tran saction, th en the use of thi s rule is likely to redu ce the number of probe messages by about hal f. Transaction priorities could a lso be used to reduce the number of probes th at are forwarded. The general idea is that probes sho uld lravel ' dow nhill ' - that is, from tran sactions w ith higher priorities (Q tran saction s w ith lower priorities. To do thi s, servers use the rule that they do not for ward any probe to a ho lder that has hi gher prio rity th an th e initiator. The arg ument for doing thi s is that if the hol der is wa iting fo r ano th er
Figure 14.17 Probes travel downhill (a) Vstores probe when Ustarts wai ting
w
(b) Probe is forwarded when Vstarts waiting
w U--"
V
Walts ( ;/A'" V--" W for C
\
j/
probe queue
U
~ v ~altsfor
probe queue
U--" V
8
transaction then it mu st have initiated detecti on by sending a probe when it started
wait ing. However. there is a pitfall associated w ith th ese apparent improvement s. In our example in Figure 14. 15 transacti ons U. \I and Ware exec uted in an orde r in whi ch U is wai ting for Vand V is wait ing for W when W start s waiting for U. Without priority rules, detection is initiated when W starts wait in g by sendin g a probe < W ---7 U>. Under the priority rul e. thi s probe will not be se nt , because W < U and deadlock w ill not be detected. The problem is that the orde r in which tran saction s start wait in g can determine whether or not dead lock w ill be det ec ted. The above pitfa ll can be avoided by using a sc he me in whi ch coord inators save copies of all th e probes received on behalf of each tran saction in a probe queue. When a transaction starts wait ing for an object, it forward s the probes in its que ue to the server of the object, which propagates the probes on downhill routes. In our exampl e in Figure 14.15 , when U start s wait in g for V, the coord inato r of V w ill save the probe < U ---7 V >. See Figure 14. 17(a). Then when V start s wa it ing for W, the coordinato r of W wi ll store < V ---7 W > and V wi ll forward its probe que ue < U -7 V > to W. See Figure 14. 17(b), in which W' s probe queue has < U -7 V> and < V ---7 W>. When W starts wait in g ror A it w ill forward its probe queue < U ---7 V ---7 W > to th e server o r A. whi ch also notes the new dependency W ---7 U and combines it with the informatio n in the probe received to determine that U -7 V -7 W -7 U. Deadlock is detected . When an algorithm requ ires probes to be stored in probe que ues, it al so req uires arran ge ment s to pass on probes to new holders and to d iscard probes that refe r to tran sactions that have been com mitted o r aborted. If re levant probes are di scarded , undetected deadlocks may occur, and if o utd ated probes arc retained , fal se deadlock s may be de tected. This add s much to the complexi ty of any edge-chasin g algorithm. Readers who arc interested in the detail s of such a lgorithms should see Sin ha and Natarajan 1985) and Choudhary ef al. [1989). who present a lgorit hm s for use w ith excl usive locks. But they wi ll see that Choudhary ef al. sho wed that Sinha and Natarajan' s algori thm is incorrect and fail s to detect a ll deadlock s and may eve n report fal se deadlocks. Kshemkal yani and Singha l 199 11 corrected th e algorithm of C houd hary ef al. (w hi ch fail s to detect all deadlock s and may report fal se deadlock s) and
r
r
provide a proof o f co rrectness for the correc ted algorith m. In a subsequent paper, Kshe mkal yani and Singhal r19941 argue th at di stributed deadloc ks are not very we ll unde rstood becau se there is no g lobal state or tim e in a di stri buted system. In fact, any cycle that has bee n collec ted may cont ain secti ons recorded at diffe rent times. In addit ion, sites may hear about dead loc ks but may not hear that they have been resolved until after random de lays. The paper desc ribes di stributed deadl oc ks in terllls of the conte nt s of di stri buted memory, usin g cau sal re lation ships bet ween event s at different sites.
14.6 Transaction recovery The atomi c propert y of tran sacti ons requi res th at the e ffects o f all committed tran sacti ons and non e of the e ffects of incomplete or aborted transactions are renected in th e objects th ey accessed. This prope rl y can be desc ri bed in term s of two aspects: durabi lit y and fa ilure atomi c ity. Durability req ui res that objects are sav ed in perman ent storage and will be avai lable inde finite ly thereafter. T he re fore, an ac knowledgment of a c lient 's comm it reques t implies th ai all the e ffec ts of the tra nsacti on hav e been recorded in perman ent storage as we ll as in the server's (vo lati le) obje.c ts. Fai lure atom icit y requires th at e ffects of tran sacti ons are atomi c eve n when the se rver c rashes. Recove ry is concerned wit h e nsurin g that a se rver's objects are durabl e and that the service prov ides failure atomic ity. Although fil e se rve rs and database servers maintain data in perman ent storage, ot he r kinds of se rvers of recoverable objec ts need not do so except for recovery purposes. In thi s chapter, we ass ume that when a serve r is running il kee ps all of its objects in its vo latile me mory and reco rds its committed objects in a recovery./lle or fi les. The re fore, recove ry ·consists of restori ng the se rver with Ihe lates t com mitted ve rsions o f its objects from permanent storage. Databases need to dea l wit h large volumes of data. They ge nerall y hold the objec ts in stable storage on di sk wit h a cache in volat ile memory. The two requirement s for durab il it y and for failure at omic it y are not reall y ind ependent of one another and can be dea lt with by a single mec han ism - th e recovelY manager. The task of a recovery manage r is: to save objec ts in permanent storage (in a recove ry fil e) for committed tra nsacti ons; to res tore the server's objec ts aft er a cras h;
to reorgan ize the recove ry fi le to improve (he pcrfonnance of recovery; to reclaim sto rage space (in the recovery fi le). In some cases, we require the recovery manage r to be res ilient to med ia failures fa ilures of it s recovery file so that some of Ihe daw on the di sk is lost, e ithe r by being corrupted durin g a crash, by random decay or by a penn ane nt failure. In such cases, we need anot he r copy of the recovery fil e. T hi s can be in stable storage, whi ch is im plement ed so as to be ve ry unli ke ly to fai l by using mirrored d isks or copies al a differe nt location .
Intentions list 0 Any se rver thai prov ides tran sacti o ns needs to keep track of the objec ts accessed by c lie nts' transactio ns. Reca ll fro m C hap ter 13 th at whe n a clie nt ope ns a transact io n. th e se rver first conlacted provides a new tra nsaction idc llI ifier and returns it to the cl ient. Each subsequent cl ient req ues t w it hin a tran sactio n li p to and in clu d ing the commit o r ahort request in c ludes the tran saction ident ifie r as an arg ume nt. During the progress of a tra nsaction. the upd ate operations are app lied to a private set of ten tative versions of the objec ts be longing to the tran saction. At eac h serve r. an illfellfiolls list is reco rded fo r all o f its current ly acti ve transac ti ons - an in tent io ns 1is! of a pa rt iClI la r tran sac tion cont ai ns a 1is! of the references and the va lu es of all the objects that are alte red by th at tran saction. When a tran sac tion is commi tted, tha tlransac tion' s intenti o ns li st is used to idelll ify th e o bjects it affected. The com m ilted version of each objec t is replaced by th e tentat ive version made by th at transacti on, and the new va lue is wri tten to the se rve r' s recovery fil e. Wh en a tran saction abort s. the se rve r uses the intention s list to de lete all the tcnt ative ve rsions of objects made by th at transacti o n. Reca ll al so that a d istribu ted transac ti olllll ust carry ou t 3n ato mi c comm it pro tocol be fore it can be cOlll lll iLied o r abo rt ed. O ur di sc llss io n of recovery is based o n the twophase comm it protocol, in whic h all th e pa rt ic ipan ts invo lved in a tran sact io n first say wheth er thcy are prepa red to co mm it and the n. later o n if a ll the participants agree. they a ll carry ou t the actua l cOlllm it actio ns, If the pa rt ic ipants cannot ag ree to co mm it. they mll st abort the transaction. At the point when a pa rti c ipant says it is prepared to comm it a tran saction. its recovery manager m ust have saved both its intent ions list for that transac tion and the objects in that inten ti ons list in its recovery f ile. so th at it w ill be ab le to ca rry o ut th e cOlll m itment late r on. even if it crashes in the interim. When a ll the pa n ic ipa ms invo lved in a tran saction agree to com lllit it. the coord inator in fOlms th e c lien t and then se nds mess ag es to the parti cipa nt s to commit their pa rt of the Iransacti 0 n. O nce the cli ent has bee n info rmed th at a tr Po
---> P I B ---> P2 C ---> PO" A
B ---> PO'
C ---> PO"
\Iersioll store
Po
Po'
Po"
P,
100
200
300
1180
P3
P.J
278
242
CII('ckpoillf
14.6.2 Shadow versions Th e logg ing techniqu e record s transacti on statu s entri es, in tenti o ns lists and objec ts all in the Silme fil e - the log. The sliad()ll' \'ers;oJ/s techniqu e is an alternative way to organize J recove ry fil e. It uses a map to locate ve rsions of th e se rve r' s objects in a fit e call ed a l'usioll stor e. The map assoc iates the identifi ers of the se rve r' s o bjec ts with th e positions of their current versions in the ve rsion siore. Th e version s wrillc n by each transacti on are shadows o f the prev iolls COllll11illed ve rsions. The transaction status e ntri es and intentions lists are dealt with se parately. S hadow ve rsions are desc ribed first. Wh en a tran saction is prepared to cOl1lmit. any of th e objects chan ged by the tran sacti on arc appended to the version store . leav in g the co rresponding committed ve rsions un changed . These new as ye t te ntati ve ve rsions are ca ll ed shadoll' ve rsions. Whe n a transaction commit s. a new map is made by copying th e o ld map and entering the pos itions of the shad ow ve rsions. To complete th e commit process. the new map replaces the o ld map. To restore th e objec ts when a serve r is replaced afle r a crash, its recove ry manage r read s the map and uses the information in Ih e Illap to locate the objects in the version store. This techniq ue is illuswned with th e same exa mpl e invo lvin g transaction s T and U in Fig ure 14.20. The first column in the table shows the map before tran sactions Tand U. when th e balances of the acco llnt s A. Band Ca re $ 100. $200 and $300. respec tive ly. Th e seco nd column s how s th e map aft er transaction T ha s commi ll ed. The version store contains a check poi nt. fo ll owed by th e versions of A and B al P I and P2 made by tran saction T. It al so co ntain s the shadow ve rsions of Band C made by transaction U. at P3 and P4. The map mu st alway s be wri tten to a we ll -known place (for example. at the start of the vers ion store or a separate fil e) so [hat it ca n be found when the system needs to be recove red. The sw it ch from th e o ld map 10 the ne w map mu st be pe rformed in a single atomic step. To achi eve thi s it is essential thai stabl e storage is used fo r the map - so that there is g uarant eed to be a valid map even when a fil e writ e ope rat ion fail s. The shadow versions method provides faster recovery th an logg ing because th e posit ions of th e c urrent committed objec ts are reco rded ill the map. whe reas recovery from a log requires searchin g throu ghout the log for objec ts. Logg in g should be faster than shadow versions
during the no rmal activ it y o r th e sys tem. This is because loggi ng requires onl y a seque nce or append operations to the sa me fil e. whereas shadow vers ions requires an additional stabl e storage write ( in vo lvi ng two unrelated di sk blocks). Shadow ve rs ions o n their ow n arc not sufficient for a serve r that handles distri bu ted tran sactions. Transaction statu s entries and inte ntio ns li sts a re saved in a file ca ll ed the transac tion statu s fil e. Each inte nti o ns list represe nt s the part of the map th at w ill be a lte red by a transaction wile n it commits. The tran sac tion stalu s fi le may. for eX Jlllp le. be o rga ni zed as a log. The figure be low s hows the map and the tran sacti on statu s ril e for our c urre nt example when T ha s co mmitted and U is prepared to commit.
Transaction starl/!1)ilc (stable storage) Map
T
T
u
A --->P ,
prepared
commi tted
prepared
B ---> P2
A ---> P,
B ---> P3
C ---> Po"
B ---> P2
C---> P4
There is a chance that a serve r Illay c ras h between the time when a COll1lllilled statu s is written to the tran saction statu s fil e and the time when th e map is updated - in whi ch case the client will no t ha ve been acknow ledged. The recovery manager mu st allow for th is poss ibil it y when the se rve r is re placed a ft er a cras h. for exa mpl e by chec kin g whether the map inc ludes th e effects o f the last committed transacti o n in the transaction statu s file. If it does no t. then til e lalle r should be marked as abo rt ed.
14.6.3 The need for transaction status and intentions list entries in a recovery file It is poss ible to desig n a simple recove ry fi le that doe s not in clude e lllries for tran sacti o n Slat LI S ite ms and iJ1l eJ1li ons lists. This SOli of recovery fil e may be suitab le when a ll tran saction s are directed to a sin g le server. The use of tran saction statu s items and intentions lists in the recovery fil e is esse nti a l for a serve r th at is intend ed 10 partic ipate ill d istributed transac tion s. Thi s approach can a lso be useful for se rvers of nOI1 d istribut ed tran sac ti ons for vari o us reasons, including th e following:
Some recovery managers are des igned to wri te th e objects to the recovery file earl y - under the ass llmption that tran sactions normall y comlllit. If transactions use a large number of big objec ts. the need to write them contiguously to the recovery fi le may compli cate th e design of a se rver. Whe n objec ts are refere nced from intentions lists. they call be found wherever th ey are. In ti me stamp ordering concurre ncy co ntrol. a se rver sometim es knows [hat a tran sact ion will eve ntuall y be ab le to co mmit and ac kn ow ledges the cli ent - at thi s tim e the objects are written to th e recove ry fil e (see Chapter 13) 10 ensu re their permanence. Ho weve r. the tran sac tion may ha ve to wa it to commit until earli er transactions ha ve committed. In such situations. the co rresponding transac ti on status ent ries in the reco very fi le wi ll be wailing 10 comlllit and then commilled to e ns ure tim estamp ordering of committed tran sac ti ons ill th e recovery fil e. On
Figure 14.21
Log with entries relati ng 10 two-phase co mmit protocol
Tran s:T
Coord ' r:T
prepared
part ' pant list: .. .
• Tran s:T
Trans:U
committed prepared
• Part ' pant: U Tran s: U
Trans:U
Coord ' r: " . uncertain committed
intentions list
intention s list
recove ry. an y wa ilin g ~lo -co ml1lit tran sac ti ons can be a ll owed 10 commit. becau se th e ones th ey were wa iting fo r ha ve e ithe r just cOlllmitted o r if not ha ve to be
aborted du e
10
fa ilure o f the se rve r.
14.6.4 Recovery of th e two-phase commit protocol In a di stributed transaction . eac h se rve r kee ps its ow n recove ry fil e. The recove ry management described in the prev ious sec ti on mll s t be ex tended to deal w ith an y
transac tion s that are perfo rming the two-phase comlllit protoco l at th e tim e when a se rver rail s. The recove ry manage rs use I wO new status values: dOlle. I/ I/cerraill. These statu s valu es are shown in Fi g ure 14.6. A coordinato r uses L'Ol1lmilled to indica te that th e outcome o f the vote is Yes and dOlle to indi ca te th 0
Thi s exec ution does not match a common-se nse specifica ti on for the behaviour of bank accounts: client 2 should hi.l ve read a balance of $ 1 for x, given that it read the balance of $2 for y . since y's balance was updated after that of x. The anomalous behaviour in the replicated case could not ha ve occ urred if th e bank accounts had been implement ed by a single se rver. We ca n construct systems that manage replica ted objects without the anomalous behav iour produced by the naive protocol in our example. First. we need to understand what count s as co rrec t beha viour for a replicated system.
Linearizability and sequential consistency 0 There are various COlTec tn ess crit eria for replicated objects. The most strictly COlTec t sys tems are lillcari:ablc and thi s propert y is called lilleari:abiliry. In order to understand lineari za bi Iity , consider a repl icated service implementation with two clients. Let the sequence of read and update operation s that Each operation 0ij in th ese client i performs in some exec ution be 0iO' oil' oil" sequences is specified by the operation type and the arguments and return va lues as th ey occurred at run time. We assume that every operation is sy nchron ous. That is. c lient s wait for one operation to compl ete be fore req ues ting the nex!. A single serv er managing a single copy of the objects would seriali ze the operations of the client s. In the case of an exec ution with only c lient 1 and client 2, th is interleaving of th e operations co uld be 020, 021' 10,° 22 11 , () 12' ... . say. We defin e our correctness criteria for replicated objects by referring to a \'irtllol interl eav in g of the clients' operations. which does not necessarily ph ys icall y occ ur at any partic ular replica manage r but which establi shes the correc tness of the execlltion. A replicated shared object service is said to be linea ri zable if /or al/y ex(!('wi oll there is some interl eaving of the seri es 01' ope ration s iss ued by all the clients that sati sfies the following two criteria:
°
,°
The interleaved sequence of operations meets th e spec ificati on of a (s ingle ) correc t copy of the objects. The order of operations in the interleav ing is co nsiste nt with the real tim es at which the operations occurred in the actual exec ution . This de finiti on captures the idea th at for any set of c lient operations there is a virtual canoni cal exec ution - the interl eaved operations that the definition refers to - against a virtual single image of the shared objects. And each client sees a view of the shared objects that is consistent with that single image: th at is, th e res ults of the client' s operations make se nse as th ey occur within the interl eav in g. The service that gave ri se to the exec lltion of the bank account client s in the ex ampl e is not lineari zabl e. Even ignoring the real time at whi ch the operations took place, there is no interleaving of the two clients ' operations that would sati sfy any correct bank account specification: for auditing purposes, if one account update occurred after another, th en the first update should be observed if th e seco nd has been observed. No te Ihat linearizability conce rn s only fhe inte rle aving of individual operations and is not intended to be tran sacti onal. A lineari zable exec ution may break appl icarionspecific notions of consistency if co nc urrency control is nO[ applied. The real -tim e requireme nt inlinea ri zabilit y is de sirable in an ideal world, because it captures our notion that c lie nt s should receiv e up· to-date information. But , eq ually , the prese nce of realtime in th e de finition raises the iss ue of linea ri zabilil y's practicaliLy,
because we cannot always synchronize clocks to the required degree of acc uracy. A weake r correctn ess conditi on is sequential consisTellcy, whi ch ca ptures an essential requ irement concerni ng the orde r in which req uests are processed without appealing to real ti me. The de fi niti on kee ps the first criterio n fro m th e defi niti on for linearizab il ity but modifies th e seco nd as fo ll ows: A re plicated shared object se rv ice is said to be sequ enti ally consisten t if Jor allY execlllioll the re is some interleav in g of the se ries of operations iss ued by a ll the c li ents whi ch sati sfies the fo ll ow ing two cri teria: The interl eaved sequence of operati ons mee ts the s pec ificati o n of a (s in gle) correc t copy o f the objec ts. The o rder of operation s in the interleav in g is consiste nt w ith the prog ram order in whi ch eac h individua l cl ient exec uted th em. Note th at absolute time does no t appear in this de finiti o n. No r does an y o ther IOtal o rder on all operation s. The onl y notion of o rdering that is re levant is th e order of e vents at each separate c lient - the program order. T he interleavin g of ope rat ions ca n shuffl e the sequence of ope rat ions from a set o f client s in any order, as lo ng as each cl ient' s o rder is not violated and the resuh of each operati on is consistent. in terms of the objec ts' spec ificati o n, with th e operatio ns th at preceded it. T hi s is simila r to shufnin g together several packs of cards so that they are intermingled in slI ch a way as to preserve the o ri gi nal o rder o f eac h pack. Every linearizable se rvice is al so sequenti all y consistent , since real-t ime order re flects each c lient' s program order. The converse does no t hold. An example execlItion fo r a service th at is sequenti a ll y consistent but no t lineariza ble follows:
Client I:
C li ent 2:
se! Ba ICillce 8 (x, I ) geI Balall ee"C)' ) --> 0 ge lBalall ceA(x) --> 0
Thi s execllti on is possible un de r a na ive replicati o n strategy even if ne ither of th e comp uters A o r B fails but if the updore of x that client I made at B has not reached A when client 2 reads it. The rea l-time cri terion fo r Iineari za bilil y is not sati sfied. since geI BalalleeA(x)-->O occu rs late r th an seI Ba lallceE(x, I ); but the follow in g interl eav in g sati sfi es both criteria for seque nti a l consistency: ge t Balance A (y) ---7 O. ge IBalalleeA( x) --> 0, se IBalall eee(x, I ) , ,b = p ->b . 1:
} } Program Reader: maill( ) ( strllct
shared *p:
mcthersefup( ); p = (SII"IICI shared *)METHERBA S E:
while(TRUE)(
1* read theiields alice evel), second *1
prilll!("(/ = %d. b = %dln". p ->a. p - >h):
sleep{1 ):
} }
Thi s lype of implement.ation is nonnall y only suiled to a co ll ec tion of homogeneous com puters, with common data and paging format s. Middleh'ore: Some languages such as Orca [8al et 01. 1990J , and middl eware such as Linda [Carriero and Gelernter 1989] and its derivatives JavaSpaces [Bishop and Warren 2003] and TSpaces lWyc koff (I al. 1998], support forms of DSM without any hardware or paging support, in a platfonn-ncutra l way. In thi s type of imple mentation , sharin g is implement ed by communication between instances of the user-level s upponlayer in clients and servers. Processes make call s to thi s layer when they access data ite ms in DSM. The instances of thi s layer at the different computers access local data items and communicate as necessary to maintain consistency.
Thi s c hapter concentrates on the usc of software to implement DSM on standard complHcrs. Even with hardware support , high-level software techniques may be used to minimize th e amount of communi cation between components of a DSM impl ementation.
The page-bused ~pproac h has the advan tage of im posing no particular structure o n the DSM. whic h appears as a seque nce of bytes . In prin ci ple . it e nab les prog ram s des igned for a shared-memo ry multiprocessor to run o n compute rs w ithout shared me mory. wi th little or no adap tation. Microke rne ls s uc h as Mach and Chorus provi de native suppo rt for DSM (and o the r memory abs tr;Jctions - th e Mach virtual memory fac ilities are described in www.cdk-l-.n!.;t/lll11l:h) . Page-based DSM is more usually implemented largely at user level to lake ad van tage of the fl ex ibilit y that that provides . The impl ement ation utili zes kernel support for use r-leve l page fau lt hand lers. UN IX and some va ri ant s of Wind ows provide thi s fac ilit y. Mi cro processors with 64-bit add ress spaces w ide n the scope ror page-based DSM by re lax ing constra ints on address space man age ment [Bart oli (!{ al. 1993] . The exampl e in Fig ure 18.2 is of two C prog ram s. Neada and Writ('/". whi ch communi ca te via the page- based DSM pro vided by the Melher sys tem [Minnich and Farber 19891. Writer updat es two fie lds in a stru cture ove rlai d upon the begin nin g o r th e Methe r DSM seg me nt (beg inning at add ress METtJENBA SE) and Reader pe ri odi call y print s out the va lues it reads from th ese fi e lds. The two prog ram s contain no spec ia l o perations: th ey are comp iled into machine instructions th at access a co mm on range of virtual memo ry addresses (s tarting at METtJERIJASE). Meth er ra n ove r co nventional Su n workstati on and network hardware. The midd lewa rc approac h is quite differe nt to th e usc or s pec ia lized hardware and pag ing in th at it is not intended to utili ze ex isting shared-mcmo ry codc. Its signifi cance is that it enables LI S to devc lop higher- leve l abstractions o f shared objec t ~ . rath er th an shared memory loca ti ons.
18.2 Design and implementation issues This sec tion d iscllsses des ign and implementation opt ions conce rnin g the main fea tures that characte ri ze a DSM syste m. These are the stru cture of data he ld in DSM: the sy nchroniza tion model lIsed to access DSM co nsistent ly at th e appli ca ti o n leve l: the DSM consistency mode l. wh ich governs the co nsistency o f data val ues accessed fro m d iffe rent co mput ers: the updat e options for co mmuni ca tin g wri tt en va lues betwee n comp ute rs: the gra nularity of sharin g in a DSM imp le me ntati o n: and the problem of thrashin g.
18.2. 1 Structure In Chapt er 15. we considered systems th at replicate a co ll ec ti on o f objects sllc h as diari es and fil es. Th ose systems enab le c li ent programs to pe rfo rm o perat io ns upo n th e objec ts as tho ugh the re was on ly o ne copy of eac h o bjec t. but in realit y they may be accessing d ifferent phys ica l re plicas. T he systems mak e guarantees abo ut the ex te nt to whic h the replicas of th e objec ts are all owed to dive rge. A DSM sys tem is j ust slic h a re plicati o n syste m. Each app licatio n process is prese nt ed with so me abstraction of a coll ec tion of objec ts. but in thi s case the 'coll ec tion ' looks more or less like memory. That is. the objects can be addressed in some fa shion o r o the r. Diffe rent app roac hes to DSM va ry in what th ey co nsider to be an
'object' and in how objects are ad dressed. We consider three approaches, wh ich view DSM as be in g composed res pectively o r co nti guo us bytes. language- le ve l objects or immutable data ite ms.
Byte-oriented 0 Thi s type of DSM is accessed as o rdinary virtual memory - a co nt iguous array of bytes. It is the view illustrat ed above by the Meth er system. Tt is also the view of man y other DSM systems. incl uding Ivy. whic h we di sc uss in Sec tion 18.3. It all ows application s (a nd lang uage implementations) to impose what ever data structures th ey want on th e shared memory. The shared objec ts are directl y address ible memory locat io ns (i n practice. th e shared locations may be multi-byte word s rather than individua l bytes). The only opera ti ons upon those objects are read (or LOAD) and wrile (or STO RE). If x and ya re two me mory locations. the n wc denote instances of these opera ti ons as foll ows: R(x)a - a read operatio n that reads the va lue a from loca ti o n x. W(x)h - 11 wriTe operati on that stores value h at locat io n x. An exampl e exec uti on is W(x) I. R(x)2 . This process wri tes the value I to location .r and then reads the va lue 2 from it. Some othe r process must have wrillen the value 2 to that locat io n mennwhi le.
Object-oriented 0 The shared me mo ry is struc tured as a collec tion of lang uage- level objec ts with hi gher-l eve l semanti cs than si mpl e read/wriTe va riabl es, such as stacks and di cti o nar ies. The cont ent s of the shared memory are chan ged o nl y by in vocations upon these objects and neve r by d irec t access to th e ir me mbe r variabl es. An advantage of viewi ng memory in thi s way is that object semanti cs can be utili zed when e nforcin g consiste ncy. Orca views DSM as a collection of shared objects and automarically se riali zes operations upon an y g ive n object. Immutable data 0 Here DSM is viewed as a collecti on of immutable data items that processes can rcud , add to and re move from. Exa mples inc lude Agora [Bi siani and Forin 19RRI and . more significa ntl y. Lin da and its derivat ives. TSpaces and Ja vaSpaces. Linda-type systems provide the programmer with collect ions of tupl es called a fUple space (see Section 16.3. 1). Tu ples consist of a sequence of one or more typed data
fie lds such as . and Th e set o f names de fin ed within th e 'ypes secti on o f n W S DL definiti o n is ca lled its rarge l I101lll'.\pace. The message secti o n o rlhe abstrac t pa n co ntain s a desc ripti o n oflh e set of messages exc hanged. For the doc ume nt style of inte racti on, these mess ages w ill be used direc tl y. For the re4u es l ~ repl y style of interacti on. th ere are two mess ages ror each ope rati on. whi ch are lIsed 10 desc ribe the opera ti ons in the illll'I/OCC sec tion. The concre te part spec ifi es how and whe re tile se rvice may be contacted . The inhe rent modul arit y of a WSDL definition a ll ows its component s 10 be combin ed in diffe rent ways. fo r exampl e . the same interface Ill ay be used with different bindings or locati o ns. The types may be de fined inside the types e leme nt or they may be defined in a separate doc ument refe renced by a URI from th e types c lement . In the latt er case. th e type de finiti o ns can be re fe renced from severa l different WSDL doc um ent s.
Messages or operations 0 In web services. a ll that the c li e nt and the serve r need, is to have a comlllon idea nbo llt th e mess ages to be exc han ged. For a service based on th e exc hange of a small number of diffe rent types of doc Ulll enl. WSDL just desc ribes th e types of th e diffcre nt mcssages to be exc han ged . Whe n 1.1 c li ent se nd s one of th ese messages to a web se rvice. the latter dec ides what o pe rati o n to pe rform and what ty pe of mess age to send back to th e c l iellt , o n the bas is o f the type of the message it rece ived. In o ur Ja va exa mpl e. two messages will be de fined fo r eac h of th e ope rati ons in th e interface - o ne for the request and one for th e re pl y. For ex ampl e. Fig ure 19. 11 shows the request and rep ly messages fo r th e o pe rati o n lIell'Shape operati o n whi ch has a sin gle input argume nt of type G /'{IfJh;('aIO/~i('('f and a sin gle o utput argum ent of type illf. But for services th at support several diffe rent o pe rat ions, it is mo rc effecti ve to spec ify th e mess ages exc han ged as requests for o perati o ns with arg um ent s and the ir
corresponding replies. allowing the service to dispatch each req uest to the appropri ate o peration . However. in WSDL an o pe rati o n is a constru ct for re latin g reques t and re pl y messages. rathe r than the de finiti on of a n o perati o n in a se rvice illlerface.
Interlace 0 Th e co ll ec tion of operati ons be lo ng in g to a web se rvice arc g rouped toge the r in an XML e lement named illfelf ace (sometimes ca ll ed jJorr7)pe). Eac h o perati o n mll st spec ify the message exchange patte rn be tween c li e nt and serve r. The avail able opti ons inc lude th ose show n in Fig ure 19. 12. Th e first o ne, I ll-O ll f is the comlll onl y used RR form o f c l ient-se rver co mmllni cat io n. In thi s patt ern. the repl y message may be re placed with a fault message. I II·Oll ly is for one-way mess ages with M ayhe semanti cs and Ow -
Figure 19.12 Message excllange patterns for WSDL operations Messages sem by
Name
Client
Server
Request Request
Reply
In-Onl y Robu sl In-Onl y
Req/lesl
OUI - In
Reply
In-O ul
Delivery
F aulf messag e
may replace Reply no fault message guara nteed
may be sent
Reqllest
may replace Reply
OUI -Onl y
Request
no fault message
Robu st O UI -Onl y
Rf'q ll csf
g uaran teed
ma y send fau lt
Ollly is for oncwa), messages fro m server to client ; fault messages canIlot be sent with e ith er. RohIfSf· II1-01lIy and Robllst Oll f·Only are th e corres ponding messages with guaralllccd de li ve ry: and fault messages may be exc han ged. 0111-111 is a req uest-response interacti o n initi ated by th e se rve r. Ret urning to our Java exampl e. cneh of the opera ti ons is de fined to ha ve an I II -OII( patte rn . The operation lie IrS/lOpe is show n in Fig ure 19. 13. us ing the messages defined in Fi g ure 19. 1 I. This definition. together w ith de fin iti o ns of the fo ur othe r operations will be enclosed in an XM L i11l(,ljcl('e c lement. An operati on m ay a lso spec ify the fa ult messagcs that can be scnt. Bur if fo r cxampl e, an ope rati o n has two argumcnts. say an integer and a strin g. then the re is no need to definc a new datatype. s ince these types are defined for XML sc hcm,;ls. However it wi ll be necessary to de fine a message that has these two parts. Thi s message ca n then be lI sed as an inpul or o utput in the definition for operation. Inheritance: Any WSDL interfa ce may ex tend o ne o r more other WSDL interfaces. Thi s is a s impl e 1'0 1111 or inheritance in which an int erface s upports the operations of an y int erfaces it extend s in add iti o n to th ose it defin es itse lf. Rec urs ive de finiti o n of int erfaccs is not allow ed ; 111m is. if int erface B ex tend s int erface A. the n int erface A ca nn ot ex tend int erface B.
Figure 19.13 WSDL operation newShape operation name = "newShape"
pattern = In-Out input message = "tns:ShapeLisl_newShape" output message = "tns:ShapeLisCnewShapeResponse"
tns - target names pace xsd - XML schema definitions The names operation, pattern , input and output are defined in the XML schema for WSDL
Figure 19.14 SOAP binding and service definitions
binding name : "ShapeListBinding" type: "tns:ShapeList" soap:binding transport : URI lor sqrem,qs lor soap/http style: rp c operation name= " newShape"
service name : "MyShapeListService"
endpoint name: "ShapeListPort" binding: "tn s:ShapeListBinding"
soap:address location: service URI
input soap:body encoding , names pace the service URI is:
output
''http://localhost:8080/ShapeList-jaxrpc/ShapeList''
soap.·body encoding , namespace soap:operalion soapAction
Concrete part 0 The remaining (concrete) part of a WSDL doc ument consists of the billdiJlg (the choice of protocols) and th e service (the cho ice of end point or server
address). The two are related, since the fa nn of address depends on th e type of prolOcoi in usc. For exa mpl e, a SOA P e nd po int will lI SC a UR I whereas a CORBA endpoint would use a CORBA-specific objec t identifi er. Binding: The hillding secti on in a WSDL doc um ent says whi ch message fo rmats and form of ex terna l data representati on arc to be used. Fo r exampl e. web se rvices freque ntly use SOAP. HTIP and M1ME. But they shou ld eventually be able to use, for example, GlOP (Secti on 20. 1) to access instances of CORBA objects. Bind ings may be assoc iated wi th partic ul ar operat ions or interfaces or they may be le ft free fo r lise by a varielY of different web services. Fi g ure 19. 14 shows an exa mpl e of a hinding enc los in g a soap:hinding th at specifies the URL of a particul ar protocol for tran smitting SOAP envelopes: th e HTIP bi nding for SOAP. Opt io nal attribut es of this element may also spec ify the foll ow ing:
the message exc hange palte rn , whi ch may be e ither
fpC
(requ est-repl y) or
documellf exchange. The default value is document;
the XML sc hema for the message formats. The de fault is the SOA P em 'elope; the XML sc hema for the ex ternal data representati o n. The default is the SOA P encod ing of XML.
Figure 19. 14 also shows th e detail s orlhe bindings fo r one of the opera ti ons (Jll'H'Sltape) . specifying tha i bOlh the inpur and the ourf}ur message should (ravel in a soap body, using
a parti cular encodin g style and in additi on that the operation should be transmitted as a soap Aetion. Servi ce: Eac h sel'l'ice e le ment in a WSDL doc ument specifies the name of the service and one or more ef/dpoints (or pon s) whe re an instance of the se rvice may be co ntacted. Each o f the endpoint e lements refers to the name o f th e binding in use and , in the case of a SOAP binding. uses a soap:address element to specify the URI of the se rvice locati on. Documentation 0 Both human and mac hine readable information may be inserted in a doci/mentation e lement at most po ints within a WSDL doc ume nt. This infonnation may be removed before WSDL is used for aut omati c process ing. fo r exampl e, by stub compilers.
WSDL use 0 Compl ete WSDL doc um ents can be accessed via their URI s by clients and servers. either directl y or indirectly via a direc tory se rvice such as UDDI. Tools are available for generatin g WSDL definitions from informati on prov ided via a graphi cal use r interface, remov ing th e need for users to be in vol ved in the complex deta ils and structure of WS DL. For examp le, th e Web Services Desc ripti on Language for Java toolk it allows the creati on, re prese nt ation, and manipulati on of WSDL docum ents describing se rvices rWSDL4J 20031. WSDL definitions ca n also be generated from interface definition s written in oth er languages such as Java JAX- RPC di sc ussed earli er in Secti on 19.2. 1.
19.4 A directory service for use with web services The re are many ways in whi ch clients can obtain service desc ripti ons, for exa mple anyone provi ding a hi gher-le vel web service like th e Travel Age nt service di scussed in Secti on \9. 1 would alm os t certainl y make a we b page adve rti sing the service and potenti al cl ients would come ac ross the web page when searching fo r services o f th at type. However. an y organi zati on th at plans to base its appl ications on web se rvices will find it more conveni ent to use a directory se rvice to make these services avail able to c lients. Thi s is the purpose of the Uni versal Directory and Discove ry Service (UDDI ) rB e li wood el 01. 20031, whi ch prov ides both a nam e servi ce and a directory service (see Secti on 9.3 . That is, WSDL se rvice desc ripti ons may be looked up by name (a whit e pages service) or by allribute (a ye llow pages servi ce). They may also be accessed directl y via the ir URLs, whi ch is conve nie nt for developers who are designin g cl ient program s th at use the service. C lients may use th e ye llow pages approach to look up a parti cular category of service such as travel agent or bookse lle r, or th ey may use the white pages approach 10 look up a service with refere nce to the organi zati on that provides it.
Data structures 0 The data structures supportin g UDDI are designed to allo w all the above styles of access and can incorporate an y amount of human-readable infonnation . The dat a is organi zed in te rm s of the four structures shown in Figure 19. 15, each of whi ch can be accessed indi viduall y by means of an identifier call ed a key (apart from (M odel, whi ch can be accessed by a URL):
Figure 19.15
The main UDDI data structures
businessServices
businessEntity human readable information
businessServices
about th e publisher
key
businessServices human readable information about a fam ily of services
key
binding Template binding Template binding Template ~ informati on - - - -
about the ~ RL ~ service Interfaces key URL
tModel tModel tModel
service descripti ons
businessEllIily: desc ri bes the organi zati on that prov ides these web se rvices, g ivin g its name, address and ac ti vities etc.: bus;nessSen';ces: stores informati on aboLit 11 set of instances of a web service . such as its name and a desc ripti o n of its pu rpose. for example, trave l agent or bookse tl er: bindingTemplare: ho lds th e add ress o f a we b se rvice instance and refe rences to service descriptions: {Model: ho lds se rvice desc ripti ons, usua ll y W SDL doc um ent s, stored o ut side the database and accessed by mean s of URLs.
Lookup 0 UDD! pro vides an API for look ing up services based on two selS of query o peration s:
the gel_xxx sel of operati ons inc ludes gel_BlIsinessDerail, gel_Serl'iceDerail, gel_binding Detail and gel_IModeIDerai/: they retrieve an emit y correspondin g to a give n key; th e jlIUCX.:rX set of operati ons includes find_business. find_ serl'ice, filld _binding and jlnd_IModel: they retrieve the set of entit ies that matches a parti cul ar set of search criteri a. providing a summary of names, descriptions, keys and URLs. Thus, clients in possession of a parti cul ar key may use a gel_.ux operati on to retrieve the correspondin g entit y directl y. Other clients Illay use browsing 10 assist with searches - startin g with a large set of res ults and grad uall y narrowing it down . For example . th ey may start by lIsing th e /ind_bllsiness operari on in orde r to get a li st containing a summary of informati on on matchin g prov iders . From this summary. the use r may use the/ind_sen'ice operati on to narrow the searc h by matching the sort of service required. In both cases. they will find Ihe key of a suitable billdillgTellll'lale and thereby find the URL for retrieving th e WSDL doc um en t fo r a suitable service.
In add iti on. UOO I prov ides a notify/subsc ribe interface by whi ch clie nt s reg ister inte res t in a part icular set o f cl1Iit ies in a U DOI reg istry and get ch