Table of Contents
Preface
...
427 downloads
2855 Views
98MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Table of Contents
Preface
............................................................................................................
Chapter 1: The Concurrent Computing Landscape 1.1 1.2 1.2.1 I 2.2
1.2.3 1.3 1.4 1.5 1.6 1.7 1.8
1.9 1.9.1 1 .9.2
.
1
The Essence of Concurrent Pr-ogrammiilg ...................... .... 2 Hardware Architectures ..................................................... Processors and Caches ....................................................... 4 Shared-Me1nor.y Multiprocessors .......................................... 6 Distributed-Memory Multicomputers and Networks ............. 8 Applications and Progralnlning Styles ........................... ... 10 Iterative Parallelism: Matrix Multiplication ........................... 13 Recursive Parallelism: Adaptjve Quadrature ......................... 17 Producers and Consumers: Unix Pipes ................................ 19 Clients and Servers: File Systems ......................................... 27 Peers: Distributed Matrix Multiplication .............................. 23 Summary of Programming Notation .............................. ..... 26
Declarations ......................................................................... 26 Sequential Statements ............................................................ 27 1.9.3 Concurrent Statements, Processes, and Procedures ............... 29 1.9.4 Colnments ............................................................................ 31 .................................................. 31 Historical Notes .......................... ..... References ................................................................................................33 ..................................................................... 34 Exercises ...................... .
xiv
Table of Contents
Part 1: Shared-Variable Programming Chapter 2: Processes and Synchronization
...................
................... 41 ............................. 42
States. Actions. Histories. and Properlies Parallelization: Finding Patterns in a File ............................. 44 Synchronization: The Maximum of an Array ........................48 Atomic Actions and Await Statements ................................. 51 Fine-Grained Atomicity ....................................................... 51 2.4.2 Specifying Synchronization: The Await Statement ............... 54 2.5 Produce~/ConsurnerSynchronization .................................... 56 2.6 A Synopsis of Axiomatic Semantics ...................................... 57 2.6. 1 Fol.mai Logical Systems .......................... . ....................58 2.6.2 A Programming Logic ........................................................... 59 2.6.3 Semantics of Concurrent Execution ....................................... 62 ................................... 2.7 Techniques fool- Avoiding Interference 65 2.7.1 Disjoint Variables ........................... . . .............................. 65 2.7.2 Weakened Assertions ...........................................................66 2.7.3 Global Invariants ..................................................................68 2.7.4 S y~ichronizatjon ..................................................................... 69 2.7.5 An Example: The An-ay Copy froblern Revisited ................. 70 2.8 Safety and Liveness Properties ............................................ 72 2.8.1 Proving Safety Properties ............................................... 73 2.8.2 ScheduJiog Policies and Fairness ....................................... 74 Historical Notes .................................. ..... ................................................. 77 References ................................................................................................. 80 Exercises ................................................................................................... 81
2.1 2.2 2.3 2.4 2.4.1
Chapter 3 : Locks and Barriers 3.1 3.2 3.2.1 3.2.2 3.2.3 3.3 3.3.1 3.3.2 3.3.3 3.4 3.4.1 3.4.2
...................................................
93
The Critical Secrion Problem ................................................94 Critical Sections: Spin Locks .............................................. 97 Test and Set ................................... .........................................98 Test and Test and Set .......................................................... 100 Implementing Await Statements ......................................... 101 Critical Sections: Fair Solutions ............................................ 104 Tfie Tie-Breaker Algorithm ................................................ 104 The Ticket Algorithm ........................................................... 108 The Bakery Algorithm ........................................................... 11I. ............................... .................... Barrier Synchronizatjon . . 115 Shqred Counter ...................................................................... 116 Flags and Coordinators .......................................................... 117
Table of Contents
xv
Symmetric Barriers ............................................................. 120 3.5 Data Paallel Algorithms ............................................... 124 3.5.1 Parallel Prefix Computations ................................................ 124 3.5.2 Operations on Linked Lists ................................................. 127 3.5.3 Grid Compntations: Jacobi Iteration ................................. 129 3.5.4 Synchronous Multiprocessors ............................................ 131 Paral.l.el Computing with a Bag of Tasks ................... ...... ...132 3.6 3.6.1 Matrix Multiplication ......................................................... 133 3.6.2 Adaptive Quadrature ............................................................ 134 ......................................................... 135 Historical Notes .......................... . References ................................................................................................. 139 Exercises ................................................................................................... 141 3.4.3
Chapter 4: Semaphores
............................ ......................................... 153 4. I Syntax and Semantics ........................................................ 154 156 4.2 Basic Problems and Techniques ............................................. 4.2.1 Critical Sections: Mutual Exclusion ................................. 156 B ....tiers: Signaling Events ..................................................... 15G 4.2.2 4.2.3 Producers and Consumers: Split Binary Semapl~ores ............ 158 4.2.4 Bounded Buffers: Resource Counting ................................. 160 4.3 The Dining Philosophers ........................................................ 164 4.4 Readers and Writers ............................................................... 166 4.4.1 ReaderslWriters as an Exclusion Problem ............................. 167 4.4.2 Readerstwriters Using Condition Synchronization ............... 169 The Technique of Passing the Baton .................................. 171 4.4.3 4.4.4 Alternative Scheduling Policies ........................................ 175 4.5 Resource Allocation and Scheduling ..................................... 178 4.5.1 Problem Definition and General Solution Pattern ................. 178 4.5.2 Shortest-Job-Next Allocation ............................................. 180 4.6 Case Study: Pthreads .............................................................. 184 4.6.1 Thread CI-eation .................................... .............................185 4.6.2 Semaphores ......................................................................... 186 4.6.3 Example: A Simple Producer and Consumer ........................186 his to^.ical Notes ..................................................................................... 188 190 References ................................................................................................. E~ercises ................................................................................................... 191
Chapter 5: Monitors 5.1 5 . 1. 1
...................... . . ................................................203 Syntax and Semanlics ............................................................ 204 Mutual Exclusion ............................................................ 206
xvi
Table of Contents
Condition Variables ................................................................ 207 Signaling Disciplines ............................................................. 208 5.1.4 Additional Operations on Condition Variables ...................... 212 5.2 Synchronization Techniques .................................................. 213 5.2.1 Bounded Buffers: Basic Condition Synchronization .............213 5.2.2 Readers and Writers: Broadcast Signal .................................. 215 5.2.3 Shortest-Job-Next Allocation: Priority Wait ....................... ...217 5.2.4 Interval Tii11er: Covering Conditiolls ................................... 218 ......................................... 5.2.5 The Sleeping Barber: Rendezvous 221 5.3 Disk Scheduling: Program Structures .................................... 224 5.3.1 Using a Separate Monitor ...................................................... 228 5.3.2 Using an Intermediary ............................................................ 230 5.3.3 Using a Nested Monitor .........................................................235 5.4 Case Study: Java ....................................................................237 5.4.1 The Tl~readsClass .................................................................. 238 5.4.2 Synchonized Methods ........................................................... 239 5.4.3 Parallel ReadersIWriters ........................................................ 241 5.4.4 Exclusive ReadersNriters .....................................................243 5.4.5 True ReadersIWriters ............................................................. 245 5.5 Case Study: ftheads .............................................................. 246 5.5.1 Locks and Condition Variables ..............................................246 5.5.2 Example: Summing the Elements of a Matrix ....................... 248 Historical Notes ........................................................................................ 250 References ................................................................................................. 253 Exercises ................................................................................................... 255
5 . 1.2 5.1.3
Chapter 6: Implementations
...........................................................
265
A Single-Pi-ocessor Kernel ..................................................... 7-66 6.2 A Multiprocessor. Kernel ........................................................ 270 6.3 Implementing Semaphores in a Kernel .................................. 276 6.4 Impleinenting Monitors in a Kernel .......................................279 6.5 Implementing Monitors Using Semaphores ..........................283 HistoricaI Notes ........................................................................................ 284 References ............................................................................................... -.' X F , Exercises .................................................................................................... 137
6. I
Part 2: Distributed Programming Chapter 7 : Message Passing 7.1
..................................
...................... ...................... . 29s ............................................296
Asynchronous Message Passing
Table of Contents
xvii
7.2 Filters: A Sor-ring Network .....................................................298 7.3 Clients and Servers ................................................................302 7.3.1 Active IVlonitors ..................................................................... 302 7.3.2 A Self-scheduling Disk Server ............................................ 308 7.3.3 File Servers: Conversational Continuity ................................ 311 7.4 Interacting Peers: Exchanging Values .................................... 314 7.5 Synchronous Message Passing .............................................. 318 Case Study: CSP .................................................................. 320 7.6 7.6.1 Corn~nunicationStatements ...................................................321 323 7.6.2 Guarded Communication ....................................................... 7.6.3 Example: The Sieve of Eratosthenes ......................... . . ......326 7.6.4 Occam and Modern CSP .................................................... 328 7.7 Case Study: Linda .................................... .. .......................... 334 7.7.1 Tuple Space and Process Interaction .............................. .......334 7.7.2 Example: Prime Numbers with a Bag of Tasks .....................337 7.8 Case Study: MPI .................................................................. 340 7.8.1 Basic Functions ................................................................... 341 7.8.2 Global Communication and Synchronization ........................ 343 7.9 Case Study: Java .................................................................... 344 7.9.1 Networks and Sockets ...........................................................344 7.9.2 Example: A Remote File Reader ......................................... 345 Histol-icai Notes ...................................................................................... 348 References ...................... ................................................................... 351 Exercises ................................................................................................. 353
Chapter 8: RPC and Rendezvous 8.1 8.1. 1 8.1.2 8.1.3 8.1.4 8.1.5 8.2 8.2.1 8.2.2 8.2.3 8.2.4
8.3 8.3.1 8.3.2
...........................................
361
Remote Procedure Call ......................................................... 362 Synchronization in Modules .......................................... 364 A Time Server ..................................................................... 365 Caches in a Distributed File System ......................................367 A Sorting Network of Merge Filters ...................................... 370 Interacting Peers: Exchanging Values .................................... 371 Rendezvous ............................................................................ 373 Input Statements .....................................................................374 ClienZ/Server Examples .........................................................376 A Sorting Network of Merge Filters ......................................379 Interacting Peers: Exchanging Values ....................................381 A Multiple Primitives Notation ........................................... 382 lnvoking and Servicing Operations ...................................... 382 Examples ..................................... ...... .....................................384
Table of Contents Readers~WritersRevisited ................................................ 386 Encapsulated Access ..............................................................387 Replicated Files ...................................................................... 389 Case Study: Java .................................................................. 393 Remote Method Invocation ............................................... 393 Example: A Remote Database ............................................... 39.5 8.6 Case Study: Ada ................................................................... 397 .................................................398 8.6.1 Tasks ............................ . . 8.6.2 Rendezvous ......................................................................... 399 5.6.3 ProteccedTypes ...................................................................... 401 8.6.4 Example: The Dining Philosophers .......................................403 8.7 Case Study: SR ...................................................................... 406 8.7.1 Resources and Globals ........................................................... 406 Comrnunication and Synchronjzation ....................................405 8.7.2 8.7.3 Example: Critical Section Silnulation ....................................409 Historical Notes ....................................................................................... 411 References ............................................................................................... 41.5 Exercises .................................................................................................. 416 5.4 8.4.1 8.4.2 8.5 8.5. 1 8.5.2
Chapter 9: Paradigms for Process Interaction 9.1 9. I . 1 9.1.2 9.2 9.2.1. 9.2.2 9.3 9.3.1 0.3.2 9.4 9.4.1 9.4.2 9.5 9.5.1 9.5.2 9.6 9.6.1 9.6.2 9.6.3 9.7
............ 423
Manager/Worlzers (Distributed Bag of Tasks) .......................424 Spai-se Matrix Multiplication .................................................424 Adaptive Quadrature Revisited ..............................................428 Heartbeat Algorithms ............................................................. 430 Image Processing: Region Labeling ......................................432 Cellular Automata: The Game of Life ...................................435 Pipeline Algorithms ............................................................... 437 A Distribu~edMatrix Multiplication Pipeliile ........................438 Matrix Multiplication by Blocks ............................................441 ProbeEcho Algorithms .......................................................... 444 Broadcast in a Network ..........................................................444 Computing the Topology of a Network ................................. 448 Broadcast Algorithms ............................................................ 451 Logical Clocks and Event Ordering .......................................452 Distributed Semaphores .........................................................454 Token-Passing Algoritl~~ns .....................................................457 Distributed Mutual Excl~~sion ................................................457 Termination Derectio~zin a Ring ............................................460 Termination Detection in a Graph .......................................... 462 Replicated Servers .................................................................465
Table of Contents
xix
9.7.1 Distributed Dining Philosophers ............................................ 466 9.7.2 Decentralized Dining Philosophers ........................................407 Historical Notes ........................................................................................ 471 474 References ................................................................................................. 477 Exercises ...................................................................................................
Chapter 10: Implementations
.................... .....
.................. 487 10.1 Asynchronous Message Passing.................................................488 10.1. I Shared-Memory Keniel ..........................................................488 10.1.2 Distributed Kernel ..................................................................491 10.2 Synchronous Message Passing ..............................................496 10.2.1. Direct Communication Using Asynchronous Messages ........497 10.2.2 Guarded Com~nunicationUsing a Clearinghouse ..................498 RPC and Rendezvous ............................................................. 504 10.3 10.3.1 RPC in a Kernel ................................................................... 504
10.3.2 Rendezvous Using Asynchronous Message Passing .............507 509 10.3.3 Multiple Primitives in a Kernel ........................................ 10.4 Distributed Shared Memory ................................................. 515 10.4.1 Implementation Overview ................................................. 516 10.4.2 Page Consistency Protocols .................................................. 518 Historical Notes ....................................................................................... 520 References ................................................................................................. 521 Exercises ............................... . . .............................................................522
Part 3: Parallel Programming
.............................................527
Chapter 11: Scientific Computing
......................................... 533
11.1 11.1.1 11.1.2 11 . I . .3 I 1.1.4 11.1.5 1 1.1.6 11.2 I 1.2.1 1 1.2.2 11.2.3
Grid Cornputations .................................................................534 Laplace's Equation ............................................................... 534 Sequential Jacobi Iteration .....................................................535 Jacobi Iteration Using Shared Variables ................................ 540 Jacobi Iteration Using Message Passing ................................541 Red/Black Successive Over-Relaxation (SOR) .....................546 Multigrid Methods ................................................................. 549 Particle Computations .......................................................... 553 The Gravitatioilal N-Body Problem .......................................554 Shared-Variable Program .......................................................555 Message-Passing Programs .................................................... 559
xx
Table of Contents
....................................................569 ............................................................ 573 11.3.1 Gaussian Elimination ............................................................ 573 11.3.2 LU Decomposition ............................................................. 575 1 1.3.3 Shared-Variable Program .................................................... 576 1 1.3.4 Message-Passing Program .................................................... 581 Historical Notes ...................... ...... .................................................... 583 References ................................................................................................584 Exercises ............................................................................................... 585
1 1.2.4 Approximate Methods 11.3 Matrix Computations
Chapter 12: Languages. Compilers. . ..................... 591 Libraries. and Tools .................... 12.1 Parallel Programming Libraries .............................................592 12.1.1 Case Study: Pthreads ..............................................................593 12.1.2 Case Study: MPI .................................................................... 593 12.1.3 Case Study: OpenMP ............................................................. 595 12.2 Parallelizing Compilers ...................................................... 603 12.2.1 Dependence Analysis ......................................................... 604 12.2.2 Prograrn Transfo~mations ...................................................... 607 12.3 Languages and Models ..................................... .....................614 12.3. 1 Imperative Languages .................................. .......................... 616 12.3.2 Coordination Languages ........................................................ 619 12.3.3 Data Parallel Languages ........................................................ 620 12.3.4 Functional Languages ............................................................ 623 12.3.5 Abstracr Models ..................................................................... 626 629 12.3.6 Case Study: High-Performance Fortran (HPF) ...................... ...... .................633 12.4 Parallel Prograinming Tools .................... 12.4.1 Pel-formance Measurement and Visualization .......................633 12.4.2 Metacornpilters and Metacornpuling ......................................634 12.4.3 Case Study: The Globus Toolkit ............................................ 636 Historical Notes ........................................................................................ 638 642 References ................................................................................................. 644 Exercises ...................................................................................................
Glossary lndex
............................................................................................................. 647
.......................................................................................................................
657
The Concurrent Computing Landscape
Imagine the following scenario: Several cars want to drive from point A to point B. They can compete for space on the same road and end up either following each other or competing for positions (and having accidents!). Or they could drive in parallel lanes, thus arriving at about the same time wjthout getting in each other's way. Or they could travel different routes, using separate roads. This scenario captures the essence of concurrent computing: There are in~~ltiple tasks that need to be done (cars moving). Each can execute one at a time on a single processor (road), execute in parallel on multiple processors (lanes in a road), or execute on distributed processors (separate roads). However, tasks often need to synchronize to avoid collisions or to stop at traffic lights or. stop signs. This book is an "atlas" of concurrent conlputing. It examines the kinds of cars (processes), the routes they might wish to travel (applications), the patterns of roads (hardware), and the rules of the road (commun.icatioo ao.d s y n c h o ~ ~ i z a rion). So load up the tank and get started. In this chapter we describe the legend of the concul~-entprogramming map. Section 1 .I inlroduces fundamental concepts. Sections 1.2 and 1.3 describe the kinds of hardware and applications that make concurrent programming both interesting and challenging. Sections 1.4 through 1.8 describe and illustrate five recurring programming styles: iterative parallelism, recursive parallelism, producers and consumers, clients and servers, and interacting peers. The last section defines the programming notation that is used in the examples and the remainder of the book. Laler chapters cover applications and programming techniques i n detail. The chapters are organized into three parts-shared-variable programming,
2
Chapter 1
The Concurrent Computing Landscape
distributed (message-based) programming, and parallel programming. The introduction to each part and each chapter gives a road map, summarizing where we have been and where we are going.
1. I The Essence of Concurrent Programming A concurrent program contains two or more processes that work together to perform a task. Each process is a sequential program-namely, a sequence of stareinents that are executed one after another. Wilereas a sequential program has a siizgle thread of control, a concurrent prograin has mulriyie rl7.reads of corztrol. The processes in a concurrent program work together by cornnz~inicatirzg with each other. Communication is programmed using shared variables or message passing. When shared variables are used, one process writes into a variable that is read by another. When message passing is used, one process sends a message that is receked by another. Whatever the form of communication, processes also need to syizchrorzize wit11 each ocher. There are two basic lunds of synchronization: mutual exclusion and condition synchronization. Murual exclusion is the problem of ensuring that critical sectiorzs of statements do not execute at the same time. Condition synchr-orzization is the problem of delaying a process until a given condition is true. As an example, communication between a producer process and a consumer process is often implemented using a shared memory buffer. The producer writes into the buffer; the consumer reads from the buffer. Mutual exclusion is required to ensure that the producer and consumer do not access the buffer at the same time, and hence that a partially written message is not read prematurely. Condition synchronization is used to ensure that a message is not read by the consumes until after it has been written by the The history of concurrent programming has followed the same stages as other experirnenral areas of computer science. The topic arose due to opportunities presented by hardware developments and has evolved in response to technological changes. Over time, the initial, ad hoc approaches have coalesced into a collection of core principles and general programrning techniques. C o n c u ~ ~ e programming ~lt originated in the 1960s within the context of operating systems. The motivation was the invention of hardware units called ch~lnnelsor device controlier,~.These operate independently of a controlling processor and allow an YO operation to be carried out concurrently with continued execution o.f program instructions by the central processor. A channel communicates with the central processor by means of an ilzierr~~pf-a hardware signal that says "stop what you are doing and start executing a different sequence of instructions."
1.1 The Essence of Concurrent Programming
3
The programming challeiige-indeed the intellectual challenge-that resulted from the introduction of channels was that now parts of a pl-ogram could execute in an unpredictable order. Hence, if one part of a program is updating the value of a variable, an intell-upt might occur and Lead to another part of the program trying to change the value of the variable. This specific problem came to be known as the critical section pr.oblem, which we cover in derail in Chapter 3. Shortly after the iotroduction of channels, hardware designers conceived of multiprocessor machines. For a couple decades these weIe too expensive to be widely used, but now all large machines have ~nultipleprocessors. Indeed, the largest machines have hundreds of processors and a e often called ~ n a s s i v e l ~ ~ parallel process0~-s,or MPPs. Even most personal computers will soon have a few processors. Multiprocessor macIGnes allow diflerent application programs to execute at the same time on different processors. They also allow a ,single application program to execute faster if it can be rewritten to use multiple processors. But how does one synchronize the activity of concurrent processes? How can one use multiple processors to make an application run faster? To summarize, channels and ~nultiprocessvrsprovide both opportunities and challenges. When writing a concurrent program one has to make decisions about what kinds of processes to employ, how many to use, and how they should interact. These decisions are affected by the application and by the underlying hardware on which the program will run. Whatever choices are made, the key to developing a correct program is to ensul-e that process interaction is propei-iy synchronized. This book exami,ries all aspects of concurrent. programming. However, we focus on i~lzperativepi-ogr~lmnswith explicit concurrency, communication, and synchronization. I n particular, the programmer has to specify the actions of each process and how they commuiiicate and synchronize. This contrasts with declnrcltive pro,orunzs-e.g., functional or logic programs-in which concurrency is implicit and there is no reading and writing of a program state. In declarative programs, independent parts of the program may execute in parallel; they communicate and synchronize implicitly when one part depends on the results produced by another. Although the declarative approach is interesting and iinporrant-see Chapter 12 for more information-the imperathe approach is much more widely used. In addition, to implenient a declarative program on a traditional machine, one has to write an imperative program. We also focus on concurrent prograins in which process execution is a.synclzr-onous-namely, each process executes at i.ts own rate. Such programs can be executed by interleaving the processes on a single processol- or by executing the processes in parallel on a multiple instruction stream, multiple data stream (MIMD) multiprocessor. This class of machines includes shared-memory
4
Chapter 1
The Concurrent Computing Landscape
multiprocessors, distributed-memory multicomputers, and networks of workstations-as described in the next section. Altl~ougl~ we focus on asynchronous multiprocessing, we describe synchrorzo~ismultiprocessing (SIMD machines) in Chapter 3 and h e associated data-parallel programming style jn Chapters 3 and 12.
1.2 Hardware Architectures This section summarizes key attributes of modern computer architectures. The next section describes concurrent programming applications and how they make use of rhese architectures. We first describe single processors and cache memories. Then we examine shared-memory multiprocessors. Finally, we describe distributed-memory machines, which include multico~nputersand networks of machines.
1.2.1
Processors and Caches A modern single-processor. machine contains several components: a central processing unit (CPU), primary memory, one or more levels of cache memory, secondary (disk) storage, and a variety of peripheral devices-display, keyboard, mouse, modem, CD, printer, etc. The key components with respect to the execution of programs are the CPU, cache, and memory. The relalion between these is depicted in Figure 1.1. The CPU fetches instructions from lnernoiy and decodes and executes them. It contains a control unit, arithmetic/logic unit (ALU), and registers. The control unit issues signals that control the actions of the ALU, menlory system,
Primary memory
c=' w Level 2 cache
Level 1 cache
Figure 1.1
Processor, cache, and memory in a modern machine
1.2 Hardware Architectures
5
and external devices. The ALU implements the arjllnmetic and logical instructions defined by the processor's instruction set. The registers contain instructions, data, and the machine state (including the program counter). A cache is a sn~all,fast memory unit that is used to speed up program execution. It contains the contents of those ineruory locations that have recently been used by the processor. The rationale for cache memory is that most programs exhibit ferr~porai!Localio:, meaning that once they reference a memory location they are likely to do so again in the near fi~ture.For example, instructions inside loops are fetched and executed many times. When a program attempts lo read a memory location, the processor first looks in the cache. If the data is there-a cache hit-then it is read from the cache. If the data is not there-a cache r711,ss-then the data is read from primary memory into both the processor and cache. Similarly, when data is written by a program, it is placed in the local cache and in the primary nlelnory (or possibly just in the pri~narymernory). In a write tlw-o~ighcaclze, data is placed i n memory immediately; in a write back cache, data is placed 11-1 memory later. The key point is that after a write, the contents of the primary memory might temporarily be inconsistent with the contents of the cache. To increase the transfer rate (bandwidth) between caches and primary memory, an entry in a cache typically contains multiple, C O I I ~ ~ ~ U Owords US of memory. These entries are called cache lines or blocks. Whenever there is a cache miss, the entire contents of a cache line are transfel-red between the memory and cache. ~, that This is effective because most pr0gram.s exhibit spatitrl l o c a l i ~ meaning when they reference one mernory word they soon reference nearby memory words. Modern processors typically contain two levels of cache. The level 1 cache is close to the processor; the level 2 cache is between the level 1 cache and the primary memory. The level 1 cache is smaller and faster than the ].eve1 2 cache, and it is often organized differentlp For example, the level 1 cache is often direcr rncrpped and the level 2 cache is often set a-ss0ciative.l Moreover, the level 1 cache often contains separate caches for instructions and data, whereas the level 2 cache is commonly unified, meaning that it contains borh instructions ai~dclata. To illustrate the speed differences between the levels of the memory hierarchy, registers can be accessed in one clock cycle because they are small and internal to the CPU. The contents of the level 1 cache can typically be accessed in one or two clock cycles. 111 contrast, it takes on the order of 20 clock cycles to
'
In a direct-rnapped cache, each niernoi-y address maps into one cnclie entry. In a set-sssocintive cache, each Inemor?, address maps into a set of cache entries; rhe set size is typically two or- ft~ur.Thus, it two rrlsmorp addresses map iniu rhe same location, or~lythe most r c c c ~ i l l yreferenced locntion can be i n n direct-niilppcd cache, whereas both locariu~iscan be in n set-associalive cache. On the other hand, a direcl-mapped cache is faster becausc it i s easier lo decide i f a word is in the cachc.
6
Chapter 1
The Concurrent Computing Landscape
access the level 2 cache and perl~aps50 to 100 clock cycles to access primary memory. There are sirnilar size differences in the memory hierarchy: a CPU has a few dozen registers, a level 1 cache contains a few dozen kilobytes, a Level 2 cache coi~rainson the order of one megabyre, and primary memory contains tens to hundreds of megabytes.
1.2.2 Shared-Memory Multiprocessors In a shared-rnenzory m~tltiprocessor,the processors and memory modules are connected by means of an intercorznecziorz ~zehuork,as shown in Figure 1.2. The processors share the primary memory, but each processor has its own cache memory. On a small ~nultiprocessorwith from two to 30 or so processors, the inrerconnection network is implemented by a memory bus or possibly a crossbar switch. Such a multiprocessor is called a UMA 111achine because there is a uniform memory access time between eveiy processor and every memoly location. UMA machines are also called synzrtzetric multiprocessors (SMPs). Large shared-memory multiprocessors-those having tens or l~undredsof processors-have a hierarchically organized memory. In particular, the interconnection network is a a-ee-structured collection of switches and memoiies. Consequently, each processor has some memory that is close by and other memory that is farther away. This organization avoids the congestion that would occur if there were only a single shared bus or switch. Because it also leads to nonuniform mernory access times, such a ~nultiprocessoris called a NUMA machine. On both UMA and NUMA machines, each processor has its own cache. IF two processors reference different memory locations, the coctents of these locations can safely be in the caches of the two processors. However, problems can arise when two processors reference the same memory location at about the
I
I
Interconnection network
Figure 1.2
Structure of Shared-Memory Multiprocessors.
1.2 Hardware Architectures
7
same time. If both p~.ocessorsjusc read the same location, each can load a copy of the data into its local cache. But if one processor writes into the location, there is a coche corzsistencj~problem: The cache of the other processor no longer contains the correct value. Hence, either the other cache needs to be updated with the new value or the cache entry needs to be invalidated. Every rnultiproce~ssor has to implement a cache consistenc)~protocol in hardware. One inethod is lo have each cache "snoop" on the memory address bus, looking for references to locations in its cache. Writing also Jeads to a ~nenzaryconsisfe~zcyproblem: When is primary memory aclually updated? For example, if one processor writes into a memory location and anott~ei-processor later reads from that memory location, is the second processor guaranteed to see the new value? There are several different memory consistency models. Seg~tenfialconsiste~zcyis the strongest model; it guarantees that memoly updates will appear to occur in some sequential order, and that every processor will see the same order. Pi.ocessor corr.sistel7cy is a weaker model; it ensures that tlne writes by each processor wjll occur in inernory in the order they are issued by the processor, but writes issued by different p1.0cessors might be seein in different orders by other processors. Relecr.se co~zsistency is an even weaker model; it ensures only that the primary memory is updated at programmer-specified synchronizatjon points. The memory consistency problem presents tradeoffs between ease of programming and implen~entationoverhead. The programmer intuitively expects sequentia1 consistency, because a program reads and writes variables without regard to where the values are actually stored inside the machine. In particular, when a process assigns to a variable, the pr0gramme.r expects the result of that assignment to be seen immediately by every other process in the program. On the other hand, sequential consistency is costly to.implement and it makes a machine slower. This is because on each write, the'hardware has to invalidate or update other caches and to update primary memory; moi.eover, these actions have to be atoinic (indivisible). Consequently, multiprocessors typically implement a weaker ~nernoryconsistency model and leave to prograrnrners rhe task of inserting memory synchronization instructions. Compilers and Iib1,aries often take care of this so the application progrmlner does not have to do so. As noted, cache lines often contain multiple words that are transferred to and from lnernory as a unit. Suppose variables .K and y occupy one word each and that they are stored jn adjacent memory words that (nap into the same cache line. Suppose some process is executing on processor I of a nlultiprocessor and reads and writes va~iablex. Finally, suppose another process is executing on processor 2 and reads and writes variable y. Tllen whenever processor I accesses variable s,the caclne line on that processor will also contain a copy of variable y; a similar process will occur with processor 2.
8
Chapter I
The Concurrent Computing Landscape The above scenario causes what is called fcrlse sharing: The processes do not actually share variables x and y , but the cache hardware treats the two variables as a unit. Consequently, when processor 1 updates x . the caclie line containing both x and y in processor 2 has to be invalidated or updated. Similarly, when processor 2 updates y, the cache line containing .x and y in processor I lias to be invalidated or updated. Cache invalidations or updates slow down the memory system, so the program will run much more slowly than would be the case if the two variables were in separate cache lines. The key point is at the programmer would correctly think that the processes do not share variables when in fact the memory system has introduced sharing and overhead to deal with it. To avoid false sharing, the programmer has to ensure that variables that are written by di.Kerent processes are not in adjacent memory locations. One way to ensure this is to use padcling-that is to cleclare dummy variables that take up space and separate actual variabIes from each other. Th1.s js an instance of a time/space tradeof€: waste some space in order to reduce execution tirne. Tu summarize, rnultjprocessors employ caches to improve performance. However, the presence of a memory hierarchy introduces cache and memory consistency problems and the possibility of false sharing. Consequently, to get maximal performance on a given machine, the programmes has to know about the cl~al-acteristicsof the memory system and has to write programs to account for them. These issues will be revisited at relevant points in the remainder of the text.
1.2.3 Distributed-Memory Multicomputers and Networks In a clistribi.itecl-~rte~~zor~ nzultipmcessor, there is again an interconnection network, but each processor has its own private memory. As shown in Figure 1.3, the location of the interconnection network and memory mocJ.u-les are reversed relarive to their location in a shared-memory rnultiprocesso~~. The interconnection
/
Interconnection network
Cache
Figure 1.3
1
Cache
Structure of distributed-memory machines.
1.2 Hardware Architectures
9
network supports nzessuge ynssirzg rather than memory reading and writing. Hence, the processors cornrnunicale with each other by sending and receiving messages. Each processor has its own cache, but because memory is not shared there are no cache or memory consistency problems. A nzznlricor.rtputer-is a distributed-memory ~nultiprocessorin which the processors and network are physically close to each other (in the same cabinet). For this reason, a multicomputer is often called a fightill coupbed mnchirze. A multicomputer is used by one or at most a few applications at a time; each application uses a dedicated set of processors. The interconnection network provides a highspeed and high-bandwidth communication path between the processors. It is typically organized as a mesh or hypercube. (Hypercube machines were among the earliest examples of multicomputers.) A rtenv0r.k .system is a distributed-memory tnultiprocessor in which nodes are connected by a local area communicatio~inetwork such as an Ethernet or by a long-haul network such as the Internet. Consequently, network systems are called loosely cotiplccl multiprocessors. Again processors communicate by sending messages, but these take longer to deliver tkan in a rnulticornputer and there is Inore network contention. On rhe other hand, a network system is built out of co~nmodityworkstatiol~sand networks whereas a mulricoinputer often has custom components, especially for the interconnection network. A network system that consists of a collection of workstations is often called a nehvu~k,of wor+kssarions(NOW) or a clwter of ~vorksrations(COW). These workstarions execute a single application, separate applications, or a mixture of the two. A currently popular way to build an inexpensive distributedmemory multiprocessor is to construct whac is called a Beovvulf rnnchine. A Beowulf machine is built out of basic hardware and free software. such as Pentium processor chips, network interface cards, disks, and the Linux operating system. (The name Beowulf refers to an Old Engljsh epic poem, the first masterpiece of English literature.) Hybrid combinations of distributed- and shared-memory rnultjprocessors also exist. The nodes of the distributed-memory machine might be sbaredmemory multiprocessors rather than single processors. Or the interconnection network inight support both message passing and rnechanis~nsfor direct access to remote memory (this is done in today's most powerful machines). The most general combination is a machine that supports what is called distrib~itedslzured rnernory-namely, a distribured implementation of the shared-memory abstraction. This makes the machines easier to program for many applications, but again raises the issues of cache and memory consisrency. (Chapter 10 describes distiib~~ted shared memory and how to implement it entirely in software.)
10
Chapter 1
The Concurrent Computing Landscape
1.3 Applications and Programming Styles Concurrent program.ming provides a way to organize software that contains relatively independent parts. It also provides a way to make use of lnultiple processors. There are three broad, overlapping classes of applications-multithreaded systems, distributed systems, and parallel computations-and three corresponding kinds of concursent programs. Recall that a process is a sequential program that, when executed, has its own thread of control. Every concurrent program contains multiple processes, so every concul-rent program has multiple theads. However, the term multithreaded usually means that a program contains more processes (threads) than there are processors to execute the threads. Consequently, the processes take turns executing on the processors. A mul~ithreadedsoftware system manages multiple independent activities such as the following: window systems on personal computers or workstations; time-shared and multiprocessor operating systems; and real-time systems that control power plants, spacecraft, and so on. These systems axe written as rnultithreaded programs because it is much easier to organize the code and data structures as a collection of processes than as a single, huge sequential program. In addition, each process can be scheduled and executed independently. For example, when the user clicks the mouse on a persona! computer, a signal is sent to the process that manages the window into which the mouse cursently points. That process (thead) can now execute and respond to rhe mouse click. In addition, applications in other windows can continue to execute in the background. The second broad class of applications is distributed computing, in which components execute on machines connected by a local or global communicaeion network. Hence the processes communicate by exchanging messages. Examples are as follows: file servers in a network; database systems for banlung, airline reservations, and so on; Web servers on the Internet; enterprise systems that integrate components of a business; and fault-tolerant systems that keep executing despite component failures. Distribu.ted systems are written to off-load processing (as in file servers), to provide access to remote data (as in databases and the Web), to integrate and manage
1.3 Applications and Programming Styles
11
data that is inherently distributed (as in enterprise systems), or ro increase reliability (as in fault-tolerant systems). many distributed systems are orgallized as client/servel-systems. For example, a file server provides data files for processes executing on client machines. The components i n a distributed system are often themselves multitheaded programs. Parallel cornpuling is the third broad class of appljcations. The goal is to solve a given problem faster or equivalently to solve a larger problem in the same amount of time. Examples of parallel computations are as follows: scientific computations that model and simulate phenomena such as the g1obaI climate, evolution of the solar system, or effects of a new drug; graphics and image processing, including the creation of special effec~sin movies; and large combinatorial or optimization problems, such as scheduling an airline business or modeling the economy. These k.inds of computations are large and compute-intensive. They are executed on parallel processors to achieve high performance, and there usually are as many processes (threads) as processors. Pal.allel computations are wi-iwen as dcim parallel progr~wzs-in which each process does the same thing on its part of the data-or as tu-sk parallel y7-ogruins in which different processes carry oul: different tasks. We examine these as well as other applications in this book. Most importantly, we show how to program them. The processes (threads) in a ~nultithreaded program jnteract using shared variables. The processes i n a distributed system cornmunicate by exchanging messages or by invoking remote operations. The processes in a parallet computation interact using either shared variables or message passing, depending on the hardware on which the program will execute. Parr 1 in this book shows how to write programs that use shared variables for communication and syncl~ronizatiom.Part 2 covers message passing and remote operations. Part 3 exa~njnesparallel programming in detail, with an emphasis on scientific computing. Although there are Inany concurrent programming applications, they employ only a small number of solution patterns, or paradigms. In particular, there are five basic paradigms: (1) iterative parallelisln, (2) recursive parallelism, (3) producers and consumers (pipelines), (4) clients and servers, and (5) interacting peers. Applications are programmed using one or more of these. Iterarive parallelism is used when a program has several, often identical processes, each of which contains one or more loops. Hence, each process is an iterative program. The processes in the program work together to solve a single problem; they communicate and synchronize using shared variables or message
12
Chapter 1
The Concurrent Computing Landscape
passing. Iterative parallelism occurs ]nost freqi~enclyin scientific computations that execute on multiple processors. Recltrsive par.allelism can be used when a program has one or more recursive procedures and the procedure calls are independent, meaning that each worlcs on diffexent parts of the shared data. Recursion is often used in imperative languages, especially to implement divide-and-conquer or backtracking algorithms. Recursion is also the fundamental programming paradigm in symbolic, logic, and functional languages. Recursive parallelism is used to solve many co~nbinatorialproblems, such as sorting, scheduling (e.g., traveling salesperson), and game playing (e.g., chess). Prod~~cers and corzsl.rmers are communicating processes. They are often organized inro a pipeline through which inforination flows. Each process in a pipeline is a.flcer that consumes the output of its predecessor and produces ourput for its successor. Filters occur at the appljcation (shell) level in operating systems such as Unjx, within operating systelns themselves, and within application programs whenever one process produces output that is consumed (read) by another. Clierzls and senjers are the dominant interaction pattern in distributed systems, from local networks to the World Wide Web. A client process requests a service and waits to receive a reply. A server waits for requests from clients then acts upon them. A server can be implemented by a single process that handles one client request at a time, or it can be multithreaded to service client requests concurrently. Clients and servers are the concurrent programming generalization of procedures and procedure calls: a server is like a procedure, and clients call the server. However, when the client code and server code reside on different maclnines, a conventional procedure call-jump to subroutine and later return from subroutine-cannot be used. Instead one has to use a remote procedure call or a rerzdezvous, as described in Chapter 8. Interacting peers is the final interaction paradigm. It occurs in distributed programs when there are several processes that execute basically the same code and that exchange messages to accomplish a task. Interacting peers are used to implement distributed parallel programs, especially those with iterative parallelism. They are also used to implement decentralized decision making in distributed systems. We describe several applications and communication patterns in Chapter 9. The next five sections give examples that illustrate the use of each of these patterns. The examples also introduce the programming notation that we will be using throughout the text. (The notation is sumn~arizedin Section 1.9.) Many more examples are described in later chapters-either in the text or in the exercises.
1.4 Iterative Parallelism: Matrix Multiplication
13
1.4 Iterative Parallelism: Matrix Multiplication An iterative sequential program is one that uses for 01' w h i l e loops to examine data ancl compute results. An iterative parallel program contains two or more iterative processes. Each process computes results for a subset of tbe data, then the results are combined. As a simple example, consider rhe following problem from scientific cowputing. Given matrices a and b, assume that each matrix has n rows and columns, and that each has been initialized. The goal is to compute the matrix product of a and b, storing the result in the n x n matrix c. This requires computing n 2 jnner products, one for each pair of rows and columns. The rnatl-ices are shared variables, which we will declare as fallows: double a [n,n], b [n.nl , c [n,nl
;
Assuming n has already been declared and initialized, this allocates storage for three arrays of double-precision floating point numbers. By default, the indices for the rows and columns range from o to n- 1. After initializing a and b, we can compute the matrix product by the following sequential program: for [i = 0 to n-11 { f o r [j = 0 to n-11 ( # compute inner product of a [i,* ] and b [ * , j] c [ i , j ] = 0.0;
for [k = 0 to n-11 c [ i , j ] = c[i, j l c a [ i , k l * b l k , j l ;
1
I
Above, the outer loops (with indices i and j) iterate over each row and column. The inner loop (with index k) computes the inner product of row i of matrix a and column j of matrix b, and then stores the resu1.t in c [i,j 3 . The line that begins with a sharp character is a comment. Matrix multiplication is an example of what is called an embnrrnssi17.gly parallel cipplicntiorz, because there are a multitude of operations that can be executed in parallel. Two operations can be executed in parallel if they are independent. Assume that the read set of an operation contains the variables it reads but does not alter and that the write set of an operation contains the variables that it alters (and possibly also reads). Two operations are independent if the write set of each is disjoint from both the read and write sets of the other. Informally, it is always safe for two processes to read variables that do not change. However, it is generally not safe for two processes to write into the same variable, or for one
14
Chapter 1
The Concurrent Computing Landscape
process to read a variable that the other writes. (We will examine this topic in detail in Chapter 2.) For matrix multiplication, the computations of inner products are independent operations. In particular, lines 4 through 6 in the above prograin initialize an element of c then compute the value ol' that element. The innermost loop in the program reads a row of a and a column of b, and reads and writes one element of c. Hence, the read set for an inner product is a row of a and a column of b, and U1e write set is an element of c. Since the write sets for pairs of inner products are disjoint, we could compute all of then1 in parallel. Alternatively, we could con-lpute rows of results i n pacallel, columns of results in parallel, or blocks of rows or colurnns i n parallel. Below we show how to program these parallel computations. First consider cornpudng rows of c in parallel. This can be program.n~edas follows using the co (concurrent) stalement: co [i = 0 to 11-11 { # compute rows in parallel f o r [j = 0 t o n - 1 1 { c [ i , jl = 0 . 0 ; for [k = 0 to n - l l c [ i , j ] = c [ i , j ] + a[i,kl*bEk,j];
3
3
The only syntactic difference between this program and the sequential program is that co is used in place of for in the outermost loop. However, there is an important semantic difference: The co statement specifies that its body should be executed concun+ently-at least conceptualIy if not actually, depending on the number of processors-far each value of index variable i. A different way to parallelize matrix lnultiplication is to compute the columns of c in parallel. This can be programmed as n-11 { # compute c o l m s in parallel t o n-11 { 0.0; 0 to n-11 c [ i , j ] = c[i,jl + a [ i , k l * b [ k , j ] ;
co [ j = 0 to for [i = 0 c [ i ,j ] = for [k =
I Here the outer two loops have been intercl~nnged,meaning that the loop on i in the previous program has been interchanged with the loop on j . It is safe- to interchange two loops as long as the bodies are independent and hence compute the same results, as they do here. (We examine this and related lunds of prograrn transforn~ationsin Chapter 12.)
I .4 Iterative Parallelism: Matrix Multiplication
15
We can also compute all inner products in parallel. This can be programmed in several ways. First, we could use a single co statement with two indices: = 0 to n-1, j = 0 to 11-11 { # all rows and c C i , j l = 0.0; # a l l columns f o r [k = 0 t o n - I ] c [ i , j l = c [ i , j l + a [ i , k ] * b [ k , j];
co [i
1
The body of the above co statement is executed concurrently for each combination of values of i and j ; hence the program specilies n 2 processes. (Again, whether or not the processes execute in parallel depends on the underlying irnplementation.) A second way to compute all inner products in parallel is to use nested co statements: co [i = 0 to n-11 { # rows in parallel then co [ j = 0 to 11-11 { # columns i n parallel c[i, j ] = 0 . 0 ; f o r [k = 0 to n-11 c [ i , j] = c [ i , j ] + a[i,kIfb[k,j ] ; 1 1
This specifies one process for each row (outer co statement) and then one process for each colu~nn(inner co statement-). A third way to write the program would be to interchange the first rwo lines in the last program. The effect of all three programs is the same: Execute the inner loop for all n 2 combinations of i and j. The difference between the three programs is how the processes are specified, and hence when they are created. Notice that in all the concurrent programs above, all we have done is to replace instances of for by co. However, we have done so oilly for index variables i and j . What about the innermost loop on index variable k? Can this for statement also be replaced by a co statement? The answer is "no," because the body of the inner loop both reads and writes variable c [ i ,j 1 . It is possible to compute an inner product-the f o r loop on k-using binary parallelism, but this is impractical on most machines (see the exercises at the end of this chapter). Another way to specify any of the above parallel computations is to use a process declaration instead of a co statement. In essence, a process is a co statement that is executed "in the background." For example, the first concurrent program above-the one that computes rows of results in parallel-could be specified by the following program:
16
Chapter 1
The Concurrent Computing Landscape process r o w [ i = 0 t o n-11 { # rows i n p a r a l l e l f o r [ j = 0 t o n-13 { c [ i , j ] = 0.0; f o r [k = 0 t o n-11 c [ i , j ] = c[ij] + a [ i , k l * b [ k , jl;
1 3
This declares an array o f processes-row [ 1 I , r o w 12 I , etc.-one for each value of index variable i. The n processes are created and start executing when this declaration is encountered. If there are statements -Following the process declaration, they are executed concurrently with the processes, whereas any statements following a co statement are not executed until the co statement terminates. Process declarations callnot be nested within other declarations or statements, whereas co statements may be nested. (Process declarations and co statements are defined in detail in Section 1.9.) The above programs employ a process per element, row, or colu~nnof the result matrix. Suppose that there are fewer than n processors, which will usually be the case, especially if n is large. There is still a natural way to make full use of all the processors: Divide the result rnatrix into strj.ps-of rows or columnsand use one ~ ~ o r k eprocess r* per strip. In particular, each worker computes the results for the elements in its strip. For example, suppose there. are P processors and that n is a multiple of P (namely, P evenly divides n). Then if we use strips of rows, the worker processes can be programmed as follows: process w o r k e r [ w = 1 t o P I int first = (w-1) * n / P ;
{
i n t last = f i r s t + n / P - 1;
# strips in p a r a l l e l # f i r s t r o w of s t r i p # last row o f strip
[ i = f i r s t to l a s t ] { for [j = 0 to n-1] { c[i,j] = 0 . 0 ; f o r [k = 0 t o 11-11 c [ i , j l = c [ i , j ] + a[i,kl*b[k,jJ; 1
for
1
E
The difference between this program and the previous one is that the i rows have been divided into P strips of n/P rows each. Hence, the extra lines in the above program are the bookkeeping needed to determine the first and last rows of each strip and a loop (on i)to coinpute the inner products for those rows. To summarize: the essential requirement for being able to parallelize a program is to have independent computations-namely, computations that have
1.5 Recursive Parallelism: Adaptive Quadrature
17
disjoint write sets. For matrix multiplication, the inner products are independent computations, because each wrices (and reads) a diflerent element c [ i , j 1 of the result matrix. We can thus compute all inner products in parallel, all rows in parallel, all colun3ns in parallel, or strips of rows in parallel. Finally, the parallel programs can be written using co statements or process declarations.
1.5 Recursive Parallelism: Adaptive Quadrature A recursive program is one that contains procedures that call themselves-ejther directly or indirectly. Recursioi~is the dual of iteration, in the sense that iterative programs can be converted to recursive programs and vice versa. However, each programming style has its place, as some problems are naturally iterative and some are naturally recursive. Many recursive procedures call themselves more than once in the body of a procedure. For example, quicksort is a common sorting algorithm. Quicksort partitions an array into two parts and then calls itself twice: first to sort the left partition and then to sort the right partition. Many algorithms on trees or graphs have a similar structure. A recursive program can be implemented using concurrency whenever it has multiple, independent recursive calls. Two calls of a procedure (or function) arc independent if the write sets are disjoint. This will be the case if (1) the procedure does ~zotrefe ence global variables or only reads them, and (2) reference and result arguments, if any, are distinct variables. For example, if a procedure does not reference global variables and has only value parameters, then evely call of the procedure will be independent. (It is fine for a procedure to read and write local variables, be-cause each instance of che procedure will have its own private copy of local variables.) The quickso1-t algorithm can be programmed to meet these requirements. Another interesting example follows. The qttctdrat~ireproblern addresses the issue of approximating the integral of a continuous .Frtnction. Suppose f (x) is such a function. As illustrated in Figure 1.4, the integral of f (x)fronl a to b is the area between E ( x ) and rhe x axis from x equals a to x equals b.
Figure 1.4
The quadrature problem.
18
Chapter 1
The Concurrent Computing Landscape
There are two basic ways to approximate the value of an integral. One is to divide the interval from a to b into a fixed number of subintervals, and then to approximate the area of each subinterval by using something like the trapezoidal rule or Simpson's rule. double fleft = £(a), fright, area = 0.0; double width = (b-a) / INTERVALS; for [X = (a + width) to b by width] { fright = f(x); area = area + (£left + fright) * width / 2; fleft = fright; 1
Each iteration uses the trapezoidal rule to compute a small area and adds it to the total area. Variable width is the width of each trapezoid. The intervals move from left to right, so on each iteration the right-Fand value becomes the new leftInand value. The second way to approximale an integral is to use the divide-and-conquer paradigm and a variable number of subintervals. In pa~ticular,first compute the midpoint rn between a and b. Then approximate the area of three regions under the curve defined by function f ( ) : h e one from a to m, the one from m to b, and the one from a to b. If the sum of the two smaller areas is within some acceptable tolerance EP S I L O N of the larger area, then the approximation is considered good enough. If not, the larger problem-from a to b-is divided into two subproblems-from a to m and from m to b-and the process is repeated. This approach is called adaptive qnaclrnture, because the algorithm adapts to the shape of the curve. It can be programmed as follows: double quad(doub1e left,right,fleft,frigb.tI1rarea) double mid = (left + right} / 2; double fmid = £(mid); double larea = (fleft+fmid) * (mid-left) / 2; double rarea = (fmid+fright) * (right-mid) / 2; if (abs((larea+rarea) - Irarea) > EPSILON) { # recurse to integrate both halves Larea = quad(left, mid, fleft, fmid, larea); rarea = quad(mid, right, fmid, fright, rarea); 1 return (larea + rarea);
3
{
19
1.5 Recursive Parallelism: Adaptive Quadrature .
The integral of f ( x )from a to b is approximated by calling area = w a d ( a , b, f (a), f (b), (f(a)+ f (b)) * (b-a) / 2 )
;
The function again uses the trapezoidal rule. The values of f ( ) at the endpoints of an interval and the approximate area of that interval are passed to each call of quad to avoid con~pi~ting them more than once. The iterative program cannot be parallelized because the loop body both reads and writes the value of a r e a . However, the recursive program has independent calls of quad, assuming that function f ( ) has no side effects. In particular, the arguments to quad are passed by value and the body does not assign to any global variables. Thus, we can use a co statement- as follows ro specify that the recursive calls should be executed in parallel: co l a r e a = q u a d ( l e f t , m i d , £ l e f t , £mid, l a r e a ) ; / / r a r e a = quad(rnid, r i g h t , £mid, f r i g h t , rarea); OC
This is the only change we need to make to the recursive program. Because a co statement does not terminate until both calls have completed, the values of l a r e a and r a r e a will have been computed before quad returns their sum. The co statements in the matrix rnultiplicatjon programs contain lists of statements that are executed For each value of the quantifier variable (i or j ) . The above co statement contains two funcrion calls; these are separated by / / . The first form of co is used ro express iterative parallelism; rhe second form is used to express recursive paralleIisni. To summarize, a program with multiple recursive calls can readily be turned into a parallel recursive program whenever the calls are independent. However. there is a practical problem: there may be too much concurrency. Each co statement above creates two processes, one for each function call. If the depth of recursion is large, this will lead to a large number of processes, perhaps too many to be executed in parallel. A solution to rhis problem is to prune the recursion tree when there are enough pt-ocesses-namely, to switcli from using concurrent recursive calls to sequential recursive calls. This topic is explore,d in the exercises and later in this book.
1.6 Producers and Consumers: Unix Pipes A producer process computes and outputs a stream of results. A consumer process inputs and analyzes a stream of values. Many programs are producers and/or consumers ofone form or another. The cornbination becomes especially
20
Chapter 1
The Concurrent Computing Landscape
interesting when producers and consumers are connected by a pipeline-a sequence of processes in which each consumes the output of its predecessor and produces output for its successor. A classic example is Unix pipelines, which we examine, here. Other examples are given in later chapters. A Unix application process typically reads from what is called its standard input file, s t d i n , and writes LO what is called its standard output file, stdout. Usually, s t d i n is the keyboard of the terminal fi-om which an application is invoked and stdout js the display of that terminal. However, one of the powerful features introduced by Unix was the fact that the standard input and output "devices" can also be bound to different kinds of files. In particular, s t d i n and/or s t a o u t can be bound to a data file or to a special kind of "file" called a pipe. A pipe is a buffer (FIFO queue) between a producer and consumer process. It contains a bounded sequence of characters. New values are appended when a producer process writes to the pipe. Values are removed when a consumer process reads from the pipe. A Unix application program merely reads from s t d i n , withoi~tconcern for where the inpul is actually from. If s t d i n is bound to a keyboard, input is the cl~aracterstyped at the keyboard. If s t d i n is bound to a file, input is the sequence of c-haracters in the file. If s t a i n is bound to a pipe, input is the sequence of cl~aracterswritten to that pipe. Similarly, an application merely writes to s t d o u t , without concern for where tlne output actually goes. Pipelines in Unix are typically specified using one of the Unix command languages, such as csh (C shell). As a specific example, the printed pages for this book were produced by a csh command sjrnilar to the following: sed - f S c r i p t $ *
I tbl I eqn I gsoff Macros -
This pipeline contains four commands: ( I ) sed, a stream editor; (2) tbl,a table processor; (3) eqn, an equation processor; and (4) grof f , a program lhac produces Postscript ourput from t r o f f source files. A vertical bar, the C slnell character for a pipe, separates each pair of commands. Figure 1.5 illustrates the structure of the above pipeline. Each of the cominands is aJlrel- process. The input to the sed filter is a file of editing coinmands (Script) and the command-line arguments ($*), wlzich for this book are
Source files + sed
Figure 1.5
A pipeline of processes.
postscript
1.7 Clients and Servers: File Systems
21
the appropriate source files for the text. The output froin sed is passed 10 tbr, which passes its output to eqn, which passes its output to g r o f f . The groff elter reads a file of Macros for the book and then reads and processes its standard input. The grof f filter sends its output to the printer in rhe author's office. Each pipe in Figure 1.5 is implernented by a bounded buffer: a synchronized, FIFO queue of values. A producer process waits (if necessary) rrntil there is room in the buffer, then appends a new line to the end of the buffer. A consumer process waits (if necessary) until the buffer contains a line, then removes the first one. In Part 1 w e show how to implement bounded buffers using shared variables and various synckronization primitives (flags, semaphores, ancl monitors). 111 Part 2 we introduce cornrnunication channels and the message passing primitives send and receive. We then show how to program filters using them and how to implement channels and message passing using buffers.
1.7 Clients and Servers: File Systems Between a producer and a consumer, there is a one-way flow nT info~mation. This h n d of interprocess relationship often occurs in concurrent progranls, bur it has no analog in sequential programs. This is because there is a single thread of control in a sequential program, whereas producers and consumers are inclependent processes with their own threads of control and their own. rates of progress. The clientlserver relationship is another common pattern in concurrent programs; indeed, it is the most common pattern. A client process requests a service, then waits for the request to be handled. A server process repeatedly waits for a request, handles it, then sends a reply. As illustrated in Figure 1.6, there is a two-way flow of information: from the client to the server and then back. The relationship between a client and a server is the concurrent p~.ogramrninganalog of the relationship between the caller- of a subroutine and the subroutine itself. Moreover, like a subroutine that can be called from many places, a server typically has many clients. Each client request has to be handled as an independent
-
Request
_--Figure 1.6
Clients and servers.
22
Chapter 1
The Concurrent Computing Landscape
unit, bur multiple requests might be handled concurrently, just as multiple calls of the same procedure might be active at tlie same time. Client/server inreractions occur within operating systems, object-oriented systems, networks, databases, and many other programs. A common example in all rhese applications is reading and writing a data file. In particular, assume there is a iile server module that provides two operatjons on a file: read and w r i t e . When a client process wanl-s to access the file, it calls the read or w r i t e operation in the appropriate file server module. On a single-processor or other sha-ed-memory system, a file server would typically be iinplen~entedby a collection of subroiltines (for read, w r i t e , etc.) and data structures that represent files (e.g., file descriptors). Hence, the interaction between a client process and a file would typically be implemented by subroutine calls. However, if a file is shared, it is probably important chat it be written to by at most one client process at a time. On the other hand, a shared file can safely be read concurrently by multiple clients. This kind of problem is an instance of what is called the readers/writers problem, a classic concusenc programming problem that is defined and solved in Chapter 4 and revisited in later chapters. In a distributed system, clients and servers typically reside on different machines. For example, consider a query on the World Wide Web-for example, a query that arises when a user opens a new URL within a Web browser. The Web browser is a client process that executes on the user's machine. The URL indirectly specifies another machine on which the Web page resides. The Web page itself is accessed by a server process that executes on the other machine. This server process may already exist or it inay be created; in either case it reads the Web page specified by the URL and returns it to the client's m.achine. l n fact, as the U R L is being translated, additional server processes could well be visited or created at intermediate machines along the way. Clients and servers are programmed in one OF two ways, depending on whether they execute on the same or on separate machines. In both cases clients are processes. On a shared-memory machine, a server is usually implemented by a collec~ionof subroutines; these subroutines are programmed using ini~tual exclusion and conditiorl synchronization to protect critical sections and to ensure that the subr-outines are executed in appropriate orders. On a distributed-memory or network machine, a server is implemented by one or more processes that usualIy execute on a different machine than the clients. In both cases, a server is often a multilhreaded program, with one thread per client. Parts 1 and 2 present numerous applications of clients and server, including file systems, databases, memory allocators, disk sclleduling, and two more classic pl-oblems-the dining philosophers and the sleeping barber. Part I shows how to
1.8 Peers: Distributed Matrix Multiplication
Data
f,)!?z< f Data
(a) Coordinator/worker interaction
Figure 1.7
23
(b) A circular pipeline
Matrix multiplication using message passing
i~nplernentservers as subroutines, using semaphores or monitors for synchronization. Part 2 shows bow to implement servers as processes that colnmunicate with clients using message passing, rernote procedure call, or rendezvous.
1.8 Peers: Distributed Matrix Multiplicafion We earlier showed how to implement parallel matrix multiplication using processes that share variables. Here we present two ways to solve the prohle~nusing processes thar cornmunicate by means of message passing. (Chapter 9 presents additional, n o r e soph.isticated algorithms.) The first program en~ploysa coordinator process and an array of independent worker processes. In the seconcl program, the workers are peer processes that interact by means of a circular pipeline. Figure 1.7 illustrates the structure of these process interaction patterns. They occur frequently i n distributed parallel computations, as we shall see in Part 2. On a distributed-memory machine, each processor call access only its own I.ocal memory. This means that a program cannot use global variables; instead every variable has to be local to some process or procedure and can be accessed only by that process or procedure. Consequently, processes have to use message passing to communicate with each other. Suppose as before that we want to compute the product of matrices a and b, storing the result in rnatrix c. Assume that a,b, and c are n x n matrices. Al.so assume for simplicity that there are n processors. We can then use an array of n worker processes, place one worker on each processor, and have each worker. compute one row of the resulr matrix c. A program for the workers follows: process worker[i = 0 to n-11 { double a En] ; # row i of matrix a double b [ n , n ] ; # all of matrix b double c [nl ; # row i of matrix c receive initial values for vector a and matrix b;
24
Chapter 1
The Concurrent Computing Landscape for [j = 0 to n-11 { c[j] = 0.0; for [k = 0 to n-11 c [ j l = c [ j l + aEk1 * btk,jl;
I send result vector c to the coordinator process; 1
Worker process i computes row i of result matrix c. To do so, it needs row i of source rnatrjs a and all of source matrix b. Each worker first receives rhese values from a separate coordinator process. The worker then coinputes its row of results and sends them back to the coordinator. (Alrernarively, the source matrices might be produced by a prior computation, and the resulr matrix might be input to a subsequent computation; this would be an example of a distributed pipeline.) The coordinator process initiates the computation and gathers and prints the results. In particular, the coordinator first sends each worker the appropriate row of a and all of b. Then the coordinator waits to receive a row of c fi-om every worker. An outline of the coordinator follows: process coordinator { double a[n,n]; # source matrix double b[n,n]; # source matrix double c [ n , n ] ; # result matrix initialize a and b; for [i = 0 to n-11 I send row i of a to worker [i]; send all of b to worker C i 3 ; 1 for [ i = 0 to n-11 receive row i of c from worker
a b
c
iI print the results, which are now in matrix c;
;
1
The sen& and receive statements used by the coordinator and workers are Jnessage-pa-ssing prirnifives. A send statement packages up a message and transmits i t to anorher process; a receive statement waits for a message froin another process, then stores it in local variables. We describe message passing in detail in Chapter 7 and use it to program numerous applications i n Parts 2 and 3. Suppose as above that each worker has one row of a and is to compute one row of C . I-Iowever, now assume tliat each worker Iias only one column of b at a time instead of the entire matrix. Initially, worker i has colurnn i of matrix b. With just this much of the source data, worker i can compute only the result for
iJ/
1.8 Peers: Distributed Matrix Multiplication c [ i ,i ].
In order for worker
i to compute all of row i of matrix c ? it has to
acquire all columns of matrix b. It can do so if we use a circulai pipeline-as illustrated in Figure 1.7 (b)-and circulate the colurnns among the worker pcocesses. 111 particular; we have each worker execute a series of I'OLL}~C~.S; in each round it sends its column of b to the next worker and receives a different column of b from the previous worker. The program follows: process w o r k e r [ i = 0 t o 11-11 { d o u b l e a [nl; # r o w i of m a t r i x a double b [ n ] ; # o n e column of m a t r i x b double c [nl ; # r o w i o f matrix c d o u b l e sum = 0 . 0 ; # storage for inner products # n e x t column of r e s u l t s i n t n e x t C o l = i; receive row i of~natrixa and column i of matrix b; # compute c [ i , i l = a[i,*l x b [ * , i I f o r [k = 0 t o 11-13 sum = sum + a r k ] * b [ k l ;
c [ n e x t C o l ] = sum; # c i r c u l a t e columns and compute r e s t of c [ i , * l f o r [ j = 1 to 11-11 { s e n d my column of b to the next worker; receive a new column of b from the previous worker; sum = 0.0; f o r [k = 0 t o n-11
sum = sum + a [ k l * b [ k l ; if
(nextCol == 0 ) n e x t C o l = n-1;
e 1 se c
nextCol = nextCol-1; [ n e x t C o l l = sum;
1
s e n d result vector c to coordjrlatoi-process; 1
Above the next worker is the one with the next higher index, and the previous worker is the one wid1 the next lower index. (For workel- n-1, rhe next worker is 0; for worker 0, the pl-evious worker is n-1.) The co1urn.n~of matrix b we passed circularly among the workers so that each worker e.ve~ztuatlysees every column. Variable n e x t c o l keeps track of where to place eacli inner product in vector c . As in the first computation, we assume that a coordinator process sends rows of a and colurnns of b to the workers and then receives rows of c from the workers. The second program elnploys an interprocess relationship that we call in.reracting peers, or silnply peers. Each worker executes the same algol.ithm and
I I
! I
26
Chapter 1
The Concurrent Computing Landscape
co~nmunicateswith other workers in order to compute its part of the desired result. We will see further examples of interacting peers in Parts 2 and 3. In some cases, as here, each worker communicates with just two neighbors; in other cases each worker communicates with all the others. In the first program above, the values in matrix b are replicated, with each worker process having its own copy. In the second program, each process has one row of a and one column of b at any point in time. This reduces the memory requirement per process, but the second program will take longer to execute than the ,first. This is because on each iteration of the second program, every worker process has to send a message to one neighbor and receive a message from another neighbor. The two psograms illustrate the classic rimelspace tradeoff in computing. Secrion 9.3 presents additional algorithms for distributed matrix multiplication that illustrate additional timelspace trade0.R~.
1.9 Summary of Programming Notation The previous five sections presented examples of recurring patterns in concurrent programming: iterative parallelism, recursive parallelism, producers and consumers, clients and servers, and interacting peers. We will see numerous instances of these patterns in the rest of the book. The examples also introduced the programming notation that we will employ. This section summarizes the notation?although it should be pretty much self-evident from the examples. RecalI that a conciinent program contains one or more processes, and that each process is a sequential program. Oul programming language thus contains ~nechanis~ns for sequential as well as concurrent programming. The sequential programming notation is based on the core aspects of C, C++, and Java. The concurrent progra~n~ning notation uses the co statement and process declarations to specify parts of a program that are to execute concurrently. These were jntl-oduced earlier and are defined below. Later chapters will define mechanisms for synchronization and inter-process communication.
1.9.1 Declarations A variable declaration specifies a data type and then lists the names of one or more variables of that type. A variable can be initialized when i t i s declared. Examples of declarations are int i, j = 3; double sum = 0.0;
1.9 Summary of Programming Notation
27
An array is declared by appending the size of each dimensioll of the array to the name of the array. The default subscript range for each dimension is from o to one less than the size of the climension. Alternatively, the lower and irpper bounds of a range can be declared explicitly. Arrays can also be initialized when they are declared. Examples of array declarations are int a [n]; # same as "int a[O:n-11;" int b[l:nl; # array of n integers, b[ 11 . . . b [ n ] int c[l:n] = ([nl 0 ) ; # v e c t o r of zeroes double c [n,nl = ( tnl ( [ n l 1.0) ) ; # matrix of ones
Each declaration is followed by a descriptive comment, which begins wit11 the sharp character (see Section 1.9.4). The last declaration says that c is a matrix of double precision numbers. The subscript ranges are 0 to n-1 for each dimension. The initial value of each element of c is 1.0.
1.9.2 Sequential Statements An assignment statement has a target variable (left-hand side), an equal, sign, and an expression (right-hand side). We will also use short-hand forins for- increment, clecrement, and similar assignments. Examples are a[n] = a [ n ] + 1; x = (y+z) * f(x);
# same as "a[nl # f ( x ) is a function call ++;It
The control statements we will use are i f , w h i l e , and for. A simple if statemenr has the form if (condition) statement;
where condition is a Boolean-valued expression (one that is tnle or .false) and statement is some single statement. If more than one statement is to be executed conditionally, the statements are surrounded by braces, as i n if (condition) { statement 1;
In subseqirent chapters we will often use s to stand for such a list of statements. An iflthedelse statement has the form
28
Chapter 1
The Concurrent Computing Landscape if (condition) statement 1; else
statement2;
Again, the statements could be statement lists surrounded by braces. A while statement has the general form while (condition) statementl;
{
...
staternentN; ?
If condition is true, the enclosed statements are executed, then the w h i l e statement repeats. The while statement terminates when condition is false. If there is only a single statement, we omit the braces. The if and while statements are identical, to those in C, C++, and Java. However, we will write for statements in a more coinpact way. The general form of our for statement is for [quantifierl, statementl;
. . . , quantif ierM1
(
A quc~nl-$er.introduces a new index variable, gives its initial value, and specifies the. range of values of the index variable. Brackets are used around quantifiers to indicate that there is a range of values: as in an array declaration. As a specific example, assume that a [nl is an array of integers. Then the following statement initializes each element a [ i 1 to i: for [i = 0 to n-11 a[i] = i;
Here, i is the new quantifier variable; it does not have to be declared earlier in the program. The scope of i is the body of the for statement. The initial value of i is 0 and it takes on all values, in order, from 0 to n-1. Assume m[n, n] is an integer matrix. An example of a for statement with two quantifiers is f o r [i = 0 to n-1, m [ i , j] = 0 ;
j = 0 to n-11
1.9 Summary of Programming Notation
29
An equivalent program using nested for stale~nentsis for [i = 0 to n-l] for [j = 0 to n-11 m [ i , j l = 0;
Both programs initialize the n2 values of tnatrjx in to quantifiers are [i = 1 to n by 21 [i = 0 to n-1 st i!=xl
0.
Two more examples of
# odd values f r o m 1 to n # every value except i = = x
The operator st in the second quanti-fier.stands for "such that." We write f o r statements using the above syntax for several reasons. First, it emphasizes that our for statements ai-e different than those in C, C4+, and Java. Second, the notation suggests that they are used with arrays-which have brackets, rather than parentheses, around subscripts. Third, our notation simplifies programs because we do not have to declare the index variable. (How many times have you forgotten to do so?) Fourth, it is often convenient to have more than one index variable and hence more than one quantifier. Finally, we use the ' same kinds of quantifiers in co statements and process declar? tlons.
1.9.3 Concurrent Sfatements, Processes, and Procedures By default, statements execute sequentially-one at a time, one after the other. The co (concurrent) statement specifies that two or more statements can execute i n parallel. One for111of co has two or inore arms, as in co statementl; / / ... / / statementN; OC
Each arm contains a statement (or statement list). Tlne arms are separated by the parallelism operator / / . The above statement means the Following: Start executing each of the statements in parallel. then wait for rhem to tei~ninate.The co statement thus terminates after all the statements have bee11 executed. The second form of co statement uses one or more quantifiers as a shorthand way to express that a set of statements is to be executed in parallel for every combination of values of the quantifier variables. For example, the trivial co statement that follows initializes arrays a [n] and b [ n l to zeros:
30
Chapter 1
The Concurrent Computing Landscape
This creates n processes, one for each value of i. The scope o-€the quantifier variable is tlie process declaration, and each process has a diffei-ent value of i. The two forms of co can also be mixed. For example, one arm might have a quantiiie~.(within brxkets) and another arm might not. A process declaration is essentially an abbreviation for a co statement with one arm and/or one quantifier. Ir begins wirh the keyword p r o c e s s . The body of a process is enclosed in braces; it carltains declarations of local variables, if any, and a list o-f statements. As a simple example, the following declares a process £00 that sums the values from 1 to 1-0then stores the result in global variable x: process foo { i n t sum = 0 ; f o r [i = 1 t o 101
sum += i; X
= sum;
I A p r o c e s s declaration occurs at the syntactic level of a procedure declaration; it is not a statement, whereas co is. Moreover. processes execute in the background, whereas the code that contains a co statement waits for the processes created by co to terminate before executing the next statement. As a second simple example, the following process writes the values 1 to n to the standard output file: p r o c e s s barl { . f o r [ i = L to nl write (i); # same a s " p r i n t £ (I1%d\n, i) ; 'I
1 An array of processes is declared by appending a quantifier name of the process, as in p r o c e s s b a r 2 [i = 1 to nl w r i t e ( i );
(ill
brackets) to the
{
Both b a r l and bar2 write the values 1 to n to standard output. However, the order in which bar2 writes the values is nondete~nzitzisric. This is because bar2 is an array of n distinct processes, and processes execute in an arbitrary
1.9 Summary of Programming Notation
31
order. In fact, there are n ! different orders in which the array of processes could write the values (n factorial is the number of permutations of n values). We will declare and call procedures and functions in essentially the same way as in C. Two simple examples are int addOne(int v) { return (v + 1);
# an integer function
1 main() { # a "void" procedure int n, sum; read(n); # read an integer from stdin for [i = 1 to nl sum = sum + addOne(i); write("the final value is", sum); 1
If the input value n is 5, this program writes the line: the final value is 20
1.9.4 Comments We will write comments in one of two ways. One-line comments begin with the sharp character, #, and terminate with the end of the line. Multiple-line comments begin with / * and end with * / . We use # to introduce one-line comments because the C++/Java one-line comment symbol, / / , has long been used in concurrenl programming as the separator between arms of ~~~~~~~~ent statements. An assertion is a predicate that specifies a condition that is assumed to be true at a certain point in a program. (Assertions are described in detail in Chapter 2.) Assertions can be thought of as very precise comments, so we will specify them by lines that begin with two sharp characters, as in
This comment asserts that x is positive
Historical Notes As noted in the text, concurrent programming originated in the 1960s after the introduction of independent device controllers (channels). Operating systems were the first software systems to be organized as multirhreaded concurrent programs. The research and initial prototypes leading to modern operating systems
32
Chapter 1
The Concurrent Computing Landscape
occurred in the late 1960s and early 1970s. The first textbooks on operating systems appeared in the early 1970s. TIie introduction of computer networks in the 1970s led to the development of distsibuted systems. The invention of the Ethernet jn the late 1970s greatly accelerated the pace of activity. Almost i~n~tlediately there was a plethora of new languages, algorithms, and applications. Subsequent hardware developinents spuued further activity. For example, once worltstations and local area networks became relatively cheap, people developed client/sei-ver computing; Inore recently the lnternel has spawned Java, Web browsers, and a plethora of new applications. Multiprocessors were invenred in the 1970s, the most notable machines being the Illiac SIMD multiprocessors developed at the Universily of Illinois. Early machines were costly and specialized, however, so for many years highperfoi-inance scientific computing was done on vector processors. This started to change in rhe mid- 1980s with the invention of hypercube machines at the California Institute of Technology and their cominercial introductioi~by Intel. Then Thinking Machines introduced the massively parallel Connection Machine. Also, Cray Research and other makers of vector processors started to produce rnulriprocessor versions of their machines. For several years, numerous companies (and machines) came, and almost as quickly went. However, the set of players and n~achinesis now fairly stable, and high-pel-Forrnance co~nputingis pretty muc.h synonymous with massively parallel processing. The Hislorjcal Notes in subsequent chapters describe developments that are covered in the chapters theniselves. Described below are several general references on hardware, operating systems, distributed systems, and parallel cornputing. These are the books the author has consulted most often while writing this text. (If there is a topic you wish to learn more about, you can also try a good Internet search engine.) The now classic textbook on computer architecture is Hennessy and Patterson [1996]. IF you want to learn how caches, interconnection networks, and multiproces~ors work, start there Hwang [I9931 describes hlgh-performance 'etViP'V Rf* c.n\nnnttterxcbitqcturca and h 19'24J ,. Wry~ 'lP' LuwL cornouter archllectules. Almaal.;inc cl l%le~ ir ,+J & computing applications, software, and architectures. The most widely used textbooks on operating systems are Tanenbaum [I9921 and Silberschatz, Peterson, and Galvin [1998]. Tanenbaum [I9951 also gives an excellent overview of distributed operating systems Mullender [1993] contains a superb collection of chapters on all aspects of distributed systems, illcluding reliability and fault tolerance, transaction processing, file systems, real-time systems, and security. Two additional textbooks-Bacon [I9981 and Bernstein and Lewis (19931-cover multithreaded concurrellt systems. including distributed database systems.
pa@
LL,,
IbVv
References
33
Dozens of books on parallel computing have been written in the last several years. Two competing books-Kumar et al. 119941 and Quinn [1994]-describe and analyze parallel algorithms for solving numerous llu~nerjcand cornbinatorial problems. Both books contain some coverage of hardware and software, but they emphasize the design and analysis of parallel alporitbms. Another four books emphasize software aspects of parallel computing. Brincl~Kansen [I 9951 exanlines several interesting proble~nsin computational science and shows how to solve thetn using a small number of recurring programming paradigms; the problem descriptions are notably clear. Foster [I9951 examines concepts and tools for parallel programming, with an emphasis on distributed-memory machines; the book includes excellent descriptions of High Performance Fortran (HPF) and the Message Passing Interface (MPI), two of the most important current tools. Wilson [I9951 describes four important programming models-data parallelism, shared variables, message passing, and generative communication-and shows how to use each to solve science and engineering problems; Appendix B of Wilson's book gives an excellent short history of major events in parallel computing. A recent book by Wilkinson and Allen [I9991 describes dozens of applications and shows how to write parallel programs to solve them; the book einphasizes the use of message passing, but includes some coverage of shared-variable computing. One additional book, edited by Foster and Kessellnan [1999], presents a vision for a promising new approach to distributed? high-performance colnputing using what is caIled a computational grid. (We describe the approach in Chapter 12.) The book starts with a comprehensive introduction to computational grids and then covers applications, programming tools, services, and infrastructure. The chapters are written by the expelts who are working on inaking. computational grids a real jty.
References Almasi, G. S., and A. Gottlieb. 1994. High.& Parallel Conzl?~iting,2nd ed. Menlo Park, CA: Benjamin/Cu*lmings. Z ~ Bacon, J. 1998. Cnnc~irr-entSyste17z.~:Operatilzg Syste~ns,Dcitubc~seL ~ I Distributed Systems: An Integrated Apy~roach,2nd ed. Reading; MA: AddisonWesley.
and Beinstein, A. J., and P. M. Lewis. 1993. Corzc~rr.r-encyin Pro~rnr.nmi~zg Dutc~bnseSystems. Boston, M A : Jones and Bartletr. Science. Englewood Cliffs, Brinch Hansen, P. 1995. Studies in Contpit~atio/~nl NJ: Prentice-Hall.
34
Chapter 1
The Concurrent Computing Landscape
Foster. I. 1995. Design.ing and Building Parallel Programs. Reading, M A : Addison-Wesley. for a New ConzFoster, I., and C. Kesselrnan, eds. 1999. The Gricl: BL~repri~zt 1x1-tingInfmstrtict~tre. San FI-ancisco, CA: Morgan Kaufmann.
Hennessy, J. L., and D. A. Patterson. 1996. Conzyuter Architechtre: A Quurztit~lfive Appro~ich,2nd ed. San Francisco, CA: Morgan Kaufmann.
Hwang, K. 1993. Adl~nrzced ComptiCer Arch.itectci.re: Paru.llelismn,, Scala,bility, P~,ogrunzmabili~.New York, N Y : McGraw-Hill. Kumu, V., A. Grarna, A. Gupta, and G . Karypis. 1994. Infl-odwctiotzto Pcrrallel Cor-rzgt~ting:Design and Analysis qf Algoritlzrns. Menlo Park, CA: Benjamin/Cumn~jngs. Mullender-, S., ed. 1993. Distrib~ltedSystenzs, 2nd ed. Reading, MA: ACM Press and Addison-Wesley.
M . J . 1994. Parallel Cotnp~iting: Theoiy and Praclice. New York, NY: McGraw-Hill.
Quinn,
Silbel.sclnatz, A., J. Peterson, and P. Galvin. 1998. Operc~tingSysreun Coacq~ts, 5th ed. Reading, MA: Addison-WesJey. Tanenbaum, A. S. 1992. Mocler~zUpemtlrzg Systerns. Englewood Cliffs, NJ: Prentice-Hall.
A. S. 1995. Disli-ibmted Operatirzg Sysrents. Englewood Cliffs, NJ: Tane~~baum, Prentice-Hall. Wilkinson, B ., and M. Allen. 1999. Parallel Prograrnr~zi~lg:Techniques and Applications Using Networked Workstations and Parallel Comp~iters. Englewood Cliffs, NJ: Prentice-Hall. Wilson, G. V. 1995. Practical Pamllel Progrumnzirtg. Cambridge, M A : MIT Press.
Exercises I . 1 Determine the characteristjcs of the multiple processor machines to which you
have access. Haw many processors does each machine have and what are their cloc-k speeds? How big are the caches and how are they organized? What is the size of the primary memory? What is the. access time? What memory consistency protocol is used? How is the interconnection network organized? What is the remore memory access or message transfer time?
Exercises
35
1.2 Many probIems can be solved more efficiently using a concurrent program rather than a sequential program-assuming appropriate hardware. Consider the programs you have written in previous classes, and pick two that coutd be rewritten as concurrent programs. One progran~should be iterative and the other recursjve. Then (a) write concise descriptions of the two problems, and (b) develop pseudocode for concurrent programs that solve the problems.
1.3 Consider the nlatiix multiplicarion problem described in Section 1.4. (a) Write a sequential program to solve the problein. The size of the matrix n should be a command-line argument. Initialize each element of matrices a and b to 1.0. (Hence, the value of each element of the result matrix c will be n.) (b) Write a parallel program to solve the problem. Compute strips of results in parallel using P worker processes. The matrix size n and the number of workers P should be command-line arguments. Again initialize each element of a and b to 1.0. (c) Compare the performance of your programs. Experiment with different values for n and P. Prepare a graph of the results and explain what you observe.
(d) Modify your programs to multiply rectangular matrices. The size of matrix a should be p x q, and the size of matrix b should be q x I.. Hence, the size of the result will be p x r. Repeat part (c) for the new programs. 1.4 In the matrix multjplication programs, computing an inner product involves mul-
tiplying pairs of elements and adding the results. The multiplications could all be done in parallel. Pairs of products could also be added in parallel.
(a) Construct a binary expression tree to illustrate how this would work. Assume the vectors have length n, and for simplicity assume that n is a power of two. The leaves of the tree should be vector elements (a row of matrix a and a column of matrix b). The other nodes of the tree should be multiplication or addition operators. (b) Many modern processors are what are called super-scalar processors. This means that they can issue (start executing) two or more instructions at the same time. Consider a machine that can issue two instructions at a time. Wi-ite. assembly-level pseudocode ro implement the binary expression tree you constructed in part (a). Assume there are instructions to load, add, and multiply registers, and there are as many registers as you need. Maximize the r ~ ~ ~ m of b e rpairs 01 inst~vctionsthat you can issue at the same time. (c) Assume that a load takes 1 clock cycle, an add takes I clock cycle, and a multiply takes 8 clock cycles. What is the execution time of your program?
36
Chapter 1
The Concurrent Computing Landscape
1.5 In the first parallel matrix multiplication program in Section 1.4, the first line has a co statement for rows, and the seconcl line has a f o r statement for columns. Suppose the co is replaced by f o r and the for is replaced by co, but that otherwise the program is the same. In particular, the program computes rows sequentially but computes columns in parallel for each row. (a) Will the program be correct? Namely, will it compute the same results?
(b) Will the program execute as efficiently? Explain your answer. 1 .G The points on a unit circle centered at the origin are defined by the function f ( x ) = sqrt ( 1 - x 2 ) . Recall [hat the area of a circle is x r 2 , where r is the radius. Use the adaptive quadrature program in Section 1.5 to approximate the value of 7c by cornp~~ting the area of the upper-right quadrant of a unit circle then multiplying the result by (1. (Another' function to try is the integral from 0. o to 1 . 0 of f ( x ) = 4 / (1+x2).)
1.7 Consider the quadrature problem described in Section 1.5.
(a) Write four programs ro solve the problem: a sequential iterative program that uses a fixed number of inrervals, a sequential recursive program that uses adaptive cluaclrature, a parallel iterative prograrn that uses a fixed number of intervals, and a recursive parallel program that uses adaptive quadrature. Integrate a funclion that bas an interesting shape, such as sin(x)*exp(x). Your prograins shoi~ldRave command-line arguments for the range of values for x, the number 0.F intervals (for fixed quadrature), and the value of EPSILON (for adaptive quadrature). (b) Expeti~nentwith your programs for different values of the command-line arguments. What are their execution times? How accurate are the answers? How fast do the programs converge? Explain what you observe.
1.8 The parallel adaptive quadrature prograin in Section 1.5 will create a large number of processes, usually way more than the number of processors. (a) Modify the prograin so that it creates approximately T processes, where T is a co~nmancl-lineargument that specifies a threshold. In particular, use a global variable to keep a count of the number of processes that you have created. (Assume for this problem that you can safely increment a global variable.) Within the body of quad, if over T processes have already been created, then use sequential recursion. Otherwise use parallel recursion and add two to the global counter. (b) Implement and test the parallel prograrn in the text and your answer to (a). Pick an interesting function to integrate, then experiment with different values of T ancl E P S I L O N . Compare the pe~forrnanceof the two programs.
Exercises
37
1.9 Write a sequential recursive program to implement the quicksort algorithm for 1 array of n values. Then modify your program to use recursive paralsorting a lelism. Be carefill to ensure that the parallel calls are independent. Implement both programs and compare their performance. 2.10 Gather infonnation on Unix pipes. How are they jxnplemented on your system? What is the maximum size of a pipe? How are read and w r i t e synchronized? See if you can construct an experiment that causes a w r i t e to a pipe to block because the pipe is full. See if you can construct a concurrent program lhat deadlocks (in other words, all the processes are waiting for each other). 1.11 Most computing facilities now employ server machines for electronic mail, Web pages, files, and so on. What kinds of server rnachines are used at your computing facility? Pick one of the servers (if there is one) and find out how the software is organized. What are the processes (threads) that it runs? How are they scheduled? What are the clients'? Is there a thread per client request, or a fixed number of server threads, or what'? What data do the server threads share? Wlien and why do the rhlaeadssynchronize with each other? 1.12 Consider the two programs for distributed matrix multiplication in Section 1.8.
(a) I-Iow many messages are sent (and received) by each program? What are the sizes of the messagesc? Don't forget that there is also a coordinator process in the second program. (b) Modify the programs to use P worker processes, where P is a factor of n. In particular, each worker should compute n/P rows (or columns) of results rather than a single row of results. (c) How many messages are sent by your program for part (b). What are the sizes of the messages? 1.13 The rrclnspose of matrix M is a matrix T such that T [ i ,j 1
= M [ j ,i I ,
for all i and
j.
(a) Write a parallel program using shared variables that computes the transpose of an n x n matrix M. Use P worker processes. For simplicity, assume that n is a multiple of P. (b) Write a parallel program using message passing to compute the transpose of square rnarrix M. Again use P workers and assume n is a multiple of P. Figure
out how to distribute the dara and gather the results. (c) Modify your programs to handle the case when n is not a multiple of P .
(d) Experiment with your programs .For various values of n and P. What is their performance?
38
Chapter I
The Concurrent Computing Landscape
1.14 The grep family of Unix commands finds patterns in files. Write a simplified version of grep that has two arguments: a string and a filename. The program should print to stdout ail lines in the file that contain the string.
(a) Modify your program so that it searches two files, one after the other. (Add a third argument to the program for the second filename.) (b) Now change your program so that it searches the two files in parallel. Both processes should write lo stdout.
(c) Experiment with your programs for (a) and (b). Their output should differ, at least some of the time. Moreover, the output of the concurrent prograin should not always be the same. See if you can'observe these phenomena.
Part
1
Shared-Variable Programming
Sequential progralns often employ shared variables as a matter of conveniencefor global data structures, for example-but they can be written without thern. In fact, many would argue that they .should be written wilhout them. Concurrent programs, on the other hand, absolutely depend on the use of shared con~ponents.This is because the only way processes can work together to solve a problem is to communicate. And the only way they can communicate is if one process writes into .~o171etlzing that the other process reads. That something can be a shared variable or it can be a shared communication channel. Thus, co~nmunicationis programmed by writing and reading shared variables or by sending and receiving messages. Communication gives rise to the need for synchronization. There are two basic lunds: mutual exclusion and condition synchronization. Mutual exclusion occurs whenever two processes need to take turns accessing shared objects, such as the records in an airline reservation system. Condition synchronization occul-s whenever one process needs to wait for another-for example, when a consumer process needs to wait for data from a producer process. Part 1 shows how to write concurrent programs in which processes cornmunicate and synchronize by means of shared variables. We examine a variety of multithreaded and parallel applications. Shared-variable programs are most commonly executed on shared-memory machines, because every variable can be accessed directly by every processor. I-Towever, the shared-variable programming model can also be used on distributed-memory machines if it is supported by a software irnplernentation of what is called a distributed shared memory; we examine how this is done in Chapter 10. Chapter 2 introduces fundamental concepts of processes and synchl-onization by means of a series of small examples. The first half of the chapter
40
Part 1
Shared-Variable Programming describes ways to parallelize- programs, illustrates the need for synchronization, defines atomic actions, and introduces the await statement for programming synchronjzation. The second half of the chapter examines the seinantics of concurrent programs and introduces several key concepts-such as interference, global invariants, safety proper-ties, and fairness-that we will encounter over and over jn later chapters. Chapter 3 examines two basic kinds of synchronization-locks and barriers-and shows 11ow to implement them ~tsingthe kinds of instructions that are found on evely processor. Locks are used to solved the classic critical section probtem, which occurs in most concurrent programs. Bai~iersare a fundamental synchronization technique in parallel programs. The last two sections of the chapter examine and illustrate applications of two important models for parallel computing: data parallel and bag of tasks. Chapter 4 describes semaphores, which simplify programming mutual exclusion and condition synchronization (signaling). The chapter introduces two more classic concun-ell[ programming problems-the dining philosophess and readers/writers-as well as sever-a1 others. At the end of the chapter we describe and illustrate the use of the POSZX threads (Pthr-eads) library, which supports threads and sernaphores on shared-memory machines. Chapter 5 describes ~nonitors,a higher-level programming il~echanislnthat was first proposed in the 1970s, then .fell somewhat out of favor, but has once again become popular because it is supported by the Java progranlming language. Monitors are also used to organize and synchronize the code in operating systems and other ~nultithreadedsoftware systems. The chaprer illustrates the use of monitors by means of several interesting examples, including communication buffecs, readerslwriters: timers, the sleeping barber (another classic problem), and disk scheduling. Section 5.4 describes Java's threads and synchl-onized methods and shows various ways to protect shared data in Java. Section 5.5 shows how to program monitors using the Pthreads library. Chapter 6 shows how to implement processes, semaphores, and inonitors on single processors ancl on shared-memory multiprocessors. The basis for the iniplernentatjons is a kenzel of data structures and primitive routines; these are the core for any implementation of a concurrent programming language or subl-outine library. At the end of the chapter, we show how to implement monitors using sernaphores.
Processes and
I
Concux~entprograms are inherently more complex than sequential programs. In many respects, they are to sequential programs what chess is to checkers or bridge is to pinochle: Each is interesting, but the formel- is more intellectually intriguing than the latter. This chapter explores the "game" of concurrent programming, looking closely at its rules, playing pieces, and strategies. The rules are formal tools that help one to understand and develop correct programs; the playing pieces are language mechanisms for describing concurrenl computations; the strategies are useful programming techniques. The previous chapter introduced processes and synchronization and gave several examples of their use. This chapter examines processes and synchronization in detail. Section 2.1 rakes a first look at the semantics (meaning) of colzcurrent programs and introduces five fundamental concepts: program state, atomic action, history, safety property, and Iiveness property. Sections 2.2 and 2.3 explore these concepts using two simple examples-finding a pattern in a file and finding the maximum value in an array; these sections also explore ways to parallelize programs and show the need for atomic actions and synchronization. Section 2.4 defines atomic actions and introduces the await statement as a means for expressing atomic actions and synchronization. Section 2.5 shows how to program the synchronization that arises in producer/consu~nerprograms. Section 2.6 presents a synopsis of the axiomatic semantics of sequential and concurrent programs. The fundamental new problem introduced by concurrency is the possibility of interference. Section 2.7 describes four methods for avoiding interference: disjoint variables, weakened assertions, global invariants, and synchronization. Finally, Section 2.8 shows how to prove safety properties and defines scheduling policies and fairness.
42
Chapter 2
Processes and Synchronization
Many of the concepts introduced in this chapter are quite detailed, so they can be hasd to grasp on first reading. But please persevere, study the examples, and refer back to this chapter as needed. The concepts are important because they provide the basis fol- developing and understanding concurrent programs. Using a disciplined approach is important for sequential programs; it is imperative for concurrent programs, because the order in which processes execute is nondeterministic. In any event, on with the game!
2.1 States, Actions, Histories, and Properties The state of a concurrent program consists of the values of the program variables at a point: in time. The variables include those explicitly declared by the programmer as well as implicit variables-such as the program counter for each process-that contain hidden state information. A concurrent program begins execution in some initial state. Each process in the program executes independently, and as it executes it exmines and alters the program state. A process executes a sequence of statements. A statement, in turn, is implemented by a sequence of one or more atolnic actions, which are actions that indivisibly examine or change the program state. Examples of atomic actions are uninterruptible machine instructions that load and store memory words. The 'execution of a concurrent program results in an iizterlenving of the sequences of atomic actions executed by each process. A particular execution of a concurrent program can be viewed as a history: so + s , + ... -+ s,. In a history, so is the initial state, the other s,, are subsequent states, and the transitions are made by atomic actions that alter the state. (A history is also called a trace of the sequence of states.) Even parallel execution can be modeled as a linear history, because the effect of executing a set of atomic actions in parallel is equivalent to executing them in some serial order. In particular, the state change caused by an atomic action is indivisible and hence cannot be affected by atomic actions executed at about the same time. Each execution of a concurrent program produces a history. For all but the most trivial programs, the number of possible histories is enormous (see below for details). This is because the next atomic action in any one of the processes could be the next one in a history. Hence, there are many ways in which the actions can be interleaved, even if a program is always begun in the same initial state. In addition, each process will most likely contain conditional statements and hence will take different actions as the state changes. The role of synchronization is to constrain the possible histories of a concurrent program to those histories that are desirable. Murual exclusion is concerned with combining atomic actions that are implemented directly by hardware
2.1 States, Actions, Histories, and Properties
I
43
into sequences of actions called critical sectioizs that appear to be atomic-i.e., tliar cannot be interleaved with actions in other processes that reference the same variables. Condition synchronization is concerned with delaying an action unril the state satisfies a Booleal1 condition. Both forms of synckronizatioo can cause processes to be delayed, and hence lhey restrict the set of atomic actions that can be executed next. A properly of a program is an attribute that is true of every possible history of that program and hence of all executions of the program. There are two kinds of properties: safety and liveness. A safety property is one in which the program never enters a bad state-i.e., a state in which some variables have undesirable values. A liveness property is one in which the progran? eventi~allyenters a good state-i.e., a state i n which variables have desirable values. Purtial courecrness is an example of a safety property. A program is partially coaect if the final state is correct, assuming that the program terminates. If a program fails to terminate, it may never produce the correct answer, but there is no history in which the program has terminated without producing the correct answer. Ternzination is an example of a liveness property. A program te~ininates if every loop and procedure call terminates-hence, if the length of every history is finite. Total correctrzess is a property that combines partial correctness and termination: A program is totally correct if it always tern~inateswith a correct answer. Mutual exclusion is an example of a safety property j11 a concull-ent program. The bad state in this case would be one in which two processes are executing actions in different critical sections at Lhe same time. Eventual entry to a critical section is an example of a liveness property in a concurrent: program. The good state for each process is one in which it js executing within its critical. section. / Given a program and a desired property, how might one go about demonstrating that the program satisfies the property? A common approach is resting or debugging, which can be characterized as "run the program and see what happens." This corresponds to enumerating some of the possible histories of a program and verifying that they are acceptable. The shortcoming of testing is that each test considers just one execution history, and a limited number of tesLs are unlikely to demonstrate the absence of bad histories. A second approach is to use operational reasoning, which can be characcerized as "exhaustive case analysis." In this approach, all possible execution histories of a program are enumerated by considel-ing all the ways the atomic actions of each process might be interleaved. Unfortunately, the nulnber of histories in a concurrent program is generally enormous (hence the approach is "exhaustive"). For example, suppose a concurrent program contains n processes and that each executes a sequence of m atomic actions. Then the number of different histories
44
Chapter 2
Processes and Synchronization
of the program is ( n , m ) ! / (m!"). In a program containing only three processes, each of which executes only two atomic actions, this is a total of 90 different histories! (The numerator in the formula is the number of permutations of the n.rn actions. But each process executes a sequence of actions, so there is only one legal order of the m actions in each process; the denominator rules out all the jllegal orders. The formula is the same as the number of ways to shuffle n decks of rn cards each. assuining that the cards in each deck remain in the same order relative to each other.) A third approach is to employ assertioncil reasoning, which can be clnaracterized as "abstract analysis." 111this approach, formulas of predicate logic called ns.~er.tion.sare used to characterize sets of states-for example, all states in which x > 0. Atomic actions are then viewed as predicate tvansfori~-zers, because they change the state from satisfying one predicate to satisfying another predicate. The virtue of the assertional approach is that it leads to a coinpact representation of states atid state transformations. More importantly, it leads to a way to develop and analyze programs in which the work involved is directly proportional to the number of atomic actions in the program. We will use the assertional approach as a tool for col~structingand understancling solutions to a variety of nontrivial problems. We will also use operational reasoning to guide the development of algorithms. Finally, many of the programs in the text have been tested since that helps increase confidence i n the con.ectness of a program. One always has to be wary of testing alone, however, because it can reveal only the presence of errors, not their absence. Moreover, concun'ent programs are extremely difficult to test and debug since (1) it is diffiC L I I to ~ stop all processes at: once in order to examine their state, and (2) each execution will in general produce a differey history.
2.2 Parallelization: Finding Patterns in a File Chapter 1 examined several kinds of applications and showed how they could be solved using concurrent programs. Here, we examine a single, simple problem and look in detail at ways to parallelize it. Consider the problem of finding all instances of a pattern in filename. The pattern is a string; filename is the name of a file. This problem is readily solved in Unix by using one of the grep-style commands at the Unix command level, such as g r e p pattern filename
Executing this command creates a single process. The process executes something similar to the following sequential program:
2.2 Parallelization: Finding Patterns in a File
45
string line;
read a line of input from stdin into l i n e ; while (!EOF) { # EOF is end of file look for p a t t e r n in line; i f (pattern isin l i n e ) write l i n e ; read next line of input;
1 We now want to consider two basic questions: Can the above progra~nbe parallelizcd? Tf so, how? The fundamental requirement for being able to parallelize any program is that it contains independent parts, as described in Section 1.4. Two parts are dependent on each other if one produces results that the other needs; this can only happen if they read and write shared variables. Hence, two parts OF a program are independerzt if they do not read and write the same variables. More precisely:
(2.1) Independence of Parallel Processes. Let the read set of a part of a program be the variables it reads but does not alter. Let rhe wrire set of a part be the variables ir writes into (and possibly also reads). Two parts of a program are independent if the write set of each part is disjoint from both the read and write sets ol' the other part.
A variable is any value that is read or written atomically. This includes simple variables-such as integers-that are stored in single words of storage as well as individual elements of arrays or structures (records). From the above definition, t d o parts are independent if both only read shared variables, or if each part reads different variables than the ones written into by the other past. Occasionally, it is possible that two parts can safely execute in parallel even if they write into common variables. However, this is feasible only if the order in which the writes occur does not matter. For example, when two (or more) processes are periodically updating a graphics display, i t may be fine for the updates to occur in any order. Returning to the problem of finding a pattern in a file, what are the independent parts, and hence, what can be paralletized? The program starts by reading [he first input line; h i s has to be done before anything else. Then tlie program enters a Loop that looks for the pattern, outputs the liue if the pattern is found, then reads a new line. We cannot output a line before loohng foi a pattern in that line, so the first two lines in the loop body cannot be executed in parallel wit11 each ocher. However, we can read the next line of input while loolung for a pattern in the previous line and possibly printing that line. Hence, consider the following, concurrent version of the above program:
46
Chapter 2
Processes and Synchron~zation string line;
read a line or input from stain into line; while ( ! E O F ) { co look lor pattern i n line; if (pattern is in line) wrile line; / / read nexi line of input into l i n e ; OC ;
I
Note that the first arm of co is a sequence o f srelements. Are the two processes in this program indep~ndent?T l ~ eanswer is 110, because lhe first process reads line and the second process writes into it. Thus, i f [he second process runs Faster than lhe firsl, i t will overwrite the line before it is examined by tlze first process. As noted, parts o f a program can be executed concurl-entlyorlly jf they read
and write different variables. Suppose Lhe second process reads inlo a djffel-elit varjable than the one that is exa~ninedby the first process. In particular, consider the following program: string linel, line2; read a 1.ine of input from s t d i n illto 1inel ; while ( ! E O F ) ( co look for pattern i n linel; if (pattern i s i n line11 wrire linel; / / read next lineof input into line2: OC ;
1
Now, the two processes are working on different lines, which u e stored in variables line1 and line2. Hence, the processes can execute concurrently. But js the above program correct? Obviously not, because the first process continuously looks at linel, while the second process continuously reads into line2, wbich is never examined. The solution is relatively straightforward: Swap the roles of the lines at the end of each loop iteratjon, so that the first process always exatnines tbe last line read and the second process always reads into a different variable from the one being examined. The following program accomplishes Ulis: string linel, line2;
read a line of input from a t d i n into linel; while ( ! EOF) { co look for pattern in l i n e l ;
2.2 Parallelization: Finding Patterns in a File
47
(pattern is in l i n e l ) wrile l i n e l ; / / read 11cxt line of input into line2; if
OC ;
linel = line2;
1
111particular, at the cnd of each loop itcralion-and aftex each pl-ocess has finished+opy the conlents of line2 into linel. The processes within the c o statement arc now independent, but theil- i~crjonbare coupled because of the last sratement in the loop, which copies line2 into linel. The above concul.renl program is correct, but it is quite inefficient. First, the last line in the loop copies the contents of line2 inlo l i n e l . This is a sequential aclion not present in the first program, and ir: in general recluires copying cl.o~ensof characters; thus, i t i.s pure overheacl. Secoiid, the loop body contains a co statement, which means that on each iteration o.Fthe while loop, two p~ncesseswill be created, executed, then destroyed. It is possil>le Lo do [he "copying" much more efficiently by using an array ol'two lines, having each process index inlo a tliffeerenl line in the tu-my. a~ldthen merely swapping array indices in the last line. However, process crealion overhead would scill dominate, because it takes much longer lo creace and debt~.oyprocesses than to call proccdures and mucb, much longer than to execute straight-line code (see Chapter 6 for details). SO, we come to our final question for this secrion. Is there anotbcr way to parallelize the prog.ram that avoids having a c o statement inside the loop? As you no doubt guessed, the answel. is yes. In par~icular,instead of having a co statement inside tl~ewhile loop, we can put w h i l e loops inside each arm of the co scalemelit. Figure 2. I co~ltainsan oulline ol' thih approach. The program is in fact an insta~~cc of the proclucer/consi~~nerpattern inlroduced i n Secciun 1.6. Here, the f rst process is the producer alrd Ihe se.cond process is the consu~ne~-. They cornmunicale by means of the sha~.cdb u f f e r . Note that the declaratio~is of line1 and line2 are now local to the processes because the lines arc no longer shared. We call the slyle of the program in Figure 2.1 "while inside co" as opposed to the "co inside while" st.yle of the earlier programs in this section. The advancage of the "while inside co" style is that processes are created on.ly once, rather than on each loop iteracion. The down side is Lhac we have Lo use two buffers, and we have to program the recli~iredsynchtor~ization.The statements that precede and follow access Lo tlzc shared b u f f e r indicate the kind of synchronization cl~atis required. We will show I~owto program this synchronization in Section 2.5, but first we need to exalnitle the topics of synchro~iizarionill general and ato~nicactions in p;uticular.
48
Chapter 2
Processes and Synchronization string buffer; # contains one line of input boo1 done = false; # used to signal termination co
# process 1: string linel; while (true) (
find patterns
wail Tor buffer t o be f i l l 1 or done lo be me; if (done) break; l i n e l = buffer; signal that buffel- is cmpty ;
look for pattern in l i n e l ; if (pattern isin linel) write 1 inel ; 1 //
# process 2: string line2; while (true) {
read new l i n e s
read next line of input into 1 i n e 2 ; if (EOF) {done = true; break; 1 wait for buffer to be empty; buffer = line2;
signal that buffer is full ; 1 OC ; i 6
Figure 2.1
Finding patterns in a file.
2.3 Synchronization: The Maximum of an Array Now consider a different problem, one that requires synchronization between processes. The specific problem is to Ant1 the maximu~nelement of ai-ray a 1111. We will assume that n is positive and that all elements of a are positive integers. Finding the inaxi~nalelement in array a is an example of an accumu)atjon (or reduction) problem. In this case, we are accumulating the maximum seen so far, or equivalently reducing all values to their maximum. Lei m be the variable that js to be assigned (he maximal value. The goal of the program can then be expressed in predicate logic as
The first line says thal when the program tertninates, the value of m is to be at
2.3 Synchronization: The Maximum of an Array
49
least as large as every value in array a. The second Line says that m js to be ecluivale111to some value in array a. To solve h e problem, we can use the i'ollowjng sequential program: int m = 0; f o r [i = 0 to n-11 C i f (a[il > m )
m = a[il; 1
This program iteratively looks at all values irl alray a; if one is found that is s~nallerthan the current maximum, il is assigned to m. Since we assume that all values in a are pc~sitive,it is safe to initial.izern to 0. Now consider ways to parallelize the above program, Suppose we fully parallelize the loop by examining every array element in parallel: i n t m = 0; co [i = 0 to n-l] if ( a [ i l > m) m = a[i];
This prograrn is irlcorrect because the processes are not independent: cach one both reads and writes variable m. In particular, suppose that each process l colnpares its a [i] to m aL Lhe same executes at the same rate, and hence t l ~ aeach time. ALJ processeh will see that the comparison i s true (because all elements of a Lire posirive and I.he initial value OF m is 0). Hence, all processes will try to i~pdatem. The memory hardware will cause Llie updates to be done in somc: serial order-because writing inlo a meruory word is alotnic-and the fi~lalvalue o l ' m will be the value o f a (i] assigned by the last process that gets co update m. 1n the above program, reading and wriling m arc separate actions. One way t.o deal with having too much paral.l.elisrn is Lo use synchronization to combine separate actions itlto a single atomic action. The following program does so: int m = 0; co [i = 0 to n-1] (if (a[iJ > rn) m = a[i];)
The angle brnckels i n the above code specify d m each i f sl-ate~nenlis to be executed as a11 atomic action-name1 y, that each i f slakment both exi~lni nes rhe current value of m and coudifonally lipdates it as cr single, itldivisiblo action. (We describe c-hc angle-bracket nocation in detail j n the next hectiori.) Ux)fortunalely, the last program is almost the same as the sequential program. In Lhe sequential program, the elements of a are examined in a fixed
50
Chapter 2
Processes and Synchronization
o~-der-f~.orna [ O I to a [n-11. In the I;tsr program the elelnents of a are examined in ,an arbitrary ordel--because processes execute in an arbitrary order-but they are still examined one at a time due to the synchl-onizarion. The lccy issues in this applicarjon are to eil.sure [hat upclatcs of m are atomic and that m is indeed the maximum. Suppose that we execute the cornparisolls io paral.lel but (he updates one at a h e , as in the following prograln:
Is t.his version colrect? No, because this prograrn is actually the same as the first concurrent program; Every process could compare its value of a to m a n d then update h e value of m. Al thougll this program specifies that updating m is atomic, (he memory hardware of a machine will i n facl ensure that. So, what is the best way to solve this problem? The answer is to combine rhe last two programs. It is safe to do cornparj.sons in parallel, because [hey are read-only accions. But it is necessary to ellsure that when the program ter~ninates. m is itdeed the maxirnum. The followii~gprogram accomplishes this: int m = 0; co [i = 0 to n-11 if (a[il > rn) (if ( a [ i l > m ) rn = a [ i ] ; )
# check t h e value of m # recheck t h e value of m
The idea js first to do Lhe comparison, and then i f it is true, to double rheck belore dniiig llie update. This may seem like a lot of wasted effort. but in fact often it i s not. Ouce some process has updaled m, it is probable hat half thc otlwr processes will find their value OF a to be less than the new value of m; hence, rbose processes will not execute the body of the i f statement. After a further update, even fewer proce.sses wou1.d find the first cbeck to be true. Thus, il' the checks themselves occur somewl~ata t random rather Lhan concurrently, it i s increasingly likely that processes will not have to do the second check. This specific problem is not onc that is going to benefi~from being solved by a concurrent program. unless Lhe program is executed on a SlMD machine, which is buil~Lo execute line-grained progralns efficienlly. Howevei-, Lhjs section has made Ihl-ee key points. First. synchronization is recluired to get cor~zct answers w hene\ler processes both rcacl and write shared variables. Second, angle brackets we used to specify aclions that are to be atomic; w e explore this topic in detail in the next section and later sl~owhow to j~nplelnenlatomic aclions, wbich a r c in fact instances of critical sections. Thi.rd, the lechniclue o-F double checlting
2.4 Atomic Actions and Await Statements
51
before updaling a shared variable is qi~iteuset-.i'lrI-as we shall see it1 later examples--especially jf i t is possible that the first check is i'alse and hence that the second check is not ileeded.
2.4 Atomic Actions and Awaif Statements As mentioned eiirlier. we can view execution of a concurrent program as an interleaving of the atomic actions exccuted by i udivitlual processes. When processes interact, riot all i~iterleavingsare li.kely to be acceptable. The role of synchronization is to prevent unclesirable inlel-leavings. This is done by combining finegrained atomic actions into coarse-grained (composite) actions or by delaying process execution ~rnlilthe program state salisfies some preclicate. The first for111 of synchronizalion is called I ~ L I I L I C exclu.sion; L~ the second, cnnditio~zsynchronization. This section examines aspects of a ~ o ~ nactions ic and prcscnls a notation for specifying spncluo~iizaiion.
2.4.1 Fine-Grained Atomicity Kecall that an ato~rlicactjon makes a n indivisible state transl:olmatjon. This means that any inler~nediatestate that might exist in rl~ei~nple~nentation ol' tlie action must no[ be visible to other processes. A$rf.e-gmincd atomic action is one that is impleniented direclly by tlie hardware on which a concurrent program executes. In a sequential program, assignme~itstatcrnenls appear to be atotx~icsince no intel-~ned.iatcstate is visible to the program (except possibly if there is a machine-detected TauIc). However, this is no1 genc~.allythe case i n concurrent prograins sincc an assign~nentstatement might be imple~nentedby a scqnence of fine-grained machine instructions. Fur example, consider lhe following program, and assume that the fine-grained ato~nicactions are rcading ;ind wriling the variables: int y = 0 , z = 0; co x = y + z ; / / y = 1; z = 2 ; oc;
If x = y + z is implemented by loading a register wilh y and then adding z to it, tire tinal value of x coulcl be o, I. 2 , or 3 . This i s because we could see I.he illitin1 values for y and z,their final values, or some co~nbjnation.depending on how far the second process has executed. A ful-tliel-peculiarily of the above program is that the final value of x could be 2 , even though one could never stop the program and see a stale in which y + z is 2.
52
Chapter 2
Processes and Synchronization We assume that ~nachineshave tbe t'ollowing, real istic character.istics: Values of the basic types (e.g., i n t ) are stored in memory ele~nellts(c.3.. words) that are read and writteri as atomic acdons. Values are manipulated by loading thetn inlo registers, operating oa them Lhere, then Wring the uesi~lt~ back into memory. Each process has i t s own set of registcrs. This is realized erther by hilving distinct sels o f registers or by saving ancl restoring register values whenever a different process is executed. (This i s called a context switch since the registers constitute the execution context of a process.) Any inter~nediateresr~ltsthat occur wherl a complex expression is evaluated are stored in registers or in memory private to the executing process+.g.,
on a privale stack. With this macliine model, if an. expression e in one process does not reference a variable altered by another process, expression evaluation ulill appear to be several fine-grained atomic actions, This i s atomic, even if i t requires execuli~~g because (1) none of the values on which e depends could possibly change while e is being evaluared, and (2) no other process can see any le~nporaryvalues that might be creared whi.le. the expression i s being evaluated. Similarly, if an assign~nenlx = e in one process does not reference ally variable altered by anocher process-for example, i t references only local variables-then executio~lof Lhe assignment will appear to be atomic. Unfortunately, most staremeilts in concurrent programs that reference slzaced variables do not meet the above disjointness requirement. However, a weaker requirement is often met. (2.2) At-Most-Once Property. A criricul reference i.n an expression is a reference to a variable (hat is changed by anothcr process. Assume chat any critical reference is to a simple variable that is stored in a memory element Lhat is read and wsjtten atomically. An assignment statement x = e satisfies h e at-most-once property if either ( 1 ) e contains a1 most one critical reference and x is not read by another process, or- (2) e contains no critical references: in ~/Iiichcase x Inay be read by otlier processes.
This is called the At-Most-Once Property because there can be at lnosl one shared varii~ble,and it can be referenced at most one time. A similar delinition iippIies lo ex(~ressionst l ~ a tarc no[ in assignment statemenls. Such an expression satisfies the At-Mosl-Oncc Propelly il' il contains no Inorc than one critical re.fere11ce.
2.4 Atomic Actions and Await Statements
53
If an assignment statement meets the requirements of the At-Most-Once Property, then execution of the assignmel~rstatement will appear to be atomic. This is because the one shared variable in the statement will be read or wrirren just once. For example, if e concajns no critical references and x is a si~nplevariable that is read by other processes, they wiIl not be able to tell whether the expression is evaluated atomically. Similarly, if e contains just one critical reference, the process executing tlie assignment will not be able to tell how that variable is updated; it will just see some legitimate value. A few exampIes will help clarify the definition. Both assignments in the following program satisfy the property
UFPE '~CCEN int x = 0, y = 0 ; co x = x + l ; / / y = y+l;
M KI OC;
BIBLTUI'
There are no critical references in either process, so the final values of x and y are both 1. Both assign~nentsiu the following program also satisfy the property: int x = 0 , y = 0; co x = y+l; / / y = y + l ; oc;
The first process references y (one critical reference), but x is not read by the second process, and the second process has no critical references. The final value 01' x is either 1 or 2, and the final value of y is I. Tlie first process will see y either before or after it is incremented, but in a concull-ent program il can never know which value it will see, because execution order is nondeterministic. As a final exatnple, nejther assignment below satisfies the At-Mosr-Once Property: int IC = 0, y = 0 ; co x = y+l; / / y = x+l; oc;
The expressjon in each process corlrains a critical reference, and each process assigrls 10 a variable read by the other. Indeed, the final values of x and y could be 1 and 2, 2 and 1,or even 1 and 1 (if the processes read x and y before either assigns to them). However, since each assignment refers only once Lo only one variable alterecl by mothel- process, the final values will be those that actually existed in some state. This contrasts with the earliel- example in which y + z referred to two variables altered by another process.
54
Chapter 2
Processes and Synchronization
2.4.2 Specifying Synchronization: The A wait Statement It an exp~.cssion01- assignment st;lIernent does tlot satisfy the At-Most-Once Properly, we often need 10 have it execuled atomically. More generally, we often need t o execute sequences of statements 21s a single atomic action. I11 both cases, wc need LO use a synchronizalion mechanism to construct a coarse-grainer1 atomic action, uihich is a sequence of hnc-grained atomic actions chat appears to be i~~divisibl.e. As a concrete exiimplc. suppose a database contains two values x and y , and tlial a1 all rimes x and y are to be the satne in the sense that no process examining the database i.s eve1 LO see a state in which x and y differ. Then, if a process alters x , it must also allcr y as part or Lhe same atomic action. As a second example, suppose one process inserts elements on a queue I-epresented as a linked list. Another process renloves elements from the list, assuming tlicre are elements on the list. One variable points to the head of the list and ;inother points to the tail of the list. lnserling and re~novingelements requires manipularing two values; e.g., lo i.rlsert a n elemenr, we have to change the link of the previous clement so ic points to the new element, and we have to change the tail va~.jableso i r points to the new element. I f the ]is1 coillains just one element, simultaneous insertion and removal can conflict, leaving [he list in an u~lsrable stale. Thus, insertion arid removal nlust be atom-ic actions. Furthennore, if-' the list i s empty, we need to delay execution of a rernove operarion urllil an element has beell jnsertecl. We will speciFy atomic actiotls by means of angle brackets ( and ). For example, ( e) indicates thal expression e is to be evalualed atornically. \Ve will bpecify synchronization by means of the await statement: (2.3) (await (B)
s;)
Boolean expression B speciiies a delay condition; s is a sequence of sequential slatelnencs that i s guaranteed to terminate (e.g., a sequence of assignment stateincnts). An await statement is enclosed in angle brackets to indicate that it is executed as itn ato~nicaction. 111 particular, B is guaranteed to be true when execulion of s begins, and no inrerrlal state in s is visible to other processes. For ex~~mple, (await (s > 0 ) s = s - 1 ; )
delays until s is positive, rl~endecrements s. The value of s is gua-anteed lo be posi live before s is dacremerlted. The await sLaLemenl is a very powerful slalemeot since it can be used to specil'y arbitrary, coarse-grained atomic actions. This makes it convenienl for
2.4 Atomic Actions and Await Statements
55
expressing synchronization-and we will therefore use await to develop illilia1 solutions to synchronization problems. This expressive power also makes await very expensive to implement in its most general form. However, as we shall see in h i s and the next several chapters, there are many special cases of await that can be implemented efficiently. For example, the last await statement above is an example of the P operation on semaphore 8, a topic in Chapter 4. The general form of the await statement specifies both mutual exclusion and condition synchronization. To specify only mutual exclusion, we will abbreviate an await statement as follows:
For example, the following increments x and y atomically:
The internal state-in
which x has been incremented bur y has nol-is, by definition, not visible to other processes that reference x or y . Tf s is a single assignmen( statement and meets the recluiremenls of t l ~ eAt-Most-Once Property (2.2)-or if s is implemented by a single machine instruction-then s will be executed atomically; rbus, (s;) has the same effect as s. To spccil'y only condition synchtonization, we will abbreviate await 21s (await (B);)
For example, the following delays the executing process until count > 0: (await (count > 0);)
If B meets tile requirements of the At-Most-Once Property, as in this example, then ( await (B) ;) can be itnple~nentedas while ( n o t B);
This i s an inslance of what is called a .$pin fool). In particular, (lie w h i l e statement has an empty body, so i r just spins until B becomes false. An urlcnn.ditional! atomic action is one that does not conlain a delay condition B. Such an action can execute immediately, subject of course to the r-equirement that il execute alomically. Hardware-i mplemented (fine-grained) actions, expressions ill angle brackets, and await svatenlenls in which the guard is the constant true or is omitted are all ~incooditionalatomic actions. A condirionul atomic action is an await statement with a guard 8. Such an action cannot execule until B is true. lf B is false, i t can only become true as
56
Chapter 2
Processes and Synchronization
the result 06 actio~lstaken by orl~erprocesses. 'I'hus a process wailing to execute a concli~ionalatomic action could wait for an arbitrarily long time.
Producer/Consumer Synchronization The last solution in Section 2.2 to rhc problem of finding patler~is in a file employs a producer process and a consumer proccss, l o puticular, the producer repeatedly reads inpur Ijnes, determines those chat contained the desired paclern, and passes Lhem on lo the consumer process. Thc consurner then outputs the lines i t receives frorn the producer. Commun cado on between the producer and consumer is conducled by means of a shared buffer. We left unspecified how to synchronize access ro h e buffer. We are now in a position to explain how to do so. I-Iel-e we solve a somewlia~silnpler producedconsu~nerproblem: copying all elements of an array forn the prodi~certo the consurner. We leave lo the reader the task o.F adapling 111issolution to the specific problem at the end of Section 2.2 (see the exercises a1 rl~eend of this chapter). Two processes are given: Producer and consumer. The Producer I~asa local array a [n] of integers; lhe Consumer has a local array b [n] of iiztege~s. We assume chat array a has been initialized. The goal i s to copy the coatems of a inlo b. Because the arrays are not shared, the processeh I~avelo use sharecl vaciables lo co~nmunicatewith each other. Let buf be a single shared integer lliar will sel-ve as a co~nmunjcationbuffer. The Producer ant! consumer have to alt.ernaLe access to buf. To begin. thc Producer deposits rlle first elernenr of a in buf, then Consumer (etches it, then Producer deposils Che second eje~nent0.1' a, and so on. Let shal.ed variables p and c count the number of items that li;nle been deposited and fctcheci, respectively, T n i tial l y, these values are both zero. The synchronization rcqui rements between the producer and consumer can then be e.xpressed by the followi I I predicate: ~
PC:
c reco~idjcion sitnplifies lo he predicale true, wluch characrerizes all stales. Thus, this triple says that 110 maaer what the starting stale, wlwn we assign 1 LO x we get a state that satisfies x == 1. The more coln~nonway lo view assjg~urlentis by ''going forward." In parLicular, s l m with t~ predicate that characterizes whal is true of the current state,
Composition Rule:
(PI s, {Q), [Q)
Sa
IR)
{ P I S1; S2 {R)
If Stat.ement Rule:
(P
A
B) S {Q), (P {PI
While Stalemer~t1Pule:
{I
if (B) A
{I) while(B)
Rule of Consequence:
Figure 2.3
PI
+
P, {P) s
iB)
A
S;
3
Q
CQ)
S (1) S;
{I
A
,
Q
{Q)
TB)
q1
Inference rules in programming logic PL,.
and then produce Lhe predicate that is rrue 01: the state aftel- I-he assignment takes place. For example, if we start i n a state in which x == 0 and add 1 lo x, then x will be 1 in the resulting state. This is caplured by the triple:
~ I
{x
L
== 0) x
= 1; { x
1-
1)
The Assignment Axioln describes how Lhe state changes. The inference n ~ l e sit1 a programming logic such as PL aliow thcore~nsresillling from instances of' the Assignment Axiom to be combined. I n pacdcular, in(-'el.eacrI-ulesare i~sed to characrerize the effects of statemeni co~nposition(sratelnent lists j and of control. statemenls such as i f and while. They are also used to modify the predicates in triples. Figure 2.3 gives four of the most jmportanl inference rules. The Composition Rule allvws one to glue together die o-iples i20r two sraletnents wlien the siatelnents arc execclled one after thc ~Llier. The first hypothesis in the Lf Slatewent Rule chal-aclerizes the effect of execi~tings when B is true; the second I~ypolliesischaracterizes whai i s [rile when B is false; the conclusion combines the two cases. As a simple example of the use of these two rules, the following progntn~sets rn to the maximil~nof x aizd y:
I I
62
Chapter 2
Processes and Synchronization
tVhace\ter the i.nitial state, the lisst assignmer~tproduces a shle in which m == x. After the if slalement has executed. m is equal ro x and at least as large as y , or it is ecliial to y and greater than x. The While Stateme111Rule requires a b o p invur-iunt I. This is a plcdicate that i . ~true before ancl afi-er eacll ileradon of the loop. II' r and loop condicioli B are u-ue before [he loop body s is executed, lben execution of s rrii~stagain make I t ~ u e .Thus, when the loop terruinales, I will slill be true. but now B will be false. As an exa~nplc,the following program se;ucbes array a for- the first occurrence of the value of x. Assuming that x occurs i n a. the loop rerminaccs with variable i seL to the index of the first occurreczce. i
= 1;
{i == 1 A ( v j: 1 c the buffer is full. Slippose the initial conlents 01' a [ n ] is some collection of values A l n ] . (The A [i] are called logical variables; they are sililply placeholders for whatevw the values actually we.) The goal is to prove Lhat, upon termination of [lie above program, the conterits of b [n] are the sarxle as A [n], the values i n arrray a. This goal can be proved by using the following global invariant:
Since tl-te processes altertlale access to b u f , at all times p is equal to c, or p is one more than c. Array a is not altered, so a l i ] is always equal fo A [ i ] . Finally, when the bufrer is full (i.e., when p = = c+1), it coilrains ALP-11. Predicate PC is true initially, because both p and c are initially zero. lr is nlaiilrajned by evely assignment statement. as illustrar.cd by the proof outline in
2.7 Techniques for Avoiding Interference
71
Figure 2.4. In the ligurc. IP i s arl invariant ~ O Illie - loop in the Producer procesh, and IC is an inva~ianclor the loop i n 1he consumer psoccss. Predicates l P and IC are relatcd to predicate PC as indicated. Figure 2.4 is anotlie~example of a proof oulline, because there i s an assertion before aud afrer every slalernenl. and each criplc in (lie proof outli~lci s true. The h-iples i ~ )ellch process follow directly TI-om h e assignment slalernents i n each process. Assuming each process conlinually gels a chance to execute, the
int b u f , p = 0 , c = 0; {PC: c < = g pc ) b[cl = buf; {iC A c < n A p > c A bCcl == A [ c l l c = ci-1;
(IC)
Z (IC
A
c ==
n)
1 Figure 2.4
Proof outline for the array copy program.
72
Chapter 2
Processes and Synchronization
await statements terminate since firs1 one guard is hue, t11e11the other, and so on. Hence, each process te~.minatesafter n ileracions. When the program terminateh, tlie postconclitions of both processes are true. Hence the final program state satisfies the predicate:
P C A ~ = = ~ A I C =A= Cn
Consequently. a.may b contains a copy OF array a. The assertions i n the two processes do not interfere with each orher. Most of them are a colnbinarjon o-F tlie global in\~~rianr PC and a local predicate. Hence lhey rlieet rhe requirements for noninterference described at the start of the section 01)GlobaJ Invariants. The four exceplions are the asserlions that specify relalions helween lhe values of shared variables p and c . These are not interfered with because oI' dlc await statements in the program. The role OF the await hlalernenls 111 Lhc array copy program is to ellsure LhaL the proclucer and cotlsumer processes allelnale access lo Lho buffer. This play5 two roles wjch respect to avoiding intert'erence. First, i t ensures chat the procehses canno1 access buf a t the salrle tjms; this js an instatlce of mutual exclusi011. Second. i t ensures that the producer does not. overwrite items (overflow) and (hat (he consumer does not read a11 itel11 twice (underflow): this is an instance (sf cundilion synchl-oni~ation. To summarize. this example-even Cl~ougl~ il is simplc-illustrates all ~ O ~ I I lech~~jqi~es for avoiding intel-1:erence. Firsl, many of the statements and ~nany parcs o f the assertions in each process are disjoinl. Second, we use weakened ;isserlions about the values of the shared variables; for example, we say that buf == A [p- 1 I , but o ~ ~ if l yp == c + l . Third, we use Che global i~ivariantPC to express the relationship belween h e vrdues of the shared variables; even though each variablc cha~lgcsas the program executes, this relationship does not change.! Finally, we use synchroniza~on-expressed usjng await statemeiltsto ensure the mutual exclusion and conditioa syc~clu-onizacionrequired for rhis program.
2.8 Safety and Liveness Properties Recall from Section 2.1 that a property of a program is an a~tribulethat is true of every pobsi ble hi story oF [hat program. Every interesring property can be formulated as safety or liveness. A safety properly asserts that norlling bad happens during execution; a liveness property asserts 111:it sonietliing good eventually happens. [n scqucntial programs, the key safety property is that Lhe final state is correcl, and (he key liveness property is termination. These properties ;we eqi~allyimportant for concuirent programs. In addition, r1ze1.eare other i.nterestirip safety and l iveness properlies lhat apply to concurrent programs.
2,8 Safety and Liveness Properties
73
n v o important safety properties in concurrent programs are mutual exclusion and absence of tleatllock. For ~ n u h i a lexclusion, the bad thing is having more 1lia.n one process exec~~ting crilical sectio~isol' slatelnents al Lhe same time. For dead.lock. the bad thing is having all processes waiting for conditions thal will never occur. Exa~nplesof Ji\leness propertj.es of concurreof programs are that a process will eventually get to enter a critical section, that a request for service will evenLually be honored, or (hat a message will everltually reach ils destination. Liveness properties are affected by schedul iog policies, which determine which eligible atomic actions are next to execute. I n this section we describe two methods for proving safety properties. Then we describe different kinds of processor scheduling poljcies and how they affe.ct liveness properties.
2.8.1 Proving Safety Properties Every actiori a program takes is based on its state. If a program fails to satisfy a safety property, there must be some "bad" state that fails to satisfy the property. For example, if the mutual exclusion property fivils to I~old,there must be some state in which two (or more) processes are simultaneously i n their critical sections. Or if processes deadlock, there must be some state in which deadlock occurs. These observations lead to a silnple method for proving that a program satisfies a safety property. Let BAD be a predicate that characterizes a bad program state. Then a program satisfies the associated safety property if BAD is false in every state in every possible history of the program. Given program s, to show that BAD is not true in any stare I-equiresshowing that it is not true in the initial state, h e second state, and so on, where the state is changed as a result of executing atomic actions. Alternatively, and more powerfully, if a program is never to be in a BAD state, then it must always be in a GOOD state, where GOOD is equivalent to BAD. Hence, an effective way to ensure a safety property is to specify BAD, then negate BA.D to yield GOOD, then ensure that GOOD is a global invariant-a predicate that i s true in every program state. Synchronizatjon can be used-as we have seen and will see many more titnes in later chapters-ro ensure that a predicate is a global invariant. The above is a general method for proving safety properties. There is a related, but somewhat more specialized method that is also very useful. Coiisider the following prograln fragment:
74
Chapter 2
Processes and Synchronization co # process 1 . .- ; ( p r e ( S 1 ) ) S l ;
- ..
/ / # process 2
...;
(pre(S2)) S 2 ;
...
OC
Tlierc are two statements, one per process, and two associated precot~ditions (predicates that are true hcfore each statement is executcd). Assume that rlle predicates are not interfered with. Now. suppose that the conjunctions of the precooditjons is false: pre(S1)
A
pre(S2) == f a l s e
TJljs means that the two processes cannot be at these statements at the same time! This i s because a predicate that is M s e characterizes no program state (the empty sel of slates. if you will). l'his mcthod i s called c?xclusio~zoj conjgurutio~zs, heci~usei l excludes tlie program configuration in which the first process is in slale g s e ( SI) at the same time [hat the secontl process is in stace p r e (s2 ) . As an example. consider the proof outline of the m a y copy progrnln in Figure 2.4. The await statement in each process can cause delay. The processes would deadlock if they were both delayed and neither could proceed. The Producer process is delayed if it is at its await statetnent and the delay condiLiar) is I'alse; i n I-llar state, the following predicate is true:
I'Cnp
< n n p !=c
Hencc, p > c w11er) tile Producer i s delayed. Similar1y, the Consumer process is delayed if it is at its await statement and the delay cotldition is false; that state sarisf es
Uecausc p > c ancl p c = c cannot be true at the same time, the processes cannot siini~l~aneously he i n these stares. Hence. deadlock cannot occur.
2.8.2 Scheduling Policies and Fairness Most l iveness properties depend on ,fairness, wli ich is concerned with guaranteeing that processes get the chance to proceed, regardless of what other processes do. Each process executes a sequence of aton~icactions. Au atomic action in a process is eligible if it is the next acolnic action in the process that could be executed. When there a - e several processes. there are several eligible atomic
2.8 Safely and Liveness Properties
75
actions. A schedulirig policy determines which one will be exccuted next. Tllib section defines three degrees of hirness thal a scliedul ing poljcy might provide. Recall that nn unconditional atornjc aclion is one thal does not have a delay condition. Considel- the following simple program. in which the processes execute uncondirional atomic actions: boo1 continue = true; co while (continue); / / continue = falae; OC
Suppose a scheduling policy assigns a processor to a pyocess until that process eit.her rer~ninalesor delays. 1f lhel-e is only one processor, the above program will nor lerminate jf the firs1 process is executed lirsl. Howcver: the program will teiminate i f the second process eventualiy gels a chance to execute. This is captured by the following definition.
(2.6) Unconditional Fairness. A scheduling policy is unconditionally fail- if every unconditional atomic action that is eligible is executed eventually. For the above program, round-robin would be an unco~~ditionally fair scheduling policy on a single processor, ancl parallel execution would be an unconditionally fair policy on a multiprocessor. When a program contains conditional atomic actions-awai t statements with Boolean conditiolls B-we need to make stronger assumptions to guarantee that processes will make progress. This is because a conditional atomic action cannor be executed until a is true.
(2.7) Weak Fairness. A scheduling policy is weakly fair if (1) it is unconditionally fair, and (2) every conditional atomic action lhal is eligible is executed evenlually, assuming that its condition becomes m e and then remains true until it is seen by the process executing the co~~ditional atomic action. In short, if ( a w a i t (B) S; ) is eligible and B becomes true, then B remains true at least until after h e conditional atomic action bas been executed. Round-robin and time slicing are weakly [air scheduling policies if every process gets a chance to execute. This is because any delayed process will eve~ltuallysee that its delay condition is true. Weak fairness is not, however, sdlicient to ensure that any eligible await statement evenrually executes. This is because the condition might change value-from false to true and back lo false-while a process is delayed. In thi.s case. we need a stronger scheduling policy.
76
Chapter 2
Processes and Synchronization
(2.8) Strong Fairness. A scheduling policy is strongly fair if ( 1 ) i f is unconditionally fair, and (2) every conditional atomic action that is eligible is executed eventually, assum.ing that its condition is infinitely often true.
A condition is infinitely often true if it is rrue an infinile number of times in every execution history of ;I (nontenninating) program. To be strongly fair. a scheduling policy cannot happen only to selecl an aclion when the corldition is I'alse; it must soinelilne select the aclion when the condiliotl is true. To see the difference between wed* and slrong fai~'~leg;s, consider boo1 continue = true, try = f a l s e ; co while (continue) { t r y = t r u e ; try = false;} / / (await (try) continue = false;) OC
With a stro~~gly fair policy, his prograln will eventually terminate, because try
is infinitely often rrue. However, with a weakly fail- policy, the program might no1 terminate, because try is also infinitely often false. U~~:Fortunacely, il is impossible lo devise a processor scheduling policy lhat i s both practical and strongly fair. Consider the above program again. On a single processor, u scheduler chal alteinates (he actions of the two processes would be strongly fair sincc Ihe second process would see a slate in which t r y js true; however, s~icha scheduler is i~npracticalto implemenl. Round-robin and time slicir~gare pl-aclical, but they are not strongly fair i n general because proccsscs execute in unpredictable orders. A mulliprocessor scheduler lhal executes t.he processes i n parallel is also practical, bul it too is not strongly fair. This is because h e second process might always esamine try when i t is f;~lse. This is uolikely, of course, but jt is theoretically possible. To furlher clarify the different kilicls of scheduling policies. co~isidcragain [lie :u7-ay copy prograrrl in Figures 2.2 and 2.4. As noted earlier, that progi.am is deadlock-frce. Thus, (lie program will terminate as long as e;lcli process continues to get 3 chalice lo make progtcss. Each process will make progress as long a$ Lhe scheduling policy is weak.ly fair. This is because, when one process makes Llle delay conclilion of the other true, chat condition remains true unril the other process contitlues and changes shared variables. Both await slalernents i n the w a y copy program have the form (await (B);): and B refers lo only orle variable altered by the other process. Consequenrly, boll1 await statetrlents can be implemented by busy-waiting loops. For example, ( await ( p = = c ) ; ) in the Producer CUI be irnplc~nencedby while ( p ! =
C ) ;
Historical Notes
77
Tbe program will tem.inare if the scheduling pol icy is unconditional1y fair, bccause now there arc no conclilional atomic actions and tbe processes alternalc access to the shared buffer. It is not generally the case, however. thar an unconditionally fair scheduljng policy wjll ensurc tcrrnirlation of a busy-waiting loop. Tbis is because ao uncor~ditionallyfair policy might always schedule tlie atomic action that examines the loop condition when the condition is true, as show11 in the example program above. IF all busy wailing loops in a program spit1 forever, a program is said Lo sufCer h-om livelock-the program is alive, but the processes are not going anywhere. Livelock is the busy-waiting analog of deadlock, and absei~ceof livelock, like absence of deadlock, is a safety property; the bad state i s one in which every process is spinning and none of the delay conditions is true. On the other hatld, progress for any one of tbe processes js a liveiless properly; the good thing being that the spin loop of an individual process eventually L.em~in;ttes.
Historical Notes One of the first, and mosl influential, papers on concurrent programming was by Edsger Dijks tra [ 19651. That paper i ntroduced the critical section problem and the p a r b e g i n statetnetlt, which later became called the cobegin statement. Our co statement is a gerleralizacion of the cobegin slatement. Another Dijkstra paper 119681 introducecl the producer/consumer and dining j~hilosopbersproblems: as well as others thal we examine in Lhe next few chapters. Arl Bernstein [I9661 was lhe iirst to specify tche conditions that ai-e sul'licienl to ensure that two processes are independent and lzence call be executed in pnrallel. Eel-nstein's conditions, as they are still called, are expressed in tenns of the inpul and outpl~tsets of each process; the input set contains variables rend by a process, and the output set contains variables written by a process. Bel-nstein's three conditions for independence of two processes are thal the intersections of the (inpul, oulput): (output, inpul), and (output, output) sets arc all disjoint. Bernscein's conditions also provide che basis for the data clependency analysis done by pal-allelizjng coml~ilers,a topic we describe in Chapter 12. Our delinilion of independence (2.1) employs read and write sets for each process, and a variable is i n exactly one set for each process. This leads lo a simpler, but equivalent, condition. Most operating systems texts show how irnplernenti~lgassignmen1 statements using registers and fine-grained atornic actions leads to the critical sectjon problerrr i n concurrent programs. Modem hardware ensures that there is some base level of atomicity-usually a word-for reading and writing memory. Larnport (1977 131 contains an inleresting discussion of how to itnplelnel~tato~rljcreads and writes of "words" i l only single bytes can be read and written atomically; his
78
Chapter 2
Processes and Synchronization
solution does not require using a locking mechanism for mutual exclusion. (See Peterso1111 983 ] for more reccnt results.) The angle brackel not.ation for specifying coarse-grai lied i~lomic actions was also invented by Leslie Lamport. ]-lowever. it was popularized by Dijkslra [1977]. Owicki and Gries [1976a] used await B then s end to specify atomic actions. The specific iotal lion used i t 1 this book cornbi~iesangle brackets and corzdi~ionulafonzic actions and a varjant of await. The rerms ~azcond~tionul are due to fied Sch~ieider[Schneicler and Andrews 19861, Thew is a wcallh 01' material on logical systems for proving pi-operiies of sequential and concu 1-1-en1programs. Schnejder [ 19971 provides excellent cove,rage of all (he topics summ;~rizedi n Sections 2.6 rhl-ough 2.8-as well as many more-and gives extensive hislo~.icalnotes and references. The actual content of these sections was condensed and revised horn Chapters I and 2 of my prior book [Anclrews 199 11. That cuaterinl, in turn, was based os earlier work [Schneider and Andrews 19863. A brier history of formal logical systems and its application to program veri f cation is given below; for more information and references, consult Schneider [ 19971 or Andl-ews [I99 11. Formal logic is concerned with the formaljzation a~ldanalysis oT systematic i ~ a s o n i ~methods. ~g Its origjns go back to the ancie111Greeks, and for the nexl two millennia logic was of intesest mostly to philosophers. However, the discovery of non-Euclidean geometries in the nineteenth century spurred renewed, widespread interest among mathematicians. This led to a systematic study of mathematics itself as a formal logical system and hence gave rise to the field of marllernatical logic-which is also called rnetamalhematics. Indeed, between 1910 and 1913, All!red North Whitehead and Bertrand Russell published the voluminous Pri~lcipiaMuthemar.ica, which presented what was claimed to be a systenl for deriving all of mathelnatics from logic. Shorlly thereafter. David Hilbesc sel out to prove rigorously that the system presented in Prirzcipia Mathelnatica was both sound and complete. However, i n 1931 Kurt Godel demonstrated h a t here were valicl statements that did not have a proof i n that system or i n czrzy similar axiomalic system. Gfidel's incompleteness resull placeti limikatious on formal logical syslems, bul it certainly did not stop intcrcsl in 01-work on logic. Quire the colilrary. Plnving properlies of programs is just one or many applications. Forlnal logic is covered in all slandatd textbooks on rnadlemalical logic or 1netnnlathema6cs. A lighthearted, entertaining introduction to the topic can be found in Douglas liofstadler's Pulitzer Prize-wi nning book Gddel, Eschet; Racl~: 1411. Eterncrl Golden Br-aid [Hofstadter 19791, which explains Godel's result and also describes how logic i s related to computability and artificial intelligence. to propose a Robert Floycl [I9671 is generally credited with being [lie [?I-s~ technique for proving that proglans are correct. H i s method involves associati~lg
Historical Notes
79
a predicate wit11 each arc in a tlowcl~artin such a way that, if the arc is t~.avcrsed, the pred.ica1.eis true. Inspired by Fl.oycl's work, Hoare (19691 developed the first formal logic for proving parrial colwclness propcctics o C seyue~itialprograms. 1-loare introduced the concept of a Lriple, the interpretation for triples, and axiorns and i n Cerence rules for sequenlial state~nents(ill his case, a subsel of Algol). Any logical system for sequential progrnlnming lliat is based on this style has since come to be called a "l-loare Logic." P~.ograrnmiiiglogic PL is an exalnple of a Hoare Logic. The first tvork on proving properties ol' concurl.ent programs was also baked oti a flowchal-I representation [Aslicroft and Manna 19711. At about the s a ~ n e time, Hoare 119721 extended his parr.ial correctness Jogic for becluential programs to include inference rilles for concurrency and synchronization, with synchronization specified by conditional critical regions (which are sj~nilarLo a w a i t statements). However, thal logic was inco~nplelesince asserrio~isi l l one process could not refererlcc \!ariables in anothcr. and global i1,variaiirs coil Id not ~cfcrence local variables. Susan Owicki. in n dissertation supervised by [>avid Gries. was the lirsl lo develop a complete logic I'or proving partial correctriess properties of concurrenr programs [Owicki 1975; Owicki and Gries 1976a, 1976bI. The Logic covered cobegin stalelnents arid synchronization by means of sbarcd variables, semaphores, or co~lditionalcritical regions. Tho1 work described the at-rnoslonce propelry (2.2): introduced aid for.rnalized the concept of inlerl'erence freedom, slid illusti.aced several of the techniqi~es[or avoiding inter-l'erence described in Secdon 2.7. On a personal note, the author had the good fol-tune co bc al Carnell while this research was being done. The Owicki-Gries work nddrcsses only three safcty pi-opel-ties: pat-lial correctness. rnutual exclusion, and absence of deadlock. Leslie Lamport llc377aI independent I y developed an idea sil-rlila1 to interference fr.eedom-monotone asserlions-as part of a general method for proving both safcty and liveness properties. I-Iis paper also i ~ltroduccclthe lenns .sqfeIy and I~VC'II.C.P.S.The rnelhod in Sectioi~2.8 fol- proving a safety prope~lyusing a global invariant is based on Lamport's mell~od.The method of exclusion of configurations is d u e lo Schneider [Schneider and Andrews 19861: i t is in turn a generalization of h e OwickiCrics merhod. Francez [ 19861 contains a thorough discussion of Fairness and its relation Lo termination: synchronization, and guard evaluation. The terminology for unconditionally fail; weakly fair, and suongly fajr scheduliilg policies comes from that book, which co~ltairlsan exleusi\/e bibliography. Tliese scheduling policies were first defined and formalized in Leh~nanet al. [I98 I], although with somewhat different terminology.
80
Chapter 2
Processes and Synchronization
Programming logic (PL) is a logic for proving safety properties. One way I'ormal proofs of liveness properties is to extend PL with two temporal operators: henceforrh and ever~tually. These enable one to make assertions of states and hence about the Future. Such a telnporul logic was a b o ~ sequences ~l first introduced in Pnueli [1977]. Owicki and Lamport [ I9821 sl~owhow to use temporal logic and invariants to prove live~lessproperties of concurrent programs, Again, see Schneider 11 9971 for an overview of temporal logic ancl how to use ir to prove l iveness properlies. 10construct
References A ndrews, G. R. 1991 . C n n c ~ ~ r r e nProgramtnir~g: t PI-incipfes and Pructice. Menlo Park, CA: Benjamin/Cummiugs.
Ashcroft: E., and 2. Manna. 1971. Formalization of properties of parallel programs. N ~ c h i n eintelligence 6: 17-41. Bcmstein, A. J. 2966. Analysis of programs for parallel processing. IEEE Trans, 0 1 1 Com,pur~rs EC-15, 5 (October): 757-62. Dijkstlx, E. W. 1965. Solutjon of a problem in concurrent progl-amming control. Comnz. ACM 8, 9 (September): 569. Dijkstra, E. W. 1968. Cooperating sequential processes. In E Genuys, ed. Prograrnmi~zgLunguages. New York: Academic Press, pp. 43-1 12. Dijkstra, E. W, 1977. On two beautiful solutions designed by Martin Rem. EWD 629. Reprjnted in E. W. Dijkstra. Selected Writirzgs olz Cbmpu/ing: A Personal Perspecrive. New York: Springer-Verlag, 1982, pp. 3 13- 18. Floyd, R. W. 1967. Assign.ing meanings to programs. Proc ante^ Math. Socicc Syn~p.in Applied ~l.lathemutics19: 19-3 1. Francez, N. 1986. Fnirness. New Yo&: Spri nger-Verlag Hoare, C. A. R. 1969. An axiomatic basis for computer programming. Cornm. ACM 12, 10 (October): 576-80,583. Hoare, C. A. R. 1972. Towal.ds a rhcory of parallel programming. In C. A . R . Hoare and R . H.Perroll, eds. Oyeralivlg .S.ystrrrzs Trchn,iquc.s. New YOI-k: Academic Press. Hofstadter. D. J. 1979. Godel, Eschet; Bach: An Eternal Golderz Braid. New York: Vintage Books. Larr1po.1-t,L. 1977a. Proving [he correctness of multiprocess programs. IEEE 'li.nns. 017 Sofh;ure Engr SE-3, 2 (March): 125-43.
Exercises
81
Lamport. .1, 1977b. Cor~currentreading and writing. Comm. ACM 20, I 1 (November): 806-1 1. Lehnian, D.. A. Pnueli, a ~ J. ~ d Stavij. 1981. Impartiality, justice, and Fairness: The ethics of cor~curt-en1tectnination. Proc. Eighth Colloy. on Auto~lzaro, Lungr., and Prog., Lecture Notes in Conlputer Science Vol. 1 15. New York: Springer-Verlag,264-77. Owick.i, S. S . 1975. Axiomatic proof Lechniques for parallel progrwns. TR 75-25 1. Doctoral dissertation, Tthaca, NY: Cornell University. Owicki, S. S., and D.Gries. 1976a. An axiomatic proof rechnique for parallel pl-og~u-ns,Ac.ti~1nforn.mtic~r6: 3 1940. Owicki, S. S.. and D. Gries. 1976b. Verifying properties of parallel programs: An axiomatic approach. Cornni. ACM 19, 5 (May): 279-85. Owicki, S., and L. Lamport. 1982. Proving ljveness properties of concun-ent programs. ACM Trrlns. on Prog. Languclges and Systems 4, 3 (July): 455-95. Pelerson, G. L. 1983. Concurrenl reading while writing. ACM. Trolls. on Pmg. Lnnguuges and S~sfenx.s5 . 1 (January): 46-55. Pnueli, A. 1977. The temporal logic of programs. PI-oc. 18tla S.yn1.11. olr the Foundations of Comnprcrer.Science, November, 46-57. Sch~leider,E R . 1997. On Concrkn-ent Proglnntming. New York: Springer. Schneidec, F. R., and G. R. Andrews. 1986. Concepts for concurrent programming. 111 Currenl T7en.h in Conc~.irrency,Leclure Notes in Co~npurer Science Vol. 224. New York: Springer-Verlag,669-7 16.
Exercises Consider the outli tle of Lbe program in Figure 2.1 that prints all (lie lines in a file that contain pattern.
(a) Develop the missing code fix- synclu-onizing access to buffer. Use the await staternenl to program he synchronization code. (b) Extend your program so that jt reads two files and prints a11 the lities that contairl pattern. Idenlify the independen1 activities and use a separate process for each. Show all synchronization code that is required.
Consider the solution to h e a-ray copy problem in Figure 2.2. Modify the code so tbat p is local to the producer process and c is local to the consumer, Hence, those variables cmnol be used co synclil.onize access to buf. Instead, use two
82
Chapter 2
Processes and Synchronization
new Boolean-valued shared variables, empty and full, to synchronize the two processes. Initially, empty js t r u e and f u l l is f a l s e . Give the new code for the proclucer and consulncr processes. Use await: staternerlts to program the synchronization. 2.3 Tlie Unix tee co~r)~rlancl is invoked by executing: tee filename
The co~n~natld reads the standard inpul and wriles il to bolh the standard oulput and LO file filename. In shol-t, i t produces two copies of (lie input. (;I) Write a secluential progra~nro implement this cornniand
(b) Parallelize your sequentid progratn to use three processes: one to read from standard input, one to w i l e to standard output, and one to write to file filename. Use the "co inside while" style of program.
(c) Change your answer to (b) so that il uses the " w h i l e inside con style. Jr-1 parlicular, create [he processes once. Use double buffering so that you can read and write in parallel. Use the a w a i t statement to syncluonize access to the bufl'ers. 2.4 Consider a simplified version of the U~lixdif f comtnand for conlparjng two text files. Tlie new command i s i~zvokedas differ filename1 filename2
It examines the two files and prints out all lines in the two files that are djtterent. In particular. lor each pair of lines that differ, the command writes two ljnes to sla~idardourput: 1ineNumber: 1ineNumber:
line from file 1 line from f i l e 2
If one file is longer than the other, the command also writcs one line of output for each extra line in the longer file.
(a) Write a sequential prograrn to j mplerne~ltd i f f e r . (b) Parallelize your sequential prograrn to use three processes: two to read the files. and one to wrile lo slandarcl ourput. Use the "co inside while'' style of program.
(c) Change your answer to (b) so that it uses the " w h i l e jnsjde cox style. I n ~~articular, create Lhe processes once. Use double bul'fering for each file, so thal you can read and write in parallel. Use [he a w a i t statement Lo synchronize access to the bufters.
Exercises
83
2.5 Given integer arrays a [ ~ : m ]and b [ l : n l , assume that cach array is sorted in ascending order, and that the values in each array are distinct. (a) Uevclop a sequential program to compute the number oT different values ~ h a l appcar in both a atid b. (b) [denci'fy the independent operations rn you^ sequc~~tial program and then inudify ttie program lo execute the independent operat ions i 11 pal-allel. Store the answer in a shared variable. Use tlzc co statement to spccify concurrency and use the a w a i t starement to specify any syncIi.ronizi~tio~i that ~nighlbe required.
2.6 Assume you have a tree that is reprzsellted using a linked structure. In pal.ticular, each node of the lree is a structure (record) lhat has three fields: a value ancl pointers to tlze leit and right subtrces. Assume a tiull pointer i s represeoted by the corlstant nu11. Write a recursive parallel program to cornpule the sun] of the values of all nodes in the tree. The tocal execution tiwe of lhe computarion sllould be o n the order of the height of Ihe tre.e.
2.7 Assume that tlie integer array a L 1 :n] has been initialized. (a) Write an iterative parallel program to compute the sum o f the elements of a using PR processes. Each process should work 011 a sNip of the array. Asstrme that PR is a filctor of n.
(b) Write a recursive parallel program Lo compute tlie sum of Lhe elene~ltso l the m a y . Use a divide-ancl-conquer scrategy t o cut the size of the problem in half for each recursive step. Stop recursing when the proble~zzsize is less than or' equal to some Lhreshold T. Use the scquenlial iterative algorithm Tor the base case.
2.8 A queue js often represented using a linked list. Assurnc rl~attwo variables, head and tail, point to the first and last elements of the list. Each ejernenl contains a data f eld and a link to the 11exlelement. Assume (hat ;I null link is represenled by die constant n u l l . (a) Write routines lo (1) search the list h ~ the r first elernenl (if any) that contains data value a, (2) insert a new element at the end of the lisl, and (3) dclele Lhe element from tlie frol~tof the list. The s e i ~ c hand delete routines should relurn n u l l if U~eycannot succeed. (b) Now assu~nethat several processes access the linked list. Identify the read
and write sets of each mutine, as defincd in (2.1). Which combina1io1-r~ of roilli ties car1 be executed in para1 lel? M'hicll combinations of rourines must execute one at a lime ( i t . , atom:ically)?
(c) Add synchron,izadon code to the thrce routines lo enforce the synchronizatjon you iderlcified in yoiu- answer to (b). h4ake your atomic aclions as slnall as
84
Chapter 2
Processes and Synchronization
possible, and do not delay a muline unnecessalily. Use [he await statement to prograrrj tbe synchronization code.
2.9 Consider the code Fragmenl in Seclion 2.6 h a t sels m lo rhe maximuin of x and y . The kiple for the i f statemenl in [hat code frngmenl uses (lie I F Stateme111Rule f If from Figure 2.3. What are the predicates P, Q, and B in chis application c ~ the St alemen t Ru l c'? 2.10 Cocisider the followitlg program: int x = 0 , y = 0; c o x = x + l ;x = x + 2 ;
/ / x = x + 2 ; y = y - x ; OC
(a) Suppose each assign~nenlstalemen1 is ilnplemel~tedby a single machine instruction and hence is atomic. How Inany possihle histories ;Ire there? What are the possible final values of x atld y?
(I)) Suppose each assignment slatelnenr is impleme~ltedby three atomic actions that load a register, add or sublract a value from that regislei-, then store d ~ result. e 1-low Inany possible histories are the)-e now'? What are Lhe possible final values (SF x and y'? 3,. 1 1
Consider (he l'ollowing program: int u = 0, v = 1, w = 2 , x; c o x = u + v + w ; / / u = 3; / / v = 4; / / w = 5; 0C
Assume tbat h e atomic actions are reading and writing ilidividual variables. What are the possible final \/slues orx, assuming hat expression u + is evaluated left to right?
(;L)
v
+ w
(b) What are the possible final, values of x jf expression u + v + w can be eval-
uated in ally order? 2.12 Consider (he following program: int x = 2, y = 3 ; co ( x = x + y ; ) / / ( y = x (a) What arc the possible filial values
*
y;)
of x and Y?
Exercises
85
(b) Suppose the angle brackets are removed and each assignment shlelnenr is now irnplement.cd by three atomic actions: read a variable, add or multiply, and wrile to a variable. Now what are the possible final values of x and y?
2.1 3 Consider the following lhree statements: S,:
S,: S,:
x = x + y; y = x - y; x = x - Y;
Assume tltal x is initi.ally 2 and that y is initially 5 . For each o-l' the following, whar are Lhe possible final values of x and y? Explain you^ answers. (a) s,;
s,;
S3;
2.14 Consider the following program: int x = I, y = 1; co ( x = x + Y ; ) / / y = 0; / / x = x - y ; OC
(a) Does the progl.;un meel the requiren1enl.s of the Ac-Most-Once Property (2.2)? Explain. (b) What ;uc the final values o i x ancl y? Explait1 your answer. 2.15 Consider the following program: i n t x = 0, y = 10; co while ( x ! = y ) x = x + l ; / / while (x ! = y )
y = y - I ;
OC
(a) Does the pl-og.racn meet the requii-einents of the A t-Most-Once Property (2.2)? Explain . (b) Will Lhe program terminate? Always'! Sometimes'! Never? Explain your answer.
86
Chapter 2
Processes and Synchronization
2.16 Consider the following program: int x = 0 ;
co (await (x ! = 0 ) / / (await (x != 0 ) / / (await (x == 0 )
x = x - 2;) x = x - 3;) x = x + 5;)
OC
Devclop a proo.i' outline rliat demonstrates that the final value of x is 0. Use the lechnjque of weakened asse~lions. ldenrify wh.ich assertions are critical assert i o ~ ~as s , defined in (2.5). and show that (hey ai-e not interfered with. 2.17 Consider the following program: c o (await ( x > = 3 ) / / (await ( x > = 2 ) / / (await (x == 1)
x = x x = x x = x
-
+
3;
)
2; 5;
) )
OC
For what initial values of x does Ihe program termiuale, assuming scheduling is weakly f ~ i r ?Whar are the correspondir~gfinal values'? Explain your answer.
2.18 Considel- Lhe fol(owing program: co (await ( x > 0 ) x = x - 1;) / / (await ( x < 0 ) x = x + 2;) / / (await ( x == 0 ) x = x - I ; ) OC
For what initial values ol' x does the program ~ e ~ m i n a tasalrr~it~g c, scheduling is
weakly (air? What are the corresponding final values? Explain your answer. 2.1 9 Consitler the following progrun: int x = 10, y = 0; while (x != y ) x = x - 1; y = y + 1; (await ( x == y ) ; ) x = 8; y = 2;
co // OC
i t ntkes [or the program to (el-~ninate.When the program does lerminate, what are the f i n d vali~esfor x and y'?
Explaj~)what
2.20 Let a [ 1 :m] tund b [ 1 :n] be inceger arrays, m lo cxpress the follo\ving properties.
> 0
and n > 0. W ~ i t epredicates
(a) All elemeiiLs of a are less than a1.l elements of b.
(b) Either a 01. b contains a single zero, bur not both.
Exercises
87
i
(c) Ir is not the case chat bolh a and b contain zeros.
(d) The values in b are t l ~ esame as the values i n order. (Assu~nefor this part that m == n.)
a, except they are
in the reverse
(e) Every element of a is an element of b. (f) Some element of a is larger lhan some elemeilt of b, and vice versa
2.21 The if-then-else statement I I ~ S the form: if (B) S 1; else 52 ;
I f B js true. sl is executed; othe~wise,s2 is executed. Give an inference rule [or this statement. Look at the If Stale~nentRule for ideas. 2.22 C~.nsiderChe fo1lowing f o r statement: f o r [i = 1 t o n ]
s; Rewritc the stalement using a w h i l e loop plus explicit assigninen1 state~nelirsto the quantifier variable i. Then use the Assignment Axiom and While Statement Rule (Figure 2.3) to develop an inference rule 'for such a for statement.
2.23 The repeat statement repeat
s; until
(8);
repeatedly executes stat.ement s until Boolean expression B is true at the end of some iteration. (a) Develop an ilifercrlce rule for repeat.
(b) Using your answer to (a). develop a proo€ outline that demot~stratesthat repeat is equjvalenl Lo S;
while (!B) S;
2.24 Consider the following precondition and assignment staternenc.
I > I
88
Chapter 2
Processes and Synchronization
For each of the following triples, show whether [lie above statement interferes with the triple. (a) ( x > = 0) ( x = x + 5 ; ) (x >= 5)
(b) (x >= 0) ( x = x + 5;)
(X > =
0)
(c) { x >= 10) ( x =
X
+ 5;) {x >= 11)
(d)
x
+
{X >= 10) ( X =
5;)
{X
>= 12)
(g) {Y isodd) ( y = Y + 1;) (Y iseven)
(h)
{X
i s a multiple of 3 ) y = x; {y is a multiple of 3 )
2.25 Consider the following progratn: int x = V1, y = v2; x = x + y ; y = x - y ; x = x - y ;
Add assertions belore and alter each stalemen1 to characterize [lie etfects 01' this prugranl. In particular, what are the final values o f x and y'?
2.26 Consider the folJowing program: int x, y ; c o x = x - l ; x = x + l ; / / y = y + l ; y = y - 1 ; OC
Sllow hat (X == y) s Cx == y ) is n theorem. where Show a1l nonin~ecferenceproofs in detaj I..
S is
Ihe co statement.
2.27 Consider the following pi-ograrn: i n t x = 0; ( x = x+2;) / / ( x = x+3;) / / ( x = x+4;) oc
co
Prove that ( x == 0 ) s (x == 9 ) is a tlieol-em, where s is the co statemen[. Use the technique of weakened asscrlions. 2.28 (a) Write a parallcl program (hat sets boo lea^^ variable allzero to true if integer array a [I:n] contains all zeros; otherwise, the program should set a l l z e r o to false. Use a co statement to exanline all m a y elements in parallel.
Exercises
89
(b) Develop a proof outline that sliuws that your solution is correct. Shou~thar the processes are interference-Free.
in integer anay a [ 1 :n] by 2.29 (a) Develop a prograin to find the maximum ~~tllue searching even and odd subscripts of a in parallel. (b) Develop a proof outl.ine tllal shows thai your solutjoll is correct. Show that lhe processes are inlerference-free.
2.30 You are given three inleger-valued functions: f ( i), g ( jI . and h (k). The doroain of each function is the nonnegalive integers. The siunge of each function ) all i. Tliere is a1 least one value i s increasing: for example, f (i) < f ( i + lfor common to the range o.F the three f~lnctjoos. (This has been callecl the earliest comnlon meeting time problem, with thc ranges of the functions being the titneb at which lhree people call meet.)
(a) Write a colicurrellt program lo set i, j. and k to the sm;~llestjoteget-s such Lha~f (i) == g (j) == h(k) . Use co to do comparisons in pa~:allel. (This program uses fine-grained parallelism.) (b) Develop a proof oulline that shows that your solulion is con-ect. Sbow thar the processes are intel.feretlce-free. 2.31 Assume that the triples t },?I S, {Q1) and {P,) S 2 {Q,) nl-e both rrue and chat they are interrerence-free. Assume that s, contains an await stateillelit (await ( B ) T). Let s, be S, wilh the a w a i t hlatelnenl replaced by while T;
(
!B);
Answer r l ~ efollowi~lgas independent questions. (a) Will {P,} S, (Q,) still be true? Carefully explain your answer. 1s it rrue always? So~nelimes'! Never? (b) Will (P,} s,' C Q , } and {P,) s, {Q,) scill b e i n r c r l e r e ~ c -Again. c:u-efi~lly explain your answer. Will they be inlerfcsencc-free alwny s'! Soniet i mes? Never? 2.32 Consider he following concurrent program: int s = 1; process f o o l i = 1 to 21 ( while ( t r u e ) { (await (s > 0 ) s = s-1;) Sf; ( s = s+l;)
1 1
90
Processes and Synchronization
Chapter 2
Assurne si is a statemen1 1is1 diat cloes not modify shared variable s. (a) Develop proof outlines for Lhe two processes. Detnonstrate thal the proofs of (he processes are interference-ll-ee. Then use t l ~ cproof outlj.ties and the method ol' exclusio~iof configurations it) Seclion 2.8 to show rhat s, and s, cannot execule a1 the same tj111eand tliat the program is deadlock-free. (Mint: You will need to introduce ausiliary variables to the program and proof outlines. These v;iriables should keep [rack of the location of each process.) (13)
What scheduling policy is required to ellsure that a process delayed at its first
await sralcment w j 11 evel~tuallybe able to pprocecd?
Explain.
2.33 Consider the following program: int x = 10, c
=
true;
co (await x == 0); c = false; / / while (c) ( x = x - 1); OC
(a) Wi l l the progl'aln terminate i.f scheduling is weakJg fair? Explain. (b) Will lhe program ter~niaateif scheduling is strongly fair? Explain. (c) Add h e follocving as ;I third arm while ( c ) [if
i
( X
< 0)
(X
oT the co statement: = 10);)
Repeat parts (a) and (b) for rhis three-process prograln. 2.34 The 8-queens problem is concerned with placing 8 queens on a chess board in such a way that no one queer) car1 attack another. One can atlack another jf both are in the same row or colurnn or are on the same diagonal. to LIK 8-queens Write a parallel recursive program to generate all 92 solurio~~s proble~n.(Hint: Use a recursive procedure to try queen placements and a second procedure to check whelher a giver) placerrlent is acceptable.)
2.35 The stable marriage problem is the following. Let Man [ 1 :n] and WomanCl :n] be al-rays ol processes. Each man ranks the woinen h o ~ n1 to n, and each woman ~-ailks(he Inen frorn 1 to n. (A tanking is a per~nuralionof tlie integers from 1 to n.) A pc~irirzg is a one-to-one correspondence of men and women. A pairing i s ,stczhle if, for Lwo rnen Man [ i l and Man [ j 1 and their paired women woman [pl and woman I s ] , bofli of the following condjtions are satisfied: I.
Man [ i ] ranks woman [ p J higher than Woman [ql , or Woman [ql ranks Man [ j I higher than Man [i]; and
2.
.
Man [ j 1 ranks woman [ql 11igIie1- than woman [p] or Woman [ p ] ranks Man [i) higher than Man [ j I .
Pur differently, a pairing is ~rnscablei f a ma) and wornan would boll1 prefer. each other lo I-heil-currenl pair. A solution to the stable marriage pro\>lemi s a set ot'n pairings, all of which are stable. (a) Give a predicate ihar speci hes rhe goal for the stable marriage problem. I
(b) Write a parallel program to solvc the slnble marriage problem. Be sure ro explain your bolution strategy. Also give pre- ant1 pos~conditionsFor each process and invariants for each loop.
Locks and Barriers
Recall chat concunrnt programs e~nploytwo basic kinds of sy~~chronizarion: murual exclusjorl and condition synchronization. This chapter examines two importan1 problems-cri tical sections and barriers-chat illustrate how to program these kinds of syuchronization. The critical section problem i s concerned with implementing atomic actions in software: the problem arises in mosl concurrenr programs. A barrier is a synchronization point that all processes must reach before any process js allowed ro proceed; barciers a e needed in Inany parallel programs. Mutual exclusion is typically implernenled by means of lock.^ lhal protect critical sectiolis of code. Section 3.1 cleGnes rlie critical section problem and presents a coarse-grained soli~tionthat uses the a w a i t scatemel~tto implemenl a lock. Seclion 3.2 develops fine-grained st~lutionsusing whac arc called spin locks. Section 3.3 presents three fail- solutions: the tie-breaker algorjrhm, the ticket algorithm, and Lhe balteiy olgoritbm. The val.ious solurions illustra~edifferent ways to approach the problem and have different performance and fairness attributes. The solutions to the critical section proble~nare also impo~t;int, because they call be used to implement await staleinents and hence arbitrary ato~rlicactions. We show how to do this at the end oPSectio11 3.2. The last half of (he chapter introduces three rechniqucs for parallel computing: barrier synchraniza~ion,data parallel algorilhms, and what is called a bag or tasks. As noletl earlicr. many problems can be solved by parallel iteralive algorithms i n whjch several identical processek repealed ly man jpulate n shared array. This kill(! of algorithm i s called a clcitu purdlel nigorithm since Lhe shared clata is manipulated in parallel. In such an algorithm, each iteration typically depends on the resihrs of the previous iteration. Hence. at the end of an iteration, faster processes need Lo wait for the slower ones befa-e beginning the nexl iteration. This
94
Chapter 3
Locks and Barriers
point is called a Dul-~-iet,.Section 3.3 describes various kind of synch~.onizalio~) ways lo irnplemenl bar~iersynchronizatioi~and djscusses the performance kadeoffs bctwccn Llieln. Seclion 3.5 gives several exarnplcs of data parallel algoritlilns h a t use barriers ancl also briefly describes thc dehigll of synclironous mulrip~-ocessors(SIMD machines), tvhicli ate especially si~itedto implementing data parallel algorillims. This is because SIMD ~nacllinesexecute instructions ill lock step on evcxy processol-: thence, they provide barriers aulo~naticallyafter every tilachine inslructiou. Section 3.6 presents anothex useful technique for parallel computing callecl a htrg c?j'tuslts (or a work farm). Tliia approach can be used to jmplemenl rccursive p;~rallelismand to implement iterative parallelistn whe~ithere is a fixed numbcr of' independent tasks. An iinportant itttlib~tcOF t h e bag-of-tasks paradig~nis that i t Facilit;ltes load balancing-namely, cnsuring Lhal each processor does about the same amoiuit of work. The bag-of-lash paradigin employs locks to itnpleinenc the bag and bacrial--like synchroilization to detect when a co~nputatiou is done. Tlie programs in d ~ i schaprer employ bus! wuiting, which is w) implelnenlation of hynclu-onization in which a process I-epeatcdly checks a condition until it becolnes truc. The virtue of busy wailing synchronization is that we can implement j r using only (he machine instr-uctions available on modern processors. Altlioi~ghbusy waiting is inefficient when processes share a processor (and hence cheir execution is interleaved), it is efficje~~I when each process exec11tes on i1,s own pi-occsso~: Large rnullipl-ocessors have beer) used for many years to support high-perfo~.mancescicnlilic compu1;ltions. Small-scale (two to four CPU) multiin worlcslations and even personal computers. processors are beco~ningcornmo~~ The operating system Icernels for- muldprocessors employ busy waiting synclironizi~lio~i, as we shall see later in Section 6.2. Moreover, hardware itself c~nploys btrsy waiting synclu-onizarion-for example. to synchronize clnh transl'ers on inelnory busses and local net.works. Soffwwe libraries for multiprocessor machines include routines for locks and somctirnes for barriecs. Those library routines are implemented using the techniques described in this chapter. Two examples of such libraries arc Ptlu-eads. = in [i] and l a s t [j] == i) s k i p ;
I
1
critical section: i n [ i ] = 0;
/*
exit protocol. * /
~loncriticalsection:
1 Figure 3.7
The a-process tie-breaker algorithm.
3.3.2 The Ticket Algorithm Thc n-proccss tie-breakel- algori~hmis cjuite cocnplex and consequently is hard to undersrand. This i s in part because i t war, not obvious how tu generali7,e the twoprocess algorithrn Lo n processes. Here we develop an n-process solution to the critical scc~io~l problern Lhat is much easier to u~~dersland.The solution also ilJustrates how inlcger counters can be used to order processes. The algorithrn is called a ticket cllgorirhm since it is based orr drawing tickets (numbers) ancl then waiting ~ i ~ r n s . Sume stores-such as ice crcaln slol-cs itncl bake~.ies--employ [he follo~vi~lg method io ensure that cuslorners are serviced in orcler of arrival. Upori entering the store. a custorner draw7sa number that is one larger than the nuinber held by any other custorner. The custouler then waits until all custo~nersholding smaller numbers have beer) serviced. This algorilhtn is implemc~.rtedby a number dispenser and hy a display indicating which customer is being served. If the store has one employee behind the service counter, customers are served one at a time in heir order of arrival. We can use this idea to implement a fair critical section protocol. Let number and next be integers h a t are inirjally 1. and let t u r n [l:n] be an array of integers, each OF which is jni~ially0. To enler its critical sec~ion,process cs [i] first sets t u r n l i ] to rhe current value of number and then
3.3 Critical Sections: Fair Solutions
JOY
int number = 1, next = 1, turn[l:n] = ([nl 0 ) ; # # predicate TICKJT is a global invariant (see t e x t ) process CS [i = 1 to n] { while (true) { (turn[il = number; number (await ( t u r n E i 1 == next) ;)
= number +
1;)
CI-ilicalsection; (next = next
+ 1;)
noncritical section;
1
3
Figure 3.8 The ticket algorithm: Coarse-grained solution.
increments number. These are a single atomic action ro ensure that customers unicli~enu~nberrs.Process cs [il then waits until the value of next i s equal to the number i t drew. Upon co~npletingits critical section, cs [i] inc~-e~nents next, again as an atomic action. This protocol results in the algorjtl~mshown in Figure 3.8. Since number is read and incremented as an atomic action and next is incremented as M R L O I ~ I C action, Lhe I-i)llowingpredicate is a global invariant: draw
TICKET:
next > 0 A ('d i: 1 c= i c = n: (CS [i] in its crilical section) 3 (turn [il == next) A ( t u r n [ i ] > 0 ) + ( V j: 1 c = j c = n , j ! = i : t u r n [ i ] != turn[j]) )
I I
I
i
!
The last line says that nonzero values of t u r n are unique. Hence, at most one turn [il is equal to next. Ilence, at m o s ~one process can be in i l s critical section. Absence of deadlock and absence of un~iecessarydelay also I'ollow from the fact that nonzero values in t u r n are unique. Finally, i C schecluling i s weakly fair, the algorithm ensures eventual entry, because once a delay condition becomes true, i t I-emains tme. Unlike the tie-breaker algorithm, the ticket algorithm has a po~entialshorlcorning that is colnmon in algorithms that employ incrementing counLe1.s: the valLies of number and next are unbounded. IF the ticket algorithm runs for a very long time. increnienting a counter will eventually cause arithmetic overflow. This is extremely unlikely to be a problem in practice. however. The algorithm jn Figure 3.8 contains tkrec coarse-grained alomic ttclions. It is easy to implement the await statctnent using a busy-waiting loop siuce the Boolean expression references only one sharcd variable. The lasl alomic action,
Q; !! ,
j
,
f',),! : ,,,
.
?
i
110
Chapter 3
Locks and Barriers
which increments next, can be i~nplelnentedusing regular load and store instructions, because a1 most one process at a t.ime can execute the exit protocol. UnFortunarely, it is hard in general to implemenl Lhe first aro~nicaction, which read? number and theo increments it. Some machines bave instructions that return thc old value of a variable and increment or decrement it as a single indivisible operation. This kind of insuuc[ion does exactly what i s required for the ticket algorithm. As a specific example, Fetch-sod-Add is an instruction with the following el-'l'ect: FA(var, i n c r ) :
(int t m p = var; var
=
var + incr; return(trnp);)
Figure 3.9 gives the ticket dgol-ithm i~nplerncntedusing FA. On ~nachinesthat d o not have Fetch-and-Add or a cornparable instruction, we have to use another approach. 'The key requirement in the tickel algorith~rlis that every process draw a unique number. If R machine has an atomic incle~na~lt the first step in the eouy protocol instruction. we might considel- i~nple~nenting by turn[i] = number; (number = number + 1;)
This cnsures that number is increrllented correctly, but i l does not ensure tliar processes draw uriique numbers. Tn pardcular, every process could execute the first assignment ahove at about the same ~ . i m eand draw 1 . h ~same number! Thus ir is essential that both assignments he executed as a single atomic action. We h;l\le already seen two other ways to solve rhe criticd section probjejn: spin locks arld Ihe tie-breaker algorithm. Eilher of heae could be used wilhin rlie
int number = 1, next = 1, turn (1:nl
-
=
(
In1 0
;
process C S l i 1 to n] { while (true) ( t u r n [ i ] = FA(nurnber,l); / * entry p r o t o c o l * / while ( t u r n [ i ] ! = next) s k i p ;
critical section; next = n e x t + 1;
/ * exit protocol
noncritical section; 1
1 Figure 3.9
The ticket algorithm: Fine-grained solution.
*/
3.3 Critical Sections: Fair Solutions
Ill
ticket algorithttl to make r~umberdrawing atomic. I n par~icular,suppose CSenter i s a critical section entry prolocol and CSexit is the coi-responding exit protocol. Then we could replace the Fetcli-and-Add statement in Figure 3.9 by
(3.10) CScntes; turn[i]
= number; number = numbercl;
CSexit;
Although this might seem like a curious approach, in practice it would actually work quile well, especially if an instruction like Test-and-Set is available Lo j~nplemenrCSenter and CSexit. With Test-and-Set, processes m i g h ~not draw numbers in exactly the order they attempt to-and theorclically a process could spin forever-bul with very high p~obabilityevery process would draw a ~iumber. and most would be drawn in order. This i s because die crirical section within (3.10) is very short, and hence a process is not likely to delay in CSenter. The major source of delay in the ticket algorithm is waiting for t u r n [i] to be equal to next.
3.3.3 The Bakery A Igorithm The ticket algoritlim cam be ilnplemented directly on machines that have an instruction like Felcli-and-Add. If such an instruction is not available, we can e algorithm using (3.10). But (liar sitnulace the number drawing part of d ~ licket might not be fair. requires ~~silig anoLher critical section protocol, and the sol~~tion H e ~ ewe present a ticket-like algorithm--callecl the bukely algoi.ithrn-thal is fair and that does not require any special machine instruclions. The algorjllim is consequently more cort~plexuhan the ticket algorithm in Figure 3.9. In the ticket algorit)\m, each customer clraws a unique nu~nbel.and then waits for its number to be eqi~alto next. TJzcbakery algorithm takes a different approach. In particular, when customers enter the store, they first look around at all o~hercustomers acid draw a number one larger than any otliel: All customers tnust then wait for their number to be called. As in the ticket algorithm, the customer with he s~nallestnutnber is the one that gets serviced nexl. The difference is that custolners check with each other rather than wirh a central next counte~.lo decide on the order of service. As in [lie tickel algorithm, let t u r n [ l : n ] be an axray of integers, each OF which is initially rero. To enter its critical section, process cs [i] first sets t u r n [i] to one more than rhe maximum of all the curxenl values in t u r n . Then CS [i] waits until turn[i] is the smallest o f Ihe notizero values of t u r n . Thus the bakery algorill~mkeeps the following predicate invariant:
112
Chapter 3
Locks and Barriers
BAKERY: ( C S [i]
( V i : 1 0 )
> 0) A ( V j: 1 < = j < = n, j != i: turn[jl == 0 v turnri] < turntjl) )
+
Upon cornpleling its critical section, c s [i] resets turnti] to zwo. Figure 3.1 0 conrains a coarse-grained bakery algorit11111meeting Lliese specifications. The first alomic action guarantees thal nonzero values oT turn axe unique. The f o r statement ensures thal Lhe consequenl in predicate BAKERY is t~wewhen P [ i ] is executing its cri~icalsection. The algorithm satisfies Lhe inuti~al exclusior~ property because not all 0.1' t u r n [i] ! = 0, turn [ j 1 I = 0, and BAKERY call be true at once. Deadlock cannot result since tlonzero vaJues o f t u r n are ilnique, and as usual we assume that every process eventually exits ils critical geclion. Processes are not delayed unnecessarily sirice turn[i] is zero when cs [i1 is outside its critical section. Finally, the bakery algorithm ensul-es eventual entry if scheduling is weakly fair since once a clclay conditic)~~ becolnes true, it remains true. (The values of turn in the bakery algorithm can gel arbibarily large. I-Iowevel; the turn [i] continue to get larger only i f rhcrc i s nL\vuy.s at least one process trying 10 get into its crilical section. This is no1 likely to be a pi-aclical problem.) The bakery algo~ithmin Figure 3.10 cannot he irnplc~nented&I-eclly on contelnporary machines. The assign~nenrlo turn[i] requires computing Lhc lnaximum. of n values. and [he await statenienl references a shared variable (turn[j 1 ) twice. Thcse ;~ctjonscould be inlpleinenred ato~nicallyby using anotI1e1-critical section p~.orocolsuch as Lhe tie-breaker algorithm, but thal would be qi~jleinefficient. FOI-tunately, there is a sirnplcr approach.
int turnl1:nl = ([nl 0 ) ; # # predicate BAKERY is a global invariant
--
see text
process CS [i = 1 to n] ( while (true) ( (turn[i] = max(turn[l:nJ) + 1;) for [j = 1 to n st j ! = i l (await ( t u r n l j ] == 0 or turn[il < turn[j]);)
cri~icalsectio~l; turnril = 0;
>
noncritical section;
1 Figure 3.10
The bakery algorithm: Coarse-grained solution.
3.3 Critical Sections: Fair Solutions
113
When n processes need to synchronize, it is oAen usef~~l first to develop a rwo-process solution and then lo generalize that solution. This was the case earlier for [he tie-breaker algorithm and is again usel:i~l here since i~ helps illustrate the problelns that Ilave to be solved. Thus, consider tbe followjng enlry protocol lor process cS 1: t u r n l = turn2
+
w h i l e (turn2 I =
1; 0 and t u r n l > turn21 skip;
The corresponding entry prolocol for process cs2 is t u r n 2 = t u r n l + lj w h i l e ( t u r n l ! = 0 and t u r n 2 > t u r n l ) s k i p ;
Each process ahove sets its value of t u r n by an opti~nizedversion of (3.1 O), and [he await state~nenlsare terltativel y implelnented by a busy-waiting loop. The problem with the above "solulioi~"is that tlejthel- the assignlncnt statelnenls nor the w h i l e loop guards will be evaluated atomically. Consequently, the processes could start thejl- entty prolocols at about the same time, and both could sel t u r n l and t u r n 2 to 1. If this happens, both processes could be in their CI-itical section at the same time. Tlie two-process tie-breaker algorithm iri Figure 3.6 suggests a parrial solution to Lhe above problem: II both t u r n l and turn2 arc 1, let one of the processes proceed and have the other delay. For example, lel the lower-nurrtbered process proceed by strengthening the second conjunct in the delay loop in cs2 to turn2 >= t u r n l .
Unfortunately, it is still possible for both processes to enter their- critical section. For example. suppose csl reads turn2 81ld ge1.s back 0. Then suppose cs2 stark its ently protocol, sees that t u r n l is still 0, sets turn2 lo 1, and then enters ils critical section. At lhis point, c s l can corltinue its elltry protocol, set t u r n l to 1, and then proceed into its critical section since both t u r n 1 and t u r n 2 are 1 and C S lakes ~ precedence in [his case. This kind of sjrr~atiorli s called a race corzdition since cs2 "raced by" csl and hence csl missed seeing that C S was ~ changing t u r n 2 . To avoid this race cond,i.tion,we can have each process set its value of t u r n Lo 1 (or ally nonzero value) a t the start ol' the entry protocol. Then it examines the other's value of t u r n ant1 resets its own. In particular, the entry protocol for process CSZ is t u r n l = 1; turnl = turn2 + 1; while (turn2 ! = 0 and t u r n l > t u r n 2 ) s k i p ;
Sim ilarl y, the entry prolocol for process cs2 is
114
Chapter 3
Locks and Barriers turn2 = 1; turn2 = turnl + 1; while (turnl != 0 and turn2 >= turnl) s k i p ;
One process cannot now exit jts while loop uoril the orber llas finished setling i d value of turn iP it is in the m.idst of doi~lgso. The sol.litio~gives csl precedence over cs2 in case both processes have thc same (nonzero) value foi- turn. The encry protocols above arc not quite symmclric, because Lhe delay conditions in the. second loop are slightly different. However, we can rewrite the~vjo a s)ltnmetsic form as follows. Let (a,b) and ( c ,d) be pairs of integers, and define the glraler than relation between such pairs as follows: (a,b) > ( e , d ) == true == f a l s e
if a
z c
or if a == c alld b
>
d
otherwise
Then we can I-ewrite turnl > t u r n 2 in C S as ~ (turnl,1 ) > (turn2,2) and turn2 >= turn1 in C S as ~ (turn2,a) > (turnl,1). The virtue of a s)~mmetricspecification i s that it i s rlow easy to generalize the two-process balcery algorithm to ao n-pcoccss algorithl~i,as shown in Figure 3.11. Each process first indicates chat it wanis Lo enter by setling its value of t u r n Lo 1. Then it computes the maxi~numof all the turn[i] and adds onc co the result. Finally, (he process employs a f o r loop as in the coarsc-grained solution so that it delays uotil it has precedence over all other proccsscs. Note that the rnaxirnu~~~ is approximated by reading each eletnenl of turn and selecting the lurgesl. It i s no! colnputed atolnically and hence is not necessa~ilyaccurale. However, if two or more processes compute the samc value, tliey arc 01-deretl as descr-ibed a bovc.
int turn[l:n] = ([n] 0 ) ; process CS[i = 1 to nl { while (true) { turn[i] = 1; turnti] = max(turn[l:n] ) + 1; for [ j = 1 to n st j ! = i] while (turn[j] != 0 and (turn[i],i) > (turn[ j] j) ) skip;
.
crilical section; turn[il = 0;
noncritical seclion;
I 1 Figure 3.1 1
Bakery algorithm: Fine-grained solution.
3.4 Barrier Synchronization
115
3.4 Barrier Synchronization Many problems can be solved using i~erativealgorithms that successively compute better approxiniatiolis lo an answer, cerlninating whet] either- the firlal atiswer has been compi~tedor-in the case of many numerical algorithms-wheo (lie linal answer has converged. Such an algoriihm Lyyically nmiipu1ate.s an array of values, and each iteration perlom~s(.he satne computalion on all array elemenls. Hence, we can often use mulliplc processes to compute disjoint paw of the sulution in parallel. Wc have already seen a feu/ examples aod will see many rnore in che next section and ill later cl~apters. A key atlribute of most parallel it.eralive algorilhrns is that each ilel-ation typically depends on tbe results of the previous ileration. One way lo structure such an algorithm is to i~nplementthc hody o f each iteration using o n e or more co statements. Ignoring tenninatiou, and assuming thcre are n parallel casks on each ilel-ation, this approach l ~ a the s general fornl while (true) { co [i = 1 to n]
code to implemetlt task i; OC
}
Unfol-mnately. the above approach is quire inefficient since co spawns n processes on each iteration. 11 is rn~lclimore cosrly lo create and destroy processes than to i~nplenentprocess syncluonization. Thus an alternative st~-ucru~.c wilI result in a more efficient algorithm. I n particular, create the processes once al the begini~ingof the computation, Lhen have tliero syncl~ronizeat the erld of eac 1 iteration: process ~ o r k e r r i= 1 to nl { while (true) { code to implement task i; wait for all n hsks EO complete;
1
1
This is called barrier syr~ch.t~)t~izutio~~, because the delay poini ac the et>d of each itcration represenls a barrier that all processes 11:ive Lo arrive at bel'ore any are allowed to pass. Barriers can be needed both a1 the ends of loops, as above, atid 3~intermediate stages, as we shall see. Below we develop several implementations of barricr synchronization. Each ernldoys a dieerenl pmcess ince~actiontechnique. We a l s o describe when each Irind of' barrier- is appropriate lo usc.
116
Chapter 3
Locks and Barriers
3.4.1 Shared Counter The simplest way lo speciCy the ~.equiremenlsfor a barrier js to employ a shal-ed inlegel; count,which is inilially zero. Assu~nediere are n worker processes that ~)eedco meet a1 a barrier. When a process arrives a1 the barrier, it increments count; when count: i s n, all processes can proceed. This specification leads Lo thc following code outline: (3.1 1 )
i n t count
= 0;
process Worker [i = 1 to nl { while (true) C code to jmplemenl task i; (count = count + 1;) (await (count = = n);)
1
1
We CiIll implelrient ehe await sraternent by a busy-waiting loop. I F we also have all indivisible incre~nerit inslruction. such as the Ferch-and-Add instruclion dclinecl in Section 0.3, \ve can implement lhe above barrier by FA(count. 1 ) ;
while ( c o u n t ! = n ) skip;
The above code is not fully adequate, lirrwever. The difficulty is lhar count must be 0 at the start o f e x h iteratioti, which means that count needs to be reset to o each Lime all processes have passed rhe barrier. Moreover, count has to he reset before any process again tries LO incrernc~~t count. It is possible to solve this reset proble~ilby employing two counters. one Iha~ counts up LO n and another that counts down to 0, with the roles of thc cou11Lei-s be; ng switched arler each stage. However. there are addi lional, pragmatic problems with using s l i a ~ ~counters. d First, Lhey have to be incrementecl andlor decremented as atomic actions. Second, when a process is delayed in (3.1 I ) , it is continuously examining count. 111 the, worst case, n-1 processes might be d Lo delayed waiting for the lasr process to arrive aL the barrier. 'L'hjs c o ~ ~ llead severe tnelnory contenliotl, except on mul~iprocessorswilh colierent caches. Rut even then, the value of count i s continuously cllanging, so evely cache needs to be updated. Thus it is appropriate to ilnplerner~ta barrier using counters only if Ihe target machine has atornic increment instructjons, coherent caches, and e f f cienr cache update. Moreover, n should be relatively small.
3.4 Barrier Synchronization
117
3.4-2 Flags and Coordinators One way co avoid the memory conlention prob1e.m i s lo distribute the implelaentation of count by using n variables that sum lo h e same value. I n particular, let a r r i v e [ l : n ] be an array o[: integers inilializccl to zeros. Then replace the incremenl of count in (3.11) by arrive 1 i] = 1. Wit11 this change. the €01lowing predicate is a global invariant: UFPE CcFW
count == ( a r r i v e [ l l +
... +
M IEI
arrive In] )
BIBLIOTMLA
Memory contention would be avoided as long as elements of arrive are stored i1-1 different cache lines (to avoid contenrion whdn they are written by the processes). With the above change, the remaining problems are irnplemenling rhe await statement in (3.1 1) and rcsetlit~gthe elernenls of arrive at t,he end of be each iteration. Using the above relalion, Lhe a w a i t stalemen1 could obvio~~sly i~nplementcdas (await ( (arrive [l] +
. ..
+ arrive l n l ) = = n);)
However. this reinlrtduces rrlernory contention, and, moreovel; it i s inefficient s.ince the sum of the arrive [i] is conlinually being computed by every wailing Worker. We can solve both l l ~ ememory contenrion and rebel PI-ohlemsby using :un additional set of sl~ared\~aluesarid by employing an additional process. Coordinator. Instead of having each Worker sum a11d lest the values of arrive, let each worker wait for a single value lo become true. In parlicular, let continue [I:n ] be another array of integers, initialized to zeros. After setting arrive [ i] Lo 1,Worker [ i l delays by wailiiig tbr continue [i] to be set 10 1.
(3.12) a r r i v e [ i l = 1; (await (continue[ i l == 1);)
The coordinator pmcess waits for all elenlenrs of
a r r i v e Lo become 1,then
sets all elen~entsof continue to 1: (3.13) for [i = 1 ta n] (await (arrive[il = = for [i = 1 to n] continueti] = 1;
1);)
The await statements in (3.12) and (3.13) can be implemented by w h i l e loops since each references a single shared variable. Ttle coordinator can uxe a for statement 10wait Ifor each elernen1 of arrive to be set; Jnoreover. because all arrive 1-lags n i ~ ~ be s t set bel'ote any Worker is allowed 1.0 coutinue, he Coordinator can test the arriveri] in ally 01-cler. Finally: inemory
118
Chapter 3
Locks and Barriers
con~entioriis not a problem since [Ire processes wai.t For different variables to be set and Il~csevariables could be slored in different cache lirles. Variables arrive and c o n t i n u e in (3.12) and (3.13) are examples of what is callecl a ,@OR variable-a variable that is raised by one process to signal that a synchronization condirio~~ is true. The rerxraining problem i s augmenting (3.1 2) and (3.13) with code to clear the Rags by resetting hem to o in preparalion for the next i~eralion.Here ~ w general o principles apply. (3.J4) Flag Syncl~ror)izationPrinciples. (a) The process that waits for a synchronization Rag io be set js the one that sl~ouldclear that flag. (b) A flag should not be set unlil it is known that i t is clear.
The firs1 princjple ensules rhat a flag is not cleared before it has been seen to be set. Thus in (3.12) Worker [ i l ~Iiouldclear c o n t i n u e [ i ]and in (3.13) c o o r d i n a t o r should clear all elernerlts of a r r i v e . The second principle ensures that another process does not again set the sarnc flag before it is cleared, which could lead to deadlock if rhc first process later wails for the flag ro be set again. In (3.13) this means C o o r d i n a t o r should clear arrive [i) hefore setting cont i n u e [i]. The C o o r d i n a t o r can do [his by executing another for statement aftel the iirsl one irr (3.13). Alternatively, C o o r d i n a t o r can clear arrive [i] immediately after it has waitec! for it to be set. Adding flag-clexrjng code, we get Lhe coordinalor barrier shown in Figure 3.12. Although Fjgul-e 3.12 implements barrier sytichro~iization in a way lhat avoids Iiielnory contentjon, the solution has two undesi~.ableattributes. First, it requiscs an extra process. Busy-waiting synchronization is ineflicient ~ ~ n l eeach ss process exea~teson its own processor, so the c o o r d i n a t o r shoilld execute on its own processor. However, it would probably be bertcr to use that processor for another worker process. The seconcl shortco~ning01using a coordinator is that the execution time of eacl~iteration of c o o r a i n a t or-and hence each instance of barrier synchronization--is proportional to the number of Worker processes. In iterative algorithms. each Worker often executes identical code. Henee, each is likely to arrive at the barrier ar about the same titne, assuming every Worker is executed on its own processor. Thus, all a r r i v e flags will get set at about rhe same lime. However, Coordinator cycles through tlxe Aags, wailing for each one lo be set ill turn. \Ve can overcome both problerns by combin,ing the actions of the coordinator ancl workers so that each worker is also a coordinator. In particular, we can organize the worker's into a tree, as shown in Figure 3.13. Workers can then send arrival signals up (he tree and continue signals back down the bee. In pattic~~ku, a worker node firs1 waits for its children to arrive, then tells its parent node thal il too has arrived. When the root node learns rhat its children have arrived,
3.4 Barrier Synchronization
int aurive[l:n] = ([n] 0).
119
continue[l:n] = ([nl 0);
process Workerti = 1 to nl ( while ( t r u e ) { code to impletnenl task i i arrive [il = 1; (await (continue [i] == 1);) continue[ij = 0 ; 1
1 process Coordinator ( w h i l e (true) { for [i = 1 to n] f (await (arrive[i] -= 1);) arrive [il = 0; 1 for [i = 1 to n] continue[il = 1;
Figure 3.12
Barrier synchronization
using a coordinator process.
it knows that all other wo~kershave also arrived. Hence the cool can tell its children to continue; they in turll can tell their childl-en to continue, and so on. 'The specific actions of each kind of worker process are listed i n Fjgure 3.14. The await statements can, of course, be implemented by spin loops. The implementatio~iin Figure 3.14 is called a combining tree h u r r i ~ r .This is because each process combines the results of its children, the11passes thern on to its parent. This barrier uses tlie same number of variables as the cenl~alized
Worker
Worker
f
/"- -. 2
Worker
Figure 3.13
fZ
2
Worker
Tree-structured barrier.
f
' Worker
'i
Worker
120
Chapter 3
Locks and Barriers
leatnode L;
interiornode
arrive[^] = 1; (await (continue[L] continue[ll = 0 ;
I:
==
1);)
(await (arrivefleft] = = I);) arrive[leftl = 0;
(await (arrive[right] == 1 );) arrive[right] = 0; arriveC11 = 1; (await (continue (I] == 1) ;} continue [II = 0 ; continue[left] = 1; continue[right] = 1;
toot node
Figure 3.14
R:
(await (arrive[left] == 1) ;) arrivefleft] = 0; (await (arrive [right] = = 1) ; ) arrive[rightJ = 0; continueCleft] = 1; continueCright1 = 1;
Barrier synchronization using a combining tree.
coordinator. but i t is much inore efljcient for large n, because tJle height of the tree is proportional to log2n. We can make the co~nbiningtree barrier even Inore eflicient by having the root node broadcast a single message that tells all other nodes to continue. In pvlicular, the root sets one continue flag, and all other nodes wait l o r i t to be set. This continue flag can larcr be cleared it1 either of Lwo ways. One is to use double buffering-that is, use Lwo continue flags and allenlare betwee11them. The other way is to alternate the sense of the continue Hag-that is, on oddnumbered rounds Lo wait for. it to be set to I. and on even-numberecl rounds to wait for it to be set to 0.
3.4.3 Symmetric Barriers In the coinbjning-tree barrier: processes play cliifercnt roles. I n p ~ ~ l i c u l a rhose , in the lree execule more actions than rbose at the leaves or roo(. Moreover, the root node needs lo wait for arrival signals to propagate up the tree. If every process is executing the same algorithm and every process is executing on a diffe~entprocessor, (hell all processes should anive at cbe barrier at about the satne time. Thus if all processes take the exact saine sequence of actions when at interior nodes
3.4 Barrier Synchronization
121
they reach a barrier, then all might be able to proceed th~.oughthe barrier at the same rale. This section presenrs two sjurnn.etric b~lrl.ier.s.These are especially suitable I'or shared-mernol-)I mul~igrocessors with nonuniform melnol-y access time.
A symmetric n-process barrier is co~lstrucredfrom pairs o l siulple, twoprocess barriers. To construct a two-process barrie,~.,we c o ~ ~ use l d tlie coordin;~torlworker technique. Hou,e\ler, the actio~isof the two processes would tlien be different. Instead, we can consoucc a Fully syin.~netricbarrier as follows. Let each process have a flag that it sets when i r arrives at the barrier. It then waits for the. other process to set its ,flag and finally clears the othes's flag. IF w [ i ] is one process and w [ j 1 is the other, the sy rn~nerrictwo-process barrier is then i cnplenlet~tedas follows: (3.15) / * b a r r i e r code f o r worker process W [ i ] * / (await (arriveti] == 0);) / * key line - - see t e x t * / arriveci] = 1 ; (await (arrive[j] == 1); ) arrivelj] = 0 ;
/ * b a r r i e r code for worker process W l j J * / (await (arrive[j] == 0);) / * key line - - see text arrive[j] = 1; (await (arriveli] = = 1);) arrive [ i l = 0;
*/
The last three lines in each process serve rhe roles described above, The cxis;elice of the lirst lirle in each process may at first seem odd. as it ju.st waits for a process's own flag to be cleared. However, it is needed to guard against h e possible situation in which a process races back to thc barrier and sets ils own flag before the other pt.ocessj-om the previous rcse of the bat.rier cleared the flag. In short, a11 four lines are needed in order ro follow tlie Flag Synchronization Principles (3.14). The question, now is how Lo combine instances o f two-process barriers to construct an n-process barrier. In parliculal; we need Lo devise an intcrcoonection scheme so that each process eventually learns chat all othcrs have an-ived. The best we can, do is to use some sort of binary inlerconnectio~l,which will have a size proportional to log2 n. Let Worker [ l : n l be Lhe array of processes. If n is a power of 3. we cotlld combine them as shown in Figur-e 3.15. This kind o f barrier is called a bzrtterJy Bnl.~.ierdue Lo the shape oC the iuterconneclion pattern, which is similar As shown, a to Lhe buttelfly i ntercon neclion pa1cel.n for tlie Fourier ~ral~sflom~. bulterfly barrier has log2 n stages. Each worker synchronizes with a difl'erent
122
Chapter 3
Locks and Barriers 1
Stage 1
u
Stage 2
'
Stage 3
Figure 3.15
----
Workers
2
3
4
u
5
6
7
u
8
u
M
,
Butterfly barrier for eight processes.
other worker at each stiige. In particul.ar, i n stage s a worker synchronizes with a worker at distance 2"' away. When every worker has passed through all the stages, a1.l worker3 must have a~rivedat the barrier and hence all can l~soceed. 'This is because eve~yworker has hrectly or indirectly synchronized with every orher one. (When n is not a power of 2, a buttelfly barrier call be constructed by using the next power of 2 greater than n and having existing worker processes subslitute for the missing orles at each slage. This is [lot vely efficient, however.) A different inlerconnec.tion pattern is shown in Figure 3.16. This pattern is becrer because it can be used for any value of n, not just those that are powers 01' 2. Again there are several stages, and in stage s a worker s y n c h r o n i ~ ~with s one ac disrance away. Howevel; in each two-process barrier a process sets the arrival Rag of a workel- to its right (modulo n) and waits for, then clears. its own arrival Bag. This slruclure is callecl a di.~semi~zntio~~ burrier, because i t is based on a tecliniclue for disscmi~~ating inforination to n processes in log, n rounds. I-lere, each workei dissernitlates notice of its arrival at the barrier. A critical aspect of correctly imple~neilting an n process burrierintlepelldent of which interconnectiou pattern is used-is to avoid race conditions cha~can result horn using multiple i~stancesof (he basic two-process barrjer.
Workers
1
2
3
4
Stage 1
Stage 2 Stage 3
Figure 3.16
Dissemination barrier for six processes.
5
6
3.4 Barrier Synchronization
123
Corlsider the butterfly ba~rierpattern in Figure 3.15. Assume tl~ereis o ~ l yone array o.f flag variables and that rhcy art: usecl as shown i n !3.15). Suppose Lhat process 1 ardves a1 its first stage and sets its flag arrive [I]. Further suppose process 2 is slow, so has nut yet arrived. Now suppose processes 3 and 4 arrive a1 the f rst stage or tlie barrier, set and wait for each other's flags, clew them, and proceed to the second stage. In the second stage. process 3 wants to synchronize with proccss 1. so i t waits fol- arrive [ 1) to be 1. it crlready is, so process 3 clears a r r i v e [ I ] alld merrily proceeds lo stage 3, even though process 1 had set arrive r 11 for process 2. The net effects are hat some processes will exit the barrier before they should and thal some processes will w a i l forever to gel ro tlie next stage. The s;une proble~ncan occur wilh the dissemination barrier in Figure 3.16. The above synchi-onizationanomaly results h ~ using n only one set of tlags per process. One way to solve [lie problem is lo use a di.f€erent set of flags for each stage of an n-process ban-ier. A better way is to have the flags take On rnore values . I F the flags are integers-as above-tbcy can be used as incre~nencing co~1ntel.sthat record the number of barrier stages each process has execukd. The inilial value oteach Hag is 0. Any time worker i arrives at a new barrier stagewhether i l is the firsr stage of a new bnlrier or an jrlterrnediate stage of a harrierit incremenls arrive [ i I . Then worker i determines lhe appropriak pwtner j for the current snge and waits for the value of arrive [jl LO be nl Lccist arrive [ i]. The actual code follows: # b a r r i e r code f o r worker process i for [s = 1 to num-stages) { arrive[i] = arrive[i] + 1; # determine neighbor j for stage s while (arrivetj] < arriveti]) skip; 1
In this way, worker i i s assured thar worker j has also proceeded at leasc as far. This approach of using incrementing coullters and stages removes die above race condi~ion,removes the need For a process to wait Ior its own flag to be reser-the first line in (3.15)-and removes the neod for a process lo reset another's l-lag-the last line in (3.15). Tlius every stage of every barrier is just three lines of code. 'I'he only downside is that the counters always increase, so they could tl~eoretically overflow. However. this i.s extlunzely unlikely to occur. To sumn~arize[he topic of barriers, there a-e many possible choices. A counter balder is h e simplest and is reasonable for a small number o f processes when (here is an alomic Fetch-and-Add instruction. A symmetric barrier provides tlie mosr general and efficient choice fol- a shared-memory ~uachine,
124
Chapter 3
Locks and Barriers
becai~seeach process will execute the sane code, ideally a t about the same rate. (On a distributed-mernory machine, a tree-structured barrier is often more efficien t hecause it results in fewer interprocess communications.) W hatcvcr the str-ucture of the barrier, however, the crjtical codilig problem i s to avojd race conditjons. This is acllieved either by using n~ulljplcflags-one per stage for each pair of processes-or by using increine~ltingcounters as above.
3.5 Data Parallel Algorithms In dnta ~>a.rcrllel algori~hmuseveral processes execute the same code and work on different parts of shared data. Balrjers are used to sj~ncluonizethe execution phases of the pi-ocesses. This kind o£ algorithm is most closely associated with synchronoiis mulriprocessors-i.e., single instruction stream, multiple data suearn (SIMD) machines Cllat supporc fine-grained conlputations and ba~rierspncl~ronization in hardware. However, dala parallel algorithms are also exlsemelg usefi~lon asynchronous multiprocessors as long as the granularity of the processes is large enough ro more than cornpcnsate for barrier synchronization overhead. This sect ion develops data parallel solutions for three problems: partial sums of an array, finding the end of a linltecl list, and Jacobi ilerarion for approximating the solution to a partial difl'erenlial equation. These illuslrare the basic techlliques that arise in data parallel algoril.hnls and illushate uses of barrier syncluonization. At the end of the seclion, we describe SlMD mul~iprocessorsand show how thcy remove Inany sources o f inlerference and hence remove the need tor programming barriers. An entire chaptel- later in Lhe text (Chapter 1 I ), shows how to implement data parallel algorjtl~rnsefficiently on shared-memory multiprocessors and d istri buled machines.
3.5.1 Parallel Prefix Computations 1c is freq~~ently useful Lo apply an operation to all elenlent-s of an array. For example. to compute Lhe average of an array of values a [nl, we first need to suln all the elements, then divide by n. Or we might wan1 to know the averages for all prefixes a [ o : i ] of the array, which requires computing rhe sums of all prefixes. Because of the importance of this kind of compulation, Lhe APL language provides spccial operators called reduce and scatz. Massive1y para1 lel S IMD machines, such as the Co~lnectionMachine, provjde reduction operators in hardw v e for combining values in messages. In 11iis section, we sliow how to compute in parallel the sums of all pref xes or' an array. This is thus called a pamilel p r e b c o n ~ p ~ i f n ~ i oThe n . basjc
3.5 Data Parallel Algorithms
125
algorithm can be used for any associative binary operator, such as addition, multiplication, logic operators. or maxirnum. Consequently, pal'allel prefix colni n many applications, jncluding image PI-ocessing,matrix putations are use.l'~~I computations, and parsing a regular language. (See tlze exercises at the end of [his chapter.) Suppose we are given away a [n] and are lo compute sumrn], where sum[i] is LO be the sum] of the first i eleinenls of a. The obvious way to solve this problem sequentially is to itera1.e ac.ross Lhe rwo arrays: sum[O] = a [ O l ; f o r [i = 1 to n-11 sum[i] = sum[i-11 + a [ i l ; I n particular, each iteralion adds a t i ] LO Lhe already co~nputedsuln of Lhe previous i-1 elernenls. Now consider how we might parallelize this approach. If our task were ld as l'oJlows. First, add merely to find the sum of all elemcnch, we c o ~ ~ PI-oceed pairs oE elements i n parallel-for example. add a [O ] and a [I] i n parallel with adding other pairs. Second, combine the re.sults of the first step. again in pamllel-for example, add the sum of a [ 0 I and a [ I I to [he sum of a 1 2 I slid a [ 3 1 in parallel with computi-ng ol.her parlial sums. 'If we co~~tinue this process, in each step we w o ~ ~ ldouble d the number of elements thai have been summed. Tllus in log, n sreps we would have co~npuledthe SLIITI o f all elements. This i s the best wc can do if we have ro combine all elements two at a time. 'To conipuce the sii~iisof all prefixes in parallel, we can adapt this techniclue ol' doubling the nun~beror elements that have been added. First, ser all the surn[i] to a [i]. Then, in parallel add sumti-11 to sumti]. for all i > = l . I n particular, add elements that ar-e di.stance I away. Now double the distance and add surnri-23 to aum[i]; in this case ( ~ o J - ~ IiI > = 2. It' YOU continue to double the distance, then afler log, n 1rounds you will have cornyuted all partial sums. As a specific example. the following table illusrrales the steps of die algorithm for x six-elerncnt array:
r
1
r
initial values oC a
1
2
3
4
5
sum after disPaocc 1
1
3
5
7
9 1 1
sum after distance 2
1
3
6
I0
14
18
sum after distance 4
1
3
6
10
15
21
6
Figure 3.17 gives at1 implementation of chis algorithm. Each process first initializes one element of sum. Then jt repeatecllg co~rlputesparlial sums. I n the
126
Chapter 3
Locks and Barriers i n t a [ n l , sum[nJ, o l d [ n ] ;
process Sum[i = 0 to n-l] { int d = 1; sumri] = a [ i ] ; / * initialize elements of sum * / barrierli); ## S U M : surnli.3 = (a [i-d+l] + . . . t. a[i3 ) while (d < n ) Z oldlil = sum[i] ; /* save o l d value * / barrier (i) ; if ((i-d) >= 0) sum[i] = old[i-dl + sum[il; barrier (i); d = d+d; / * double the distance * / 1 }
Figure 3.?7 Computing all partial sums of an array,
algorithm, barrier ( i ) i s a call lo a procedur-e thal implements a barriersynchronization point; argument i is the identity of the calli~lgprocess. 'The procedur.e call retums after all n pl-ocesses have called barrier. The body of the procedure would use one of the 4gori~limsi n the PI-evioussection. (For [his probleln, the barr~erscan be optimized hince only pairs of processes need lo synclironize in each stcp.) The barrier points in Figlure 3.17 are needed to avoid interference. For examplc. all elclnents of sum nced to be initjaljzed bel'ore any process examines theln. Also, each procoss llecds t o ~nakea copy of the old value i n sum[i he.fore i t updates that value. Loop invariant SUM specifies how rnucll of the preiix ol'a each process has sulnrned on each iteralion. As mentioned, we call modify this algorithm to use any associative binary operator. A.11 that we need to change is [.heoperator in the slateme~~l that modifies sum. Since we have written the expression in the combining step as o l d [i-dl + sum[il, the binary operator need not be cormnutat.ive. We can also adapt he algorithm in Figure 3.17 to use fewer than n processes. In this case, each process would be responsible for computing the parlial sums of a slice of the an-ny.
3.5 Data Parallel Algorithms
127
3.5-2Operations on Linked Lists When working wi~hlinked data structures such as trees, programmers often use balanced shucrures such as binary trees in order to be able to search for wlcl insert items in logarilhmic time. When data parallel algorithms a e used. however; even with linear lists marly operations can be i~nplementedin logarithmic time. Hcre we show how to find the end of a serially linked Jist. The same kind of algorithm can be used for other operations on serially linked lists-for example, computing dl parrial sums of data values, inserting an element in a priori(.y lisl, or matching up elements of two lists. Suppose we have a linked lisl of up to n elements. The links are storcd in array linkCn1,and the data values are stored in array data [n]. The head ol' the list is pointed to by another variable, head. If element i is part of thc list, then either head == i or l i n k [ j 1 == i for some j between 0 and n- 1. The l i n k field of the last element on the list is a null poinler, which we will represen[ by n u l l . We also assume that (he l i n k fields of eletnents no1 on the Jist are null pointel-s and that the list is already inirialized. Tbe following is a sample list:
FbRwibwR head
The problem is to find the end of the list. The standard sequential algorithm list element head and follows links until finding a lull link; the last eletnetlt visited is rhe end of the list. The execution ~ i m eof the sequential algorilhin is thus proyorlional to the length of the list. I-lowever, we can find the end of the list in t i ~ proportional ~e co rhe logarithm 01the length of tlie list by using a data para.llel algorjtlun and the technique of doubling thal was introduced i n rhe previous sectioo. We assign a process Find 10 each I.ist elemenl. Let end [n] be a slza-ed iuray of integers. [f elerneat i is a part of the lisr, lhe goal of F i n d ti] is to set e n d [ i ] to h e index of the end of the Last element on the lisl; otherwise F i n d [i 3 sh0~11dSet ena [i ] Lo n u l l . 7'0 avoid special cases, we will assume that the list contajas at leas1 lwo elements. lnilially each process sets ena[il to l i n k [ i ] - h a t is, to the index o F the next element on the list (if any). Thus end initially reproduces the paltern of links i n the lisr. Then (lie processes execute a series of I-ounds. 111 each round, a process looks a1 end [end[i]1 . If both i l and end [i] are not null pointers, thcn [he process sets end [ i 1 to end [end [ i ] I . Thus aRer the first roilnd end [i] will poiot to a list eletnent two links away (if there is one). After two
starts at
128
Chapter 3
Locks and Barriers int link [n], end ln] ;
process FindCi = 0 to n-11 ( int new, d = 1; end[i] = link[i] ; / * initialize elements of end * / barrier ( i) ; # # FIND: end[i] == i n d e x of end of the list ## at most ad-' links away from node i while (d < n) { new = null; / * see if end[il should be updated * / if (end[i] ! = null and end[end[i]] ! = null) new = end[end[il]; barrier(i) ; i f (new ! = null) / * update end[il * / end [i] = new; barrier(i); d = d + d ; / * double the distance * /
1 1 Finding the end of a serially linked list.
Figure 3.18
rounds, end[i] will poin~lo a list element four links away (again, if tliel-e is one). After logz n 1rounds, every process will I~avefound the end of the list. Figure 3.18 givcs an irnplementarion of this dgorithm. Since Lhe programming tecliniclue is [lie came as i n the parallel prefix computation, the stnlcture of the algorithlu is also the same. Again. barrier (i) indicates a call to a proccdure Ihar implements barrier syncluonizalion for process i. Loop invariant FhVD specifies what end[i] points to before and nfLer e;icll ireraticln. If the end of [he list i s fewer Lhan 2d-1links away from eletnent i, then end[il will not change on Fu~tlierirerations. To illustrate the execution of this algorithm, consider a six-element list linked together as follows:
r
A t the start of the while loop in Find, the end pointers contain these links. Aflcr [he first ireration of t(lc loop,end will contain the following Irnks:
3.5 Data Parallel Algorithms
129
Notice tlial the end links for the last two elements have no1 cha~~ged since they are already correct. Alrer the second round, the end links will be as follows:
After the third and final I-ouod,the end links will have their final values:
As with the parallel prefix compulat~on,we can adapl this algorithm t o use fewer lhan n processes. 111 pilrticular, each would rhe~ibe responsible for computing the values for a subset ofthe end links.
3.5.3 Grid Computations: Jacobi Iteration Many problems in image processing or scient16c modeling can be solved using what are called grid or mesh cont/~ut~~rio~is. The basic idea is to employ a matrix of points tha~superimposes a grid or tuesh on a spatial ~.cgion. I n an image processillg probletn, rhe rnau-ix is initialized to pixel valuex, and the goal is l o do solnethjng like find sets of neighboring pixels having the same intensity. Scientific ~uodelilzgd e n involves appl-oxilnatjng solulions to partjal z1il'~erential ecluations, in which case the edges of Lhe ~llatrixare initialized LO boundary conditions. and the goal is to comnpute an approxi~nationfor the value ol'eilch interior point. (This col-responds to linding a steady state solution to an equation.) I n ejtber case, the basic outline of a grid co~npuracionis initialize Lhe ~n:~lrix; (not yet terminated) ( compute a new value for each poiat; check for terrni nation: 3
while
On each ireration, the new v a l ~ ~ of e s points can typically be computed in par-allel. As a specific example, here we present a si~nplesolution LO Laplace's equation in two dimensions: = 0. (This is a parrial differential equarion; see Section 1 1.1 for details.) Let g r i d [ncl.n+ l] be a matrix of points. The edges
130
Chapter 3
Locks and Barriers
of grid-the left and right columns. and the lop an.d bottotn rows-xepresent the boundary OF a two-dirnensionaI region. The n x n interior elements of g r i d corlrspontl t o a mesh tliat is superi~nposedon the region. The goal is to compute the steady state values of interioi points. For Laplacc's equation, we can use a finite difference ~netl~od such as Jacobi i.terntion. In parl.icular, on each ileration we compute a new value for each interior point by taking the average of the previous values of jts four closest neighbors. Figure 3.19 presents a grid computation that salves Laplace's ecluatio~l Once again we use barriers Lo synchronize steps of the using Jacobi itc~a~io~l. comp~~ration. Tn this case, thei-e are two main steps per ilera(ion: update newgrid and check for convergence, then move the contents of newgrid into grid. Two matrices are used so i l ~ new t values for grid points depend only on old values. We can tenninale the cornpi~tationeilher afier a f xed nuln ber of' iterations or wl-ren values in newgrid al-e all within solne tolerance E P S I L O N of those in grid. If differences ilre used. they cao be checked i n parallel, but the 1-esults need to be combined. This can be done using a parallel prefix compulatiorl; we leave the details 1.0 the reader (see the exercises al the end of this chapter). The ;~lgorirhmin Figure 3.19 is correcl, but it is simplislic in several respects. First, ic copies newgrid back into grid on encli iteration. It would be far more efficient lo "unroll" the Joop once ancl have each iteration first update poinfs going from grid to newgrid and then back from newgrid to g r i d . Seconcl, i t wollld be better sti.11 lo use an algo~ilhrniliac converges Faster, such as red-black successive over relaxation (SOR). Third. the algorithm in Figure 3.1,9 i s roo line grain for asyncli~onousmultiprocessors. For those, it would he far
real grid[n+l,n+ll, newgrid[n+l,n+ll; boo1 converged = false; process Grid[i = 1 to n, j = 1 to nl { w h i l e (not converged) { newgridli, j] = (grid[i-1,jl grid[i, j-11
+ grid[i+l,j l +
+ grid[i,j+l])
check for convergence as described in the text; barrier(i); gridli, j] = newgridfi,j l ; barrier ( i ); 1
1
Figure 3.19
Grid computation for solving Laplace's equation.
/ 4;
3.5 Data Parallel Algorithms
131
better to partition the grid into blocks and to assign o~icprocess (and pl.occssor) to each block. All these points are considered in dehil in Chapter I I. where we show how to solve grid cornputalions efficiently on both shared-memory m i ~ l ~ i plucessors and disrri buted machines.
3.5.4 Synchronous Multiprocessors On an asynchronous multiprocessor, each processor cxeculcs n separate process, and the processes execute at potenrially differelit rates. Such multiprocessors are called MIMD machii~es,because they have multiple inslruclion streams and ~nultiple data strearns-that is, they have multiple independell( processors. Tliis is Llie execution model we have assumed. Aithough MIMD macllines are the most widely used and flexible ~nultiprocessors, synch-o~iousmult.iprocessors (SlMD ~nachines)have recencl y heen avail able, e.g., the Connecfon Machine in the early 1990s and Maspar machines in the mid to late 1990s. An SIMD machine has multiple dala streams but only a single inslluction stearn. Tn particular, every processor executes exaclly lhe same sequence of instructiot~s,a i d they do so in lock .rtep. TI~Ismake3 S l M D niachines especially suited to executing data pa-allel algorithms. For example. on an SlMD machine the algorithm in Figure 3.17 ror computing all parlial sunis of an array simplifies to i n t a [n], sum[n]; process Surn[i = 0 to n-l] { sumCil = a [ i ] ; / * initialize elements
while ( d
n) { if ((i-d) > = 0) / * update sum * / sum[i] = sumti-dl + sum[il; d = d + d ; / * double t h e distance
of s u m * /
c
*/
We do not need to program b a ~ ~ i e rsince s every process cxccutcs the same at the same time; hence every instruction and indeed every memory instructio~~s reference is lollowed by an itnplicit barrier. In addition, we do not need 10 use extra val-iables lo hold old values. I n the assignment lo sum[i], evely process(or) fetches the old values from sum before any assigns new valua. Hence, par4l)el assignment statements appeal- to bc atomic on an SlMD machine, which reduces some sclurces ul'inlerfcrence. It is technologically muck easi.er to construct an SlMD machine with a massive number of processors than it is to constnlct a massively pzu-allel MIMP
132
Chapter 3
Locks and Barriers
n~acliine.This makes SIMD machines atlracrive [or large problertis that can be solved using data parallel algorithms. On the other hand. SmlD m?ch'lnes are special purpose in the sense that the entire machine executes one program a1 a time. (This is the main reason they have fa1len out of favor.) Moreover, i l is a cha.llenge for the programmer to keep the processors busy doing useful work. Jn the above algorithm, for example, fewer and fewer processors upclate s u m [ i] on each iteration, yct each has to evaluate the condition in the i f statement and then, if the condition is false, has to delay until all other processes have updated sum. In general, the executiot~time of' an i f staten~etltis the cola1 of the execution time lor e v e n branch, even if Ihe branch is not iaken. For example, Ulc lime to exccule an i f / t h e n / e l s e slatcment on each processor is the suin of the limes to evaluale the condition. executc the then part, and execute the else part.
3.6 Parallel Computing with a Bag of Tasks Many iterative problcrns can be solved using a data parallel y~-ogram~i-~ing scyle,
as shown in Lhe previous scclion. Recursive problems Lhal arise from (he divide and conquer paradigm can be parallelized by executing ~.ecul-sive calls in parallel rnther than sequentially, as was shown in Section 1.5. This section presenls a dil'l'erenl way to implemenl parallel computations using whai is called a hag nftcrslts. A task is an independent unit of work. Tasks are placed in a bag rhal is shared by two or more worker processes. Each worker execules the following basic code: while (true)
get a task from the bag; i f (no more tasks) break;
# exit the while loop
execute the task. possibly generating new orles;
I This approach can be used to implement recursive parallelism, in which case the tasks are the recursive calls. It call also be used Lo solve iterative problems havi11g a fixed number of independent taslts. The bag-of-tasks paradigm has severd usefill attributes. First, it is quite easy to use. One has only Lo define the representation for a task, impletr~entthe bag, program the code Lo execute a task. and figure out how to detect tzl-mination. Second. programs thal use a bag of tasks are scnfable, meaning that one can use any number of processors merely by varying the number of workers. (However, the perl'ormance 0.1' the program might no1 scale.) Finally, the paradigm makes i t
3.6 Parallel Computing with a Bag of Tasks
133
easy Lo i~nplementload ba1nncin.g. II: tasks lake different amounts oC lime to execute, s o w workern will probably execute more than others. However, as long as tsl~ereare enough more tasks chan workers (e.g., two or Ltlree times as many), the total arnounl of cornpi~tationexecuted by each worker should be approxirnately the same. Below we show how to use 21 bag o l tasks to implement matrix multiplicalion and adaptive quadratwe. I n matrix mulliplication there is a fixed number of
tasks. With adaptive quadrature, tasks are created dynamically. Botb exa~nples use critical sections to prolect access to the bag and use balrier-like synchronization to detect krm.ination.
3.6.1 Matrix Mu1fiplicafion Consider again the problem of multiplying two n x n matrices a and b. Th.is requires computing n2inner produc~s,one for each combination of a row of a and a column of b. Each inner product is an independent computation, so each could be execrlled in parallel. Ho\ve\ler. suppose we are going Lo execute the program on a machine wilh PR processors. Then we woi~ldlike to employ PR worlter processes, one per processor. To balance the colnputational load. each should compute about the same number of inner products. In Section 1.4 we statically assigned parts of the compulation t o each worker. Here, we will use a bag o t tasks. and have each worker grab a task whenever it is needed. 1I' PR is much smallel- than n, a good size for a task would be one or a few rows of the result matrix c. (This leads co reasonable data locality for matrices a and c. assuming matrices are sto~.edin row-~najurordel-.) For simplicity, we wjll use single ~'ows. Initially, the bag contains n iasks, one per row. Since these can be ordered in any way we wjsh, we can represent the bag by simply counti~ig rows : int nextRow = 0 ;
A worker gets a task out of die bag by executing h e atomic action: ( row = nextRow; nextRow++; )
tvl~erer o w is a local variable. The bag is empty ic r o w i s at least n. Tlie i~tomic action above is another instance of drawing a ticket. I1 can be implemented usj~ig a Fetch-and-Add instruction., it one js available, 01.by usirbg locks to protecr rl~e critical seclion. Figure 3.20 contains an outline of the full progran. LVe assume that the malrices have been initialized. Workers compute inner products in the usual way. The program terminates when each worker has broken out of its while loop. [f
134
Chapter 3
Locks and Barriers
int n e x t R o w = 0; # the bag oE tasks double a [ n , n l , b [ n , n l , c [ n , n l ; process Worker[w = 1 to PI ( int row; double sum; # for inner products
while (true) I # get a task ( r o w = n e x t R o w ; nextRow++; ) if (TOW >= n) break; coinpure jnner products lor c [row,* I ; 1 1 Figure 3.20
Matrix multiplication using a bag of tasks.
we wanr to deiect this, cve could employ a shared counler, done, ll'lal is initially zero and ihal is incremeoted atonlically before a worker executes break. Then if we want Lo have the last worker print the results, we could add the following code lo h e end of each worker: if (done == n) print rnalrix c ;
This i~scsdone like a counter barrier.
3.6.2 Adapfive Quadrature Recall that the quadrature problem is to approximate the integral ol' a function f ( x ) fi'on~a t o b. With adaptive quadrature, o w first computes the midpoint m between a and b. Then one approximaces three areas: from a co m, From m to b, and froin a to b. If tile sum oP be two smnl.ler areas is within some acceptable lolcrance of the larger area, tllen the approximation is considered good enough. if not, the larger problem is divided into ~ w osubproblems and the process is
repeated. We can use the bag-of-tasks paradigm as followb to implemenl adaptive quadrature. Here, a task is an interval 10 examine; i t js defined by the end points of' (lie interval, lhe value of the function at those endpoints, and the approxitnatiun of the area for that interval. lnidally there i s j u s t one task, for the entire interval a to b.
Historical Notes
135
A worker repeatedly ta.kes a task from the bag and executes it. In contrast to the matrix m~~ltjylication program, howevt?r, the bag may (temporarily) be empty and a worker might thus have to delay bel'o1.e getting task. Moreover, executing one task will usually lead to producing two smaller casks, which a worker needs to put back into the bag. Finally, it is tnol-e difficult Lo determine when all the work has bcen done because it js not sufficient simply to wail until the bag is empty. (Indeed, the bag will be empty rl~e~ninutethe first task is taken.) Rather, a11 rhc work i s done only when borh (1) the bag is elnpry itnd (2) each wol-ker is waiting to get another task. Figure 3.21 contains a program for adaptive quadrature using a bag of tasks. The hag is <epre.sentedby a queue and a counter. Arlolher counter keeps track oT the number of idle workers. All h e work has been finished when size is 0 i ~ n d i d l e is n. Note thac the progl-am contains several atomic aclions. These are needed to protect the critical sections that access the shared v;uiables. All but one of the arornic actions are unconditional. so they can be protectecl using locks. has to be implementcct using the rnore elaboHowever, the one await state~ne~it rate pro~ocolin Section 3.2 01-usjng a morc powcrl'ul sync11rouit;ition mechanism such as semaphores 01-mot~ilors. The prograrr) in Figure 3.21 (makes excessive use o f the bag of hsks. In particular, when a worker decides to generate two tasks, it pilts both of them i n the bag, then loops around and takes a new task out of thc bag (perhaps even one that it just pur in). Tostcacl, wc could have a worker put one Lilsk jli the bag ood keep the other. Once the bag contains enough work to lead lo a balanced computational load, a f~lrlheroptimjzation would be to havc a workel- execute a rssk fully, using sequerltial recursion, rather than putling Inore Lasks into the bag.
Historical Notes 'The critical section problern was first clescribed by Edsger Dijkstra [J965). Because t.he problem is lundarue~ilal,it h:is bcen studied by scores ol' people who have published 1itel;~llg~ hundrecls of papers on the topic. This chapter h:~s presented four of the lnost i~nl>ortantsolulions. (Raynal [I9861 has written an enlire book on lnuti~alexclusiol~algorithms.) AJtIioi~ghdevisjog busy-wai ting solutions was early on mostly an academic exercise-sioce busy waiting is illefficient on a single processor-the advent o f multiprocessors has spurre.d renewed inlel.cst in such solutjons. Indeed, all m~~ltiprocessors now provide instructions that support at least one busy-waiting solution. The exercises at the end of chis chapter describe most of these inslructions. Dijkstra's paper [I9651 presenl-edthe li rst n-process software solution. It is an exlension of the lirst two-process soluriotl designed by Dutch mathematician T. Dekker (see the exercises). However, Dijkstra's original formulation of the
136
Chapter 3
Locks and Barriers type task = (double left, right, fleft, fright, lrarea); queue bag (task); # the bag of tasks int size; # number of tasks in bag int idle = 0; # number of idle workers double total = 0 . 0 ; # the total. area
compute ;ipproximal-e area from a to b; inserl task ( a , b, f (a), f ( b ) , area) in the bag; count = 1; process Workertw = 1 to PR] { double l e f t , right, f lef t , f r i g h t , l r a r e a ; double m i d , fmid, larea, rarea; while (true) ( # check for termination
( idle++; i f ( i d l e == n && size == 0) break; ) # get a task from the bag ( await (size > 0)
remove n task from the bag; size--; idle--; ) mid = (left+right) / 2 ; fmid = f (mid); larea = (fleft+fmid) * (mid-left) / 2; rarea = (fmid+fright) * (right-mid) / 2; if (abs((larea+rarea) - lrarea) > EPSILON) ( ( put (left, mid, fleft, fmid, larea) inthebag; put (mid, right, fmid, fright, rarea) in the bag; size = size c 2 ; ) 1 else ( t o t a l = total + lrarea; )
1 if
( w == 1) # worker 1 prints the result grintf("the total is % f \ n M ,total);
1 Figure 3.21
Adaptive quadrature using
a bag of tasks.
problem (lid nor require (he eventual entry properly (3.5). Donald Kllulh [I9661 \\/cis the f r s ~ LO publisll a soluliun that also ensures event.ua1 entl-y. The tie-breaker algorithm was discovered by Gary Pelerson [I9811; it is often callcd Peterson's algorilhm as a result. This algorjthrn is particularly simple fo~.two processes, unlike the earlier solutio~lsof Dekker, Dijkstra, and others.
Historical Notes
137
l'eterson's algal-illim also generalizes readily lo an n-process solution, as shown in Figure. 3.7, That solution requires that a process go through all n-1 stages, even if no other process is tryi.ng co enter its critical section. Block and Woo (1990J present a variation hat requires only m stages if just m processes are contending for entry (see the exercises). The bakery algorithm was devised by Leslie Lamport 119741. (Figure 3.1 1 contains an improved version that appeared in Lamport 1.19791.) I n addition lo being more intuitive than earlier critical sectiorl solulions, the bakery algorilh~n allows processes to enter in essentially FJFO order. It also has thc interesting property that it is tolerant of some he~dwarc:failures. First, if one process reads turnti] while another process is setting it in the entry protocol, the read can relurn any value between 1 and the val~icbeing wr-ilten. Sccond, a process cs [i] may fail at any Lime, ass~~niing it im.mediately sets t u r n [i] to 0. However, the value of t u r n ti] cat] become unbounded in Lamport's solu~ionil' there is always a( I.east one process it) its critical seclion. The tickel algorith!n i s the sirnplest of tke n-process solulions to the crilici~l ,secrin~-lprohlen~;it also allows processes to enter in lhe order in which they draw number^. The cicket algorirlim can it1 fact be viewed as an optimization of the bake131 algorithm. However. it requires a Fetch-and-Add j n s ~ u c l i o n ,so~aehing that few ~nachincsprovide (see Alnlasi ancl Goldieb [I 9941 for machines that (lo). Hany Jordan [I9781 was one of the firs[ lo recognjzc the importance of barrier syl~ct~ronization in parallel iterati vc algorithms; he i? creditetl wit11 coining the term bar~ier. Altliough barriers axe uol neatly as commonplace its crjtical sections. they loo have been studied by dore~lsof people who have cleveloped several dil'l'erent i mple~ncniations. The cumhining tree barrier iu Figures 3.13 a11d 3.14 is sirnilax to one devised hy Yew, Tzeng, ant1 Lawric 11 9871. The butterBy barrier was devised by Brooks [1986]. Hensgen, Finkel, arrd Manber [I9881 developed the dissemination barrier; their payer :ilso describes a tourn;iment barrier, which js similar in structure to a combining tree. Gupta [I9891 describes a "fuzzy barrier," ujhich includes a rcgion of statements a 131-ocesscan execute while i t waits for a barrier; his paper also describes a hardware implementation or
fuzzy barriers. As described in seve.ral places i n the chapter, busy-wailing jmplemenrations of locks and baxxiers can lead to m.etnory and interconnection-networlc conrenlion. Thus it is iinport,mt for delayed processes lo spill on the contents of local memory, e.g., cached copies of variables. Chapter 8 of Wennessy and Pdtterson's a~cbitecturebook [I9961 discusses synchronization issues in multiprocessors, includes an intexesling historical perspeclive, and provides an exlensive set of I-el'erences. An excellent paper by Mellor-Cmrrrtny 2nd Scot1 [I9911 arlalyzes [lie performance of several lock and barrier protocols, including most of those described
138
Chapter 3
Locks and Barriers
in Lhe iexc. That paper also presents a new, list-based lock protocol and a scalable, tree-based ban-ier with only local spinning. A recent paper IMcKenney 1996Il describes clickrent kinds of locking patterns in parallel programs and gives guidelines for selecting which locking primitives to use. Another recent paper [Savage et al. 19971 describes a tool called Eraser that can detect data races due l o synchronization mistakes in Jock-based rnultitlueaded programs. On ~nultiprogramsned(time-sliced) systems; performance can su t'l'er greatly i f n process is preempted while holding a lock. One way to overcolne this problem is to use noizblockbzg-or lock fi-ce-algorithms inslead of using mutual exclusion locks. An irnpleinentation o i a data structure is nonblocking if some process will cornplele an operation in a finite number of steps, regardless of the execution speed of other processes. I-lerljhy and Wing [I9901 int~oducesthis concept and presents a nonblocking implementatjon of concurrent queues. 111 I-lerlihy and Wing's implementation, Lhe insert operation is also wait flee--i.e., every inserr operation will complere in a finite number of steps, again regardless of execution speeds. Herlihy [I9901 presents a general methodology for i1np1e.inenting highly concurrent data stnlctures, and Herlilzy 119911 contains a comptehensive discussion of wait-free synchronizatjon. A recent paper by Michael and Scott [I9981 evaluates the performance of nortblocking implementations of several data slruclures; they also considel- the alternative of lock-based impletnenta(ions lliat u e modified to be preemption safe. They cot~cluderhar nonblocking implemen~ations are si~perjorfor those data structures for which clwy exist (e.g., rli~eues),and lhat preemptjon-safe locks outpel-form normal locks on inultiprogrammed syscems. Data parallel algorithms are no st closely associated wirh ~nassivelyparallel machines since those m;lcliines pcmit thousands of dala elelmerits ro be operaled on in p:rallel. Many of the earliest examples of data parallel algori(1ims were designed for Lhe NYU Ul tricornputer [Schwartz 19801; al I have the characterjstic attribute that execution tjrt~eis logarithmic jn the size of the data. The Ultracomputel. is a MIMD machine: and i t s designers realized the importance of cfficienl critical section and barrier synchronization protocols. Consequently. they implein an early hardware prototype [Gottlieb et cnented a Replace-and-Add operi~tio~l al. I9K31; 111is jnstluction adds a value to a memory word and hen relurns the I-esull. I-lowever. more recent versions of the Uluacomputer provide a Fetch-andAdd operation instead [Al~nasi& Gotllieb 19941, which returns che value of a memory word bel'ore addjog ro it. This change was made because Fetch-anclAdcl, n~lilceReplace-aad-Add, generalizes 1.0 any binary combining opcratorI'or example, Fetch-and-Max or Fetch-and-And. This general kind of instruction i s called a FeLch-and-0 operation. The introduction of the Connection Machine in the micl-1980s spun-etl renewed interest it1 data pal-allel slgoritl~rns. That tnachine war designed by
References
139
Daniel Hillis [I9851 as part of his doctoral djsse~faiionat MIT. (Precioi~sfew dissertations liave been as influential!) The Conneclion Machine was ;I SINID machine, so barrier synchronization was au tolnalically provided after every machine instr.uct.ion. The Con~lectionMachine also had ~liousandsof processing elements; tli.is enables a data parallel algorillim to liave a very fine granulai.ityfor example, a processor can be assignecl t o every element of an array. I-lillis and Steele [I9861 give an overview d the oligiiial Col)r~ecfonMacliine and describe severi~linteresting data parallel algorithms. was introcluced by CarThe bag-of-tasks paradigm for parallel cornpuli~~g r i e r ~ .Gebtlter, and Leichter r19861. Thar papei- shows hoihl to ilnplernent thc task bag in the Linda prograrnmin,g notation, wliich we describe in Chapter 7. They called [lie associated proprnmtning n~oclclr-qlicrrtcd woi-keys. because any uuiiiber of workers can share the bag. Soine now call this the work j i ~ r n ln.zodel. because the tasks in the bag are farmed out to [he wol-ltel-x. We prefer the phrasc "bag of tasks'' because il characterizes the essence OF the approach: a sharecl bag or tasks.
References Alruasj. G. S., and A. Gobtlieb. 1994. l-lighly P.rrallel Conrprrling. 2nd ed. Menlo Park, CA: Benjarnilz/Conmni~igs.
of' Peterson's Block, K., and T.-K. Mroo. 1990. A mol'e cfticicnl ge~~eralization Proc~ssirlg L c ! t ~ ~35s (August): ~nutiialexclusion algorithm. Jn.ji>nn~rtio~~
219-22. Brooks, E. D.. 1 11. 1986. The bi~tterflybartier. 1111. Jn~~~r~lill c?f Purnllel Prog. 15, 4 (Aug~ltit):295-307. Can-iel-o,N., D. Gelei-nler, and 1. Leichter. 1986. Distributed data slnlcrures in Linda. Th.i~.teenthACIV Synzp. on Principles of l'rog. Lnuzg.~.. Jani~aql,pp. 23642. Dijkstra, E. W. 1965. Solution of a problem in conculrent progra~~~rning control. Comm. Act11 8 , 9 (September): 569. Gottlieb, A,, B. D. Lubachevsky, and L. Rudolph. 1983. Basic techniques for the efficient coordination of very large ilumbers of cooperating sequential pro. and Sjtslrms 5 , 2 (April): 164-89. cessors. ACM Pans,on P ~ o g Lnn.guage.r Gupta. R. 1 989. The fuzzy barrier: a lnechanism for high speed sy lichronizalion of processors. Tizird Inr. Conj: oop~ Arc:hitecturul Shipport f i r PIVQ. LangLlages and Operatilzg Systems, April, pp. 54-63,
140
Cha~ter3
Locks and Barriers
I-leanessy, J . L.. and D. A. Patterson. 1996. Cor~~pw/c.t.ilrc~hifectura: A Qua~iriralive Approt~ch,2nd ed. San Francisco: Morgan Kau fmann.
D., R. Finkel. and U. Manber. 1988. Two algorithms for barrier synclironi z a t i o ~ ,In/.Jo~lrnc~l of Pc~ru.lle1frog. 17, 1 (January): 1-1 7.
Hensgen.
I-Ie~.lihy,M. P. 1990. A methodology for irnpleme~iri~~g highly concuo-ent data slruclures. Puoc. Second ACM Symp. o,z Principles & Practice ~f Parallel Pmg., March, pp. 197-206. Hcrlihy, M. P. 1991. Wait-free synchronization. ACM fic~rzs. 011 Prog. Lunpuage.es ar1.d Sy.c/ems I I , 1 (January): 124-49.
M.P., and 1. M. Wing. 1990. Linearizability: a correctness condition and Systenzs 12, 3 for cotlcilrrent objects. ACM X-tr~ir.on P m g . Lnng~~ages (July): 463-92.
Herlihy,
l-iillis, W. D. 1 985. Tlze Conneclion Muchine. Cambridge, MA: MIT Press. I-Iillis,W. D., and G . L. Sleele., Jr. 1986. Data parallel algorithms. Conzm. ACM 29, 12 (December): Il70-83. Jordan, H. F. 1978. A special purpose architecture for finite element analysis. Proc. I978 I ~ z f .Conj: on Parallel Processir~g, pp. 263-6.
problem in concurrent prograinming control. Cnnsm.ACM 9, 5 (May): 321-2.
Knutll, D. E. 1966. Additional comlnents on a
L?unporl, L. 1974. A new solution of Dijkstra's concurrent progl-ammjng problen~.Con7nl. ACM 1 7, R (August): 453-5.
Lalnporl, L. 1979. A new ;~pproachto proving the correctness of multiproccss programs. ACM Trans. ow Prog. La)iguages und Sysrons 1 , I (July): 84-97. Lamport, L. 1987. A fast mutual exclusion algoritlinl. ACM T m s . on Conzp~.rterSystems 5, 1 (February): 1 -I I. McKenney, P. E. 1996. Selecting locluter'.S,ys~ern.c 9. 1 (February): 21-65. JL.
Mich;lel, M. M., and M . L. Scotl. 1998. Nonblocking algorilhlns and preernption-sd'c locking 011 ~nultiprogrammed sharccl rnemory mu1tiprocessors. Journal of PCII-trllel ancl Di.strilm.tcd Computer 5 l,1-26. Pctcrson, G. L. 1981. Myths aboul the r n u a ~ a lexclusion probletn. Info1nzation Processing Lef1er.s 12. 3 (June): 1 15-61.
Exercises
Raynal, bI. 1986. Algoraithrns for
~M~rllial Exclusion.
141
Cambridge, MA: MIT
Press.
M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. 1997. Eraser: a dynamic data race detector for multitlueaded progratns. ACiW. Trczns. on Computer Syslems 15,4 (November): 391-41 1.
Savage, S.,
Schwartz, J . T. 1980. UItracomputers. ACM Traizs. on Prog. Lutzguages and Systems 2 , 4 (October): 484-52 1. Yew, P.-C.. N.-F. Tzeng, u l d D. H. Lawlje. 1987. Dishibuting hot-spot addressing in large-scale multiprocessors. IEEE Pans. on Computers C-36, 4
(April): 388-95.
Exercises 3.1 Following is Dekker's algol-llhm, the first solution to the critical section problem for two processes: boo1 enterl = false, enter2 = f a l s e ; int turn = 1; process P1 I while ( t r u e )
{
enterl = true; w h i l e (enter2) i f ( t u r n == 2 ) { enterl = false; while (turn == 2) skip; enterl = true;
1 crjtical section; enterl = false; t u r n = 2; noricrilical section; 1
1 p r o c e s s P2 { while (true) { e n t e r 2 = true; w h i l e (enterl) if (turn == 1) { enter2 = f a l s e ; w h i l e (turn == 1) skip; enter2 = true;
1
142
Chapter 3
Locks and Barriers
ci-itical section : enter2 = false; turn
=
1;
noncrirical section;
1
I
Explain clearly how the program ensures mutual exclusion. avoids deadlock. ovoids unneccssary delay, ant1 el~sureseventual enlsy. For the eventual enu-y properly, how many times can one process that wanls to enter its critical secljon be bypassed by the oLher before the first gets in? Explair~. 3.2 Suppose a computer ha$ atomic dec~crnentDEC and increment I N C instruclions (hat alho return the value of the sign bil of thc rcsulc. In particular. Lhe decrement ~nsiruclionhas the following effect: DEC ( v a r , sign) : ( var = var - 1; if (var >= 01 sign = 0; else sign = 1;
)
INC i s si~uilar.the only difference being that it adds 1 to v a r .
Using DEc andloi. INc, develop a solution to rhe critical scctiorl pcoble~nTo]- n proccsscs. Do not wony abouc the e\~entualentry property. l>esc~.ibcclearly how your solulion works ancl why i t is correct. 3.3 Suppose a computes has an ;~to~nic swap instruction, detinecl as l'ollows; Swap(var1, var2) : ( tmp = v a r l ; v a r l = var2; var2 = tmp; )
Irr the above, tmp is a n internal register. (a) Using Swap. develop a solution to the critical section problen~lor n processes. Do not worry about the evenlual e n 0 property. Describe clearly how
your solutio~iworks 2und why
is corrcct.
(b) l\/lodify your answer to (a) so that it will perl'osm well on a mul~iprocessor system with caches. Explain what changes you make (jf any) and why. ( c ) Using Swap, develop a fail. solulion lo the critical sec~ionpi-oblem-namely, one that also ensures eventual enlry lo each waicing process. The key is lo order the processes, where first-come, first-served is the most obvious ordering. Note that you cannot j u s t use your answer to (a) lo imple~nelltthe alomic action in the ticket algorithm, because you cannot get a fai.r solution using unfair componenls. You may assume chat each process has a unique identily, say, the integers from 1 to n. Explain your solution, and give convincing arguments why it is correct and fair.
Exercises
143
3.4 Suppose a computer has an aroo~icCompare-nad-Swap instruction, which has the following ei'fect: CSW(a, b, c ) : ( if (a == C ) ( c = b ; return ( 0 ) ; 1
else ( a = c;
return (1); 1 )
Para~netersa,b, and c are sjrl.iplevariables, such as integers. Using csw, develop a solutioc~to the critical secrion problem for n processes. Do not won-y about the eventual entry property. Describe clearly Ilow your sojution works and why it is colrec t. 3.5 Some RISC (reduced instruction set computer) machines provide the following two jnstructions: LL(register, variable) # load locked ( register = variable; location = &variable; ) SC (variable, value) # store conditional ( if ( l o c a t i o n = = &variable) ( variable = value; return (1); ) else return (0); )
The (LL) (load locked) instructiotl atomically loads variable into register and saves the address of the variable in a special register. Thc location register is shared by all processors, and il is changed onlj. as a result 0.E executing LL instructions. The sc (store condilional) instruction atomically checks LO see i f the address of variable is the same as the address cu~senclystored in location. If i t is, sc scores value in variable and returns a l: orherwise sc returns a O.
(a) Using these instructions, develop a solution to the critical section problem. Do not worry about the eventual entry property. (b) Are these instructions powerful enough to develop a fair solution to Lhe critical section problem? If so, give one. If not, explajn why not.
3.6 Consider the load Locked (LL) and store condilional (SC)instructions defined in the previous exercise. These instructions can be used to implemenl an a~omic Fetch-and-Add (FA) instruction. Show how to do so. (Hint:l'ou will need to use a spin loop.)
144
Chapter 3
Locks and Barriers
3.7 Considel the following critical section protocol [Lampor1 1987): i n t lock = 0; process CS[i
= 1 to n] { while ( t r u e ) ( (await (lock = = 0 )); lock = i; Delay; while ( l o c k ! = i ) ( (await (lock == 0)); lock = i; Delay;
1
crilical section; lock
= 0;
noncl-ilical section; 1 1
(a) Suppose the Delay code is deleted, Does the prolocnl ensu~.emutual cxclusion? Does it avoid deadlock? Does the protocol avoid unnecessary clelay? Does it ensure eventual enuy'? Cal~eFullyexplain each of your answers. (b) Suppose elie pl-ocesseh execute with true concurrency on a multiprocessor. Suppose the Delay code spins for long enough to e~is~lre that every process i tha~ cvaits for lock to be o has r i ~ n eto execule lhe assignment stalemenl Lhac sets lock to i. Does the p~.otocolnow ensul-e ~ n u r ~ ~ exclusion, al avoicl deadlock, avoid unnecessary delay, and ensure eevetual entry? Again, carefully explain each of your answers.
3.8 Suppose your machine has the followi~lgatomic instl-uclion: flip ( l o c k ) ( lock = (lock + 1 ) % 2; return (lock); )
# flip the lock # return the new value
Someone suggests the followjng solulion lo Ihe critical section problem for t ~ ) o processes: i n t lock = 0;
# shared variable
process CS[i = 1 to 21 { while ( t r u e ) { while (flig(1ock) != 1) while (lock 1 = 0) skip;
critical section; lock
= 0;
noncritical section; 1 1
Exercises
145
(a) Explain why this solulion will not work-in other words, give an execution order that results in both processes being in their critical sections at the same time. (b) Suppose Ulnt the first line in the body o.1 f l i p is changed to do addition n~odulo3 rather than modulo 2. Will h e solution now work for two processes? Explain your answer.
3.9 Consider the following variation on the n-process tie-breaker algori~hm1Block and Woo 19901: int in = 0, last[l:n];
# shared variables
process CS [I 1 to nl ( int stage; while (true) { (in = in + 1;): stage = 1; last[stagel = i; ( a w a i t (last[stagel ! = i or in oddleven transposilion sort). Assume here are n processes P [I n] :and that n i s evcn. In [his kind of sorting method, each process cxecutes a series of rou~ids. On odd-uu~nheredrounds, odd-numbered processes P lodd] exchange valucs with P [odd+l] if the values are out of order. On even-nurnbered rounds, evennumbered processes P [even] excbange values with P [ even+1] . again if the values are out of order. (P [ 11 and P [n] do nothing on even numbered rounds.) (a) Uelei.~nioehow many rounds have Lo be executed in the worst case ro sort n nu~nbcts. 'The11 write a data parallel algorithm to sort inleger array a [l:n] into ascendirlg order. (b) Modify your answer ro (a) lo ~eenliinateas soon as the array has been sorled (e.g.. it might i~litiallybe in ascending order).
(c) Modify yyour answer
LO (a)
co use k processes; asslime n is a muttiple of k.
3.21 Assume there are n processes P ( l : n ] and chat P [I] has some local value v that it wants Lo broadcast to all the others. In particular. the goal is to store v in every entry of array a [I:nl . The obvious sequential algorithm requires linear time. Write a data parallel algorjthm to slorc v i n a in 1ogaritlim.i~ Lime. 3.22 Assume P [l:n] is an array of processes a ~ l db [l:n] is a shared Boolean array.
(a) Write a data parallel algorithm ro count the number of b [ i ]that arc true.
Exercises
149
(b) Suppose Lhe answer to (a) is count. whicll will be belween 0 and n. Write a d a b parallcl algorithm that assigns a u n i q ~ ~integer e index between 1 and count lo each P [ i] for which b [i I is true.
3.23 Suppose you are given two serially linked lists. WI-ite a data parallel algorithm LIMI matches co~.respondingelements. In particular, wheo the algorithm terniid 10 each other. (If one list is nates, lie ith elements on each list s h o ~ ~ lpoilit longer that Llie other, extra elements on the longer list should have null pointers.) Define the data structures you need. Do nol modify he original lists; instead store the answel-s in additional arrays. 3.24 Suppose thar you are give11a serially linked lisl and rhar: the elements RrC linked Logether in ascending o~dei-of their data fields. The standard sequential algorithm for inserlirlg a new ele~nelitin the proper place takes linear time (on averige, lialf tbe list has to be searched). Write a data parallel algorith~nto insert a new element into Lhe list in logarithmic time.
3.25 Consider a si~n.plelanguage .tor expression e v a l i ~ a ~ owith n the following 3yntax: e x / ~ ~ - e . r ~::i=o n operand I expression operator-oper~md ol~ernnd::= iden~ijkrI nrnn/?er o/~ermoi'::= + I *
An identilier is as usual a sequence ot' letters or digi~s,beginning wilh a letler. A nunibel- is a sequence of digils. Tlie opc~.alorsare + and *. An array of characters chll:n] has been given. Each character is a lelter, a digit, a blank, +. or *. The sequence of chatncrers from ch [I] lo ch[nl represents a senlencc in the above expression language.
Write a data parallel algorittlm that deter~ninesfor each character i n ch [I:n] the loken (nontexminal) lo which il belongs. Assume you have n processes, one pcr character. The resull fur each character should be one of I D , NUMBER, PLUS, TIMES, or BLANK. (Hints: A regular language can be parsed bp iI finite-slate automaton, which can be represented by a rans sit ion matrix. The r o w (cf Lhe maLrix ace indexed by stales. the columns by chwaclers; the value o F an entry is tbe new slate the auto~nalon~voilldenrer. given thc currenl starc and next characcer. The conlposition of state transition functions is associative, which makes it menab able to a par;lllcl prefix co~npulation.)
3.24 The following region-labeling problem arises in image processing. Given is inkgcr array image [ 1 :n, 1 :n] . The value of each entry is the intensily of a pixel. The neighbors of a pixel ai-e the lour pixels to 11ie left, right, above, and below il. Two pixels belong to the satne region if they are neighbors and they have the same value. Thus, a region js a maximal set of pixcls that are col?nectcd and that all have the same value.
150
Chapter 3
Locks and Barriers
The problem is to find all regions and assign every pixel in each region a unique label. In particular. lcl label C 1 :n, 1 :n] bc a second matrix, and assume that Lhe initial value or l a b e l [i ,j ] is n* i + j . The final value of label [i, j ] 1s lo be the largest oT Ihe initial Iabcls in thc region to which pjxcl [ i ,j ] belongs. Write a data puallel grid computation to co~nputethe final values of label. The complrCalio1) should rcrmirlnte when no label changes value.
3.27 Using a grid computation to solve the region-labeling problem of the previous exercise reclujces worst-case execution time of O(n2). This car) happen if there i s a regioti that "snakes" around the image. Even for sitnple itnages, the grid compuLation requires O(n) execution time.
,
The region-labeling problem can be solved as follows in lime O(log n). Firsl, for each pixel, delerinine wlierher i t is on the boundary or a region and, if so, which of its neighbors are ;~lsoon the boundary. Second, have each boundary pixel create poin~ersto its neighbors that are on tlle boutldx~y;this produces doubly lirlked liscs corinecting all pixels that are on the bourldary of a region. Third, using the lisls, propagale the largesr label of any of the boundaiy pixels lo he olhers (hat are on Ihe boundary. (The pixel with the largest label for any region will be on its boundary.) Finally, use a p;lrallel prefix compuialion to propagare the label foe each region to pixels in the inierior of the region. WI-iiea dala parallel program that iinplenlerlts this algorithm. Analyze iLs execution li~lze,which should he O(log n).
3.28 Consider rhe problem of generating dl prime numbers up to some limit L using the bag-of-(asks paradigm. One way to solve this problem is to inimic the sieve OF Eratoslhenes in which you have an array of L integers and repeatedly cross out multiples of primes: 2. 3, 5, and so on. I n [his case the bag would contain ilie next prime to use. This approach is easy to program, but i t irses l o ~ sof storage because you need to have an an-ay oi-'L inlegers. A sccontl way to solve the problern is LO check all odd numbers, one after the otliec Ln this case, the bag woultl contain all otld numbers. (They do not have to he stored, because you can generate Lhe next odd number from (lie current one.) For each candidate, see whether il is pr-ime; if il is, add it to a growing list of known p~-imes.The list of PI-it-nesis ilsed to check Future candidates. This second approach requires far less space than the first approach, but it has a tricky slnchronizatjon proble~n.
Wrile parallel programs Lhai irnplen~eilleach of lhese approaches. Use w worker pl-ocesses. At the end o l each program. print [he last 10 primes thar you have found anrl the execution time for (he compulational par1 of the pl-ogram.
Exercises
151
Con1pa1.e the time and space requ irernents 01' chese two programs for various values of L and w.
3.29 Consider the proble~nof determining the number. of words in a dictionary that contain uniclue letters-namely, (he number of words in which no letter appears more than once. Treat upper and lowel-case versio~lsd a letter as the same letter. (Mosr Unix systems conlain one or tnorc online dictionaries. for example in / u s r / d i c t /words.)
Write a palallel program to solve this probiern. Use (he bag-of-tasks paradigm and w worker processes. A1 the end of the program, print the number of words that concain unique lelters, and also print all those rhal are the longcst. You may read the dictionary file into shared variables befol-e slarting the parallel cotnpu tation.
3.30 A concul.rent queue is a queue in which insert and delele operations can execute in parallel. Assume (he queue is stored in array queue [ l : n l . Two variables f r o n t and rear polnt to the first full element and the next empty slot, respectively. The delele operation delays rlntil lhere is an elc~nentin queue [ f r o n t ] , then removes i t and jncrernencs f r o n t (modulo n). The jnserl. operalion delays u n t i l there is an empty slot, then puts a new element in queue [ r e a r ] and increments r e a r (modulo n). Design algorilhms for queue insertion and deletion that ~naxixnizeparallelism. 111 particular, excepl a1 critical points where they access shared variables, inserts a l d deletes ought to IF, able to proceed in pa-allel with each other and with themselves. You will need additional variables. You may also assume the Fetch-andAdd instruction is available.
As we saw in (lie last chapter, most busy-waiting prol.ocol5 arc quite complex. Also, there is t ~ oclear dislinclion belween variables that are used lor sytichronization and those that are used for comp~llingresults. Consecluently, one bas to he very careful when designing or using busy-waiting protocols. A furtl~erdeficiency of busy waiting i s that it is inel'licient ii.1 tnost multithreaded programs. Except in pardlel programs i n which the number of processes malches 111e number of processors, there are usually more yroccsses lliiun processors. Hence, a processor executing a spinning process can usually be Inore produchvely elnployed execuGng another proccss. Becaiise synchronizalion is fundnrne~iral10 concurlcnt programs. i t is dcsirable to have special tools that aicl in the design of correct sy nclironization prolocols and that can be used to block processes tliat cnust be delayed. Soncip11orr.s were the first and remain one of rlze most important synchronizatjon tools. They make ir easy to protect critical sections and can be used in a disciplinetl way to implc~nenlsignaling ant1 scheduling. Consecjuenlly, they are i~lclutleclin all ~hreadsand parallel programming libraries of which the author i s aware. Moreover, semaphores can be implemented jn Inore than one way-using busywaiting techniques fr-o~nthe previous chapter or using n kernel, as described in Chapter 6 . The concept of a seluaphore-and indeed the very term-is motivated by one ol' lie ways in which railroad traffic is syncl~i-onizedto iivoid train collisions. A railroad se~naphoreis a sigr~alflag that irldicales whether. the t.rack ahead i s clcal-or is occupied by anorher- train. As a train proceeds, sernapl~oresare set and cleared; lhey remain set long enough lo allow anothel- train t i m e to stop i F necess a y . Thus rail~.oadsemaphores can be viewed as rnech;~nisrusthat signal condiLions in order to ensure mutually exclusive occupancy of critical sections of track.
154
Chapter 4
Semaphores
S e n ~ t ~ p t ~ )inr econcul-rcnt s programs are similar: They providc a basic signditlg mechanism and are used to implement mutual exclusion and condition synchronization. This chap~erdefines the syntax and semantics of semaphores and then illustrates how to use then1 to solvc synchronization problems. For coillparison purposes, we I-eexamine some of the problems considered in previous chapters, including critical sections, producers and consumers, and barriers. In addition, we intl.oduce several interesting new problems: bounded burel-s, dining ph~losophers, readers and writers, and shorcesl-job-next resource allocation. Along Lhe way, we introduce ~111-ee usefd programming tecllniques: (I) changing variables, (2) split binary semaphol.es, and (3) passing the baton, a general teclitlique cl~at can be used to concrol the order io which processes execute.
I
4.1 Syntax and Semantics A semaphore is a special kind of shared variable that is manipulated only by two n/onzic: operations, P and v. Think of a semaphore as an instance of a semaphore class and of P and v as the two methods of he class, with the additional attribute
that the methods are atou~ic. The value of a semaphore is a nonnegative integer. The v operation is used to signal the occurrence o l an event, so it increments the value of a semapliore. The P operation is used to delay a process until an event has occurred, so it waits until the value of a semaphore is positive then decrements the value.' The power of semaphores ~-esullsl'rom the fact Lhac P operations might have to delay. A semaphore is declared as follows: sern s;
The default initial value is zero. Alternatively, a semaphore can be initialized to any nonnegative value, as in sern l o c k
=
1;
A.rrays of semaplioces cao also be declared-ant1 usual fashion, as in
optionally initialized-in
the
sern forksf51 = ([53 1); The lettcro P a11dv are mnemonic tor Dutch words, as described in thc Ius~oncalnotes ;I( tllc cnd ot the chapter Think o f P as standing for pa^^.". i ~ l d think of thc upward 'rliilpc of v as signifying ~ncrcment.Somc ailrhorc usc w a i t and signal inslcd ol' P and V, but wc wlll 1,eserve ~ h cuse o f thc tcrms w a i t WLI signal for [he next chapler on monrtors.
4.1
Syntax and Semantics
155
11' there were no inilialization clause in lllc above , .
,,
;).
I
:: ,: /
.
i:l
\a :
:&
;I
,
., L. J
I
I
I 11
i
158
Chapter 4
Semaphores sem arrive1
=
0, arrive2 = 0 ;
process Worker1 {
.- .
V(arrive1) ; P(arrive2);
/ * signal arrival / * wait for other process
*/ */
process Worker2 I -.V (arrive2); / * signal arrival */ P(arrive1); / * wait for o t h e r process * /
Figure 4.2 Barrier synchronization using semaphores.
I
I
1
We can use the two-process ba~ricras shown jn Figure 4.2 to i~uplementan we call n-process butterfly barrier having the s k u c t u ~ - esllown in Figure 3.1.5. 01use the same idea to implernenl at1 n-pxoces~disserxliriation hari-i.er having [he structure shown in Figui-e 3.16. In boLb cases. we would employ an array or arrive semaphoces. At each stage, a process i first signals its arrival by execuling arrive [i] ) , then wails for another process by cxccuting P 011 that pmcess's instance of arrive. Unlike the situarion with flag variables, only one atmy or arrive se~napl~ores is required. This i s because v operatious are "remelnbered", whereas the value of a flag variable might be overwritten. Alternatively, we can use semaphores as signal flags to ilnplcment npl-ocess barrier synchrc>nizarjoousing a central coordinator process (Figure 3.12) or a cor~lbiningtree (Figill-e 3.14). Again. bccause v operations are i.emcmbered, we woi~ldneed to use fewer semaphores lhan flag variables; for example, only one semaphore would be needed ,Poor the Coordinator in Figure 3.12.
4.2.3 Producers and Consumers: Split Binary Semaphores This sec~iollreexamines the producers/consurne~~s problem introduced in Seclion 1.6 and revisjted in Scclion 2.5. There we assi~~ned that the]-e was one producer and one consumer; here we consider the general situation in which there are tnulciple producers and consumers. The solulion illustrates anorlier use of scmapl1ore.s as sigualing flags. It also inrroduces the irnpo~fanlconcept of a split binary setnaphol-e, which provides another way to protect critical sections OF code.
4.2 Basic Problems and Techniques
I
i
159
In the producersJcons~~rners problen~.producers send messages that are received by consumers. The processes communicate using a single shared buffer, which is manipulated by two operations: deposit and fetch. Producers insel-r messages inla the buffer by calling deposit; consumer5 receive messages by calling fetch. TO ensurc chat messages arc: nor overwritlen and that they are ~ x e i v e donly once, execution of deposit and fetch must alternate, with degosit executed first. The way to program the required alternation property is again to use signaling semaphores. Such sc~naphorescan he used either 10 inclicare when processes ~ ~shared s varireach crilical execillion points or lo indicate changes to (he s l a ~ of ables. Here, the critical execution points are starting and ending deposit and fetch operations. The correspondiiig changes to h e sharecl buffer are its becoining empty or full. Became there might be multiple producers or consurners, il is simpler to associale a semaphore with each of the two stales o f the buffer rathcr than the execution points of h e processes. Let e m p t y and f u l l be two semaphores ind.icating whether the buffer is cnlply or full. Initially, the buffer is empty so the initial value of empty is 1 (i.e., lhe "make the buffer en~pcy"evenl Ilas already occurred). The initial value of f u l l is 0. When a producer wants to execute deposit, it I U U S ~first wail for che buffer to be empty; oiler a producer deposits an item, he buHcr becomes full. Similal-ly, when a consumer wants to execute f e t c h , it must first \riait fo1- the buffer to be full. and ihen it inakes the buffer empty. A process waits for an event by executing P on the appropriate senlapho~eand signals an event by executing v, so we get [he solution shown in Figure 4.3. In Figure 4.3, empty and f u l l are both binary semaphores. Togcthcr they form what is called a split hinan, sen7a.pliore, because at most one of empty or f u l l is 1 at a time. The term split binary semaphore comes from the fact that empty and f u l l can be viewed as a single binary semaphore that has been split into two binary semaphores. In general, a split binary semaphore can be formed from any number of binary semaphores. Split binary semapho~esare important because they can be used as follows to isnplement inutual exclusion. Suppose that one of the binary semaphores has an initial value of 1 (hence the others are initially 0). Furlher suppose thac, in the processes that use the semaphores, every execution parh starts with a P operation on one of the semaphores, and ends with a v operation on one of the semaphores. Then all statements between ally P and the next v execute with mutual exclusion. In parlicular, whenever any p~ocessis betweerl a P and a v, the semaphores are all 0, and hence no other process can completc a P unril the first process executes a V. The solution to che producers1consumers probleln io Figure 4.3 illustrates this use of split binary semaphores. Each Producer alternately executes
i
1 A
,.f/
.!) ..
.
;
.I,, ,,, , , 8
,
.
., ,
., $1
!( ,
I., ,- ,
1 1
160
Chapter 4
Semaphores
/ * a buffer of some type T aem e m p t y = 1, f u l l = 0;
t y g e T buf;
p r o c e s s Producer [i = 1 to MI
while ( t r u e )
*/
{
{
-. / * produce data, then deposit it in the buffer * / P (empty); buf = data; V(flll1);
1 process Consumer [ j = 1 to N] { while (true) ( / * fetch result, then consume it * / P(ful1) ; result = buf; vternpty) ; .- . }
1 Figure 4.3
Producers and consumers using semaphores.
then v ( f u l l ) , and each Consumer alxernately cxecules e ( full ) then v (empty). I n Section 4,4, we will use chis property of split binary seinaphores to constriict a general method for implementing await stalements. P ( empty)
4.2.4 Bounded Buffers: Resource Counting The lasl exa~npleshowed how to synchronize access to a sjngle communication bufl'eer. 1f dala are p~oducedat ;ipproximately the same rate ax which they are consumed, a process would not generally bave to wait very long lo access the hingle buffer. Commouly however; producer and consumer execulion is bursty. For example, a producer might produce several items in quick succession, then do more computalion before ptoduciizg anotller scl o l items. In such cases, a buffer capacicy larger' than one can significantly increase performance by reducing the number of rimes processes block. (This is 2111 example of Llle classic ti me/space tratleocf in computing.)
4.2 Bas~cProblems and Techniques
161
Were we develop a solurion lo what is called the bourlci~dbuffer probb?~. In particulu-, a bounded b11:Keris a niultisloi communication but'Fer. Thc solution builds upon the solution in lhe previous section. It also illustrates the ~ s 0 1e general semaphores as resource counters. Assume for now that tbere is just one producer md one consumer. The producer deposits messages in a shared bu lfer; the consumer fetches them. The buffer contains a queue of messages that have beell deposited bu~noc yet fetched. This queue call be represented by a liliked list or by an array. LVe will clse the array representation here, because i t is si ~nplerto prograo,. Jn particular, let (he buffer be represented by buf [n].where n is sreater than 1. Let f r o n t be the index of the message at the front of the queue, and let rear be rhe index ol the first empty slot past the lnessage at the real of tbe queue. Initially, f r o n t and r e a r are set to tlie same value. say 0. With d~isrepresentation for the buffer, the producer deposits a message with value data by executing buf [rear]
=
data; rear = (rear+l) % n;
Similarly, the consumer fetches a message into its loci~l vcl~iableresult by executing result = buf [front]; front = (front+l) % n;
The modulo operator ("a) is used t o ensure Lhac the values of front and rear are always between 0 atld n-1. The queue of buffered messages is thus stored in slors from buf [ f r o n t ] up to but no1 includir~gbuf [rear]. with buf treated as a circular m a y in which buf [ O ] follows buf 111-11.As an example, one possible configuration of bufis shown below: buf
4
rear
. .. I I.../ i 1 + front
The shaded slots are full; tlie blank ones are empty. When there is a s i ~ ~ g buffer-as le i n thc pl-oducecs/consurne~~ problc~nexecution of deposit and fetch nlust alter~late. Wllen there are rnultiple buffers, deposit can execule whenever tbere is an enlply slo~,and fetch can execute whenever there is a stored message. In fad. deposit and fetch can execute concu~~enlly i f there is both an empty slot and a btorcd message, bccause deposit and fetch will thcn access differen1 slots, and hence they will not interfere wit11 each other. tlowever, the synchronizi-ltion requirements are iderr~iculfor both a single-slot and a bounded buffer. It1 particular, the
162
Chapter 4
Semaphores t y p e T buf [nl; / * an array of some type T * / int f r o n t = 0, rear = 0 ; sem empty = n, full = 0 ; / * n-2 c = empty+full = 0 cond pos; # signaled when s > 0 procedure P s e m O {
if ( S == 0) wait (pas) ; else s = a-1;
> procedure Vsem( )
{
if ( e m p t y (pas) 1 s = s+l; else signal (pos)
;
1
1 Figure 5.3
FIFO semaphore using passing the condition.
complemen~aryaccions are clcci-ementiirp s i n procedure Psem and incrementing e in procedure vsem. ln Sections 5.2 and 5.3 me will see additional exarnples that i~scthis same technique to solve scheduling PI-ohle~ns. Figures 5.2 and 5.3 illustrate that condition variables are sindlar to the P and v operations on semaphores. The wait operalion, like P, delays a pl.ocess, and the signal o~cration,like V, awakens a process. Howcvel; the~.enre two important cliffel-ewes. First, w a i t n1~jrly.sdelays a process unril a litter signal is executed, whereas a P operation causes a process to delay o ~ l yif the semaphore's value is cul-lently 0. Second, signal has no effect if no process is tlelityed on the condition variable, where;~sa v operation eilller awalccns a delayed process 01increments the value of the se~napho~-e; in short, the h c l 111:t signal llns been executed is not remembered. These diifc?re~~ces cause condition synchronization to be prograrnmcd differently than wirh seiwdpho~-es. I n the remainder of lhis chapker, we will assunie that monjtols use the Signal and Contin~~e discipline. Although Signal and Wail was the lil-sl discipline proposed lor monitors. SC is thc signaling discipline used wirliin the Unix ope^ ating sysle~n,the lava progl.amming language, niitl the PLhrei~cIslibrary. The reasons for prefering Signal and Continue are hat: ( I ) it is colnpalible wih priorily-based process scheduling, and (2) it has himpler formal semantics. (The Hiscorical Notes section discusses these issues.)
j
.:,!' Y
5.2 Synchronization Techniques
213
Rroadrast signal is the final operation on condition variables. It is used if more than one delayed process could possibly proceed or if the signaler does not know which delaycd processes might be ablc to proceeil (because they themselves need to rccheck delay condiiions). This operation has the form
Execution of signal-all awakens nU processes delayed nn and Continue discipline, its effect is the same as executing:
cv.
For the Signal
while I!empty(cv)) signal(cv);
Each awakened process resumes execution in Ihe monitor at qornc tnne in the fulure, subjcct to the usual nlr~tualexclusicm constraint. Like signal, signal-a11 has no effect if no process i s delayed on cv. Also, the signaling process continues executjon in the monitor. The signal-all operation i? well-defined when monitors use the Signal and Continue discipline, because thc signaler always continues executrr~gnext In the montlor. Hnwevcr, it I S not a wcll-defined operation with (he Signal and Wait discipline, because it i:, not possrble to transfer control to more than one other procesr; (ind to give each 1nutu;~llycxclusi\,e acccss lo the monitor. This is another rcasun why Unix, lava, Pthrcads, and this book u.;e thc Signal ant1 Continue discipline.
5.2 Synchronization Techniques This section develops monitor-based solutions to five problems: ( I ) boundccl buffers, (2) readc1sfwriter5, ( 3 ) shortest-.iob-next scheduling, (4) intcl-val limcrs, (5) and thc sleeping barber. Eilch is an inkresting problem in its own right and also ~llusu-alesil programming technique [hat i s uscfuI with monitors, a5 incliciiled by the subtitles in he section headings.
5.2.1 Bounded Buffers: Basic Condition Synchronization Consider again the hounded buffer proble~nintroduced in Section 4.2. A yr40ducer process and a consumer process communicate by sharing a buffer having n slots. The buffer convains a queuc of messages. The producer sends a messdge to the consumer by depositing d ~ emessagc al the end ol' Ihe queuc. The consumer receives a message by fctching the onc a1 the front ol' the queuc. Synchronization i s required so that a message i s nol deposited il' Ihe queuc is full and a message is not felclied if the, queue is empty.
214
Chapter 5
Monitors monitor Bounded-Buffer
{
t y p e T buf Cn3;
# an array of s o m e type T
int front = 0, rear = 0; count = 0 ; # # rear == (front cond not-f ull ,
# index of first full slot # index of f i r s t empty s l o t # number of full s l o t s + count) % n # signaled when count < n # signaled when c o u n t > 0
not-empty;
procedure deposit (typel data) C while (count == n) wait(not-full); buf [sear] = data; rear = [rear+l) % n; count++; signal(not-empty); 1 procedure f e t c h (typeT &result) { while (count == 0) wait (not-empty) ; result = buf[front]; front = (front+l) % n; count--; signal (not-full) ; 1
1 Figure 5.4
Monitor implementation of a bounded buffer.
Figure 5.4 contains a monitor that implements a bounded brrffer. Wc again lepresent the message queue using an array buf and two integer variables f r o n t and rear: lhcse poinl to thc first lull slot and the first empty slot, respeclively. Integer variable count keeps track ol' lhc liurnher of messages i n the buffer. The ~ w ooperations on the buffer are deposit and f e t c h , so these arc the monitor procedures. M u m 1 exclusion is iinplicit, so semaphores are not needed to protcct critical sections. Condition .;ynchronizat~onis ~rnplementedusing two condition vni-iablcq, a.; shown. I n Figi~rc5.4, bolt) w a i t statements arc enclosed in loops. This is always a safe way LO cnsure that rhc desired condition is true beCorc the permanenl valiabler: ale accessed. IL is also necessary if [here are muttiplc producers and consumers. (Kccal l that we arc using Signal and Continue.) When a process executes signal. it merely gives a hint that the signaled condition i s now true. Recause the ~ignaler;~ndpossibly other processes may exccutc in the inonitor before the process awakened by signal, rile awaited condllioii may no longer bc true when thc awakened process resumes execution. Fol examplc, a pmduces could he delayed waiting for an crnpty ~ I n tthen , a consumer could Cctch a Itlessagc and awaken ~ h cdelayed produces. However. bcfore
5.2 Synchronization Techniques
215
thc producer gets a turm to execute, another producer could enter deposit and fill the empty slot. An analogou~;situation could occur wjlh consumers. Thus, in
general, it is necessary to recheck the delay condition. The signal statements in deposit and fetch are execi~ted unconditionally since in both cases the signaled condrtion is true at the point of the signal. In fact, as long as wait staternenls are enclosed in loops lhat recheck the awaited condition, signal statements can be executed r ~ fnnj7rime since they rncrely give hints to dclayed processes. A program will: execute more efficiently, however; if signal is executed only if it is certain-or ar least likely-that rome delayed proces could proceed.
5.2.2 Readers and Writers: Broadcast Signal The readerdwriters problem was introduced in Section 4.4. Recall that reader processes query a databasc and writer processes examine and alter it. Readers may access the database conairrently, bul wrrtcrs require exclusive access. Although the database is shared, we cannot encapsulate it by a monitor. because readers coulcl not thcn access it concurrently since all code within a monitor cxecutes with tnutuaI exclusion, Instead, we use a monitor rnerely to a1bitrate access to the database. The database itself is global to the rcaders and writersfor example, it could be slored in a shared melnory l;thle or an external filc. As we shall see. thih same basic swucture is often employccl in monitor-based programs. For the rcaderslwilers problcm, the arbitration monitor grants per~nissionl o access the database. To clo so, it requires Lhat procewes inform it when thcy want accesi and when they have finished. There are Iwa kinds of prooes5es and two aclions per process. so the monitor has four procedure$: (L) request-read. (2) release-read. (3) request-write. and (4) release-write. These procedures arc used in llie obvious ways. For example, a reader calls request-read before reading the database and calls release-read after reacling the database. To synchronize access to the ddtabase, we need Ee record how many processes are reading and how many are writing. As before lei nr be the number o f readers. and let nw be the number of writers. These are the permanent variable^ of he monitor; for proper synchronizatii)n. they must satisfy the monitor invariant: I
Tnitially, nr and nw are o. Each vari~bleis incremented in the appropriate request procedure and decrerncnted in the appropriate relcase procedure.
216
Chapter 5
Monitors
monitor RW-Controller C i n t nr = 0, nw = 0; condoktoread; # cond oktowrite; #
## (nr == 0 v nw == 0) A nw 0 I I nw > 0 ) wait (oktowrite); nw = nw + 1;
I procedure release-write() nw = n w 1; s i g n a l (oktowrite);
-
signal-all(oktosead);
{
# awaken one writer and # all r e a d e r s
I
Figure 5.5 Readerdwriters solution using monitors.
Figurc 5.5 contains a monitor that meets thir; specj ficalion. While loops and RI'V is inva~iant. At the start of request-read. a reader needs ro delay until nw is zero; oktoreaa is thc condition variable on which readers delay. Similarly. wrilers need to delay at the start of request-write untjl both nr tiild nw are Leer; o k t o w r i t e is lhe condition variable on whlch they delay. In release-read a writer rc signaled when nr is .zero. Since writers recheck their delay condition, the solution would still be correct if writers were always qignaled. However, the solution woulci h e n be less cfficicnt since a signaled writer would have to go right back to sleep il n r were zero. On the other hand, at the end of release-write, we know that both nr and nw are zero. Hence any delayed process could proceed. The solutio~~ in Figure 5.5 doeh not arbitrale between readcrs and writers. Instead it awakens rcM delayed processes and lets Lhe underlying process scheduling policy determine which executes first and hence which gels LO access the database. 11' a
wait statements are used to ensure that
5.2 Synchronization Techniques
217
writer goes firs[, Lhe awakened reder rj go hack to sleep; if a readcr goes Eirs~. he awakened writer goec back to qleep.
5.2.3 Shortest-Job-Next AJ/ocaiion: Prioriiy Waif A condition variabIe is by default a FIFO queuc, so when a process euecuter;
it delays at the end of the qucuc. Thc prio~+ilywait statement wait ( c v , rank) dclays proccsscs i n ascending order of' rank. It i s used to implement non-FIFO schcchiling policies. Here we revisit the Shortest-Job Ncxt allocation problem, which was introduceil in Section 4.5. A ~hortest-job-nextallocator has two operationlaline this outline Into an actual solution, we etnploy the same basic synclironizatinn as in the sleeping barbel rjolution (Figure 5.10). Howe\rci-*we add
Figure 5.14
Disk scheduler as an intermediary.
5.3 Disk Scheduling: Program Structures monitor Disk-Interface
233
{
permanent variables for status, scheduling, and data transfer procedure use-disk ( i n k cyl, transler and rcsull parameters) wait for turn to use driver store transl'er parameters in permanent variables wait for transfer to 17e cotl~pleted retrievc resulls from permanent variables 3 procedure get-next-request(sameType
{
&results) {
select next request wait for t~ansrespiir;lmeters to be slored set results lo Iranxfer parameters 1 procedure finished-transfer ( s o m e m e results) store results in permanent variables wajt for results to be retrieved by client
{
1 1
Figure 5.15
Outline of disk interface monitor.
schcduli~lgax in the nisk-scheduler monitor (Figurc 5.13) and add parameter paq~ingas in a bvundcd buffer (F~gurc5.4). Exsentlally, the monitor invariant for ~isk-Interface becomes the conjunclion of barbershop invarianl BARUI?Ire,s,sion~ [Car-upbell & Kolstad 19801, a high-level
References
253
mechanism that allows one to specify the order in which procedures execule and obviates the need for conhtion variables (see Exercise 5.26). Tf~eMesa Ianguage provides mechan~srnsthat give the programmer control over the granularity of exclusion. More recently, as shown in Section 5.4, Java has taken this approach of allowing the programmer to dectdc which methods are synchronized and which can execute concurrently.
References Andrews, G. R., and J. R. McGraw. 1977. Language fcaliires for procecs interaction. Smr A CM Conf kre-encr on Lu?qziage Tleszg~zfi)r Rriiublr Softwxrc2. SIGPLAN Nrtiees 1 2 , 3 (March): 114-27. Brinch Hansen, P. 1972. Structured multiprogramming. (July), 574-78.
Cupnpn.
ACM 15, 7
Brinch Hansen, P. 1 973. Opemting Systam Principles. Englewood Cliffs, NJ: Prentice-HaI1. Brinch Hanscn>P. 1975. Thc pragram~ninglanguage Concurrent Pascal. IEEE Truns. on S@ware EEIZSK SE-I. 2 (June): 1 99-206.
Brinch Hansen, P, 1977. The A r r h i t e c f ~ roj~ Concurrent Prqyr~ims.Englewood CliIfs, NJ: Prenticc-l-Iall. Rrinch 1-Iansen, P. 1993. Monitors and Concurrent Pascal: A penanal history. Hislory of Programming Languages Conrereme (ElOPL-IT), ACM SigpEan N # ~ ~ c E28. , I . 3 (March): 1-70. Bustard, D., J . Elder. and J . Welsh. 1988. Conc~irre.enzPrclgmm Sfatr'lures. New York: Prentice-Hall 1nler11;ltiolial.
Cainphell, R H., and R. R - Kolstad. 1980. An overview of Path I'ascal's dcsign and P;th Pascal user manual. SZGPlAA/ ~J~otices 15,9 (Seplember): 13-24. I('
Dijkstra, E. W. 1968. Cooperating sequential processei;. Ttl F, Gellllyh, ed., Progmm~~ring language,^. New York: Acadcmic Press, pp 43-1 12.
Dijkstra, E. W. 1 97 1. Hierarchical or-dcring of sequential processes. Artr! Informarica 1, 115-38. Gehani, N. H., and A. D. McGettrick. 1938 Conr~rrrcnrPro~mmming.Read-
ing, MA: Addison-Wesley.
Geist, R., and S. Daniel. 1987. A continuum of disk xcheduling algorithm use will be al'l'ected hy which special instructioris arc available. For example. if there is a Fetch-and-Add instruction, we can use the simple and fair ticket algorithm in Figurc 3.9. Bccause we assume that Waps are handled by [he processor on which the trap occurs and that each processor has its own interval limer, the trap and timer intclxupe hand1e1.s are essentially the same as those found in thc single-processor kernel. There are only two difrerencer;: executing needs now to be a n array, with one entry per processor. and ~ i r n e r - ~ a n d l eneeds s lo lock and unlock the ready l i ~ t . TIie code for Ihe three kcrncl primitives is also essentially the same. Again, the differences are that executing is an array and that Ihe free. rcady. and waiting lists need to be accessed in critical sections. Thc grcaiest changes are to the code for dispatcher. Before, we had one processor and assumed it always had a procesc to execute. Now there could be fewer processes than processors, and hence some processors could bc idle. When a new process is forked (or awakened after an 110 interrupt), it needs to be assigned m an idle processor, il' there is one. This f~rnctionalitycan be provided in one of three wayc:
6.2 A Multiprocessor Kernel
*
+
273
Have each proccssor, when idle. execute a special process that per~odically cxarnines the rcady list until it finds a ready process.
Have n processor executing fork search for an idle processor and assign the new process to it. Use a separate dispatcher process lhal executcs on its own pi.ocewc~er.a library runs on top of an operating system, so it is entcred by means of normal plocedure calls and contains s o f t ~ ~ t ~ r c s i p a l handlcrs rathe1 than hardware interrupt handlers.) A sernapharc descriptor contains the value of one semaphore; it is init1:lliaed by invoking createsern. The Psem and vsem prirnjtives Implement the P and v operations. We assume here that all semaphores m-e general scmaphores. We frrsl show how 10 add bemaphorcs to the single-proce\sof kernel o f Scct~on 6.1 and thcn show how to change thc resulting kcrnel to supporl multiple proceswrs. ac; in Section 6.2.
6.3 Implementing Semaphores in a Kernel
277
Kecall rhat in a single-processor- kernel, one process at a timc is cxeculing, and all others were ready to cxecule or are waiti~igfor 1hei1children ro qiiil. As bcfore. he index of thc descriptor of t l ~ ccseculinp proccss is slored in variahle executing, and the ~IescriptorsI'or ulI ready processes are stored on Ihe rc;~cIy list. When semaphores arc added to the kcrnel, illere js a fourth possiblc pTAoce.ss stale: blocked on a sciuaphore. Tn particrrlar, a process is blocked if ir ic waiting lo complete ;I P operation. 'To keep (rack of blocked proceqscs. each scrnapllnre clescriptor contains a linked list ol' [he rlescriptars of processes blocked on tlzal semaphore. On a single processor, exactly orle process is cxeculing, and iis dexcriptor is on no list: all orher process descriplors are on the ready list. Ihe waiting list, or a blocked list of solme semaphore. For each semnphorc declaration in a concurl-ent program, one call to thc creakesem prirnitivc i s ge~lerated;(lie i;emaphore's initial value i:, passed as an argument. Thc creat esem primitive finds ail empty qernaphore desc~iplor.in itializes the semaphore's valuc and hlocked lisl, and rerul.11s a "name" !'or thc descriptor- This name i s typically eithcr the descriptor's address or an index inlo a table that coiltains the address. After a se1naphol-e is created, it is used by invoking: tllc Psem ancl vsem primitives, which are the kernel mutines lbr the P and v programming primitives. Both have a single argument that is the name ol' a sernaphore descriptor. The psem primitive checks Lhe vaIuc in Lhe descriptor. Tf the value is positive, it is decremented; otherwise. the descriptor of the executing process is inserlcd on the semaphore's blocked list. Similarly, vsem checks the semaphore descriptor's blocked list. 11' it is empty, the semaphore's value is incremented; otherwise one process descriptor is removed from the blocked l i s l and inserted on the ready list. It is common Sor each blocked list to bc iinplernented as a FIFO queue since this ensures ha^ the semaphore operations are fair. Figure 6.5 gives outlines of tliese primitives. They are added lo the routines in [he single processor kernel given earlicr in Figure 6.2. Again, the d i s patcher procedure is called at the end of each primidve; its actions are (he same as beforc. For simplicity, the imp1ementation OF the setnapliore primitives in Figure 6.5 does not reuse seinapkore descrip~nrs. This \.vould he sui'licient il' all seinaphorcs are created once, but in gcneral this i s not the case. Thus it is usually necessary to reuse semaphore descriptors as well as process clcscriptnrs. Onc approach is for Lhe kerncl 1.0 provide an additional destsoysem pi-irni~ive:this would then be invoked by a procecs when it no l o ~ ~ g neecls er a semnphorc. An alternative is to record in the descriplor of each process lhe names of all semaphores thar, tl-ic process created, and whcn rhe process invoke7 q u i t . to
278
Chapter 6
Implementations procedure create~ern(rnitia1value,
i n t *name) get iin empty semaphore dcscriptor ; ~ n i l i n l i ~(he e descriptor; yet name to tlnc lianlc (ii~dcx)of the descriptor; dispatcher ( ;
{
1 procedure Psem(name) [ lind semaphore dewriptor of name; i f (value > 0 ) value = value - 1; else {
inserl descrip~orol' executing at end of blockcd list; # indicate executing is blocked
executing = 0; 1 dispatcher )
;
1 procedure Vsern(name)
find aclnaph~lrcclescriplor ol' n a m e ; if (blockcil liqt cinptll) value = value + 1; else { re~novcpinccss dcscriptor i'rom I'ront uf blocked list; inscl-t thc dcsci-iploi-at end oC ready list; 1 dispatcher ( )
;
1
Figure 6.5
Semaphore primitives for a single-processor kernel.
desh-oy h e semi~phoresit created. With either dpproach, it is imperalive that a semaphore not he u ~ e dafter it has hecrl dcslroyed. Detecting misuse requircs that each descriptor- have a irniq~~c ilanle that i s validated each time psem or. vsem is calletl. This can bc irnyhcnicnteci by lerling the name of a descriptor bc a combination of an incicx-which is used 10 locate the descriptor-and a unique
i;erlucnce nurnkcr. We ~ ~ 1 exte~ld 11 thc slnglc-processor lmplemenlatlon of the semaphorc primitivev in Figure 6.5 into onc for- a rnullipincessar i n the s~lrneway as described in Secrio~i6 2 a ~ i dshown in F i p r e h.4 Again, the critical requircment is lo lock 1;11arcd ilala struclurcs. bul only I'or as long as absolutely rcquircd. Hence, there shol~ Id he a separule lock for each seniaphnre dewriptor. A ~emaphoredescriptor IS locked in Psem and v s e m ji~stbefotc 11 is accessed; the lock is releascd as
6.4 Implementing Monitors in a Kernel
279
soon as the descriptor is no longer ~leeded, Locks ai-e again acquircd and reIeased h y meanr of n busy-waitins solution to thc cr~lic;~l sectian problcm. The is~;ilesthat wcrc discussed a[ the end of Scclion 6.2 also arise i n a muItiptocewor kcrr~clthat implements scrnaphores. In particulal; a process might need to cxccttle on a rpecific proccssol; il may be itnportant to execute a process on thc stline placessor that it 1 x 1 executed on, or it niay be important to coschedule processes o n the camc processor. To support this I'unctionality-or to avoid contention for a shared ready I15t-cach processor might wcll have its own ready list Tn this casc, whcn a procesq is awakcned in the vsem primitive. the process needs to be put an the appropriate ready Iisl. Thus either thc vsem priiniLive needs to Iock a possibly remote ready list, or i l needs to inform another processor and let that processor put Lhe unblockctl prncexs on itc rcady lisl. The first dpproach requires remote lucking; thc sccond requires using soniething like interprocccsor intcrrupls lo send a messagc from one processor to anolher.
6.4 Implementing Monitors in a Kernel Mon~torscan also be readily implemented in a kernel; this scction qhowq how to do so. Thcy can also be sirnulatcd using semaphores; we show how to do that in the next section. The locks and condition variables in a hbrary such as Pthreads or a language such as Java are also implemented in a manner similar to thc kernel described here. We assume the monitor semantics defined in Chapter 5. In pariiculal; pro cedures execute with mutual exclusion, and condition synchronization uses thc Signal and Conrinuc diwipline. We also assume that a monitor's permanent variables are to red in memory accessible to all processes that call the monitor's procedures. Code thai implements the procedures can be storcd in shared memory, or copies of the code can be s~oredin local mcmory on each processor that cxcculeh processe5 that use tlle monitor. Finally, we assume the permanent varial~lesare initialized before the procedurcs are called. This can be accomplished hy. ~'(JI. exarnplc. allocating and 1nitia11zin.gpermanent variables before creating any processes rhal will access thcm. (AlrernaliveIy, init~alizationcode could be executed on t11c iirsl call of a monitor prucedure. However. this is le.;r; efficient since every call woulcl have to check to scc if il were the first.) To itnplement monitor\, we need to add primitives to our kernel for monitor entry, monitor exit, ancl each of the operalions on condition variablex. PI-imitives are dlso needcd ro create descriptors for each monitor and cach condition variable (unles~thcse are crealed when thc kernel itself ic initialized); these are not shown here, becauxe they are analogous to Lhe createsem PI-imilivein Figurc 6 5.
280
Chapter 6
Implementations
Each monitor descriptor mame contain< a lock mLock and an ctltrg qucue of descriptorh ot processes waiting to enter (or reenter) thc mo~~itoi-, The lock is used 10 ensure mrrtual excluxinn. When the lock il; cet, exactly o ~ l cproces\ is executing ~n the rnoilitot; othcrwisc, no process is executing in the monitor. 'fhc descriptor for a condi~ionvariable conlains Ihe head of a queue of descriptors of processes wailing on lhat condition variable Thur; every proccss descriptor--excepI perhaps [hose of executing processes-is linked to cithcr thc ~ e a d ylist. a monitor entry queue. or a condition variable qucuc. Condi~ionvariahle descr~ptorsare contmonly stored adjacent to the descriptor oi' llrc monllor in which Ole cunclition variables are declared. This is done to avoid cxccssive I'ragmentation of kernel storage and to allow the run-time idcnlity of a condition variable simply to be an offset from the start uf the appropriate monilor descriptor. The rnonitor entry primitive enter (mame) findt he descriptor f(,r tnonitor mame, thcn cither scls Ihe monitor lock and aIlows lhe executing procesl; to p~ocecdor blocks rhe process on the monitor entry queue. To enahle thc dcscripLor lo be I'ound quickly, the run-time identtty of m a m e is typically thc adclrcss or the non nit or descriptor. The monitor exit primilive e x i t (mame ) eithel- moves one process from the entry q u e u e to the ready list or clcass thc monitor lock The w a i t ( c v ) statement i s implemented by invoki~lgthe kernel priiiiilive w a i t (mame, cmame ) , and the signal ( cv) slatcinenc is implcmenled by invoking the kerncl prirriitive signal ( m a m e , c N a m e ) . In holh pri rnitives, mame is the "name" of the moililor wilhin which the pri rnitive i s invoked-this could e~tlierbe the index or adclre$s of Ihe monitor's der;criptor-and cName I S [he index or address of the clescriptor of the appropriate conditio~ivariable. Cxecution of wait delays the executing process on the spccificd condirion va1i:zble queue and the11eithcr awakens soxrlc process on Ihc monilor entry q u e u e or clears (he monitor lock. Execution of s i g n a l checks thc condition variable queue. JI' it is empty, the prilnitive simply returns; otherw~sc.Lhc dcscr~ptol-at the tront of the condition variable qucuc is movcd to the cnd or the rnoniIo~-cnlry clueue. Figure 6.6 gives code outlines for lhese primitives. Ac: in previous kcrncls. he pri~ni~ives are entered as a result of a supervisor call, executing pointq to Ihe dei;criplor of the executirlg process, and executing is set to 0 when he executing procesx blockx. Since a process that calls wait cxits thc mon~tor.Ihe w a i t primitive sirnply cdII5 the exit primitive after bloclcinp the execuling procew. The last action of the enter, exit, and signal prilnilives is to ci~llthe dixpalcher. It is straighlforward to implement the orhcr operarions on conclitio~lvariablec. For example, irnplernenting empty ( cv) mere1y ~nvolveslesling whether cv's delay qiicue is empty. In fact, if the delay y ueue is directly accessible to proccsws, il is not neccssasy lo use a kernel primili\,e lo ~rnple~nent empty. This is because the execuling process alrei~dyhas the rnonitor locked, so the contents
6.4 Implementing Monitors in a Kernel
procedure enter(int m a m e )
281
{
find dewriplor for inonitor m a m e ; if (rnLock = = 1) {
insert tlcscriplor ol' executing
at end of entry queuc;
executing = 0;
1 else m L o c k = 1; dispatches(};
# acquire exclusive a c c e s s to rnName
3 procedure exittint m a m e ) {
fncl descriplo~for monitor mame; if (enlry queue nor empty ) move pmccss from front of entry rlueue In reiir of' ready Ilr;t; else mZlock = 0; dispatcher ( ) ;
# clear the lock
1 procedure wait (int xnNWe;
i n t c N ~ M ~{ )
find descriptor for condition i~ariable c ~ m e ; insert descriptor of executing aL encl of delay quc~rcof cName ; executing = 0 ; e x i t (mame) ;
1 procedure signalIint mName;
int cName) {
find descriptor for monitor mame; find descriptor- for condition variable c N a m e ; i f ( delay queue not crnpty )
move proceu from front of delay queue lo rear of cotry queue; dispatcher ( ) ;
1
Figure 6.6
Monitor kernel primilives.
of the condition queue cannot be changed by anothcr process. By iinplernenting empty withoi~tlocking. we would twoid he overhead or a supervisor call and return.
We can also m a l e the implernentatio~i of signal more el'ticient that) is shown in Figure 6.6. In particuIar, wc could inodrry s i g n a l so that il ulwtlys moves a descriptor from the front o l the appropriate delay queuc to the end (>I'ihe appropriate entry qucue. Then, s i g n a l in a progratrl would be ~ranxlaiedinto
282
Chapter 6
Implementations
codc that rcsts t l ~ cdc1ay queue and invokes msignal in t l ~ ckernel only if [lie delay qucuc is cnlpty. By malti~~g thcsc changes. [he overhead 01' kernel entry and cxit is avoided u,hcn signal has 110 cffcct, Indcpenclcnt of how signal is 1ml7le1nented,signal-all is implemenlecl by a kernel pr~mit~ve that moves all descriptorl; l'rom the specified delay queue to the end of the ready list. Priority w a i t is irnplernented analogously to nonpl-iority wait. The only difference is that the descriptor of the executing proccss nceds to bc inserled at he appropriale place on he delay queue. To keep lhat queue ordered, the rank of cach waiting proceas nccds lo bc rccordcd: Ihc Iogical place Irr store the rank is in proccss descriptors. This also lnakcs implementation of minrank lrivial . In fact, minrank-like empty-can be ilnplemenled will~ou! entering the kernel, ac; long ah llie minimum rank can be read direclly by Lhe executing process. Thi5 lcernel can he exlended to one for a multiprucer;~orusing the techniques cler;crjbed in Section 6.2. Again, the key requirement is to protect kcrnel data structui-es froin bcing accesscd simultancausly by processes executing on diffcrcnt proccwors. A n d again, onc has to worry about avoiding memory conlenlion, exploiting caches, co-scheduling processes. and balancing processor load.
On 3 single processor, monitor%can in some cases be implemented even more elliciently withour using a kernel. 11 there are no nested monitor calls-or ~f nested calls are all open calls-and all monitor procedure< are short and guaranteed In tern~inate, it ir; both pnxsible and reasonable to implement mutual esclucion by inhibiting interrupts. This i~ done ac follows. On c n t q lo a monitor, thc cxccuting proccss inhibits all interrupts. Whci~it rctums from a monitor proccdutc, il enables inrcrrupts. 11' the procecs lzas LO wail within a monitor, i t bluuks on a condilion variable queue. and the fact that the process was executing wilh intcirupts inhibited is rccordcd. (This is oIZen encoded in a processor status rcpi;lcr thal li saved when a process block\.) When a wzuting proceqs 1s awakened iis ii rexull rrl' signal, il i \ i u o ~ e dfroin the condition variable queue to the ready lict; the iipnallng process conti tlues to execute Finally, whcnevcr a ready proceqs i s drspatclied, it wcsumcs cxcccltjon with intcrrupts inlzibiled or enabled, depending on wl~cthcrintcrrupls wcrc inhibited or enabled at the time the process blocked. (Ncwl y created proccsscs bcgin execution wtlh inlerrupts enabled.} This implemen(a1lon does away with kernel primi tjves-they become either in-line code 01 regular subroutines-and it does away with monitor descriptors. Since intcrrupts arc inhibilcd mshilc a process is cxecuti~~g in a rnonilor, il c;innot he lbroed LO relinquish the processor. T h u ~the process has exclusive access to Ihe mon~lorilntil il wails or I-e1u1.n.s. Assunling monilor procedures terminate, eventually the process will wail or return. If the process waits, when it is awakened and resumes execution, interrupts are again inhibited. Consequently, the procevs again has exclusive control of the monitor in which it waited. Nesled
I
6.5 lmplement~ngMon~torsUsing Semaphores
283
monitor. calls cannot he allowed, however. IT (hey were, anolher pi-ocess might slast cxccuting in the monitor from w h i d ~the nested call was made while a process is waiting in a second monitor. Monilnrs are implcmcntcd in essentially this way i n the UNIX operating system. I n tact, on entry to a monilor procedure In UNIX, intcrrugts are inhibited only frurn tho.;e devices that could cause some other process lo call !hc sarnc monitor before the inte~ruptedprocess wailta or rctul-na from thc moniror. 111 general, kowevcr, a kernel implementation is needecl since not 311 monitor-based programs mcct thc rcquireme~~ts for this speciali~edimplementation. In par(icular, only i n a "trusted" program such as w operal~npsystem is 11likely lhal all rnonilor procedures can he guaranteed to terminate. Also, the speciali~edimplemctltatioti only workc on a single procehsar. On a multiprocessol, lock!. 01' some form are s t i l l rcquircd to cnsur-e mutual exclusion of processes executi~lgun di f'l'erenl pl-ocessors.
6.5 lmpiementing Monitors Using Semaphores The previous seclion showed how to implement monitors using n kernel. Here we show how to i~nplementthem us~ilgsetnaphorcs. Thc rcasonq for clo111g so arc that ( I ) a ~oftw~1t.e library rniehl support semaphores but no( monitors, or (2) a la~lgiragcruight provide semaphores but not monilors. In ally evenl, the solution illustrates another cxarnple of the use of semaplzores. Again we assume he monitor semantics defined in Section 5.1. Hence, our concerns are ilnplementine mutual exclusion of monitor praccdures and condition sy nchroniaation belween monitor procedures. In particular, we need to develop (1) entry code that is cxcculed aftcr a process calls a rnonitor procedure but before the procesh begins execuling the procedure body. (2) cxit code that is executed just before a process returns from a procedure body, and (3) code h a t implements wait, signal arid the nthcr operations on condition variables. For simplicity, we assume thcrc is olzly one condition variable; the code we develop below can i-e;idily be generalized to having Inore condrtio~ivariables by uring arr-dys of deldy queue:, and counleis. To implement rnonitor exclusion, h e use one entry se~naphorcfor each monitor. Let e be (he xernaphore associated with norl lit or M. Since e i s to be useti for mutual exclusion, i t s initial value is 1,and ils valuc is always o or I. The purpose of the entry p~otocolol each piacedu~-ein M i s to acqi~irccxclusivc acccss to M: as usual with semaphores, this is implemented by ~ ( e )Similarly. . [he exit protocol of cach procedure releases exclusive access, so i t ii; implemenled by v ( e ) .
,
I 4
284
Chapter 6
Implementations
Execution of wait (cv) releases monilor exclr~sionand delays the executing proccss on condition variable cv. The process resumes execlit ion in the monitor after i t has been signaled and after il can regain exclusive cont~olol' lhe monitor. Because a conditioi~variahle is essentially a queue of del,~yedprocewes, we can LIxxurne there is a queue data type that iriiplernents a FIFO qucue of processes and we can use an m a y delays of this type. We can also use an intcgcr counter, nc, to count the number of delayed proceqses. Inirially delayQ 1s einpty and nc is zcro, because there are no waiting processes. Whcn a process executes w a i t ( cv), i t increments nc then adds its clexcripror lo delayQ. The procesr; next releases the monilor lock by executing V ( e ) ;tnd lhcn blocks itself on a private semaphore, one that only it waits an. After the process is awaiccned. it waits to reenter the motiitor by executing F ( e1 . When rt process exccutes signal ( c v ) , it firrt checks nc to see if there are any waiting processes. If nol, the signal has no effect. If thcre are waiting processec, the slgnaler decrements nc. removes the oldesl waihng process T~.ornthe front of delay^, then signals its private semaphore.' Figure 6.7 contains the code for impIementing monitor rynchroni7atio1-1 using scinaphor-cs. Mutual exclusion is ensured since semaphore e is initially I
and cllery cxecution path executes alternate P (el and V ( e ) operations. Also, nc i~ pos~livewhen ar Icasl one process is delayed or is about to delay. Give11the above rcprcscntation for condition variables, it is strs~ightlbrward to I mplernent the empty primitive. In particular, empty rcturils true if nc i < zero. and i t returns false if nc is positive. It is also s!raiglltforw:~rdl o irnplerricili signal-all. Rcrnove each waiting process limn delayQ and qignal its pribate scrnaphorc. then 5ct nc to 7et-o. To implement priority wait, i t is sufficient tc) male delayQ n priorily qucuc ordcrcd by delay ranks. Tlri.; requires adding a iieId Tor the dclay rank to each queue clement. With this addition, i t is trivial to ill-~plcnilcntminrank: just rcturn the delay rank of the process ,it the front of delayQ.
Historical Notes Thc f o r k , join,and q u i t primitives were in~n~cluced by Dennih and Van Horn 119661. Variants of fork,j o i n , and q u i t are provided by most operating systcms; lor cxamplc. UNIX [Ritchie & Thompson 19741 provldes similar I OHCtnisht think thot it would bc sufficient to usc a se~n:tphn~.eI b r the dclay queue. becalrze 11 scn~aptrorcis csscntially a ql~cuco f blocketl pruceises. 'l'liis w o ~ ~ lobvjatc ri (Re need Tor an explicil dr:lay cluerle and an arrnv nf ~)rivirte scrnaphorcs Hrrwever, thcxe is a whtle prrlblrm: Allhr~ughP anrl v oprralirms arc a r [ m ~ i SCLIIIEIICCS ~, of tlieln are ncn. In partict~la~:a proceix cruld be prsenlple~tin wait arler execulinp v ( e ) and b e h e hluckinp. Ertrciac: 6-12 explorec Iliir prtlhlern.
Historical Notes
sharcri variables:
monitor entry: wait ( c v ) :
= I; int nc = 0; queue delayQ; s e m private[Nl; senr e
285
# one copy p e r monitor # one copy per condition # variable cv # one entry per process
P (e); nc = n c + l ; insert rnyid on delayQ; V ( e ) ; P(privateTmykdJ}; P(e);
signal (cv): if (nc r 0) nc = nc-1; rclnove otherid from delayQ; V(private [ o t h e r i d ] ) : 1
monitor exit: Figure 6.7
v ( e );
Implementing rnonilors using semaphores.
system calls namcci fork, wait, a ~ exit. ~ d Similar primilives have also bccn included in several programming languages. xuch as PLIl. Me.s&~. ancl Moclttlu-3. Many operating systerrl tcxtq describe implerncnlalions of singlc-procer~or kernels. Blc and Shaw 11988 1 and Hull [I 983 1 contain parlicularly good descriptions. Those books also dcscribe the other r u ~ ~ ~ t i oiln n : ,operating syhlem inusr support-quch as a file syslem and lncrnory 111;inagement-ancl how they rclatc to Ole kerneI. Thompson 1 1 9781 descrihcc the implementation of the UNlX kcnicl: Hull [ I 9831 describe7 a UNIX-compa~ibiecys~cmcalled Tunis. Unfortunatcly. opet-;ding systems tcxls do not descri bc rnul~iprocessurkmnels in any dctail. However, an exccllcnt report on cxpcricnce with qome of Lhe early mulljprocessorr; devclopcd at Canegie-McIlon University appcars in ;I survey paper by Jones and S c h w a r ~11 9801; the Mu1liprt)ceh~osCocking Principle (6.1) come5 fram thal paper. A few multiprocessor operating syslems arc discussed i n Hwang 119'$3] and Altr~asiand Go~tlieh[1994]. Tucker and Gupva 1 19XgI descl-i bc process control and scheduling i s s ~ ~ fore r shared-memory multiprocessors with uniform mernory access lrme Scott et :tI. I19901 cliscusr kcrncl issues f'or nonuniform memory access (NUMA) multiproccssors. including thc use al' multiplc rcady lists. In general. Ihc proceed~ngsof the i'ollowing rhtce conferences are cxcellent sources for nluch o f the best recent work on language and software issues related to multiprocessors: Architecti~ral Support for Programming Languages and Operating Systems (ASPI,OS), Symposium on
286
Chapter 6
lmplementations
Ol'eraling S y slems Principles (SOSP). and Principles and Practice or P:irnllel Program~ning(PPoPP). Several papers ; ~ n d hooks descr~bekernel implcrncntal~onsof mon~tor-5. Wirth [ 19771 descr~besthe Moclula l c ~ n c l .Holl el al. 1 19781 describe both single and rnultiplc proccisor kcr11eIs lor CSPlk. Holl [I9831 describer; kernels for Concurr-eni buclid. Joseph et al. [ 19841 present the design and implcmcnta~ionol' a comple~eoperallng system for a shared-memory multiprocc~sor;lhey use moni(ors wilhin the npel-;~tingsystern and ~ l i o whow to iniplernenl them ~1r;inga kernel. Thompson [ I 9781 dncl Holt [I 9831 describc thc rmplcmcntation ol*UNIX. The UNIX ker11el imple~nents mutual exclusion by performing context switches only when uher processes block or exit thc lcernel and by inhihiting exlernal inlerrupts at critical points. The UNIX cqujvalenl of a condition variable is called an event-an arbitrary integer that is typically he address oi'a descriptor such as a proccss or file descriptor. Within the kernel, process blocks by cxccilllng sleep ( e 1. It i i awakened when mother proccss execu les wakeup(e). The wakeup pl-imitiiie has Sign'il and Contir~ucsernanlicc. 11 i h ;ilr;o a broadcast primitive; namely, wakeup ( e ) :wakens all plocesses blocked on event e. UNIX has no equivalent of the signal primithe lo awaken just one process. Thus. if more than one process could bc wailing for an evenl, each has to check if thc condition r t was waiting [or is slill tn~eand go hack to sleep if it is not.
Hoare's cIassic paper on monitors [ I 9741 deycribes llow to implcrnent them using hernnphores. However. he assumed the Signal and Urgclll Wait discipline r;ilhei. rhan Signal and Continue or Signal. and Wait.
References Almasi, G. S . , and A. GoLtlieb. 1994. Highly Parallel Computing, 2nd ed. Menlo Park, CA: Renjam inlcummings.
Bic. L., and A.C. Shaw. 1988. The Logical Design qf'Operufing ,~ysf~~rn,s, 2nd ed. Englewvod Clif'Fs. NJ: Prentice-Hall. Dcnnis, J . B., and E. C. V~11Hnrn. 1966. Programming semantics for mulcipropil111111ed cornpiilalions. Cornm. RCM 9. 3 (March); 143-55.
Heal-e. C.A.R. 1 974. Monitors: An operating sysccln skucluring concept. Uomm. ACM 1 7, I 0 {Ocmbcr);549-57. Halt. R. C. 1983. Concurr.cnl E~r,lid TIIP L/A/lX Sy.\ti>na, and Tunis. Reading, MA: Addison-Wesley.
References
287
Holl, K. C., G . S . Grah;im, E. D. Lamwska, and M. A. Scott. 1978. Strurfure~I Concurren t Programming wirh Operating Svstem Applicntinns. Reading. MA: Addison-Wesley.
Hwang, K. 1993. Advanced Computer Archilecture: Purulleiism, Scnlahiiily, Prigrammahility. New York; McGraw-Hill. Jones, A. K.. aiid P.Schwas~.1980. Experience using multiprocessor sy steinsa slatus report. ACM Complkting S U I V P ~12, S 2 (June): 121-65. Joseph, M., V. R. Prasad, and N. Natarajan. 1984. A Mulripmressor Oper~iting Sy.7t.m. New York: Prentice-Hall Internalional.
Ritchie, D. M., and K. Thompson. 1974. The UNTX timesharing systcm. L'omm. ACM 17, 7 (July): 365-75. Scott, M. L.,T. J. LeBlanc, and B. D. Marsh. 1990. Multi-model parallel programming in Psyche. Pro-oc. Second ACM S y ~ n p on . Princip1~s6: Prucrice elf PurnEI~lProg., March. pp. 70-78. Thompson, K. 1978. UNIX implcmenlalion. The Bell System Tcrhrrirn/ Jourrtul 57, 6, part 2 (July-August): 193146.
Tucker, A.. and A. Gupta. 1989. Process control and schcdulii~gishues lor multiprogram~ned %hared-memory multiprocessors. Pmc. Tw~lfTh ACM Symp. on Operalrag Systems Principles, December, pp. 1551-66.
Wirth, N. 1977. Design and i~nple~nentation of Modula. Sqflnrc~~.e-Pmoticc and Exi>eriencc 7. 67-84.
Exercises 6.1 In tbe rnultiproccssor kcmel described in Seclion 6.2, a processor exccutcs the I d l e process when it finds the ready list empty (Figure 6.3). On soinc machines. there ir a bit in the processor status word that, if scr. causcs a processor to do nothing until it is intermpted. Such machines also provide interprocessor interrupts-i.e., one processor can interrupt a scconcl, which causes the ~ecorldproceqsor to enter a kcrnel CPU interrupt handler.
Modify the multiprocessor kernel in Figure 6.4 so that a processor sets ils idle bit if it finds the rcady list cmpty. Hence another processor will need to awaken it whcn thcrc is a process for it to execute. 6.2 Suppose process dispatching is handled by one master processol. In particular, all the master does is execute a dispatcher process. Other processors executc
regular processes and kernel routines.
288
Chapter 6
Irnplernentations
Des~gnthe d~spatcherprocess, and modify the multiproceccor kernel in Figure 6.4 as appropriate. Define any dalu xtructu~csyou izecd. Kcmcmbcr lhat an idlc processor rnusr not block inhide Ihe kernel sincc that prcvcntq olhcr processorc f'rom enlerlng thc Iccrncl. 6.3 I n a multiprocess~rsystem, assume each processor has its own rcndy list and that it execules oiily lhose processes on its ready list. As discussed in the text: this I-ilises[he issuc of load balancing since ii new process has to be assigned lo some processor. There ;Ire numerous Inad-balancing schemes-e.g., assign to a ratldom processor. a\\ign to a '-neighbo~,"or keep ready lists roughly cqual in length. Pick some scheme. juctify yoi~rC ~ O ~ C Ci~nd . modify thc multiprocer s, where B i s a Uovlean expressloll (the guard) and s is a slate~nentlist (the command). The above do btatcment lnops forever. Wc will explair~otl~eiuser; of guarclcd coinmand~as they arise. As a second cxample, the followiilg server process t ~ s c sEuclid'r; algcl~~ilhrn compute the greatest common divtsor of two positivc Integers x and y : process GCD int; id, x, y ; do true - > Client[*]?args(id, x, y ) ; # input a # r e p e a t the following until x = = y d o x > y - > x = x - y ;
I1 x i y - > y = y - x ; od
7.6 Case Study: CSP
Client [ i d 1 ! result (x);
323
# return t h e r e s u l t
od 1
waits to I-eceive input on its args port from any one of' an array or' client process. The client that succecds in malcliing with tlio input xtiltement also sends its idcntjty. G c D then cornputcs lhe answer and vcnda 11 bclck to thc client's r e s u l t porl. In the inner do loop, if x r y thcil y i s subtractcd rrnni x: if x < y then x is subtractcd horn y. The loop tcrmrnaies when neither or*these cot~diliunsis rue-narncly, when x == y. A client client [i] communicates with GCD by execut~ng GCD
H e ~ ethe port nalncs are ]lot actually necdcd; however, thcy l ~ c l pindicate thc role ot' each cl~nnncl.
7.6.2 Guarded Cornmunication Input and outpul statements enable processes to communicaie. By themsclvcs. however. they are somewhat limited, becausc hvlh are blocking statctncnis. Olien a process wishes to communicate with morc than one other processperhaps ovcr dll'rerenl port>-yet does no[ know the order in which he nther processes inlghl wish Lo co~nn-~unica~e w ~ t hit. For example, consider exiend~ngthe copy process above so that up to, say. 10 charac~ersare bu rl'erecf. IT more than 1 hut fewer than 10 characterc are br~ffered,Copy can either itlput another character from W e s t or o u t p ~ another ~t character to E a s t . However, Copy cannot know which of west or East will next reach a matching statement. Guarded complzunication stutrmenfs support noiidetcrrninistic comrnunicalion. These corn bine aspccts of condi tinnal and cnrnrn~uiicationxtalemenss and have the fonn
Here B i s a Roolean expression, c is a cotnmur~icatioristatemcnl. and s \i a stalement list. The Roolean expression can bc omitrcd. in which case i l has [he implicit value true. 'Ibgetlier, B ancl c comprise whnl is called the guni-ri. A guard s/acr.eetJr if a is lruc ancl cxeculing c would no[ cause a delay-~.e., home other process is waiting 31 a matching communicalicm statement. A guatd.fails if B is Talse. A guard h1ork.c if B i~ true hut c cannot yet be executed.
324
Chapter 7
Message Passing
Guarded communication statemenls appear within i f and do slaleinents. For exatnple, consider if B,; [ I B,; fi
C,
- > S,;
C,
- > S2;
This statement has two arms-each of which i s a guarded communication statement. You can cxecute it as follows. Fir% evaluate the Boolean expressions in the guards. lr both guards fail, then the i f statement terminates with no effect; if at lcast one guard succeeds, then choose one of then1 (nondeterminislically); if bolh guards block, Lhen wait until one of the guards succeeds. Second, aCter choosing a guard that succeeds, execute the communication statement c in the cllosen guard. Third, execute the corresponding statement s. At this point the i f statement terminates. A do statement irj executed in a sitnilar way. The difference is that the above election process is repeated until all guards fail. The examples below will (hopefully) make all t h i s clear. As a simple tllustration of guarded communication, we can reprogram the previous ver~ionof the copy process as follows: process Copy char c; do West?c
F
Eaat!c; od
1
Here we have ~ntlvedthe input statement inlo lhc guard of the do stale~iien(. Since the guard does not contain a Boolean expression. it will never (ail. and hcnce the do statement will loop forever. As a second example, me can extend the above process 50 rhat i t buffers up to two characters ~nsteadnf ,ius1 one. Initially, copy has lo i11p11tone character fi-om West. But once il holds one ch;ifi~ctel; it could either output that chatacter to East or inpi11ailother character frotn west. It has no way of knowing which choice to make. so it can use gu'lrded comtnunication as follows: process Copy { char cl, c2; Weat?cl; da W e s t ? c 2 - > East!cl; cl = c2; [ I East!cl - > West?cl; od 1
7-6 Case Study: CSP
325
At the slart of evely iteration of do, one character is buffered in cl. The first arm of do waits for a second character from West, output< the first character to East, then assigns c2 to cl. thus getting the process back lo the starting stt~tcfor thc loop. The second arm of do outputs the first character to E a s t , thcn inputs another character from West. If both guards of do succeed-i.e., if West is ready to send another character and E a s t ix ready to accept one-either one can be chosen. Sincc neither guard ever fails, the do loop never lerrninates. We can readily generalize the above prosram to a bounded buffer by implementing a queuc inside the Copy process. For example, the following buffers up
to aa characters: process Copy { char bufferll0l; int front = 0, rear = 0, count = 0; do count c 10; West?buffer[reas] - > c o u n t = count+l; rear = ( r e a r + l ) mod 10; [ j count > 0; East!buffer[front] - r count = count-1; front = (front+l) mod 10; od
3
This version of copy again e~nploysa do slaterncnl with two arms. l ~ u now l the guards conlain Boolean expressions as bell as ccornmut~lcationsraternents. The guard in he firs1 arm succeed'; if Lhere is I+oomIn the bufrer and West i s r ~ a d yto oulput a character; the guard jn the second arm succccds if thc buffer contains a character and East i5 ready to input it. Again the do loop never tcrrnrnatcs, in thic case bccause at least onc of the B~trleanconditions is always true. Another example wtll il1u';tratc lhe usc c ~ frnulliple ports in a hervcr process. In pa~ticulal-.consider the resource alloc;~torprogrammed in Figurc 7.7 using asynchronous message passing. The Allocator: procexs can he programmecl in CSP as lollows: process Allocator ( int avail = MAXUNITS; s e t units = initial values; int index, unitid; do avail > 0 ; Client[*l?acquire(index) - > avail--; remove(units, unitid}; client [index] !reply( u n i t f d ) ; [ ] Client [ * 3 ?release [index, unitid) - > avail++; insert (units,u n i t i d ) : od
1
326
Chapter 7
Message Passing
T h i h prucesx i x rn~lrhinore concise than the A l l o c a t o r in Figure 7.7. In particular, we do not need to merge acquire and release messages on Lhe same chani~cl;instead we use diffcrcnt ports and a da stalcmerlt with one ann i'or each port. Moreover, we can delay receiving an a c w i r e message vntil there are available ~tnils:this does away with ille need to save pendins requests. However. these dilferences are due lo Ihe use of guarded communic;ition; they are not due to thc diffei.enccs belween asynchronous and synchronous inessage pascing. In fact, il i ~ possible ; to uL;e guarded communication with asynchrotlous nlescagc p;l\xing, as we shall see in Section K 3. AS a final example, suppose wc have two processes that wan( ro exchange thc vnliica of two local variablcs-thc problem hvc cxainined al the end of Section 7.5. Thcrc wc had to usc an asymmclric solu~ionLo avoid de;ldluck. By using guartlccl conlmunicalion stalemenls, we can program a xyrnlnetric exchange a
PZ?valueZ; 1 3 P2?value2 - > PZ!valuel;
Ei 1 process P2 I i n t valuel, value2 = 2; i f Pl!value2 - > Pl?valwel; [ I Pl?valuel - > Pl!valueZ;
fi
1
This einploys the nondetermi nil;tic choice that occurs when there is more than nrie pair of rrlatching guards. Thc cornrnnnication stateinenl in lhe fil'st guard in P I matches the communication statcrnent in the fecond guard in ~ 2 t~nd , the olher pair ol' comtnunicalion slatemenls also match. Hence. either pair can be ckascn. The soludon is symmetric, hut i( is still no1 ne21rly as simple as a sym metric sol ~ ~ t i c tusing n asynchronous message passing.
7.6.3 Example: The Sieve of Eratosthenes Onc of lhc cxamples in Hoare's 1978 paper on CSP was w interesting parallel program [or generaling primes. We develop Ihe xolution here 130th as a complete exainple 01' il CSP prognun and a< another example of the use of a pipelinc of procesws to paraIlelize ;I secluential algorithm.
7.6 Case Study: CSP
327
The xieve of Eratosthenes-named aftcr the Greek mathematician who dcveloped ~t-is a classic algorithm for deter~niningwhich numbers in a given range are prime. Suppose we want to generate thc primec; betwccn 2 and n. First, writc down a list with all the numbers:
Starting wtth the first uncrossed-out number in the list. 2, go through the lisl and cross out multiples of that number. If n is odd. this yieIds the list:
At thrc pomt, crossed-out numbers are not prime; tlie other nurnbcrs are st111candrdates for being prime. Now mtne to the next trncroised-out n ~ ~ ~ n ihn ethc r li~l. 3, and repeal the abovc process by cross~ngoul rnultlplcs of 3 . TF we continue this process un tll cvcry numbei has been eon5idercd. the trn~rocwd-outnurnherc in he final list will be all (he prirncs between 2 and n. In eccence the sieve catchex the prlmcs and lets lnultiples of (he prirncs fall Ihmugh. Now coilsider how we might pal-allelize th15 algm irhm. Olic poh\ihility is lu as5ign a d~ffttlentprocess lo each nur~lberm tlie list and lience to have each i n parallel cross orrt multiples ol its number. Hornever, this approach has rwo problems First, sincc processes can commun~c;iteonly by euchanglng mcss;iges, wc would have to givc euch procecs a privale copy of all the ~ ~ u m b e r and c , we would Irake 10 u s e anorhcx proces.: to c o i ~ ~ b i nthe e resultr. Sccond. wc woulrI u s e way Loo Inany pl-occqses. We only nced as many procccses as there are prime\, but wc do not know in advance urhrch numberh are primc! Wc can ollercorne both problems by parallel~z~ng the sicve ol Eratosthcnes u h i n ~I? pipeline of' filter processes. Each filter receives a strcan-r ol numhu s from its preilecehxor dnd 5ends a stream of nutnher\ to it? Fuccesxor Thc first number ~h&it ;I filter reccrvcs 1s (lie next lasgcsl prime; it P R S ~ C Son t o its ';ucccssi>r all n u m her.; that arc tlol multiples uf 11ic lir\t. Tlic aclunl program is sllown In F~gurc7.15. The f i r q l process. sieve 11 1. \ends all the odd nu~nhcrfto s i e v e [ a ] . h v e ~ yother process re~eivesa stsc;un of nurl~bclsl'rorn its pre cce%or The fir-st nulnber p that placei\ Sieve [i] receives IS the it11 prime. Each sieve [i] subsec~uentlypasscs 011 ,111 othcr numher5 it rcceives (hilt are not multiples nf itc pri~nep. Thc tntsl number L or Srcvi. proccsscs niusl be Iargc cnough to guarantcc lhal ;LII pnlncs up lo n are gcnoraled. FOI example, thcrc are 2 5 prlmcs less ban 100; the percentage ilecreaser; for increasing valucs ol'n Proccss sieve113 terminate< when it runs out of odd nu~nbersto \end to sieve [ Z I . Every other proccs5 eventually bloclca w;iltir-lg Ibr more ~npulfrom
P
328
Chapter 7
Message Passing process Sieve [l] { int p = 2 ; for [i = 3 to n by 23 Sieve [ 2 ] ! i; # pass odd numbers to S i e v e [ a ] 1 process S i e v e [i = 2 to L3 ( int p, next; Sieve [i-11?p; da Sieweci-l]?next - > if (nexk m o d p ) ! = 0 - > Sieve[i+ll !next;
# p is a p r i m e # receive next candidate # if it might be prime, # pass it on
fi
od
3 Figure 7.15
Sieve of Eratosthenes in CSP.
Its pl-t3deces?or. Whcn thc program stops, thc values ut p i n thc procesr;es are thc primes. Wc could casily modify the progl-am to terminatc nonnally and to prinl the prirncs in ascending order by using u sentinel to nark the end of thc list of nurnhcrs. When a process sces (he sentinel (a zcro for example), it prints its valuc oCp, sends the sentincl to i t s successor. and then quits.
7.6.4 Occam and Modern CSP The Communicaling Secluerltial Processes notatinn described above has been widely used in papers but CSP ilrelf has ncvcr been implcmcnted. However, scvcral languages based on CSP have bccn implemented and used. Occam is [he beyt known nf 11ie.ce. It extcnds the CSP notation to use global channels rather than dilcct naming, provides procedures and Iunctionx, ancl allows nested parallelism. Hoarc and others went on to develop a forrnal languitge that is used to ciesc ti be and reason i~houtconcnrrcnl communicating syslcins. The formal language waq influenced ;by ideac in 111c original CSP notation-and it too is named CSP-but 11 is not a language that is used to write applicalion programs. Rather i t i s uscd to model their behavior. Eelow we rurn tnarize the attrihuter of Occarn and modcrn CSP and give a few examples to illustrate h e flavor of cach. The HistoricaI Notes itt the end ol'
this chapter dcscribc source:, of detailcd information, including a I~bral->I of J a k a pac kagcs tkdt h u ~ ~ cthe ~ r Occ.ctnl/CSP t programming n~odel.
Occam contains a vcry small number of mechanisms (hence she narnc. which
comes from Occam's razor) and hils a unique syntax. The fi~stkersion of Occarr~ was dcsiglled in the mid-1980s; Occam 2 appeared in 1987; a proposal is now underway for an Occam 3. AIthough Occam is a language in it? own right, ii was developed in concert with the trunsputer, an early muIticornputer, ancl wac; essenlially rhe mach~nelanguagc of the Lransputer. The basic units of an Occam program are dcdaratiuns and three pritnilivc "processes": arsignment, input, and output. An assignrnenl process I S cimply an assignment statement. The inprll and output procexscs are himilar to the input and output commands of CSP. However, channcls have names and are global to processes; moreover, each channel must havc exactly one sendcr and one receiver. Pnniitive processer; are combined into conventional processes using what Occam calls constructors. These i n ~ l u d execluentinl construclors, a parallel constructor similar to the co stalemelit, xnd a guarded cnmmunic;ition ~tatement.In Occam's distinctive syntax, each primitive proccss, constructor, and dcclnration occupies a line by itsell': declaratiuns have a trailing colon; and the l;ingi~i~gc irnpoxcs all indzntatjun c u n v e ~ ~ l i u r ~ .
An Occam program contains a static 1111rnber01' processes and stlicic cornriiunication pathh. Procedures and fuiictions provide the only form of modulasiza~ion.They are esqentialIy puarncteri~edprocesses and call share only chdn11eIs and constants. Occam does not support recursion or any form ol' dynamic creation or naming. This makes m;my algorithms harder- to program. However, jr ensures that an Occam compiler can determine cxaclly how many processes a prograin conlains and how they co~ninunicalewith each other. This, and the fact that drfferent constructors cannot sharc variables. makes it possible for a compiler lo assign processes and data to processors on a distributed-mcrnoiy rnachlne such as a transpuler. In mosi languilge accept slale~nent;additiorlal starernenls;
or or
... when B, = > accept statement; additional sraternenls:
end select;
Each line i s called an alteunntive, Each B~ is a Boolean expression; the when clauses are oplional. An alternative is said lo be open if B, is true 01' the when clause is omitted. This form of selective wait delays the executing process until the accept statement in some open alternative can be execulcd-i.e., there is a pending in\location of [he entry named in the accept statement. Since each guard B, precedes an accept statement. the guard cannot reference the parameters of an entry call. Also, Ada does not provide scheduling expressions. As we discussed in Sectiorl 8.2 and will see in the exa~nplcsin the next two sections, this complicares solving many synchronization and scheduling problems. A selective wait statement can contain an optional e l s e alternative, which i s aelecred if no other alternative can be. Tn place of an accept stalement, the progralnmer can also use a delay statement or a terminate allernalive. An open alternarivz with a delay slatemen[ is selccled i T the delay ~nte~.\)al has tl.;lnspirecl; this provicles a tilneour mechanism. The terminate alternative i s selzcled if all rhe tabks that rendezvous with this one have lerlninated or are hemsel elves wailing at a terminate altel-native(see the cxnniple i n Figure 8.19). A conditional entry call is used il one task wants to poll anolhcr. Tt has lie form select enlq call; additional statements; else ctarements: end select;
The entry call is selected if il call be executed immedialely; otherwise, the e l s e al renative is selected. A timed entry call is used il' a calling task wants to wait at most a certain inlerval of lime. Its I'orni is similar ro ~liatof a conditional entry call: select enlry call; ;~dditionalstatements; delay statement; additional stalemenls; end s e l e c t ;
or
-
-
8.6 Case Study: Ada
401
[n ~ l i i scnac: the entry call i s selected if i~ can be exccu~.ctlbefoi-e r l ~ cdelay inlcrv;11 expires. Ada 83-ancl hence Ada 95-pl-ovides a Feiv additional mech;i~lisnjs I'nr concurrent prognlmming. Taslts can sllarc variables; howr\le~; clle)~ cannor a s u m e that ~liesevarinb!cs ai-e updated except at spncIi~-oniz;~tion points (e.g., rendezvous slalerne~~tc).The a b o r t slatenienl allows one task LO ter~nirlart
I I
nothe her. l'here is a ~nech:inismfor setling the priority of u task. Finally, t1iei.e are sa-c;~Ileclattributes that enable one Lo tlel.enni~lewl'lellie~.a task is callable or h : ~ ter~ninatetlor lo determine rhe numbel-of pending invocalions o f an enlry. L
8.6.3 Protected Types Ada 95 enhances the concurrent progl-a~nmingmechanisms of Ada 83 in scvcral w21y.s. The two inost significant are prolected types, which support spnclironi7ed access to .sll;lled diilkl, and the r e F e u e sti~teinent.which supports scheduling ;~ndsynchrunizi~tion[ha( depends on t l ~ carguments of calls. A prolected type encapsul~ttessharcd data arid svnchroni-ccs access lo i ~ . Eilc1-1inslance o f a protected type is si~nilal-LO a n~onilol.. The ~ p c c i l i c ; ~ t i opart ~i o f a pl-olected type has the l i ) r ~ n
Y
D
I
d L-
,h 1C
1
protected type Name is function. procedure, or ent1.y declaralions; private variable declara~ions; end Name;
I
,II
nis re 1). he
T11e body has rhe form protected body Name is Ciunclion! proceclnre. or entry bodies: end
Name;
I
se
ii n
Prntectetl fi~nctions111-ovide rcad-only acces5 to the private va~.ii~bles: hence, such a Furlction can be called simullaneously by mulliple laslts. P1,oteclecl proccdu~-cs pvtav\E\e exc\uii~ereadlw\.itt access to the p.c'\\fi\(c vaY'\a\o\es. P~atecLeAc\-\ttr'\esaye like protected procedures except h a t they also have a when c\iluse that spcciiies a Boo\ean synch~.onizaLioncondition. A\ most one task al a lime can be cxccnling a protected procedure or entry. A call o f a pro~ected enlry delays until the syncl11-onizationco~lditionis true urr~lthe citller can have cxclusivc access condition callnor. however, deycnd to the private variables; the s)lt)ch~.o~lizalion on rhe parameters of a call. 1
402
Chapter 8
RPC and Rendezvous
C;II Is of prorected procedures and enlrics are sel-viccd i n FJFO order, subject Lo synchronization conclitions on entries. The requeue state~nerllcan bc used wirllin rhe body ol' ;i 111-oleclcdprocedure or entry to defer completion o f tlic call that is being serviced. (It can also be used with.in the body o-f an accept slatement.) This slalell?ei?r has the form requeue Opname; where Opname i s the 11a111eof a11entry or prolectecl procedure [hat either lias n o patanictcrs or has thc same parameters as the opctarior~ being serviced. The ef"l'ec( of requeue i s to place the call being serviced on Lhc clueue for Opname. juhl 21s iTLhc calling tad< had directly called Opname. A s an example of the use OF a protected type ant1 requeue, Figure 8.17 conr:iins Ada code Tor ail N - ~ s lcounter i barrier. We assume that N i s a global constant. An instance of the barrier is declared and used as followh:
protected type Barrier i s procedure Arrive; private entry Go; - - used to delay early arrivals count : Integer := 0; - - number who have arrived time-to-leave : Boolean := False; end B a r r i e r ;
protected body Barrier is procedure Arrive is begin
count
: = count+l;
i f count
c
N then
-- wait for others to arrive else count := count-1; time-to-leave : - True; end if; requeue Go;
end ; entry Go when time-to-leave is begin count : = count-1; if count = 0 then time-to-leave := False; end i f ;
end ; end Barrier;
Figure 8.17
Barrier synchronization in Ada.
8.6 Case Study: Ada
B
:
Barrier;
403
- - declaration of a barrier
-.. B. Arrive ;
--
or "call B-Arrive;"
The lirsl N-1tasks to arrive at [he bavrirr incrcrncllt c o u n t and then delay by bcinp rcqucuecl on privacc ellrry GO. The last task Lo arl-ive ;~t tlie barrier sets time-to-leave lo True; this allows the req~~eued calls that are waiting 011 GO 10 awaken-one at a linie. Each task decrements c o u n t bel'o1.c: ~ c t l ~ r n i r CI.OI~ ig the barrier: (he lasl one resets time-to-leave so (lie harrier can be used agaiii. l task dclayrcl on G O w i l l (The s e ~ n a n t i cOF ~ prolecled Lypzx ensure t h ~ every execute before ilny further call o f Arrive irl serviced.) The i.ea(ler lnighl iirlcl i~ inslruclive Lo co~np;ll-e ~ l i i x1x11.rier10 the one i n Figi11.e 5.18, which is progl-anl~nedu ~ i n g the Pthreads 1ibra1.y.
8.6.4 Example: The Dining Philosophers This scction presents a colnplete Acla pl-ogaln i r Lhe d i n i ~ l gpliilosopllers ~probas lein (Section 4.3). l'hc prograin illusrl-ates thc ucc 01' Laxlts and rcndezvo~~s well as several general features oC Ada. For convenie~lcein the prograrn, wc assume Lhe existence 01' two 1-'unctions, left ( i ) ilncl right (i) lllal rcturn rlic indices of tlie leF1 and right neigllhor of pli ilosoghe~.i. Figure 8.18 conlains llie main p~~ocedu~.c Dining-Philosophers. Beibl-e lhc p~.ocedurcarc with and use clauscs. The with clause says his procedure depends on the objec~sin package Ada.Text-10. The use clause makes the names of the objecls expol.Led by thal package direclly viqible (i.e.. so they tlo not havc Lo be qua1iiiecl by the package name). 'The lnain procedure first declarcs ID Lo be ;III in~egerbetween 1 and 5 . The specilicatiun of Llic Waiter task clecI;u-es two entries--pickup iui(l putdown tIi;it are called by ;I pliilo~oplierw l i e ~ i t wnnls to pick LIP or put clow11 its forks. 'The body o f w a i t e r is scpari~~ely con~piled;it i s givcn in Iziglt~-c 8.19. Philosopher has bccn speciticd as a lnslc lypc so (hat wc can tlecliu-e ;on an-ay DP ol' 5 such tasks. The instiuices oI' llie live philosophers are cl.eated when the declaralion ol' al-rl~yDP is elahoralccl. Each philoxopher first waits l o accept a call or its inilialization entry i n i t , then i t executes rounds iterai~rions.Variable rounds is dec\a~edg1ob.d to \he bodies or the phi\osophers s o wch can rcad it. The body o f tlie tnain procedure a1 tlic erld of Figure 8.18 initialiscs rounds by rcnding an input value. 'Then it passes each philosopher its intlcx by callillg DP(j) .init(]). Figure 8.19 contains Lhc body or the waiter task. 11 is more ccmplicatrd than the waiter process i n Figul-e 8.6, because the when conditions in the Ada
404
Chapter 8
RPC and Rendezvous w i t h Ada.Text-10; use Ada.Text-10; procedure Dining-Philosophers is subtype ID is Integer range 1..5;
task waiter is -- Waiter spec entry Pickup(1 : in ID); entry Putdown(1 : in ID);
end task body Waiter is separate;
task type Philosopher is entry init(who : ID); end;
--
Philosopher spec
DP : array(1D) of Philosopher; rounds : Integer:
- - the philosophers - - number of rounds
-- Philosopher body myid : ID; begin accept init(who); myid := who; end; for j in 1. .rounds loop - - "think" Waiter. Pickup (myid); - - p i c k forks up task body Philosopher is
--
"eat"
Waiter.Putdown(myid); end loop ; end Philosopher;
--
put forks down
begin - - read in rounds, then start the philosophers Get(r0unds); for j in I D loop DP(j) .init(j); end loop; end Dining-Philosophers;
Figure 8.18
Dining philosophers in Ada: Main program.
select stacemen1 canno1 reference entry paralneters. Each waiter repeatedly accepls calls of Pickup and Putdown. When Pickup is accepted, lhe waiter checks LO see if either of the calling philosopher's neighbors is eating. If not, then philosopher i can eat now. I-Iowever. if a neighbor is eating, the call of Pickup has lo be requcued so ihar the philosopher does not get reawakened too
8.6 Case Study: Ada
I
405
separate (Dining-Philosophers) task body Waiter is entry Wait (ID); - - used to requeue philosophers eating : array (ID) of Boolean; - - who is eating want : array (ID) of Boolean; - - who wants to eat go : array(1D) of Boolean; - - who can go now begin for j in ID loop - - initialize the arrays eating(j) : = False; want(j) : = False; end loop; loop -- basic server loop select accept Pickup(i : in ID) do - - DP(i) needs f o r k s if not(eating(left(i)) or eating(right(i1)) then eating(i) : = True; else want ( i) := True; requeue Wait ( i ) ; end if; end ;
I
1 I
or accept Putdown(i : in ID) do eating(i) := F a l s e ;
-- DP(i) is done
end; - - check neighbors to see if they can eat now if want(left(i1) and not eating(left(left(i))) then accept Wait(left(i)); eating(1eft ( i ) ) := True; want (left(i)) : = False; end if; if want(right(i)) and not eating(right(right(i))) then accept Wait(right(i)); eating(right(i)) := True; want(right(i)) := False; end if;
1 or
!I
terminate; end s e l e c t ; end loop;
--
quit when philosophers have quit
end Waiter;
Figure 8.19
Dining ph\\osophers In Ada: Waiter task
I I
406
Chapter 8
RPC and Rendezvous
early. A local away of five entries, w a i t ( I D )is, uscd to delay philosophers who have lo wait: each philosophci- is requeued on a distinct element of rile ai'ray. After eating, a philosopl~ercalls Putdown. When a waiter accepts rhis call, i\checks ifeit\ler neighbor wants to eat and whether rh:u neighbor may do so. If SO. the waiter ;\ccepls a delayed call of W a i t to awaken a philosopher whose call oi P i c k u p had been rcqueuec\. T\le accept statemen\ that services Putdown could sun.ouncI the entire select aiternative-i.c., the end of accept c ~ u l dhe ;~l'terthe two i f statements. We have placed i t earlier; however. because cllerc is no need lo clclay Ihc philosopher who callecl Putdown.
8.7 Case Study: SR Synchronizing Resources (SR) was developed in thc 1980s. The First v e r s i n ~of~ lhc languagti i nrroduced rhc rendez\,ous mechanisms described in Seclion 8.2. The language rvolvz(1 to PI-ouiclethe ~nullipleprimilivcs clescribed in Section 8.3. SR supports bo~hshared-variable and dist~.ibu~ed programming, and it: can bt: used Lo implement di~cctlyalmost all the pi-og,-alns in this bcx~k. SK pl-ograms ciul execute on .sliarcd-tnemory inultiprocesso~~s and netwocks ot wol-kslillions,as wcll w on single-PI-ocessor~nucliincs. Although SR conlains il \~:u-ietyof mechanisms, they are based on a cmall nunibes ol'orlhogotial conccplb. Also, tlre sequential and con.curl.cnt ~nechanisms arc inlcgrated so Lhar similar tl~ingsare clone i n similn~.ways. Sections 8.2 and 8.3 inlroduced ant1 illustr;ued rnany aapecls of SR-wilhout actually saying so. This section su~n~na~.izes acldirional aspecls: program stl-ucture,dynamic creation and place~t~ent, and addilional slatelnenrs. As an example, we present a program t l l a t ai~nulares[he esccutiun of processes entcring and exiling cr.ilic~11 sections.
8.7.1 Resources and Globals A n SR program is co~npriscdul' resources and globals. A r.r.rourc:e declnn~tiorr specihes a pi~l-crrifor a n~oduleand has a str~~cture quite similar LO that ol'amodule: # specification part import clauses operation ant1 type decla~-a~.ioos body name (t'ol'rnals) # body variable and other local cIeclar:~tions
resource name
7 8.7 Case Study: SR
407
initialization code pl-ocedures and processes finalization code end name A yesource contains ilnpol-t clauses if i r makes use of the declarations expor.tec1 from other resources or from g.lobalh. Declarations and irlirializalion code in the body can be intermixed; this supports dynamic arrays and perri~its[lie programmer to conrrol the order i n which variables are initiali-ced and processes are created. Tf there is no specification part, i t can be omitted. 'Hie specification and
body can also be compiled separately. Resource insrances are created dynamically by means of the create starenient. For example, executing reap := create name
(~CIU~I~S)
passes ihe actuals (by value) to an new instance of resource name and [hen executes tbac resource's initialization code. When llie initialization code tenninates, a resource capuDili!y is returned and assigned lo variable rcap. This variable can subsequenlly be used to invoke operations expovred by the resource or to destroy Lhe instance. Resources can be destroyed dynamically be means of the destroy 6l;iternenl. Execi~tion of destroy stops ally activiry in tlie named resource, execules the finalization code (if any), and rhen frees storage allocated to the resource. By default, components or an SR program reside in one addrcss space. The create sratelnent can alho be used to creake additional address spaces, which are called virtual machines: vmcap : = create v m 0 on machine
This ,+,tatementcreates the virtual machine on dlc indicared hoat machine, then retul.nS a capability for it. Subsecluent resource creation statements can use "on vmcap" l o place a new resource in that address space. Thus ST?: ~~nlike Ada, gives the programmer complete control over how resources are napped to rnachines ancl this mapping can depend on input Lo the program. A global con-~poncrrti s used to declare types: variables, operations, and procedures that are shared by resources. It is essentially a single instance of a resource. One copy of a global is stored on every virtual machine rlial needs it. In particular, when a resource is created, any globals rhal il ilnports are created implicitly if they have not yet been created.
I
!
I
i I
408
Chapter 8
RPC and Rendezvous
A n SR progl-nlrl contains one distinguishcti 111ainI.esour.ce. Execution ol' a progi-aln bcgins with implicit crealion of one inslance of this resource. The iuitiitlixalian code in the main resource is lhen executed; it oCten cl-c:ites instances of other resources. An SR progl-arn teoninates cvhen evcry process has tenninatecl or is bloclced. or when a stop statement is executed. A1 (hat point, the tun-lime syscode ( i f any) in the nzain I-esou~.ce arlcl then the finillLe~nexecutes the linal~~ation i ~ a ~ i ocode n (if any) in globals. This provides a way for thc pl~ogritnlrllerro rcgajn control t o print rcsulls or timing inilol-malion. As a simple example, the following SR program writes lwo lines of output: resource silly() write("He1lo w o r l d . " ) final write("Goodbye w o r l d . " ) end end
'The resource is cl.ealcd au~ornatically. 11 WI-ilesa line, then rermiuales. Now [lie finalization code is cxccutcd; i t writes ;I second linc. The effect is the same as if f i n a l and the first end were deleted Fro~nthe program.
8.7.2 Communication and Synchronization The distinguishiog attribuie of SR is ils variety of communic>~tionand sy11chi.onization mech:u~isms. Processes in [lie same resource can share val-i21bles,as call resources in he same address space (rhl-oughthe use of globals). Processes can also communicate and synchroni~eusing all the pritnitives described in Scctiorl 8.3: semaphores, asynchronous lnessage passing, RPC, a~lclrende~voils.Thus SR call be used lo iniplemcnr concurrent programs I'or sllarcd mernory multip~.ocessors as well as for distribuled systems. Operaliol~sare declzued in op declarations, which ha\le rile form giver1 earlier in this chapter. Such declarations ciin appear in 1,esclusce specifications. in resoul-ce bodies, and cvcn u r i t l i j ~ processes. l An opcrntion declrued within a process is callecl a Ivcc~loperufion. The declal-~ngpl-occhs can pass a capability lor a local ope]-ation 10 another process, which can then inyoke the opcnttio~r. 'l'his supports conve~~sational cunli~~uily (Sect-io~i7.3). A n operalion is irlvoked using synchronous call or ;tsyncIironous send. To specify which opel.ation to invoke, an invocalion stateinent uses an operalion capabili~yor :I held of a resource capability. Mriihin illc resource thal dec1iu-e~it. rlle nalne of an operation is in fact a capability, s o an i~~\~oci~tion state~nentc;ui
8.7Case Study: SR
409
use it di~.ectly. Resource and operation capabili~iescan be passed belwecn resources, so communicalion paths can vary dynamically. An operation is serviced either by a procedure (proc) or by Input slatetnents (in). A new process i s created 10 service each rernote call of a proc: calls from within the same address space are opli~nizedso that the caller iiself execulcs the procedure body. All processes in a resourcc execute cot1cu1-rently,at leasr conceptually. The inpuc statement supports rendezvous. 11 has the I ~ O I - I ~ shown I in Section 8.2 and call have boll1 syochronization and rcheduliag expressions t h a ~depend on paratmeters. The input statement can also cont;~iilan optional else clai~se,which is selccted if no other guard succeeds. SR al\o containa scveral ~~iechanisms that are abbreviaiions for common uses of operaiions. A process declaration is an c~bbrevjationfor an op c1ecl~1l.aliol) and a proc to sel-vice invoc:~tiol~sof thc operalion. One instance 01' lhc p~,occ\sis clcacecl by an i~nplicilsend when t1.1e resource is c~.eated.(The IYOgrninlner can also cleclare alrays of processes.) A procedure clec1;~ralioni\ an abbreviation for an op decl:~ra~ionand a proc 10 service iil\!oc;~t.ionsof the operation. Two acldiriotial abbt.eviations are the receive statement and scmapho~~es. A receive abbi-cviates an inpul stateriienl that services one operation e~lrl(hat merely stores the argulnenrs in local variables. A selnaphore cleclaratjon (sem) ;ibb~.evialesrhe declnr-ntion OIa par;~rnelcrless operation. TIic P stalcmcnr is a special case of receive. and the v st.atemcnr is a special case of send. SR provides a f e ~additional ! statemenls that havc proven to bc usel'i~l.'The reply statemenl is a variation of rile return staternenr; it returns values, but Lhe replying pcocehs continues execulion. The forward slate~nentcan bc used 1.0 pass an invocaiio~lon to in CSenter(id) by id - > write("userW, id, "in its CS atM, age()) ni receive CSexit()
od end end resource main ( ) import CS var numusers, rounds: int getarg(1, numusers); getarg(2, rounds) process user(i : = 1 to numusers) fa j : = 1 to rounds - > call CSenter (i) # enter critical section nap(int(randorn(100)) # delay up to 100 msec send CSexitO # exit critical section nap(int(random(1000))) # delay up to 1 second a£ end end
Figure 8.20 An SR program to sirnulate critical sections.
in CSenter(id) by id - z write("useru, id, Lin its CS at", age()) ni 'This i s S l i ' s rcndorvous mechiunism. If there is rnorc than one invocation of C s e n t e r , thc onc thal has the s~nallestvalue for pal.alncter i d i s selecled, and a message is then printed. N c x l the arbitrator uses a receive slalelnenl (o wail [or. i ~ ni~~vocntion or Csexit. I n thia program we could have put the arbitrator process and its operations within the main resource. However, by placing lhern in it glohal component, they couJd also be used by other resources in a liugel- program.
Historical Notes
41 1
The main rcsource reacls Lwo cornmand-line afgurnenrs, then ir creales numusers instances of h e user process. Eacli process executes a (or-all loo11 (fa) in which it calls the csenter operation 10 get perniihsion Lo enler its crirical seclio~l,passing its ~ndexi as an argurucnl. We simulate the duration of critical and noncritical sections of' code by having each user process "nap" for a randorn number of milliseconds. After napping the process j.nvokes the csexit operation. The c s e n t e r operation milst be invoked by a synchronous call statement because the user process has Lo wail to get pennission. This ir\ enfol'ced by means of tl~coperation restriction (call} i n the declaration o f the Csenter operncion. flow eve^; since a user PI-ocessdoes not need to delay when leaving its critical secl.ion, i t i~lvolcesIhe csexit operation by means ol' the asynchi-onous send slaterncnt. The program c~nploysseveral of SR's predrfined f~rnctjonrs. l'he w r i t e cltalelnerlt prin1.s a line of ourpui, and getarg I-eadsa com~nand-lineargument. s number or milliscconcls Ihe The age lirnclion in tlie w r i t e stalclnetlt ~ c l u ~ nIhe progranl has been executing. The nap f~ui~tjon cause5 a procah LO "nap'' I'or ~ h c number- of milliseconds specifiecf by its argun1elit. 'l'he random f ~ ~ n c t ~r rot unr n h a psct~do-randomreal number bccween 0 and irs at.gurnent. We also use the i n t type-convetsion f~~nccion to conrerl t hc result from random 1.0 an integer. ;IS rccllrired by nap.
Historical Notes Both remote procedure call (RPC) and rcnclezvous origina~eclin ~ h clate 1970s. Early researcb o n ihc semantics, use, uld i~nplementationor RPC occurred in tlie operaling syst.elns comniunity and conrinues to be a n inierest of thal group. a1 Xcrox's Pdlo Alto Rzsearch Bruce Nelson did many of the early experi~~ienls Center (PARC) and wrore an excellcnl. disse~.tationon the topic [Nelson 19811. While purst~itlgliis Ph.D., Nelson also collaborated with Andrew Birrell ro prod ~ ~ what c e is now viewed as :i classic on how lo impleme~ltRI'C efficiently in an operating syslern kernel [Bin-elland Nelsoli 19841. Dow~ihe road from PARC at Stanli)rd University, Alfred Spctor [I9821 wrore a clisserlalior~on the selnantics and implementation of RPC. Per Brinch Hansen [I 9781 conceived the basic idea of RPC (allhough he did nor call it thac) and designed the fii-sl progra~nniiaglanguage babed 01.1 the concepl. His language is called Distributed Processes (Dl?).Processes jn DP call export procedures. When a procedure is called by another process, i t is execu~ed by a new threi~dof conlrol. A process can also hilve one "background" thread, which is executcd first and may continue ro loop. Threads in a process cxecutc
412
Chapter 8
RPC a n d Rendezvous
They synchronize using slwred variables and the when statement. which is similar lo an a w a i t statemen1 (Chapter 2). RPC has been incli~dedin sevci.;tl otlie~languages, such as Cedar, Eden, Emcrald, and Lynx. Additional languages bascd on 17PC include Argus, Aeolus, and A\l;~lon. The 1;1llei- rhree 1:tnguiiges combine RPC with what are called ~ltonlictl-crizsc~cl.ions.A transaction is a group of operations (procet1~11-es calls). It is atornic i f i r is both i~ldivisibleand recoverable. 11' a \ransaclion cnrrrrwits, all the operations appear to have been execulal exactly once each arid as an i~idivisible unit. If a tl-ansaction a h n r ~ .i~t ,has no visible effect. Atomic fransi~ctionaoriginated in the database community; they are used Lo program fault-tolerant distributed applications. S t a ~ ~ l oand s Gi fforcl [1990] present an interesting generalization of RPC called remote evaluation (REV). With RPC: a server module provides a fixecl set o f predefined secvices. With REV, a clienl can include a program aa an argument in a remole call; when the server reccjves the call, i r cxecutcs the program and then returns results. This allows a server to provide a11unli~nitcdset of ser\lices. Stamos and Gifford's papel- describes how REV can simplify rile design OF Inany disli.ibu~edsysrerns and describes Ihe deveJoperh' experiencc with n prolotype i~nplemenl;ition. Java apylets provide a similar kind of' lunctionali~y,allhough most corr~monlyon the client side; in parlicul~.an applet ib usually returned by a server and executed on rhe client's machine. Rendezvous was developed simultaneously and inclcpendcntly in 1978 by Jean-Raymond Abrial of' the Ada dcsig~iteam and by this book's ailtho]. uihe~i developing SR. The lerrn rendezvou~was coined by the Ada designers, many of whom are French. Concurrent C is another language based on rentlezvorrs [Gcbairi & Roome 1F)SG, 19891. I t exl.ends C wich processes, rent.lczvous using an accept stalement?and guarcled communication using a select scaternent. *. I 11e select statement is like that in Ada, but the accept slalemcnt ic Inore powet.ll~l.In particular, Concurrent C borrows Lwo ideas from SR: synchronization expressio~~s can re,Ference parameters and an accept statemen1 can contain a schduling expi-cssion (by clause). Concurrent C a150 alloc\/s opcralions to be invoked by c a l l or send,as in SR. Laler, Gehani and Roome [I9881 cleveloped Concurrenl C++, which co~ubinesConcurrent C with C++. Several languages include multiple primidves. SR is the besl known (see below for I-efelcnces). StarMod is an extel~sjonof Modula that supporls asynchronous message passing, RPC, rendezvous, and dynilnlic process creation. Lynx suppol-ts RPC rind rendezvous. A novel aspect of Lynx is thal it supports dynamic program reconliguralion and PI-oteclio~l with whar are ci~lled link^. The wtensive survey paper by Bal, Steiner, and Tanenbau~n[ 19891 conrai~u informaliou on and references to all the languages ~nentioneclabove. The antho)c\~itli murual exclusion.
Historical Notes
4 13
ogy by Gehnni and McGettrick [ 19881 contni~~s repl-il~tsOF key papers on several ol-'the lungi~ages(Adit, Argus, ConcurrenL C. UP, and Sli). compnralivt surveys. and assessments of Ada. Mosl distributed operating syslems implelnent iilc caches OII client workstations. The Iilc system outli~jediu Fipul-e 8.2 is essentially identical t o thal in Alnoeba. -Frc)grikln~ I I C I'ollowing aigori~h~ns.
(a) The centralizetl dining pliilo~ophersin Figul-e 8.6, (b) Ttre time server in Figure 8.7 ( c ) The shurt.esr,iob-11ex1allocat.or in Figure 8.8
5.8 Conhider the lollou~ingspecification 1301'a [xogl'arn Lo find the ~nirli~nclrn o l a set oF integers. C;iven is ikn array of pi-ocesses Min [ 1 :n I . Initially. each process has one integer value. The processes repeatedly il?rcracl,with each one trying to give :tno(he~the niinirnurn of the set o f valucs i t has seen. I f n process gives away i t s rnijiimum value, il terminates. Eventually. one process will be left, and i[ will know Lhe niinimutn of the or-iginal set. (a) Develop a p~'0gl-illllto solve this l~mblem using only the RPC primitives clelined i n Scctio~l8. I .
(b) Deve1o.p a progl-am Lo solve (his l>roblem using only the rendezvous 131-imirives del-incd in Sectio~i8.2.
418
Chapter 8
RPC and Rendezvous
(c) Develop a program Lo solve this problem using the multiple primitives clefined in Section 8.3. Your program should be as simple as you can rnake il. 8.9 The readerslwriters algorithm in Figure 8.1 3 gives readers preference. (a) Change the input starement in w r i t e r Lo give writers preference. (b) C h a n ~ ethe input staten~enlin writer so that readers and writers alternate turns when both want Lo access the database.
8.10 The Fileserver [nodule in Figure 8.15 uses c a l l to update remote copies. Suppose the c a l l of remote-write is I-eplaced by asynchronous sena. Will the solu~ionstill work? I F so: explain why. I F nol, explain what car) go wrong. 8.1 1 Suppose processes communicate using only the KPC mechanisms defined in Seclion 8.1. Processes within a lnodule synchl-onize using semaphores. Reprogn~m each of the following algoritl~rns. (a) The BoundedBuf f e r module in Figure 8.5.
(b) The Table module in Figure 8.6.
(c) The S m - ~ l l o c a t o rlnodule in Figure 8.8. (d) The KeadersWri ters module in Figure 8.13. (e) The Fileserver module i n Figure 8.15.
8.12 Develop a scrver process thal irnplen~enlsa reusable barrier for n worker processes. The server has one operation, arrive ( ) . A worker caJls arrive ( ) when it arrives at a barrier; the call terminares when all n processes have arriveti. Use the rendezvous primitives of Section 8.2 ro prograrn the server and workers. Assunie the existence of a function ?opname-as defined in Section 8.3-that relunls the number of pending calls of ogname.
8.13 Figure 7.15 presented an algorithm for checking prirnality using a sieve ol filter processes; i t was programmed using CSP's synchronous message passing. Figill-e 7.16 presenred a second algorithxn using a manager and a bag of tasks: it was programmed using Linda's tuple space. (a) Reprogram the algorith~nin Figure 7.15 using the rnultil~lepi-i~nitjvesdefined io Section 8.3. (b) Reprogram the algorithm in Figure 7. I 6 using the multiple primitives delined
in Section 8.3. (c) Conlpare the performance o i yoiu. answers to (a) and (b). I-low many mcssages get sent to check the primality of all odd numbers from 3 ro n? Count a
4 19
Exercises
sendfreceive pair as one messagc. C0unl a c a l l as two Inessages. even if tlo value is returned. 8.14 Tlre .Snvinjis Accounl Pmblen,. A savings accounc is shared by several pcoplc.
Each person may dcposit or withdraw l'unds (I-om [he account.. Thc currenr balance in Lhe accoutlt is die silln (SF all deposils lo date minus the sum of all withcll-awals to dace. The balance must never become negative.
Using ehr tnultiple primiti\~esnotalion, tlevelop a servcr to solve this prob.letn and sl~owthe client interl'ace lo Lhe server. Thc server exports two ol,ec.atio~ls: deposit (amount)and withdraw(amount). Assume (hat amount is positive and rhal withdraw milst delay until Lhere are suflicien~runds. 8.15 Two kinds of processes, A'S and ~ ' s ,ollter a I-ooni. An A process cannot leavc until it laeers two B processes, and a B process cllnnot leave until it ~rleetsone A
process. Each kind of process leaves the rooni--wil-houI meeting any orher pronu~iibel-of other processes. cebses-once it has mcl thc reqi~i~.ed
(a) Develop a server process lo implement this synchronization. Show c hA 1111er~ fact of rl~eA and B processes Lo [lie server. Usc [he mulljple primitives nolalion defined in Section 8.3. '
(b) Modify your answer to (a) so t11iI1 he first of tbc two B processes [hat meets an A process does not leave the Inom until alkr Lhe A process meets a second B process. 8.16 Supposc a computer center has two printers, A and B, that arc siinila~but not iclenrica). Three kinds of client processes use the printers: those rllat must use A. tltosc rhnt must use B, ;und those that can use eillier A or B. Using Lhe mulliplc primitives notation, develop code 111~1each kind of client cxecutes lo rquesc and release a printcr, and develop a server process to ;~llocatr Lhe prjr~cers. Your solutjon should be [air, assunli~lga clielil using a printer eventually releases it. 8.17 The 1 1. Sincc rllerc they must li~iisli going is only one (rack. cars cannot pass each other-it.,
420
Chapter 8
RPC and Rendezvous
around the track i n the order ~ I Jwhich they started. Again. a car can go around the tracks only when it is full.
8.18 Stable Murviage Proble~n. Let Man and woman each be arrays of n processes. Each man ranks the women from 1 to n, and each woman ranks the men from 3. to n. A pairing is a one-to-one cooespondence of men and women. A pairing is sral~leif, for rwo men rn, and m, and their paired wotnen w, and w,, both of h e following conditions are satisfied: ranks w, I~igherthan w,
01-w,
ranks m, higher lhan m,; ancl
m, ranks w, higher than w,
01. w,
ranks m, higher th:ul m,.
m,
Expressed differently, a pairing is ut~stableif w man and woman would both prefer each other to their ciurenr pairing. A solution to the stable marriage problem is a set of n pairings. all ol'which are stable. (a) Using the multiple prirnjtives notation, write a program to solve the srable marriage problem. (bj The srable roon-unates problem is a generalization of the stable marriage problem. In particular, there are 2 n people. Each person has a ranked lisl of preferences for a roommare. A solurjon to ihe roommates problem is a set of n pairings, all of which are stable in the same sense as lor the marriage problem. Using the multiple pritnitives notation, write n program Lo solve Lhe stable I-oom-
Inales problem.
8.19 The Fileserver module in Figure 8.15 uses one lock per copy of the file. Modify h e program to use weighted voting, as defined at the end of Section 8.4.
8.20 Figure 8.15 shows how to implement replicated files using tlle multiple primitives nolation defined in Scction 8.3. (a) Wrile a Java program ro solve the same probletn. Use RMI and synchronized methods. Experiment with your program, placing different file servers on differen( machines in a network. Write a short report describing your prograln and experiments and w h a ~you have learned. a11 Ada program ro solve the samc problem. You will have to do research lo figure our I~owro distribute the lile servers onto different machines. Experiment with your program. Write a short report describing your progl.am and experiments and what you liwe learned.
(b) Write
(c) Write an SK program to solve the same problem. Expzriment wilh your psograin, placing differenl file servers on different machines in a netwurk. Write a short report describing your program and experiments and what you have Iearnd.
Exercises
421
8.2 1 Experiment with [lie Java prograrn for a remote darabasc in Figure 8.16. Try running (he program and scsing what happens. Modify the program Lo have mul~iple clients. Moclify the program to have a more realistic database (at least wit11 operalions hat take longel-). Write a brief rcport dcscribing what yo11 have Lriecl and what you have learned. 8.22 Figure 8.16 con~ainsa Java program that i~nplemenlsa siniple remote tlar;tbasc. Rewrite the program i n Ada or i n SR. then experiinent with y o u r program. For txalnpls, ndcl ~nullipleclients and make the database more r,e:~listic(a\ least with ope~'ationcthat lake longer). W r i ~ ea brief 1,eporL describing how yo11 implemcnlecl the pl.oZl-am i n Atla 01- SR, what expcri~nentsyou tried. ant1 w h a ~you learned.
8.23 Figures 8.18 ant1 8.19 conlain an Ada program that implcnients a si~nulationol' rl~edining philosophers problem. (a) Get the program lo run. tlien experiment ~ ~ i il.~ lhi i-er;~irlplc.have rhe philosopher-s slccp I'or rar\dom periods when [hey i>rethinlting ancl eating, and try diffcf-en1nurnbel-s or I-ouncls. Write a brief report describing wli;~ryo11 tried and \vhat you lean'lecl. Rewrite tl~cprogram in J:LV;I 01'in S R , llien expe~.i~ncnt with your progl.ilm. For examplc. have the philosophers vleep fol- randorrr periods when tl~cyare ~liinltinga11d ealing, and try differen[ numbcrs OF l.ounds. Write a hrief I-eparl describing how you implementecl the program in Java or SR. what experiments you tried. and w h a ~you learned. (17)
8.24 Figure 8.20 contains an SR program that sirnulares a solu~ionto the critical section problern. (a) Get the prog~.;~ni lo run, [hen experiment wilh it. For example. ~nodil'y the delay inLer\lals or change the scheduling priority. W~.ilca brief report ~lescribingwhar you [I-iedand what you Icr\rnecl.
two
(h) Rewrire the program in Java or Ada, then ex.pcriment with your progl-nyn. For example, modify (he lwo delay iii~.ervalsor change the scheduling priority. Write a bricf report describing how you i~nplemenledthe program in Java or Ada, whal experiinenls you tried. and whar y o u learned.
8.25 Exarise 7.26 describes several parallel and distributed PI-ogl-amniing projects. Pick one of those or someLhi11g similar, the]) design and i~nplernenta solutiol~ using Java, Ada, SR. or a subroclrine library lhat supporls RPC or re~iclczvous. When you have fioished, wri~ea report clescl-ibing your probJaii and solurion and demonsh-ate how your program works.
Paradigms for Process Interaction
As we have seen several times, here are three basic p~.ocess-interactiotlpatterns: produce~~lcons~~~ner, clientlservel-, and interacting peers. Chapter 7 showed how lo program these patterns using message passiog; Chapter 8 showed how to program them using RPC atld rendezvous. The three basic palerns can also be co~nbinedi n numerous ways. This chapter describes several of these larger process-interaction patterns and i l !ustrates their use. Each pattern is a paradigm (n~odei)l'or process interaction. In particular, each paradigm has a ilnique slruclure that can be used Lo solve many diiferent problenls. The paradigins we describe are as follows: managerlworkerb, wkich represent a disrriburecl implementadon of a bag of tasks: heal-tbeat algorithms. in which processes periodically exchange info~mation using a send then receive interaction;
pipeline ;~lgorithms.i n whjch information I-lowsfrom one process ro another i~sitiga receive (.hensend interaclion; probes (sends) and echoes (~'ecei\~es),which dissernic~aleand gatl~erinlorr~iationin trees and graphs; loken-passing algorithms, which are another approach co decenlralized decision making; and replicated server processes, which manage ~trulljpleinstances of resoul-ces such as files.
424
Chapter 9
Paradigms for Process Interaction
The firs1 t111-ccparadig~nsare com~nonlyused
it1 parallel computations; Ihc other f-'o'ourarise in c1isu.i buted sy slelns. We illustrate how tlie paradigms call be used to solve a variety of problems, includi.ng sparse matrix c~~ulljplicalion, image processing, distributed mah-ix multiplication, co~nputingthe topology of a network, distributed mutuid exclusion. distributed termination delcction, and decentralized dining phi1osophe1.s. We lacer use rl~ethree parallel compuring ~ a d j g r n sto solve thc scielltitic cornpuling prublet~~s in Chapter I I . Additional a1)plictilions are described i n h e txercises, including sorting and the ~l-avelingsalcsn~an.
9.7 ManagerIWorkers (Distributed Bag of Tasks) We inkoduced the bag-(~l'tasks paradigm i n Sectio~l3.6 and showed how to g variables for cornmunicarion and synchronization. in~plemenc it u s i ~ ~shared Rccitll tlie basic idea: Several wol-ker processes share a bag hc?l conlains indepertdent tasks. Each worker repealedly rernoves a task fiom rhe bag, executes ir, and possibly generates one or Inore new taslts that i t puts into Lhe bag. Tlie benefits OF this uppl-oach to implemenling a parallel cotnputarion are Ihal il is easy to valy rile number of worlcers and it is relatively easy to ellsure thar each does about [lie same amount of work. Herc we show how to implement the bag-o-f- asks paradigm using message a managcr propassing i.ar11er than shared val-iables. This i s done hy e~nployi~lg cess lo implement the bag; Iiand out tasks, collect results. and delecr termination. The worke1.s gel tasks and deposit resulls by co~nrnunicatiilg cvilh rlie managel.. Thia tlie lnaliager is essentially a server process md the workers are clienls. The lirsl exarnplc below shows how lo mtlltiply sparse matrices (malrices for which most enrrics are 7ero). The second example revisic.s [he quadl-aturc problem and uses a combination of static and adaptive inrcg~.ationintervals. In borli examples. the total nil mbcr of tasks i s f xed, bul the amount of wo1.1~ per task ib variable.
9.1. I Sparse Matrix Multiplication A s s ~ ~ mlhat e A and B are n x n mauices. Tlie goal, as before. is to compute the i matrix product A x B = C. This requires computing na inoer products. f i ~ c l inner producr is the sum (plub reductio~i)of the pai~.wiseproducts 0.1two vectors of lenglli n. A ~niltrixi s said lo be dense i f mob1 e~ilriesare norizel-o and il i s . c p l : v c J if lnosl entries are zero. If A and B are dense malrices, then c will also he dense
9.1 ManagerNVorkers (Distributed Bag of Tasks)
425
(utlless diere is considei.able cancellation i n the inner products). On the o(11e1. hand: if either a or B is sparse, tI1e.n c will also be sparse. This i h becauce each zero in A or B w i l l lead to a zero in n vector products and hence will not contribu~ero thc resull of n of rhe innei- products. For example, if solme row of A contains nll zeros, then tlie correspondjng row of c will also contain all zeros, Sparse rnalrices arise i n many contexts, such as nu~nel.icalapproximations to partial differential equa~ionsand large systems of linear equations. A rricliago11a1~natrixis an example of a sparse mauix; it has zeros everywhere except on the rnain diagonal and Lhe two diagonals immediately above and below the main diagonal. WI-(en we know that matrices are sparse, we can save space by storing only the nonzero entries, and we can save tim.e when multiplying matrices by ig11o.1-ing entries that are zero. We will represenr sparse matrix A as follows by storing infor~nationabout the rows of A: int lengthA[n] ; pair *elernentsACnl ;
The value of lengthA [i1 is the number of nonzero entries in row i of A. Variable elernenteA [iI points to R l i s t of [he nonzero encries in row i. Each entry is represented by a pair (record): an integer colutnll index and a double precision data value. Thus if length~ti]is 3, then there are three pairs in the list elementsAfi]. These will be ordered by increasing values of the column i n.dex.es. As a concrete example, consider the following:
This r-epresei~lsa sjx by six matrix conrailling ~ O L I Inonzero elements: one in ]-ow 0 colu~nn3, one in row 3 collunn I, one in row 3 column 4, and one i n row 5 , column 0. We will represent matrix c in rlle same way as A. However. lo facili~ate matrix multiplication, we will represent rnatrix B by colu~nnsrather than rows. [n particular, lengthB indicates the number of nonzero elements ill each column of B. and elementsB contai~~s pairs of row index and data value. Computing h e m a r ~ i xproduct of A and B requires as usilal examining the n2 pairs of rows and columns. Will) sparse ~natrices,the most natural size for a
426
Paradigms for Process Interaction
Chapter 9
module Manager type pair = (int index, double v a l u e ) ; og getTask(resu1t int row, len; result g a i r [*]eLemsl; op putResult ( i n t row, len; pair [ * I elems) ; body Manager i n t lengthA[n] , lengthC 1x11 ; pair *elementsA[n), *elementsCIn]; # matrix A is assumed to be initialized int nextRow = 0 , tasksDone = 0;
process manager t while (nextRow < n or tasksDone < n) { # more tasks to do or more results needed in getTask(row, l e n , elems) - z row = n e x t R o w ; len = lengthA li I ; copy pairs i n *elernentsA[il to elems; nextRow++; [ I putResult (row, l e n , elems) -> lengthC[rowl = len; copy pairs in elems to +elementsC [row]; tasksDone++; ni
!,.,, ,,, , .r, I..'
'h,
88,
,
I
4
,#..
:
."I
1
1.4 1;;:
1
\mi
end Manager Figure 9 . j (a)
Sparse matrix multiplication: Manager process
c, because an entire row is represented single vector of (column, value) pairs. Thus there are as many tasks as tl1e1.e arc rows of^. (An obvious optimization would be Lo s k p over rows of A that arc e n l i ~ d yzeros-namely. Ulose for which lengtkA[i] i s 0, b e c a ~ s ethe corresponding rows of c would also be all zeros. This is no1 likely in realistic applica-
task i s one row OF the resull rtiatrjx
by
:I
Lions, however.) Figure 5). 1 contains code to imple~nentsparse matrix multiplication using a manager and several workcr processes. We assume that malrix A i s already initialized in the manager and that each worker contains an initialized copy o-F rnalrix B. The processes inleract using the rendezvous primitives 01 Chapter 8, because that leacls to the simplest progi-am. To use asynclrronous message passing, the rnanager would be programmed in !he style o.F an active monitor (Section 7.3) and the call in the workers would be replaced by send then receive.
9.1 Managerworkers (Distributed Bag of Tasks)
427
process worker[w = I to numWorkers] { i n t l e n g t h 8 [nl; pair *elementeB[nl; # assumed to be initialized int row, lengthA, lengthC; pair *elementeA, *elementsC; int r, c, na, nb; # used in computing double sum; # inner products while (true) ( # get a row of A, then compute a row of C call getTask(row, lengthA, elementsh); lengthC = 0; for [i = 0 to n-11 INNER-PRODUCT(i); # see body of text send putResult(row, lengthc, elementec);
Figure 9.1 (b)
Sparse matrix multiplication: Worker processes.
The manager process in Figure 9.1 (a) implements &liebag and accumulates results. It services two operations: getTask and putResult. Integer nextRow identifies lhe next task-namely, the next row. Integer tasksDone counts the number of results that have been returned. LVhen a task is handed out. n s x t R o w is incrernented; whet) a result is returned, tasksDone is incre~nented. W h e n bokh values are equal to n, all tasks ha\~ebeen completed. The code for the worker processes is shown in Figurc 9.1 (b). Each worker repeatedly gets a new task, perl'or~nsit, then sends the result to the manager. A task consists of conlputing a row of the result matrix c. so a worker executes a for loop to compute n inner products, one for every column of B. Howevel; the code for computing an inner product of I w o sparse vectors differs significantly F~omche loop for computing an inner product of two dense vectors. In particulu, the meaning of INNER-PRODUCT ( i ) in Lhe morker code is sum = 0.0; na = 1; nb = 1; c = elementsA [nal->index; # column in row of A r = elementsB [il [nbl->index; # row in column of B while (na c = lengthA and nb < = lengthB) ( if (r == C ) { sum += elementsA[na] ->value * elementsB [i] [nbl ->value; na++; nb++;
428
Chapter 9
Paradigms for Process Interaction
} }
c = elementsA [na]->index; r = elementsBIi1 [nbl->index; else if { r < c) { nb4-+; r = elementsB [i] [nb] ->index; else t # r > c na++; c = elernentsAtna1 ->index;
1
1 if (sum
!= 0.0) ( # extend row of C elementsC[lengthCl = pair (i, sum) ; lengthC++;
1
The babic idea is lo scan rli.ro~~gh the sparse representarions for the ow of A and column i of B. (The above code assumes that the ]-ow ai,cl colunln each contain at 1c;lst one 11on~el-o element.) 'The inner producl will be nonzero only if there are pairs ol'values that have the surne column index c in A ant1 row i~~clcx r in B. The while loop above finds all tllcse pairs and adds up (heir products. I ( the sum is nonzero when the loop lermilzatcs, hen a new pair is added lo lhc vector ccprcsenting Lhc cow of c. (We assume tllar spacc for the elements 11;~salready been allocated.) After all n inner products have beer) compurcd, a worker- sends the row of c to the manager, then calls rhe manager ro gel another rask.
9.1.2 Adapfive Quadrature Revisifed '171ie quadrature problem was introduced in Section 1.5, whcre we presented static (iterative) and dynamic (~ccu~sive) algorill~msfor approxil~~atilzg the integral of firnclion f ( x ) from a lo b. Sectjun 3.6 sliowed liow to implement adaptive quadral~~re using a shared bag of tasks. I-Iere we use a manager Lo i~ilplernenta djst~.ibutedbag ul' tasks. However, we use a cotnbination of the static and dynamic algorilhms rather than the "pure" adaptive quadrature algorithm used in Figure 3.2 1 . In particular, we divide Llle interval I-'I.OITI a to b inlo a fixed number ol' subintervals and then use the adaptive quadrature algorithm wilhjn each suhinte~.val.This combines Uie simplicity of an iterative algorilh~nwilh the greatei- acculxcy of an adapLivc algorithm. In particular, by usiirg a s~alic11umbe1of tasks, the manager and workers are sinlpler and t11el.e are far fewer interactions between [hem. Figure 9.2 conlains the code for the manager and workers. Because the m a n a p is esbex~tiallya server process, we again use rendezvous for interaction
9.1 ManageriWorkers (Distributed Bag of Tasks)
429
module Manager op getTask(resu1t double left, right); op pu tResul t (double area) ; body Manager process manager { double a , b; # interval to integrate int numIntervale; # number of intervals to use double width = (b-a\/nuTntervals; double x = a, totalArsa = 0.0; int tasksDone = 01 while (tasksDone c nvmIntervals) { in getTask(left, right) st x < b - > left = x; x += width; right = x ; [I p u t ~ e s u l t(area) - > totalArea += area;
tasksDone++; ni
1
pi.i111[he result totalArea;
1 end Manager
double f 0 { ... ) double quad ( . .- 1 {
I
I
il
)
# function to integrate # adaptive quad function
process w o r k e r [ w = 1 to numWorkersJ { double left, r i g h t , area = 0 . 0 ; double f l e f t , f r i g h t , lrarea; w h i l e (true) { call getTask(1ef t, right) ; fleft = f (left); fright = f ( r i g h t ) ; lrarea = (fleft + fright) * (right - left) / 2 ; # calculate area recursively as shown in S e c t i o n 1.5 area = quad(1ef t, right, f l e f t , fright, lrarea) ; send putResult (area);
I
I
.. .
}
1
3
Figure 9.2
Adaptive quadrature using manager/workers paradigm.
430
Chapter 9
Paradigms for Process Interaction
betwecn the manager and workers. Thus the manager has the SiIIlIc stiucture as thi~tshown in Figure 9.1 (a) ancl again cxporls two operations, getTask alld putResult. However, the operations have different parameters, because now a task is defincd by the enclpoinls of an interval, Left rrnd right. and a result consists of the area under f ( x ) over thal interval. We assume the values of a. b, and numIntervals are given, pcrllaps as command-line arguments. From tl~csevalues, the manager co~npu[esthe width of each interval. The managel- then loops, accepting calls of getTask and putResult, until il has received one rcsult for each interval (and hence task). Notice the use of a synchronization expression "st x< b:' i n the arm for g e t T a s k in the input statemenr; this prevenrs getT a s k from handing out another [ask when the bag is empty. The worker processes in Figure 9.2 share 01. have theil- own copies of r l ~ e code for functions f and quad, where quad is the recursive function given in Section 1.5. Each worker repeatedly gets a task rrorn the manager, calculates the ;mguments needed by quad, calls quad to approximate the area under f(left) to f (right),[hen sends the resull to rl~emanager. When the program in Figure 9.2 terminates, every worker will be blocked at ils call of getTask. This is often harmless, as here, but at times il can indicaite deadlock. We leave to the ~eadermodifying the program so that the workers lesininate nor~~~ally. (Hint:Make getTask a fu~unclionthat retiu-ns true 01.false.) Tn this program the amount of work per task varies, depending on how rapidly the Function f varies. Thus, if there are only about as many [asks as workers, [he co~npulatjo~~al load will almost cerrai~lly be unbalanced. On the olher hand, if there are loo many tasks, then there will be unnecessary interaclions belweei~c-11e manager and workers. and hence iilinecessary overhead. The ideal is to have just enough tasks so that, on average, each worker will be responsible for about the sarne roral arnounl or work. A reasonable heuristic is to have from lwo to three times as rnaliy tasks as workers. which here means that numrntervals sllould be two to three times larger than nurnWorkers.
9-2 Heartbeat Algorithms The bag-of-tasks paradjg~nis useful for solving problems that have a fixecl numbe]- of independent tasks or that result from use of [he divide-and-conquer parxligm. The Atcirlbeur puiudigm is useful f o ~many data pa~.allelirerative applicalions. In parlicular, it can be used when [he data is divided among workers, each is responsible lor updating a particular pall, and new data values depend on values held by workers or their in~~nediate neighbors. Applications include
9.2 Heartbeat Algorithms
431
grid computations, which arise in image processing or solving partial diFferentia1 equations, and cellulw automata, which arise in sirnillations of phenomena such as forest fires or biological growth. Suppose w e have an mray of data. Each worker is responsible for a part of the clara and has the following code outline: process Workerli = 1 to numworkers] (
decla~ations01' local variables ; i~~itialize Local variables; w h i l e (notdone) { send values to neighbors; receive values From neighbors; update local values; 1
We call (his type of lzrocess interaction a Izeur-rbuar algorithm because I-heactions of each worker are like the heating 01' a heart: expand, sending infol-~~~ation out; conl.racr, ~allieringrlew info~-nx~tion; then process thc in-formation and I-epeat. IF the dara is a two-dimensional gricl, i t could be divided inlo strips or blocks. With strips ilierc woulcl he a veclor of worlters, and each \vould have t w o neighbors (except perhaps for the worltcrs at the cnds). With blocks (here would be a matrix of worlcers, and each would have from two ro eight neighbors, depending on whether the block is on thc cornel-,edge, 01.interiol- o.l' the array of data. and depending on I~owm:uny aeiglzbol-ing values are needed to update local values. Three-dimensional arrays of data could be divided simil:~r\yinto planes. rectangular prisms, or cubes. The sendreceive interaction pattern i n a heartbeat algoridlm produces a "fi~zzy''barrier anlong Ihe wo~.lters. Recall that a barrier is a synchronization poi111 that all workers musr reach before :;ny can proceed. In an iterative colnputa~ion,a barrier ensures that every woi-ker linisl~esone ireration before starling the next. Above, the message exchange ensures that :I worker docs not begin a new ~~pditte phase ~ u i t i lIES neighbors have co~nplctedthe previous updale phase. Workers lbat are not neighbors can get more than one il.cralion aparl, but neighbors cannot. A true barrier is not ~.equiredIxcause workers sl~al-edata only with llleir neighbors. Below we develop lieartbeat algoritlin~sf01- two problems: region labeling, an exstrnple ol' image processi~lg;ancl 111e Game of Life, an txa~npleof cellular automata. Chapter I 1 and h e exercises describe additional applications.
432
Chapter 9
Paradigms for Process Interaction
9.2.1 image Processing: Region Labeling An ir170ge is a representation OF a picture; i t typically consists ol' a matrix O F numbers. Ez~clieJe~nentof an image is called a pixel (picture element). and has a value tlial is its llgllt intensity or color. There a l r doderls of image-processing operations, and each can bendcrl'orlnence (lepends on how the MP1 library is implemcn~edon a given machine.
9.2.2 Cellular Automata: The Game of Life Many biological ol ghysical systems call be modeJccl as a collection of bodies that r.epealeclly inlevact and evolvc ovcr time. Solne systems, especially simple ones, can be ~nodelcdiisi~igwhat are called cellular ilutomala. (We exasnine a tnol-e col-rlples system, grilvitationnJ interaction, in Chapler I I .) The idea is to divide Lhe biological or physical proklem space into a collection of cells. Each cell is a finite state machine. After the cells are iniliatized, all nialte one slate transition. then all make a second, and so 011.Each transition is based upon the currenl state of 111e ctll as well as rhe states of its neighbors. Idere we use cellular aulomata to n~odelwhal is called the Game of Lice. A two-cli~nensionalboard of cells is given. Each cell either contains an organism (it's alive): or i t is emply (ir's dead). For (his problem. cach interior cell I~aseight neighbors, located above. kelnw, left. right. and along the I'our diagonals. Cells in the corner have three neighbors, chose on the edges havc live.
I
-
436
Chapter 9
Paradigms for Process Interaction
The Garne ol' Life is played as rollows. First, the board is initi;~lized. Second: every cell examines its slatc and the state of its ncighbors. then makes 21 stare transii inn according to the Following rules:
A live cell with zero or one live neighbors dies from loneliness A live cell with two or tl~reelive neighbors survives for anorher generalion.
A live cell with
fo111-or
more live ncighbors dies due lo ovel-population.
A dead cell with exactly three live neighbors becomes a1ive.
This process i s repeated for some number of generations (stepb). Figure 9.4 outlines a program for simulatit~gthe Game of Lice. The processes interact usiug the heartbear paradigm. On each iteration a cell serlds a message to each of i t s neighbors and receives a message from each. I1 then updales its local state accorclirlg ro the above in~lcs. As LISUBI with a heartbeat algoritbtu: the processes do nor: execulc in lockslep. hut neighbors nevel- ger an iteration ahead 01.' each other. FOIsi~nplicity. we have progratnmetl each cell as a process, nl tliough 1he board could be divided into strips or blocks of cells. Wc have also ignored the special cases of edge a~zd corner cells. Each process cell [ i ,j 1 rcceives
chan exchange [ l : n , 1 rn] (int row, column, state) ; process cell[i = 1 to n, j = 1 to nl { int state; # initialize to dead or alive
declal-a~jonsoll ollicr variables; f o r f k = 1 to numGenerations1 ( # exchange state with 8 neighbors for [ p = i-l to iil, q = j - 1 t o j + l l
if ( p ! = q J send exchange [ p , ql (i, j , state) ; for [p = I to 81 ( receive exchange [i ,j 1 ( r o w , column, value) ; save value o f neighbor's stale;
>
updare 1oc;ll slate using rules in text; }
Figure 9.4
The Game of Life.
1 I
9.3 Pipeline Algorithms
437
.
lnessages From element exchange [ i j I of the mattix of corn~nunicacionchannels, and jt sends messages to neighboring eIemenrs of exchange. (Recall that our channels are buffered and that send is nonblocking.) The reader might find it instructive to imple~nentthis program and to display the srates of the cells on a graphical display.
UFFE ,Ocm
9.3 Pipeline Algorithms Recall that a filter process receives data from an input pol-t, processes it, and sends ~esultsto an outpul port. A pipeline is a linear collection or filter processes. We have already seen the concept several rimes. includil~gUnix pipes (Section 1.6): a sorring network (Section 7.2), and as a way to circulate values among processes (Section 7.4). The paradigm i s also useful in par.nllel compucing, as we show here. We always use some number of wo~.kerprocesses to solve a parallel computing pl-oblem. Sornetitncs wc can program the workers as fillers and connect then1 together into a parallel computing pipeline. There are three basic strucllrres for such a pipeline, as shown in Figure 9.5: ( I ) open, (2) closed, and (3) circular. The I V , L o MJr, are worker processes. In an opeiz pipeline. the illput source and oulput destinatioil are nut specified. Such a pipeline can be "dropped dowo"
1 (b) closed
c
(c) circular
Figure 9.5
Pipeline structures for parallel computing.
438
Chapter 9
Paradigms for Process Interaction
nnywlierc that ir will f i t . A closed pipeline is an open pipeline (.hilt is connected (o a Coo~-clincr,of. process, which produces the input ceded by the lirsl worker and consumes the resulcs producecl by Lhe last ~orlter. Thr. U n i s command "grep pattern f k l e I wc" is an example of an open pipeline Lhat call be put in various placcs; when it is actually executed on a shell co~n~nnnd Jine. il becomes par1 o f a closed pipeJine, wirh the human user being he cool-dinator: A pipelinc i s c i w u h r il' Lhe ends are closed on cach other; i n this case data circulates :imong c.he worke~x. In Sccdun 1.8 we inLrocluced L\vo distributed i~npleme~ltations of matrix multiplication a x b = c, where all of a, b, and c are dense n x n mal~ices. Thc Lirst solutiorl simply divided Lhc work lip among n workers, one per row or a iuld C , but each process needed to store all of b. The second solution a J s o used n wo~.kess,but each needed Lo store only one column of b. That solution of b circulating among ;~cluallye~nploycda circulnc pipeline, with [lie coli~~nns the wo~kers. 1-lcre we examine two addilio~lal diswibured implcrnenlations of dense nrsltrix multiplication. The firs1 e~nploysa cJosed pipeline; the second employs a mesh of cisculal- pipcliues. Each has intevest i ng properties relative to lhe priorsnlu~ions,and each illustrates a pattcrn that has otlier applications.
9.3.1 A Distributed Matrix Multiplication Pipeline For simplicity, we will again assume tl~ateach matrix has size n x n, and we will employ n worker processes. Each worker will produce one row of c. I-lowevec initially the c\/orlcers do not have any pzwt of a or b. Instead, we will connect t t ~ eworkers in a closed pipcline through which [hey acquire dl data items and pass all resu.lrs. In particular. the coordi~~ator process sends every row of a and every column of b down the pipeline to the first worker; eventually. he coordinator I-eceivesevery row of c fi-om I-he last worker. Figure 9.6 (a) contains the acrjons of the caordinacor process. As indicated, (.he coortlinator sends rows a [ 0, * I ro a [n-1, *I lo channel vector [ 0 I , which is the input cli:unnel for worker 0. Then rhe coordinator sends colu~nnsb[*, 0 1 lo b [ * ,n- 11 to worker 0 . Finally, the coo~.dinato~receives the I-owsof I-esults(it will get Ihese from worker n-1). However, [he results arrive in (lie order c [n-1,*I to c (0,*I 1-01reasons explai~lcdbelow. phases. First: il. reccives rowh of Each wot.lie~pi-occss h;rs ~lirceexeci~~ion a , keeping the (i~.sl one il receives arld passing the otllcrs on. This phase distrihulcs the rows OF a among the workers, with Worker [i] saving ~ h cvalue oC a [i,* I . Second, workers rcceive columns of b. immediately pass then1 on r o
9.3 Pipeline Algorithms
439
rl~enext workel; then compute one inner product. This phase is repeated n rimes by each worker, afrer which time it will have computed c [ i ,* I . Third, each worlter sends its row 01 c to the next worker, and [hen receives arid passes on rows of c from previous workers in the pipeline. The last worker sends its row of c and others it reccives to the coordinacor. These are sent to the coordinator in order c [n-I, * I to c [O, * I , because ~ l ~ isa lthe order in which the last worker sees them; this also decreases communication delays ancl means that the last worker does not need to have local storage for all of c. Figure 9.6 ( b ) shows the actions of the worker processes. The comments indicate the three phases. The extra details are the bookkeeping to distinguish between llle last worker and others in t!le pipeline. Tli is aolution has several interest.ing proper.ties. Fir-sl, messages chase each other down h e pipeline: first the cows of a. then thc colurnns of b: and finally the rows of c. There is essentially no delay between tlie t i m e rllat n worker receivcs a message and the ri~neit passes it along. Thus messages keep flowing conlinuously. When a worker is co~npulingan inner product. il has already passed aloag the column it is sing, and hence another worker can get it, pass it along, and start computing its own inner product. Second, it taltes a total of n message.-passing tjrnes tor the first worker tu receive nll the rows of a and pass them along. It takes anolI7er n- 1 messngepassing rimes to lill the pipeline-namely, to get every worker its row of a. However, once the pipeline is ~'LIII, inner products get computed about as fast as messages can flow. This is because, as observed abovc. the co1uni.n~o f b
chan vectorCnl (double v[n]); chan result(doub1e v[n]);
# messages to workers # rows of c to coordinator
process Coordinator ( double a [n,n] , b [n,nl , c ln,nJ; inilialize a and b; f o r [i = 0 to n-11 # sene a l l rows of a send vector[Ol ( a [ i ,*I ) ; for [i = 0 to n-I] # send a l l colurrms of b send vector[0l ( b [ * , i J ) ; f o r [i = n-1 t o 01 # r e c e i v e rows of c receive r e s u l t (c l i ,* ] ) ; # in reverse order 1
Figure 9.6 (a)
Matrix multiplication pipeline: Coordinator process.
410
Chapter 9
Paradigms for Process Interaction process Workerfw = 0 to n-11 { double a [nl , b [nl , c [n]; # my r o w or column of each double temp [n] ; # used to pass vectors on double total; # used to compute inner product # receive rows of a; k e e p f i r a t and pass others on receive vector [w](a) ; for [i = w+l to n-1] { receive vector [wl (temp); send vector [w+l](temp);
1 # get columns and compute inner products for [j = 0 to n-11 [ receive vector[w](b); # get a column of b if ( W < n-1) # if not last worker, pass it on send vector [w+l](b); total = 0.0; for [k = 0 to n-11 # compute one inner product total += a[kl * b [ k ] j c [ j l = total; # put total into c 1
# send my row of c to next worker or coordinator if ( W < n-1) send vector[w+ll (c): else send result(c); # receive and pass on earlier rows of c for [i = 0 to w - I ] { receive vector [wl (temD); if (W < n-1) send vector [w+l](temp); else send result(temp); 1
1 Figure 9.6 (b)
Matrix multiplication pipeline: Worker processes.
I
9.3 Pipeline Algorithms
441
i~n~nedintcly Ibllow the rows of a, and they pet passed as as soor) as \hey are and ~cceive received. If it takes longer lo compure an inner producr than 1-0se~~cl a message, then the compula.tion lime will stair1 to dominate oncc the pipcline is 11111. We leave to the reader the interesting challenges 01' devising ilrlalytical p a h ~ m a n c equat.ions e and couducling timing experiments. Another interesting property of the sojution is tlwt i t is trivial to vasy the number of colu~rinsof b. All wc need lo change are lhe uppcl- bounc!s on the loops that deal with columns. In fact, rl~esilrne code could be used to multiply a by any stream of \xcrnrs. prodt~cinga scream of \/eclors as I-ehl~lts.For and the example, a could represent a set of coefticienls of' linear eq~~alions. slream of veclors coulcl be different combinalions ot' values fbl val-iables. Tbc pipeline can also be "shrunk" to use fewer workers. The only change needed is to Ix~veeach worker store a stvip of the rows of a. We can still have the colucnns of b and rows 01' c pass through the pipeline in the same way, or we could scnd Fewer, but longer messages. Finally the closed pipeline used in Figure 9.6 can be opened up and clle worke~.scar) bc placed into another pipeline. FOI example. instcatl ol' having a coordiiii~torprocess that produces the vecloss ancl consumes the rcsul~s,the veclors could be produced by other pmces.ses-such as another ~nalrix~nultiplicatjon pipeline-and the results can be consun~edby yet a~iotherprocess. To make the worker pipeline fully general. however, all vect.ors would have to bc passed along tlie entire pipeline (even rows of a) so that they would come 01.11tlie end ant1 hence would be available lo some other process.
9.3.2 Matrix Multiplica fion by Blocks The perfo~.rnanceof the previous algorithm is delertnined by the length of the pipeline ancl the time it Lakes to scnd and rcceive message.s. The comn~unick~tion network on some high-perCormance macbirles is organized as a two-di~nensiorlal 111eshor in an arrangement called a hypercj:ibc.. These kinds OF interconnection networlts allow messages belween different paj.rs o l neighboring processes to be in tr-ansir at the same time. Moreover. they reducc the disla~lcebetween yrocessol's rcl.ative to a ljnear arrangement, which reduces message ll-ans~nissiontime. An efficient way to inultiply ~nalriceson meshes and hypercubes is lo divide Lhe lnalrices into reclangulal- blocks, ancl l o employ workel. PI-ocess pelbloclc. The workers and dara ;Ire then laid out or1 the processors as a twodimensio~~al grid. Each worker bas f o u ~ neighbors, . above atid below, ancl to the leh and right o.f it. The workers on the top and bottom or the gircl are considered lo be neighhorb, as are the workers on the lefl and rig111 o f tlze grid.
442
Chapter 9
Paradigms for Process Interaction
Again Lhe problem is lo compute the lnalrix producl o i Lwo n x n matrices a and b, storing tlie resul~in matrix c. To ximplify the code, we will use one worker per rnatrjx eleruent and index the rows iuld c o l u i n ~ ~from s 1 to n. ( A t (lie end of he section. we describe IIOW lo use blocks of valucs.) Let Worker [I:n, 1 :n] be the matrix of workeipr.occsses. Matrices a and b are dislribuled initially so h a t each Worker [ i , j I has the corresponding elements of a and b. To compute c [ i ,j ] , Worker [i,j ] needs to mulliply every element i n ow i of a Ily tlie corresponding elenient in column j orb and sum (lie results. B u r the ordcr in which the worker perfo:orrus these ~nultiplicationsdoes not matter! The challenge i s to find a way to circulale ( h e values among the workers so that ei~cllgels every pair OF values Ihal it needs. First consider Worker [1,1]. To cornpule c [1,1 1 , il needs lo get every elemenl of 1 . o ~1 01' a and colu~nn1 ol b. Iuitially i l has a 11,1] and b [I,11, so it can ~nultiplyLhern. If we the11shiti row 1 OF a to the leA one column and sliifl column 1 of b up one row, Worker C1,1] will have a [I,21 and b C2.11, which ic can ~nultiplyant1 add Lo c [I,11 . 11' we repeat this n- 2 marc tinlcs, hliifting Lhe elelnenls o f a left ancl the elemenls of b up. Worker [I,1 1 will scc
all the values it necds. Unforluna~cly,this ~nultiplyand shift sequence will work correctly only for worlters Iinndling elcmenrs on Lhe diagonal. Other workers will see every value Lhey need, bul nor in Lhe r i g h ~comb~nalions.However, i t is possible t o rearrange a and b berol-e we s k u l llle mulltply and shifl sequence. In p;u-ticular. first shift t.ow i of' a cil-cularly lert i colu~nnsand column j OF b circularly up j rows. ( I t is nol at all obvious why lhis parlicular inilial placelnent works; people came u p with it by examining s~nallmatrices and then generalizing.) The following display illustrates the l.esult of the initial rearrangement of the values of a and b for a 4 x 4 onavjx:
Afcer this initial which it stores in local able cij lo a i j * b i j values a i j arc scnr
real-rangetnenl of values; each worker has two values, variables a i j and bi j. Each worker next initializes varjand then executes n-l sliift/co~nputephases. I n each, )eft onc column: valucs in bij are sent up one row; and
9.3 Pipeline Algorithms chan left[l:n,l:n](double); chan up 11-n,1:n] (double);
443
# f o r circulating a l e f t # Eor circulating b up
process Worker[i = 1 to n, j = 1 to n] double aij, b i j , c i j ; i n t LEFTI, UP1, LEFTI, WPJ;
{
initialize above values; # shift values in a i j circularly left i columns send left [i,LEFTI] (aij); r e c e i v e left [i. j ] ( a i j ) ; # s h i f t values i n b i j circularly up j rows send up [UPJ,j] (bij) ; receive up [i,j] (bij ) ; c i j = aij * bij; f o r [k = 1 to n-11 ( # s h i f t ai j left 1, bij up 1, then multiply and add send left [i,LEFTl]( a i j ) ; receive l e f t [i, j J (aij); send up[UPl,jl(bij); receive up[i,jl(bij); c i j = c i j + aij*bij; }
I Figure 9.7 Matrix multiplicat~onby blocks.
new values are received, multiplied togelher, and added to c i j. When the workers ter~uinate,the matrix producl i:, stored i n the c i j in each worke~.. Figure 9.7 contains [he code for this ~natrix~nult,iplicatjonalgoi~i(11ni.The workers share n2 channels for circulating values Icft and another n2 for circulating values up. Thcse are used to for1112 n iutersecl,itig circular pipelines. Workers in lhc same row arc connecled i l l a circula~pi.peline Lhrougb which v a l ~ ~ e s [low to t h e left; workers in the salut: c o l i ~ ~ nare n connected in a circulal- pipeline through which values flow up. The conslanls LEFT^, uP1. LEFTr, and uPJ in each worker are initialjzed to the appropriate \lalucs and used it) send statements to index Ihe arrays of channe.1~. The program in Figurc 9.7 is obviously inel'licicnl (unlcss i t \vel-e Lo be implemented directly in liarclware). There are way log Inany processes and mescages, ancl way too Litlle coinput.adon per PI-ocess. However, the algorithm can readily be generalized to use square or I-ectang~~lar blocks, In parricular, each worker can be assigned blocks of a and b. The workers f 1-51shift their block of a lefi i blocks of columns ancl their block of b up j blocks of rows. Each worker then iriilializes ils block of d ~ eresult matrix c ro the inner products ol' its
444
Chapter 9
Paradigms for Process Interaction
new blocks of a ancl b. Tbe worlcers rher) execute n-1 phases of shifc a left one l i e inner producrs, and add rhem 1.0 c. block, shift b u p one block, c o ~ n p i ~ t ~new We leave the details to the reader (see the exercises at the end of this chapter). Ari additional way to improve the efficiency of' the code it1 Figure 9.7 i s to execule bolh sends bclore either receive when shifting values. In particular, change t h e code from send/receive/send/receive 10 send/send/receive/receive.Th.is will decrease the Ii kelihood that receive blocks. [I also makes it possible to rransmit messages in p;~rallcIif lllat is supported by the hardware ioterconnection network.
9.4 ProbdEcho Algorithms Trees and graphs are used in many applications, such as Web searches, databases, games, and espert systems. They are especially impo~tantin distrjbured computis a graph in which ing since the structure of many distribulecl comp~~larions processes are nodes and colnrnu~~icatiot~ channels are edges. Dep1.11-lirsc search (DFS) is one of the classic sequential programlning ~~aradigms for visiting all the nodes i n a tree or graph. 111a. m e , ihe DFS srrategy for each node is to visit the children of that node all-d the11 to return to the palcot. This i s called cleplh-first search since each sewcli pal11 reaclles down c.o a lcaf before che next pat11 is traversed; for example, rhe path in the tree from the 1-0o1lo the lel'tmost leaf is traversed first. In a general graph-which lnay have cyclesthe sarne approaclj is used, except we neecl to mark nodes as rhcy are visited so that edges out of a node are traversed only once. TI1is seclion describes he probe/echo paracligm for distributcd computarjolis on graphs. A probe is a message sent by one node to its succcssot: an echo is a subsequent reply. S itice processes execute conco~ciitly,probes are sent i n parallel to all successors. The probetecho pa~.adigmis thos the concurrent progrdmming analog of DFS. We first illustrate the probe paradigm by showing how to broadcast information 10 all nodes in a network. We then add the echo par~digtnby developing an algocitlvn I'or conslructirig the topology of a network.
9.4.1 Broadcast in a Network Assunle that we have a network of nodes (~>rnccssors) connected by biclirectsonal directly only with its com~nunicationchannels. Each node can co~nlnl~nicate neiglibors. Thus lha network has lhc structure of am undirccred graph. Suppose one source node s wanrs to broadcast a lnessagc to all o~liernodes. (More precisely, suppose a process executing on s wants to broadcast a message
9.4 ProbeIEcho Algorithms
445
to processes executi~igon all the other nodes.) For example. s might be rhe site of [he network coordinaLor, who wants lo brondcasl new status inronnarion to all oilier sileh. J f every other node is a neighbor ol' s, broadcast would be trivial to irnplement: node s would simply send a messilge directly to every other node. 1-lowever, in large networks, each node is likely lo have only a srnall number of aeighbors. Node s can send the message to its neighbors, but they would have to forward i r Lo Lheir neighbors and so on. I n short: wc necd n way to send a probe [(I all nodes. Assume clial node s has a local copy of the network ropology. (We laler show how to compute il.) The Lopology is repreaented by a symmetric Boolean matrix; eotry topology [i,j 1 is true if nodes i and j are connected and it is false othcrwise. An cfficicnt wag for s to broadcast a rnessnge is first co construct a spann,ing lr(>e o l Ulc network, wirh itself as the 1.001of the rrcc. A spanning tree of a graph is a wee whose nodcs arc all h o s e in tlie graph and whosc edges are a subset of those in (lie graph. Figure 9.8 canlains an example, with node s on the left. The solid lines are the edges in a spanning tree; the dashed lines are the other edges in the graph. Given rpannirlg tree t, node s car1 broadcast a message m by sending m together will1 t to all its children in t . Upon receiving [lie message, eve]-y node examines t lo dete~.mineiLs chilcb-en in Lhe spanning Lree, then i-Lrwarcls both m and t lo all of them. Tbc spanning tree is sent along wirh m since nodes other than s woi~ldnot othcrwise know what spanni~lgwee ro use. Tlie I'i~ll algoritli~n is given in Figure 9.9. Since t is a spanning bee, evenlually rl~emessage will reach every node; moreover, each node will receive il exactly once, liorn its parent in t . We use a separate Initiator process on node s to start the broadcast. This makes lbc Node processes on each node identical. Tlic broadcast algoritbril in Figlirc 9.9 assumcs that (hc initiatol- node knows the entire topology, wliich i t uses lo compute a spanning tree that guides the
Figure 9.8
A spanning tree of a network of nodes.
446
Chapter 9
Paradigms for Process Interaction type graph = boo1 [ n , n J ; chan probefnl(graph spanningTree; message m);
process Node[p = 0 to 11-11 [ graph t: message m; receive probe [ p l (t , m) ; for C Q = 0 to n-1 s t q isacliildof p in t ? send probe [q] ( t , m) ;
> process Initiator { # executed on source node S graph topology = network topology; graph t = spalllling tree of' topology; message m = mcssage to br.oadcas1; send probers1 (t, m ) ; 1
Figure 9.9 Network broadcast using a spanning tree.
broadcast. Suppose instead rhal every node knows only who its neighboi-s are. We can s t i l l broadcast a inessage m to all nodes as follows. First. node s sends n to all i ~ neighbors. s Upon receiving m from one neighbor, a node formards m 10 all its O ~ ~ Pneighbol-s. I If the links delined by (lie neighbor sets happen to Tom a tree roored ai s, the effccr of this approach is the samc as before. In general, however, [he network will contain cycles. Thus, some node lnight receive m frorn two or 1nol.e r~eighbors. 111 fact, two neighbors might send the lnessrige to each othel-at itbour the same lime. It wolild appear that all we need to do in the general case is ignore inulliple copies of m h a t a node might receive. l-lowever, his can lead 1.0 message pollution 01.lo deadlock. After receiving rn for tbe first time and sending it aJong, a node cannot know how m:uny limes to wait lo receive m from a dil$crenl neighbor. if Ihe node does not w a i l at all, extra meshages could be left buffered on some ol' the probe channels. I f a node waits some fixed number o f tirncs. ir might deadlock uuless at leas1 thal many messages are sew; cven so, therc might b e Illore. We car1 solve lhe probleni of unprocessed messages by using n fillly symrne11.i~ algorirhin. 111 particular, after a node ucceives m for the lirsl time, it sci\ds m to a11 its neighbors, iicluding the one from whom it received m. Then the node receives redundiitil copies of m From all its other neighbors: Iliese i f ignores. The nJgoritl?~nis given i n Figure 9.10.
9.4 Probe/Echo Algorithms
447
I
chan probe [nl (message m) ;
process Node[p
=
1 to n]
(
boo1 l i n k s [n] = neighbors o f node P; int num = nil~nberof neighbors; message m; receive probe [pl (rn) ; # send m to all neighbors for [q = 0 to n-1 st links[gll send probe [gl (m) ; # receive nun-I redundant copies of m for [q = 1 to num-11 receive probe (p] (m);
1 process Initiator { # executed on source node S message rn = message tu broadcast; send probe [ S l (m); 1 Figure 9.10
'
Broadcast using neighbor sets,
The broaclcast algorilhm usi~iga spanning tree causes n-1 messages to be sent, one Ibr each parent/child edge in the spanning tree. Tlie algoritbln using neighbor sets causes two messages to be sent over every l i n k ill the nelwork, one j,o, each direcrion. The exact number depends on the topology of the elwo work, bul in genwal the number will be much larger than n-1. For example, il' the network topology is a tree rooted al lie initiator process. 2 ( n - 1 ) messages w i l l be sent; for n complete graph in which there is a link between every pair o f nodes, zn(n-1) nlessages will be s e o ~However, dle neighbor-set algorithm does not ~ node to know the ky~ologyor to eompule a spanning tree. In require t h initjator essence. 21 spanning [I-eeis consrrucled dynamically; it co~isistsof the links along which the first copies of rn are sent. Also, the tnessages arc shorter in the neighbor-sel algorithm since [he spanning lree need not be sent i n each msssikgc. BoLll bl-oaclcast algorith~nsassiltnc Lhac the Lopology of the network does not change. In par-cicula.,neither works con-ectly if there is a processor 01- con~lnunicution link failure while the algorithm is executing. If a node fa'ails, it cannot receive the message being broadcasl. Ifii link fails, it rnighl. or rnight not he possible Lo ccacli the nodes connected by the link. The H istol-ical Notes at [lie end of this chapler describe papers tl~ataddrcss the problem o f implementing n faultI.olerant bmadcasl.
448
Chapter 9
Paradigms for Process Interaction
9.4.2 Computing the Topology of a Network The efficient b~.oaclcastalgorithm in Fignre 9.9 required knowiiig tlie topology of the netwol-k. Here we show how lo conlpute il. Initially, every node knows its local Lopology-namely; ils links lo its neigl~hors. The goal is to gather all the local ropologies, because their union is the nel.work topology. The topology is g;~tlieredin two phases. First. each node sends a ptobe to ils ~leighhors,as llappened in Figure 9.10. I.,arer, each ode sends an echo cont:lining local topology information back to the node from which it received the first pr.obe. Eventual1y, tbe iniliating noclc has gatherecl all the echoes, and hence has gathered the topology. I t coulcl then, for example, cornpule spanning lree ant1 broadcasl the ~opologyback to lhe other nodes. For now assume that (he topology 0.1 the uetwork is acyclic. Since n nctis a n undirected graph, this Ineans (lie structu~'eis a tree. Let node s be the WOI-k root ni' this lrec. and l e ~s be tlie iniliator node. We can then gather [be topology 11s rollows. First s sends 21, probe to all its children. Wlie~ithese nodes receive a probe; (hey send i l lo all their children, and so 011. Thus probes propagate tk~.oughh e tree. Eventually they wil! rcacll lcaf ~iodcs.Since lliehe have no children. they begin the echo phase. In particular. each lea( sends an echo containing its neighbor sel to its parenl i n Lhe tree. Alfer receiving echoes from all ol'its children. a node combines them and its own neighbor set and cchocs Ulis inkornation ICI its parent. Eventually the root !lode will receive echoes 1'1-om all its children. The union of these will contain the entire l.opology since tlie initial probe will rei~cllevery node and every echo contains rbe neighbor set ol'lhe echoing nocle together with those of i t s descendants. for gathering [he network lopvlogy in a tree The full probejecho algoritl~~m is shown in Figure 9.11. The probe phase is essentially the broadcast algorithm from Figurc 9.10, escepl that probe messages indicate the identity of the sender: The eclio phase returns local topology information back up the rree. In rliis case, the algol.ithms for the nodes are nor fully symmetric since the instance ol' Node [PI execuf ng on node s needs Lo know to send its echo lo the Initiator. To colnpurc the topology of a network rhal contains cycles, we generalize (hc above alporill~mas follows. Atter receiving a probe, a node sends the probe lo ils other neighbors; then Lhe node waits I'or an eclio From those neighbors. 1-lowever. b e c a ~ ~ sofe cycles and because nodes execute concurrenlly, two neighbors might send cach olliel- probe5 al aboul Lhe same rime. Probes otller ban rhe fi~.srone can be echoed irnnledintely. In particular; if a node receives n subsecluenl l~-obewhile waiting for echoes, it ilnnlediately sends an echo containing a null topology (tl~isis suflicie~llsince tlie local linlts of thc node will be contained in tlie echo it will secid in response ro the first pi-obe). Evonlually a node will
449
9.4 PrabeIEcho Algorithms type graph = boo1 [n,n] ; chan probe [n]( int sender) ; chan echo[nl (graph topology) chan finalecho(graph topology)
# parts of the topology # Einal topology
process Node [p = 0 to n-11 ( boo1 links [n] = neighbors of node p; graph newtop, localtop = ([n*n] false); i n k parent; # node from whom probe is received localtog[p,O:n-11 = links; # initially my links receive probe lp] (parent); # send probe to other neighbors, who are p's children for [q = 0 to n-1 st (links[q] and q ! = parent) J send probe lql ( p ) ;
# receive echoes and union them into localtop for [q = 0 to n-1 st (links[gl and q ! = parent)] receive echo [pl (newtop); localtop = localtop or newtop; # logical or 1
if ( p
(
== S )
send finalecho(localtop);
# node S is root
else send echo [parent]( localtop) ;
1
process Initiator ( graph topology; aend probe[SI(S) # start probe at local node receive finalecho(topo1ogy);
1 Figure 9.11
Probelecha algorithm for gathering the topology of
a tree
receive an echo in response to every PI-obs. A1 i b i s point, it sends an echo to h e node from which it got the first probe; the echo co~ltainsthe union o,f the nodc's own set o f links logether with all the sets o f links ir received. The general p~.obelechoa l ~ o r i t h n lfor co~nputingthe nerworlc topology is shown io Fi&u~.c9.12. Bccnusc a nodc mighl rcccjve subsequenl probes while waiting for echocs, the two types of messages havc ro bc merged inlo one channel. ( I f Lhey came in on sepal-ale channels, a node would have to use empty and
450
Chapter 9
Paradigms for Process Interaction type graph = bool [ n , n ] ; type kind = (PROBE, ECHO); chan probe-echo[n] (kind k; int sender; graph topology); chan finalecho(graph topology); procees Node [p = 0 to 11-11 bool links [n] = neighbors otnode P; graph newtop, localtop = ( [n*n] false) ; int first, sender; kind k; int need-echo = nuurtberof neighbors - 1; localtop[p,O:n-I] = links; # initially my links
receive probe-echorpl (k, first, newtop); # get probe # send probe on to to all other neighbors for [q = 0 to n-1 st (links[ql and q != first)J send probe-echo [ql (PROBE, p, 0 ) ;
while (need-echo > 0 ) ( # receive echoes or redundant probes from neighbors receive probe-echo [gl (k, sender, newtop) ; if (k == PROBE) send probe-echo [sender](ECHO, p, 0 ); e l s e # k == ECHO { localtog = localtop or newtop; # logical or need-echo = need-echo-1; 1 1 if ( p = = S) send finalecho(loca1top); else send probe-echo[firstJ (ECHO, g , localtog); 1
process Initiator { graph to~ology; # network topology send probe-echo[source](PROBE, source, 0 ) ; receive finalecho(topo1ogy); 1
Figure 9.1 2
Probelecho algorithm for computing the topology of a graph.
9.5 Broadcast Algorithms
451
pollitlg 10decide which k.incl of message to I-eceive. Alrecnalivcly, we cot~lduse reudczvous, with probes and echoes being separate operations.) The correclncss of the algurich~nresults from the following facts. Sjllce the ~xeiwo~.kis cotinec~ed,every node eventually receives a probe. Deadlock is avoided since e\lc1-)/ probe i s echoed-the lirst one just beforc a Node process rerminales, the others while they are wailing lo receive echoes in 1.espo11selo d~ejr own probes. (This avoids leaving nlessages bul'l'ercd on Lhe probe-echo channels.) The lasl echo sen1 by a node conlains its local neiglibor set. Hence, the union of lhe neighbor sets evenlually reaches Node [s],which sends the topology lo Initiator. As with die algoritlin~i n Figure 9.10, (he links along which firs[ probes al'e sent fortn a dynamica.lly coinpuced spannin~tree. The ne~work topology is echoed back LIPlhis spanning tree; the echo iiom each node contains d ~ topology e of the subtree roored at that node.
9.5 Broadcast A lgorithms In tlie pi.evious seclion, we showed how lo broadcast information in a ~lelwork that has the strucrut.e of i\ graph. In most local area networks, processors share a common communication channel such as an Ethernet or 1o.ken ring. In Lhis case, each process01 is directly connected to every other one. In fact, such comm~uijcation networks often support a special network prirnirive called broadcast, which transmits a nlessage frorn olie processor to all othel-s. Whether suppol-ted by communicatiun ha]-dwae or not, message br.oadcas1 provides a useful programming technique. Let T [n] be an 'way of processes, and let ch [n] be an an-ay of channels, one per process. Then a process T [i] broadcasts a message m by executing broadcast ch(m);
Execution of broadcast places one copy ol m on each channel ch[i], incli~ding that oi T [i]. The effect is thus Ihe same as executing co [i = 1 to nl send c h [ i ](m) ;
PI-ocessesreceive both broattcasc and point-lo-point messages using tlie receive primitive. c in the Messages broadcast by the same process arc queued on ~ h channels ortlel- in which lhey are bl-oadcnst. However. broadcast is no1 alornic. In particular, messages b~.oadcastby two procchaes A anti B migh~be received by other
452
Chapter 9
Paradigms for Process Interaction
processes in different orders. (See the Historical Notes for papers that discuss how to impJe~neni atomic broadcast.) We can use broadcast to disse-minate informat ion-for example, to exchange processor state information in local area networks. We can also use it to solve many distributed synchroniznrio~~ problems. This section develops a broadcast algorithm that provides a distributed implemeotalion of semaphores. The basis for distributed scmapl~ores-and many oLher decen tralizcd synchronizalion plar-ocols-is a total ordering of co~n~nunicalion events. We hus begin by showing how to inlplernent logical clocks and how to use them to order events.
9.5.1 Logical Clocks and Event Ordering The actions of processes in a dislributed program can be divided into local aclions, such as reading and writing variables, and communication aclions, such as sending and receiving messages. Local actions have no direct effect on oll~er processes. Howe\ler, communication aclions affect the execution of other processes since they transmit information and are synchronized. Communication ;~crionsare thus the significant events in a distributed program. We i~sethe term eve~?.r below to refer to execution of send, broadcast,and receive slalements. If two p ~ ~ c e s s eAs and B are executing local actions; we have no way of knowjng the relative order in which rhe actions are execuled. However, if A sends (01-bl.oadcasts) a message to B. then Lhe send action i n A must happen before the col-respoildine receive ac~ionin B. [f B subsequently sends a inessage to process c, then the send action in B must happen before the receive action i n C. Moreover. since the receive xcliorl in B happens before the send aclioo i n B. there cvents: the se~ldby A is n total orde~jngbetween the four co~nn~unication happens beforc the receive by B, which happens before the send by B, which pens before the receive by c. H o p ~ ~ ~ nbefore . . s is thus a transitive relation between causally related events. Although there is a total ordering between causally related events, there is only a partial ordering between the entire collection of events in ;I dishibuted program. This is because unrelated sequences OF events-ror example, cornmunications betweel, differenr sets of processes-mighr occur before, after, or concurrently with each other. If there were a sj.ngle cencral clock, we could totally order cornrr~unication events by giving each a unique limestamp. In partici~lar.when a process sends a message, il could read the clock and append the clock value to the message. When a process receives a message, it could read the clock and record rhe time at
9.5 Bioadcasl Algorithms
453
which the receive eveill occul.retl. Assuming the gi-anularity or' [lie clock i s such that it "licks" between any send and tlie corresponding receive. an cvcnt [hat happens before another will thus have an earlier Limestamp. Moreover, if processes have unique identities, then we could induce a [oral ordering-for example, by u s j 1 1 the ~ smallest process identity to break rjes if ur)relaled cvonts in cwo processes happen to have the s a n e limestamp. Unforlunatcly, it is unrealistic to assume [he existence of 21 single, central clock. I n a local area net\vork, for exa~nple,each processor has its own clock. If these were perfeclly synchronized, then we could use tlie local clocks for timcstamps. However, physical clocks are never perfecrly synchronized. Clock synchron ization a lgol-itbtns exist for keeping clocks fairly close to each olliel. (see the I-listorical Notes). but percect sy~~chronizalion is impossible. Thus we need a way 10 sj~uulalcpliysicnl clocks. A 1~)gicnlclock i s a simple inleger counlcr chat i s incremented when events occur. We assume Lhat each process has n logical clock that is initialized to zero. We also assume that every message contains a special field called n limesfun~p. The logical clocks are incremented according to the following rules.
1,ogical Clock Update Rules. Let A be a process and let Ic be a logical clock in the process. A updates the value of lc as follows: ( I ) When A sends or broadcasts a messilge. it sets tlie tirnesh~npof t l ~ c message Io the currenl value ol' Ic and then increments L C by I.. ( 2 ) When A receives a message with timeala~npts, i c sets lc mum of lc and ts+l and then increments lc by 1.
LO the
~naxi-
Sjnce A increases lc after every event, every message sent by A will havc a diffcrenl. increasing tiinesrru~ip. Sjnce a receive event seLs lc to he larger than the in the rcceivcd mcssagc, [lie timesla~npin any lnessage subsequently ti~nesra~np sent hy A will have a larger timestamp. Using logical clocks, we can associate a cloclc value with cacb evclll as follows. For a send event, the clock value is the timestamp i n the ~nessage-i.e., (he lowl value of lc at the sl:irt of [lie send. For a receive event. the clock value ~ bef0r.e it is is the value of Ic ailer it is set 10 the maximum of lc ant1 t s + but incremented by d ~ ereceiving process. The above rules for updatir~glogical clocks ensure that i f event n happens before ever11 b. then the clock value associated with n will be sln;d!er than that associated with b. This i~lducesa pai-rial ordering on the scl of ca~~sally I-elateclevents i n a program. 11' each process has a unique identity, Lhen we can gcl a total ordering between all events by usi~igthe s~nallel-process iclentity as a liebreaker i n case Lwo evenls happen to lia\ie the same rimeslamp.
454
Chapter 9
Paradigms for Process Interaction
9.5.2 Distributed Semaphores Semaphores are normally implemented using shared va-iables. However. we c o ~ ~jrnplernenl ld them i n a message-based program using a scr\lei.process (active rnonirol-i, as shown in Seclion 7.3. We can also implement thcm in a decenlralized way without using a central coordinal.or. Hcre, we show how. A semaphore s is usirally represented by a nonnegative integer. Execu~ing ~ ( s waits ) until s is positive [hen decrements the value; executing v ( s ) jncrerrlcnts the valuc. In other words. at all tirnes the numbcr of compleled P operations ia a1 mosl the number of completed v ope~,alionsplus the inicial value of s. Tllus. to iii1plernen( selnaphores. we need a way to count p and v operalions and b; way to delay P operations. Moreover, the processes thal "share" a se~naphore need LO cooperale so (hat they maintain (lie semaphore in~arial~r s >= 0 even ~houghthe program state is distributed. We can Ineel these requirements by having processes broaclcast tnessirges whe.11lhcy want to exccule P and v operations and by having lhcm esamine the Iqe5cages Lhey receive LO deterrnis~ewhen to proceed. In particular, each process has a local message queue mq and a logical. clock LC, which is updaled according to the Logical Clock Uptlale Rules. To si~nulaleexeculion of a P or v operario~?: a process broadcasrs a message lo all the i~serprocesses, including itself. The message cotirains the sende~.'sidentily, a tag (POP or VOP) indicaling which kind of operation, a.nd a timestamp. The rjniesta~npin every copy of the message i s the cun-ent value o f l c . When a process rcceives a POP or VOP messagc, it stores the message in ils message queue mq. This queue is kept soiled in increasing order of the limesramps in the Inessages; sender identities are used to break ties. Asuume for b e 11-~ornen~ that every process receives all messages that have bccr) broaclcast in the same order and in increasing order of tilnestamps. Then every process would kr~owexactly the order in which POP and VOP messages were senr and enc h could cou~ll tlie nutuber of corresponding P and V opcralions and ~nsjntain the setnaphore invariant. Uncortuilately, broadcast i s no1 an atomic operation, Messages broadcasl by two dil'l'ercnt processes might be reccjved by others in different orders. Moreover, a message with a smaller ti rnestamp might be receivcd after a message with a larger timestamp. However, differenl messages broidcast by one proccss will be recei\red by [lie oLher processes in rhe order they were broadcasr by rhe liI-a1 p~.occss.and Lhcse messages will also have increasjng ti ~.nesta[nps.These prope~.ticsfollow from the filcls that (1) execulion of broadcast is the same as concurt'ent execulion or send-which we assume psovides ortlcl-ed, reliable delivery-and (2) a process increases its logical clock after every cornmunication evenl.
9.5 Broadcast Algorithms
455
The fact lhat consecuti\e messages sen1 by every process ha\/e itlc~casing tjlncstamps gives us a way Lo make synchroniiralion decisions. Suppose a process's message queue m q contains a message m with Limestamp ts. Then, once che process has received a nlessage with a larger timestamp from every othci-process. i t is assured [hat it will never- see a message with a slnaller 1irnesLarnp. A t this point, message m is said to be fully cccktowledged. Moreover: once rn is ft~llyacl~nowledged,the11all orher messages i n h-ont of il in m q will also be fi~lly ti~nestamps.Thus the par1 of mq conacknowledged sjnce ihey all have sm.alle~taining fully acknowledged messages is a aablo prefix: no MW messages will ever be insertccl into i t . WI~enevcra process receives a POP or VOP message, we will have i l broadcast an acknowledgement (ACK) message. These are broadcast so that every process sees them. The ACK messages have timeslamps as usual, but they are not stored in the message queues. They are used sirnply to delelaminewheli a regular me.ssage in m q has beco~uefully acknowledged. (If we did nut usc ACK mrssages, a process coulcl not determine that a message was Fi~llyacltnowletlged urllil it received a later POP or VOP message lrom every other process: this would slotv lhe algorithm clown and would lead lo deadlock i f s o w user did no1 wall1 to execute P 01- v operations.) To complete the imple~nentationof distri butecl semaphores: each process uses 21 local variable s to repl.escnt the value of clle sem;lphore. When 21 pcocess gets an ACK message, it updales Lhe stable prefix ol' its message queue mq. For every VOP Inessage, the process increments a ancl dele.les the VOP rnessage. Ir then examines Lhe POP messages in ti~nestarrlpo~,ds~-. IF s > 0, lhe process deccernenis s and deletes the POP message. I n shorl, e;ich process maintains the following PI-edici~te, which is its loop invarjanr: DSEM:
s
>= o
A
m q i s ordered by tilnestnnips in lnessages
The POP messages arc processed i n the order i n which tllcy appear in the srable prelix so that every process makes the same decision about the order in which P operations co~nplete. Even ~Iioughtbe processes inight be at difkrent stages in handling POP and VOP messages, each one will I~andlefully acknowlcdgecl messages in t.he same order. The algorirl21n for distributed sernaphorcs appears in Fig~u-e9.13. The user processes are rcgular application processeh. There is one helper process for each user, and the helpers iliteract with each other in orcler to ilnple~ncnt[lie P and v opetntions. A user process iniriales a P or v operation by communjcating wit.11 its helper: j n the case of a P operarion. the t~serwaits u111iI i t s helpel says it can proceed. Each helper broadcasts POP. VOP, and ACK messages ro rhc other 11elpel.s and manages its local rnessage queue ah described above. All nlessages to
456
Chapter 9
Paradigms for Process Interaction t y p e kind = enum (reqP, reqv, VOP, POP, ACK) ; chan semog[n] (int sender; kind k; int timestamp); chan go [ n ] (int timestamp) ;
process Userli = 0 to n-1) i n t I c = 0, ts;
{
.-. # ask my helper to do V ( s ) send semop [i] (i, reqV, lc) ; lc = Ic+l;
- -. # ask my helper to do P(s), then wait for permission send semop[i] (i, r e q P , I c ) ; lc = lccl: receive go[i] (ts); lc = max(lc, ts+l): lc = lc+l;
1
process Helper[i = 0 to 11-11 { q u e u e mq = new queue(int, kind, int); # message queue int lc = 0 , s = 0; # logical clock and semaphore int sender, ts; kind k; # values in received messages while (true) ( # loop invariant DSEM receive semopril (sender, k, ts) ; lc = max(lc, ts+l); lc = lc+l; if (k = = reqP) { broadcast semop(i, POP, lc); lc = lc+l; else if (k == reqV) ( broadcast semop(i, VOP, lc); IC = Ic+l; ) else if (k == POP or k == VOP) ( inscrl (sender, k , ts) at appropriate place in mq; broadcast semop(i, ACK, lc) ; lc = lctl; 1 else { # k == ACK record that another ACK has been seen; f o r (all f i ~ l l yacknowletlged VOP message$ in mq) ( remove the rnessage from mq; a = s + l; 1 f o r ( a l l Fully acknowledged POP messages in mq st s > 0) { remove the message from mq ; s = a - 1 ; if (sender == k ) # my user's P request ( send go[i] (lc): lc = lc+l; 1
1
1 Figure 9.13
Distributed semaphores using a broadcast algorithm.
9.6 Token-Passing Algorithms
457
helpers are sent or broadcast to the semop array of chruinels. As d~own.every a logical cloclc which il uses to place timeslamps on messages. process maintai~~s LV? can usc distributed se~llaphoresto syncl~ronizeprocesses in a djst~.ibuted program in essenlially the same way we used regular sernap1io1-es in shared vi~riableprograms (Chapter 4). For exa~r~plc, we can LISC them to solve ~nulual \Vc can also use exclusion proble~nb,sl~chas locking files 01.clalakase ~~ecords. the same basic approach-broadcast messages and ordered cjueues-lo solve addilional problems; the Historical Notes and thc Exercises describe several applicalions. When broadcast algorjthins are used to make s y nclironiza~iondecisions; every process musl participate in every decision. In particular. a process mils1 hear from every other i n order lo determine when a message is f~ullyacknowledged. This means that broadcasl algorithms do not scale well lo interacdons among large ~lurnbei-sof processes. 1t also means that sucb algorithms must be nlodilied to cope with failures.
9.6 Token-Passing Algorithms This section describes token passing. yel another process interaction paradigm. A rok-rn is a special kind of message that can be used eilher to convey pel-mission or to galller global state information. We illustra~ethe use of a loken lo convey pe.onissio~zby sjmple, distributed solution lo the critical section problem. Then we illustrate gatherirlg slate infor~uatio~~ by developing r wo algorithms for cletectcomputation liils terminated. The next section presents an ing when a distrib~~lecl additional example (see also the liislorical Notes and the Exercises).
9.6.1 Distributed
Mutual Exclusion
Although the critical secrion proble~narises pri~narilyi n shared-variable programs, it also arises in distributecl propl-arns whenever Illere is a sharccl resource tlmt at most one process at a lime can use-for example, a cotnmunication link lo a sa~ellile. Moreover, lhe critical section problem is c~ftena component of a larger problem, such as ensuring co~~sistency i n a dislributed Ale or database system. One way to solve the critical section problem is to employ an active norl lit or that grants permission to access the critical secrion. For many problems, such as implementing locks on files. lhis is the sirllplesl and most efficient approach. A second way to solve the problem is lo use discri buled semaphores, implemented
458
Chapter 9
Paradigms for Process Interaction
D Helger [ n !
Figure 9.1 4
A loken ring of helper processes.
us shown in thc pi.cvious section. That approach y-iclds a dccentl-alized solulion in which 110 one process has ;I special role, bur i t requires exchanging a large number or messages for each semaphore operalion since each broadcast has to be acknowledged. Here we solve the problern in n cliird way by using a token ri19g. The soluand fail; as is a solution using dixtributed sernaphol.es, but it tion is dcce~.~c~-alized I-eqi~ ires the exchange of far fewer messages. Moreover, the basic approach c;u\ be generalized 1.0 solve olhel synchronizat.ion problems. Let User [1:n3 be a collection of appJication processes Ihat contain critical i~ndnoncritical sections. As usual, we necd to tlc~~elop ellcry and exit proiocols that Lhcse processes execute before and after their cl-itical sectioil. In adclition, the pl'otocols should ensurc ~nutualexclu$ion, avoid deadlock and unnecessary delay, u~idensure eventual enu-y (Tail.ness). Since Lhe user processcs 1)ave oUier ulorl~Co do, we do not want h e m ;dso t o \lave to cii-culntc the token. 'Thus we will e~nploya collection of additional procesces, Helper [ 1 :n] , one per user process. These lielpcr processes I'orm i I ring, iIs shotvn in Figure 9.14. One Loken circulates between [lie helpel-s, being 11assed f1'01nHelper [ 11 LO H e l p e r [ 21 and so on to H e l p e r [ n ] , which passes i t back CO Helper [lI . When H e l p e r Ci I receives the rokcn, i t checks to see wlietl'ler its client user [i] wants to enter its critical scction. II' not, Helper [i] passes the token on. Otherwise. H e l p e r [i] ~cllsU s e r [i] it lnay enter its critical section, lhen waits onti1 U s e r [i] exits: at this point Helper ti1 lxlsses llle token on. Thus the helper processes cooperate to ensure that !hc ~ollowinppredicate i s always true: U M U T E X : U s e r [i] is in its CS =i, H e l p e r C i l has 111e token there is ex:~cllyone token
A
9.6 Token-Passing Algorithms
459
chan token[l:n] 0 , enter[l:n] 0 , goC1:nJ 0 ,exit[l:nJ 0 ;
process Helper[i = 1 to n] ( while ( t r u e ) { # loop i n v a r i a n t DMUTEX receive tokenril ( ) ; # wait for token i f (not empty ( e n t e r [i]) ) { # does user w a n t i n ? receive e n t e r [ i ] ( ) ; # accept e n t e r m s g send go[il 0 ; # give permission receive exit[i] ( ) ; # wait for exit
1 send token [ i%n
+
11 ( ) ;
# pass token on
1 1
process ~ s e r [ i= 1 t o n l while (true) { send enter [i] ( ) ; receive go [ i l ( ) ;
[
# entry protocol
crilical secliot~; send exit [il 0 ;
# exit protocol
noncritical sectioti ;
Figure 9.1 5
Mutual exclusion with a token ring.
The program is shown in Figure 9.15. The token ring is ~cpreseriiedby an array of token cllannels, one per helper. For this problem. the loken itself cal.xies no data. so it i s represented by a null message. T l ~ sorher cliannel~are used for communic~rionbetween the usa-s and their helpers. When a helper holds the token, i t uses empty to determine whether ils user wishes to enter its critical scction; i F so: the helper sends the user a go message, and then waits to reccivr an
message. Tlie solucio~iin Figure 9.1 5 is fail--assuming as usual that processes event ~ ~ r d lexit y critical sections. Th.is is because the token corltinuously c.irculates, and when nelper[i] has it, user [i] is permitted to enter if i t wanrs lo do so. As progl-unmed, the token moves contiri~~ously between the helpers. This is in fact what happens in a physical tokewing network. In a software token ring, however, it is probably b e s ~to add some delay i n each helper so that the token moves more slowly arouncl the ring. (See Section 9.7 for anollier token-based exclusion algorilhm i n which tokens do 1101 circulate conlinuously.) exit
460
Chapter 9
Paradigms for Process Interaction
This algolithm assumes that failures do not occur and lhal Lhe token is not lost. Since conlrol is distribured, however. ii is possible to modify the algorithm to cope with failures. The Historical Notes describe algoritlms for regenerating a lost token and for using two tokens that circulate in opposite directions.
9.6.2 Termination Detection in a Ring lt is simple to detect when a secluentjal program has lelminated. I1 is also simple to detect when a concurrent program has terminated on a single processor: Every and no 110 operations are pending. However, it proccss is bloclced or ~cr~njnaced, i s not ar all casy to detecr when a distributed program has terminated. This is bccause the global slate is no1 visible lo any ollc processor. Moreover, even when all processors arc idlc. therc may be messages in transit between processors. There ;u-e several ways lo detecl when a distributed cornputation lhas Lerrninated. Thix section develops a token-passing algorith~n.assuming Lhai all communication between processes goes around a ring. The next section generalixes the a lgorilh~nI'or a co~nplerecommunication graph. Addilional approaches are dcscribcd in thc Hislo]-ical Noles and Lhe Exercises. L e t ~ [:nl 1 be (he processes (tasks) i n sojile distsjbutecl computalion, and let c h [I: n ] be an array of co~n~nunication channels. For now, assume that the processes i-'orrna ring and that all co~~r~tlunication gocs arourld the ring. In pariiculw-,each procesh T [ i ] receives messages only from ils own channel c h [i] and sends messages o11ly LO [he next channel ch [ i%n + 1 I . Thus T 111 sends messages only lo T [23, T [ 2 ] sends only to T [ 3 ] . and so on, wilh T [n] sending messages to T I1I . As l~sualwe also assume ilia1 messages from every process ale rcceived by ils neigl~borin (1-16ring in Ihe order i n wli ich they were sent. A1 any point in time, each process is active or idle. Initially, every process is active. I t is idle i f ' it has terminated or i s delayed at a receive statemenr. (II' a process is temporarily delayed wbile waiting for an I10 operalion lo lerminnte, we considel it srill to be active since it has not terminated ant1 will eventually be awakened.) Afler recejving a message, an idlc p~.occssbeco~nesactive. Thus a dist~,ibutedco~nputationhas te~.minatedif ihe following IWO conditiorls hold:
DTERM: every process is idle
A
no ruessages are in lransi~
A message is in ll-;unsit if' i l has been sent but not yet delivered Lo Lhe destinaiion chi~ntlel.The second condition is llecessary because wl~enLhe message is delivered. i t coi~ldawaken a clclayed process. Our task is l o superimpose a termination-deleclio~lalgorithm on an ;u-bitvary distributed coniputatiol), subjecc only to the above- assunlplion ilial the processes
9.6 Token-Passing Algorithms
461
in the cotnpuralion communicate in a ]-in&. Clearly terminatitrn is a properly o f [he global slalc. which is the uriion 01' the states ol' individu;~l processes p l t ~ sIhe conlents o f Inessage chanl~els. Thus the processes have to coi.ntnunicatc with each o t l i e ~i n order Lo determine if the cornputstion has tcnnjli:~ccd. To detect Icnnination, let there be onz tokcrl, which is passed al.ound i n special nlessages t11;lt are not pa1-t of rhe computation proper. The process that holds the tokcn passes i t o n when it-beco~nesidle. (Tf a process hab terminated its computalion, it is idle bul c o n t i n ~ ~ etos particjpatc in [he lermination-derection ttl.gorithm.) Pi.oce~sespass h e token using the same ring o f communication channel!, that they use i n the colnputarion itself. When a process receives the token, i t knows Lhut the sender was idlc a1 tlie time i t sent the token. MOI-eoves, when a process receives the token, i t has to be idle since it is delayed ~cceiving1'1orn its channel ~ l n dw i l l not become active again until i t receive\ a 1.egi11a1 message (hat i s p ~ ofr thc ~ dis(ributed computatio~i.Thus, 11po11receiving (lie token, a process s e ~ ~ d(lie x token to its neighbor, rhen waits to receive ano~liermessage limn i t s channel. Thc queslion now is how to detect that the entire computalion lras tcr-rninaled. Wheo ~ h ctoken has made a complete circuit OF the communicntion rjng, we know that every process was idle a1 some poinl. But Iiow can the l ~ o l d e ro f rlie loken determine i f all othei- processes are slill idle and that there are n o tnessages in t.ransit :' Suppose one proccss. T [ l l say. initirllly Iiolds the tokcen. When ~ [ l l beco~rlesidle. i t initiates tlie ter~ninntion-detectionalgal-ithm by passing the token to T [2]. After [he tolten gets back lo ~ 1 1 1 .[lie cornl~utationhas terminated if T [I] haa been c o n ~ i n ~ ~ o ~idle ~ . ssince ly it hrst passed h e loken to T [2]. This is because (he tolten goes arouncl [lie same ring that regular messages do. and messages are dclivc~edin the order in which they are selll. Thus, when (he token gets back to T [ 1 I there cannol be any regular mcssages either clueued or in Iransi~. I n essence. tlie token has "Hushcd" rhe channels clean, pushing all regular Inessitges :~lle:~dof it. LVc call lnakc the algorithm and its correctness more precise nc; I-'allows. Firsr. associi~tea colol- with evc.ql process: b l u e (cold) for idle ancl red (hot) for activc. lnitiallyall PI-ocesses are active. so they arecolored red. When apl'occss ~.cceivesthe tolten, i l is idle. so i t colors ilself blue. passes lhe token on, and wail5 lo receive a~iolhetInessage. I F the process later receives n regular mcssilge, i l colors itself red. Thus a process Illat is blue became iclle. passed the l o k e ~ i on, 2und has remained idlc siuce passing the token. Second, ahhtxiate a value with the token indicating how many channels are enlpty i f T [ l ] is still idle. Let token be this value. When T I 1 1 hecoolcs idle, it
.
462
Chapter 9
Paradigms for Process Interaction
G lob:il i~wariantKI/VG: T [I] is blue
j( T I 1 1
... T [ t o k e n + l l ale blue A ... ch [token%n + I] are empLy )
ch [21
actions o f T [I] when ir first bccot?ies idle: color [l] = blue; token = 0; send c h [ 2 ] (token); T [ 2 1 , .... T [n]upon receiving a regular message: color [i] = red;
ac~iansof
actions of T L21. ..., T [n] up011recejving he token: c o l o r [i] = blue; token++: send c h [ i % n + I] (token);
actic>nsof T 111 upon receiving the token; if (color[Zl == blue) announce telmination and hall; color [ll = blue; token = 0 ; send c h [ 2 l (token) ; Figure 9.1 6
Termination detect~onin a ring.
colors itself blue, sets token to 0, and dien sends The Loken to TC21. When T [21 receives the token. it is idle and ch L21 might be empty. Hence, T [ 2 l colors ilself blue, increments token to 1,and s e ~ ~ d the s token ro T 131. Each pro-
cess T [i] in turn colors itself b l u e and increments token bcfore passirrg it on. These toketi-passing r ~ ~ l are e s listed in Figure 9.16. As indicaled, the rules ensure that predicate RING is a global invariant. The inv;uiiance of RlNG follows from Lhe fact [hat ifT [ l l is blue, i t has not sent any regi~larrnesqages since sending the token, and hence there are no regular messages in any channel up to where the token resides. Moreover, all these processes lhave remained idle since they saw the token. Thus if T 111 is still blue when the token gets back to it, all processes are b l u e and all chailnels are empty. Hence T [13 can announce that the cornp~ltationhas terminated. i
I
9.6.3 Termination Detection in a Graph In the previous sec~ion,we assumed all communication goes around a ring. In general, the com~nunicationstructure of a disrribured conipucation will form an arbitrary directed graph. The nodes of the graph are the processes in the computation; the edges represent communication paths. There i s an edge from one process to another if the first process sends to a channel from which the second receives.
9.6 Token-Passing Algorithms
463
Hcre we assume chat the co~ntnunicaliungraph is c,on~,/)iere-namely. that one edge Fi.orn evcry process to every other. As berore, there arc n processes T [ 1 :n ] and channels ch [ 1 :nl , and each proccss T [ i I receives frorr~its private input channel chli]. However, now any process c;ul send meshages t o thcre is
ch[il. With these assu~nprions,we call extend thc previous termination-detection algorilh~nas clescribed below. The resulting illg~itlimjs adequatc co detect Lermination in any network in whidi rhe1.e is a direcl communication path hum each
processor to every other. Il can readily be extended to arbitrary cc~mrnunicarion graphs and multiple channels (see the Exercises). Detecling Lernunation in a con~pletegraph is more clifficull than in a ring because incssages can arrive over any edge. For example, co~~sider the complete graph of tl~reeprocesses shown in Fig~1r.e0.17. Suppose the processes pass the token only J'roim T [I1 lo T [ 2 ] lo T [31 and back to ~ [ l l Suppose . T [I] holds rhe token and becomes idle: hence it passes the token 1.0 T 12 I . Whcn T 12 I becornes idle, i t in turn passes die toke11 to T [ 3 l . Bul betore T [31 ~.cceivcsthe token, it could send :I regular messapc lu T t 2 1 . Thus, when the Loken gets baclc to T I1I . it cannor conclude that the cu~nputationhas lerillinated even if it Ilas re~nainedco~~tjnuously idle. The key to tlic ring algorilhm i n Figurc 9.16 is [hat all cornmunicatio~igoes around the ring, and hence the token flushes out regular tnessages. ln particular, Ihe token traverses every edge OF Lhe ring. We can cxrend i h a ~algorirh~nto a complete graph by ensuring Illat the token 11-aversesevery edge of h e graph, which rrleans that i t visits every process rnultiple times. If svr?p process has remained continuously idle sii~ccir li~.stsaw the token, then we can concludc that the computation has terminated. As before, cach PI-occssis colored red or blue, with all processes inirially red. Whcrl a process receives n regular message, it colors irself red. Whcn a proccss rcceives thc token, it i s blocked waiting to reccive the ncxt message on its
Figure 9.17
A complete communicatjon graph.
464
Chapter 9
Paradigms for Process Interaction
inpul channel. Hence the process colors itself blue-if it is not alraady blueand passes tlie token on. (Again: i f a process termi~latesi t s regular co~nputation, i t contirlues to hand 1.e Lol<en messages.) Any conlplele directed graph contains a cycle [hat includes cvcly edge (some ~iodesmay need to be i~lcludedmore dian once). Let c be a cycle in tlie communication graph. :rnd lei nc be its length. Each process keeps Lrack of the 01-dcrin whjcli its oulgoing edgcs occur in c. Upon I-eceivh~g Lhe token along one edge i n c, a process sencls it oirl over che nexr edge in c. This ensures that the token traverses every edge in the communication graph. Also as becore, the token carries a value indicating the ~iu~nber o f times i n a row rhe token has been passed by idle processes and hence the number of cllantlels that might be empty. As the above exa~nplcillustt-ates, however, in a complete graph n process chat was idle rnjght become active again, even if T [ 11 remains idle. Thus, we need a different set of Loken-passing rules and a different global invariant in order to be able to conclude that the cornp~itationhas termi-
nated. The loken starts at any process and initially has v;llue 0 . When that process becomes idle for the first time, ir colors itself blue and then pabses Lhe lolteil along the firs1 edge in cycle c. Upon receiving the token, a process takes the actions shown in Figure 9.18. If the process is red when it receives the toke~l-
Glohal invarianl GRAPH: token 112s value V =, ( the lasl V channels in cycle c were empty A the lasl V processes to receive t h e tokcu were b l u e )
actions of T [i] upon receiving a regular message: c o l o r l i l = red:
actiol-is of r [ i l upon receiving the token: if (token == nc)
announce termination and halt; i f (colorti1 == red) { color [il = blue; token = 0; } else token++;
set j to index of channel for next edge i n cycle c; eend ch [ j l (token); Figure 9.18
Termination detection in a complete graph.
9.7 Replicated Servers
465
and Ilence was active since last seeing 11-lhe procehs colors ilsell'blue and scls [he value or token to 0 before passing il along the rlexl edge in C. This cflflectively ~rcinilialeathe termjnatio~i-detectionalgorithm. I-lowever. if rlie pvocess is blue when it receives the Colten-and hence has been continuously idlc s i ~ ~ last ce sccing the token-the process increments tlie value of token before pasaing i t on. The token-passing rules ens~u-elhat predicate GRAPH is a global invariant, Once Ihc valuc of token gets to nc, the Icnglh o f cyclc c, rhen (lie computation i s known to have tel-minated. In particular, a l that pojnt the liisl nc channets the tolten has traversed were empty. Since process only passes the tolten when it ih idle-ant1 since it onlg inc~.easestoken if it has ~.ernainedidle since last seeing h e roken-all channels arc emply and all processcs are idle. [n hct., (he colnpuration had actually rerminated by Ihe time the token slal-led its last ci~.cuitaround the token has ~licgraph. However: no pfocess could possibly know lliis ~~ntiJ madt: another complete cycle around the graph to verify [hat all processes are still idle ancl that all cllnnnels are emply. Thus rhc roken hiis to circulnle a rnini~nujn of rwo cirnes around the cycle after any activity in the ca~npulationpropel.: firsl ro turn piwesses blue, and then again to verify that they have I-emeinedblue.
9.7 Replicated Servers The final process-inte~aclioii paradigm we describe i s replicated servers. A scl.ver, as usual, is ;I process that manages some resourcc. A berver niiglit be replici~ledwhen there are rnultiple diatinct instances of a resource: each server would then manage onc of tlie instances. Replication can also be used to give clients rhe jllusjon that rliere is a single resource when in fact there are many. We siiw ;in cxanlple or this earlies i n Section 8.4, where we sllowed how to i~nple~iierllreplic2lled riles. Tliis seclion illustr~tesboth uses of replicated servers by developing two ~ a l are fi vc i\dclilional solutions lo the dining yhilosop1ie1-sproble~n. As i ~ s ~ there philosopliers and five forks, and each phi lo sop he^' requires two li)rks i n OT(ICI. to eat. Tliis j~roblen~ car1 be solved in ~lireeways j n a clistl-ibuted program. Lcl PH be I: pliilo~cphe~' process and let W be ;I waiter process. One approach is to have a single waiter process dial manages a11 five forks-the cen~rcllizodst]-uclure shown i n Figure 9.19 (a). The second approach is to distcibute the forks, with one \miler managing each tork-the clisrrihulcd structure shown in Figure 9. I9 (13). Thc third approach is to liavc one wailel- per philosophu-the dt~ceir~ ~ r ~ l i zstrucri~~-e ed shown in Figure 9.19 (c). We presented a centralized solulion earlier in Fig~1r.c8.6. Here, we develop discribu~edand decent~alizedsolutions.
466
Chapter 9
Paradigms for Process Interaction
(b) Distributed
(a) Central~zed
(c) Decentralized
Figure 9.19
Solution structures for the din~ngphilosophers.
9.7.1 Distributed Dining Philosophers The centraliz.ed cljninz philosopher's solutio~lshow11 ill Figill-e 8.6 is cleadlockfree bill it is not f a i ~hlorcover. the single waiter process could be a borrle~leck because all philosop1ie1-sneed to interact wilh tt. A distributed solution can he deadlock-free, fair, and not have a bottleneck, bit1 at the expense oC a more coinplicaled client inlelface and rlzore messages. Figure 9.20 contains a distributed solution programmed using (lie mulripie pri~nilivesnotation of Section 8.3. (This leads to the shortest program; however, the progr:lm can readily be changed lo use just message passing or just rendezvous.) There arc live waiter processes; each manages one fork. In particular, each wailel repeatedly wails Ibl- a philosopher lo gel the iisrk then GO release i ~ . Each philosopher inceracrs wit11 two waiters lo obuin the forks jt nceds. Howeves, to avoid deatllock, [he philosopJiers cannol all execute the identical program. Instcad, llle first fo'our philosophers get Lheir lefi fork thcn lhe right, whereas the last philosopher gets the right fork die11 the IeFt. The solution is thus very similar to the one using se~naphoresin Figure 4.7.
-
9.7 Replicated Servers
467
module Waiter [51 op g e t £orks ( ) , relforks ( 1 ; body p r o c e s s the-waiter { while ( t r u e ) { receive getforkso; receive relforks()j
1
1
end Waiter process Philosopherli = 0 t o 4 1 { int f i r s t = i, second = icl; i f (i == 4) ( first = 0 ; second = 4; )
while ( t r u e ) { call Waiter [first] .getf orks ( ) ; c a l l Waiter Csecondl .getf orks ( ) ; eal;
send Waiter [first] .relforks ( ) ; send Waiter [secondl .relforks ( ) ; think;
1
1
Figure 9.20
Distr~buteddining philosophers.
The distributed solution in Figure 0.20 is fair because the forks are reqi~cstedone at a li111eand invocal-ions of g e t f o r k are serviced i n rhe order they are called. TIlus each call of getforks is serviced eventually, assu~ning pllilosophel-s eventually release forks rhey have acquired.
9.7.2 Decentralized Dining Philosophers We now develop a decentralized solulion thal has one waiter pcr philosopher. The process interaction prlttern i s similar to lhac in tke replicated file server.s in Figures 8.14 and 8.15. The algorit.hm en-iployed by the waiter processes is another exa~npleof tolceli passing, with the tokens being the live forks. Our solu[ion can be adapled lo coordinate access to replicated files or to yield an efficient solution to the clist~ibutedmutual exclusioa problem (sce the Exercises).
I
468
Chapter 9
Paradigms for Process lnteraclion module Waiter [t = 0 to 43 op getforks(int), relforks(int); # for philosophers # for waiters op needl ( ) , needR ( ) , passL(), pasea( ) ; op forks(bool,bool,bool,bool); # for initialization body
op hungry0, e a t 0 ; # local operations boo1 havel, dirtyl, haveR, dirtyR; # status of forks int left = (t-1) % 5 ; # left neighbor int right = (t+l) % 5; # right neighbor proc getfoxks0 i send hungry(); # tell waiter philosopher is hungry receive eat(); # wait for permission to eat 1 process the-waiter i receive forks(haveL, dirtyl, haveR, dirtyR); while (true) ( in hungry ( ) - > # ask for forks I don't have if ( ! haveR) send Waiter [right].needL ( ) ; if (!haveL) send WaitertleftJ-needR0; # wait until I have both forks while (!haveL or !haveR) in passR() -s haveR = true; dirtyR = false; [ I passL0 -> haveL = true; dirtyL = false; [ 1 needR( ) st dirtyR - > haveR = false; dirtyR = false; send waiterfright].passL(); send waiter [right]. n e e d L ( ) [ I needL() st dirtyL - > haveL = false; dirtyL = false; send Waiter[leftl.passRO; send Waiter[leftl.needR(); ni # let philosopher eat, then wait for release send eat(); dirtyL = true; dirtyR = true; receive relforkso; [ I needR() - > # neighbor needs my right fork (its left) haveR = false; dirtyR = f a l s e ; send Waiter [ r i g h t ] .passL ( ) ;
9.7 Replicated Servers
[I
469
needL() ->
# neighbor needs my left fork (its right) haveL = false; dirtyL = false; send Waiter[left].gassR();
ni }
1 end Waiter
process Philosopher[i = 0 t o 4 1 while (true) ( call waiter [i] .getforks(); eat; call Waiter[il.relforks()j
thi&
{
;
1
1 process Main { # initialize t h e forks held by waiters send Waiter[O].forks(true, true, true, t r u e ) ; send Waiter[l] .forke(falee, false, true, true) ; send Waiter [2] .forks (false. false, t r u e , t r u e ) ; send WaiterC31 .forke(false. false, true, true); send Waiter[4].forka(false, false, f a l s e , false);
1 Figure 9.21
Decentralized dining philosophers.
Each fork is a token that is held by one of two waiters or is in transit between them. When a philosopher wants to eat, he asks his waiter to acquire two forks. If the waiter does not culrently have both forks, the waiter interacts wilh neighboring waiters to get them. The waiier then retajns control of the forlts while the philosopher eats. The key to a correct solucion is to manage the forks in such a way thar deadlock is avoided. Ideally, the solulion should also be fair. For this pi-oblem, deadlock could resull i f a xaiter nescls two forks and cannot get thern. A waiter certai.nly has to hold on to both forks while h i s philosopher is eating. But w l x n the
philosopher is not eating, a waiter should be willirlg to give up his forks. However, w e need to avoid passing a fork back and forth from one waiter to another without its being used.
470
Chapter 9
Paradigms for Process Interaction
The basic idea for avoidjl~gdeadlock is to have a wailcr give up a fork that has been usecl, but to hold onto one that ir jost acquired. Specifically, when a philosopher starts eating, his waiter marks boih forks as "dirty." When another waiter wants a fork, if il is dirty and no1 cur~cndybeing used, the first waiter cleans rhe fork and gives i t up. The second waiter holds onto thc clcan fork unLi.1 it has been used. Howcver, a diriy lork can be reused until il is needed by thc olher wairer. This cleccnrrali~edalgori(hm is givcrl in Figure 9.21. (11 is colloquially called the '"hygienic phjlosophers" algorithm because of tlic way forks are cleaned.) The solution is progra~nlned using the mi~llipleprilniiives rlotatiori clescribetl in Section 8.3, because it is convenienI to be able to use all of remote proceclure call. rendezvous, and message passing. %'her1 a philosopher wal?ts to eat, 11e calls lhc getforks operation exported by his table module. The getf orks operation is implemented by a procedure to hide [he fact that getting l'orks requires sending a hungry rnessage and receiving iui eat message. When a waiter process receives a hungry nlessagc, it checks Ihe sratus of the two forks. If il has bolt), it lets the philosopher eat. then waits for the philosopher to release the forks. 1t the waiter prwcss does not have both forks, it has to acquirc those i t needs. The nee&. needR, passl, ancl passR operations are used for this. In particular, when a philosopher is hu!~gryand his waiter needs a fork. (hat wailcr sends a 11ee.d mcsaage to (he waiter who has the fork. The othcs waiter accepts h e need message w h e ~the ~ fork is d i r ~ yancl 1101 being used, and then passes h e fork to the lirst wailer. Thc needL and needR operations are invoked by asynchronous send sat her than syncl~ronousc a l l , because cleacllock car1 result i f two waiters call each oiher's opevetions at the same time. Four variables in each waiter are used ro record the SlahIs or the forks: havel, haveR, dirtyl, and dirtyR. These variables u e initialired by having the Main process call Ole f o r k s operation i l l Ihe Waiter modules. Inidally, waiter zero holds two dirty forks, waiters one lo U~receilch hold one dirty fork, and waiter foiw holds no fork. In order to avoid deadlock. it is irnpenirive that Lhc forks he distl-ibuled asyrnrne~ricallycurd that rhcy all be dirty. For example, if every waiter initially has one fork and all philoophers want to ear, each waitcr could give up the fork be has a.nd then hold on to the one he gets. I f any I'ork is initially clean, then the waiter that holds it rill no1 give it u p until afier his philohopher has eaten; if the philosopher ter~i~inates 01. lievet- wallts to eat, atlother philosopher could wait Torever to get Lhe fork. The program avoids scarvation hy having waiters give up rorks that al-e dirty. gel In par.iicular. if one wailer wants a fork that aoo(11e1-holds, he will cvel~t~lally
Historical Notes
471
passes it to h e first waiter. IF the fork is dirty and in use. eventually r l ~ eothu. philosopl~er will q u i t eating, and hence Lhe orlicr waiter cvill pahs the fork lo the first waitec 1T the fork is clean, it is because chc olhel- philosopller is hun.gry, thc orher waiter jusL got the fork, or the other waiter is waiting to get a second fork. By similar reasoning, the other waiter w/ill eventually get the second fouk, because there i s no state in whicll every waiter llolds one clean fork and wants a second. (This js another reason why asyinrnelric initialization is imperative.) it. I f the fork is dirty and nol in use. the second waiter immediately
Historical Notes
1
n :r :s le !I-
JO
:s : ng
ly.
rk. ted illy ork Ihc
the forirly. r f"l
All rhe paradigms described i n this chapter were de\leloyed between the mid-1970s a11d rile mid-1980s. During lllar decade, there was a pletho1.a of activity refining, analyzing, and applying the various paracliglns. Now the focus is more on using them. Indeed, rnany problems can be solved in [nore than one way, hence ilsirlg inore than one paradigm. We describe some of these below; additional examples are given in the Exercises and in Chapter 11. Tbe manager/worker paradigm was introduced by Gentleman [1981], who called it the administratodworker paradigm. Calriero and Gelernter have called the same idea a distributed bag of tasks. They present solutions to several problems in C a ~ i e r oet al. [1986]and Carriero and Gelernter [1989]; the solutions are prograrn~~~ed using their Linda primitives (Section 7.7). Finkel and Manher [I9871 use a distributed bag of tasks to implement backtracking algorithms. The used in paallel computations, managerlworker paradigm is now co~nmo~lly where the technique is sonletimes called a work pool, pr'ocessor farm, or work farm. Whalcver the term, the concept is the same: several workers dynamically dividing up a collection of tasks. Heartbeat algorithms a-e routinely used in disl~ibutedparallel conlpuLations, especially for grid cornputatiotls (see Secrinn 11.1). The author of this book creheartbeat algorithm in the late 1980s because that phrase seems to ated d ~ term e characterize the actions of each process: pump (send), coi1t.l-act (I-eceive),prepare for the next cycle (compute), then repeal. Per Brinch Hansen [I9951 calls i~ the ce.llulas autonzala paradigm, alcl~oughthat term seerns better Lo describe a hnd of applicatioo than a progra~n~ning style. In any event, people do 1101 ~~sually use any name Lo refer lo the now canonical send/receive/compute PI-ogramming style-QJ. they just say that processes exchange i nfoonation. 'In Section 9.2 we preser.lted heactbeat a.lgoritlin~sfor h e region-label prohleln from irnage processing and I'oJ-the Game of Life. whjch was invented by tnatheinatician John Conwily in Lhe 1960s. Image processjng, cellulai- automala, and the relaled topic of genetic algorithms are covered in some detail by
472
Chapter 9
Paradigms for Process Interaction
Wilkinson and Al.len [1999]. Brinch Hansen [ I 9951 illso describes cellular Inany auromnta. Fox et at. [ 19851 cover numzr-ous applicikcions and algo~.ithii~s, oFwliic1i arc programmed usirig the heattheat slyle. T h e conccpl of a soft\varc pipeli~legoes back at least as f i i ~ as , [lie i~u~.oduction o l Unix in the early 1970s. 'Vhc concepL of n h a r d w a r pipcline goes baclc even li11-ther.Lo (he em-ly vectol- processol-s in the 1960s. 1-Iowevel; the use OF pipelines as a getieral parallel co~npulingpal-adigm is more recent. Pipeli~leand ring (closed pi pel i fie) algorithms have now been developecl k) solve Inany problems. Examplcs can be found in m o s t books on garallel computing. A few good soul-ces are Brinch Hansen [I 9951. Fox et al. [19851+Qujnn [l994], and Wi lkinpipclines are covcred in detail iu Hwang 1 1.9931. son and Allen I 19991. l-lardwa~~e The behavior of borli software and hardware pipelines i s si~uilar,so [heir perfo1.mauce cilll be a~lalyzcdin similar ways. The concept of a probelecho paradigm was iriverlted sitnulta~ieoc~sly by scvera1 people. The mosr cornpreIiensi\le eal-ly work is ol'l'ered by Clung 119821, who presents algorithms for sevetal graph problems, including sorling, computing biconnected colnponents, and knol (deadlock) deccccioll. (Clhang called them ccllo algoritli~ns;we use the ltnn pr-obelecho to indicate h a 1 here are two djstinct phases.) Dijkslra and Scholten [I9801 and F r a n c e ~[ 19801 Llse Lhe same paradig~ii-w~tlioul giving it a name-lo delecl ler~ninatio~l ol' a distributed program. We 11sed the network topology prohlem lo illu~lralelhe w e of a PI-obclccho algorithm. The problem was first described by Lzunyor~lL982J in a papcr that showed liow 10 derive a dislribuled algorithm in a systematic way by relining an algorithin tliat uses shared variables. Thar paper also showed how to deal wi1.b ;I tlynit~nicnetwork in which processors and linlts might fail and [hen rccover. McCurley and Schneide~[I9861 syslelnalically clerive a heartbeat algorirhm for solving [he same problem. Several peoplc ha\rc investiga~edthe problem of providing reliable or l'iiulllolcrant broaclcast, which i s co~~ccrned with ensuring thal every f'unctio~iingand ~.cacliableprocessor receives thc message being broadcast and that all agree upon tlle same value. For example, Schneidcr el al. 119841 present an algorithm for fault-tolerant broadcast in a (I-ee,assuming that a I'ailecl processor stops executing and rl~olfailures are detectahle. Lnmporl el al. 119821 shoc\/ how lo cope with tl~iluresthat can result i n arbitrary behavior-so-calletl Byzantine failures. Logical cloclcs were developed by Lamport 11 9781 in a now classic paper on liow t o order evcncs i o distributed systems. (Marzullo and Owicki 119831 describe the problenl of synchronizing pl~phicalclock&.) Scluleider 11982) tlevelof djstribu~edsemaphores similar Lo the one in Figure oped an i~iiple~nentaiion 9.13; his papel- also shows I~owlo modify this kind ol' algorillim to deal w~tll
Historical Notes
473
fai lu res. Tlie same basic app1.oac11-broitclcasted measnges and orclerecl queuescan ;11so be used lo solve addi~ionnlprobletns. For exil~nple,Larnporr [ 19781 presenls a n algorillinl f(')~.distributed ~nulualexclusion. and Sclirlcidcr [I 9821 presents a distribu(ed i~nple~nentation of the guat,ded itipul/oucput cornslands of CSP described in Section 7.6. Thc algorithms in cliese papers do not assunie that messages are lotally ordered. However, li,r mally problerns it helps if b~,oadcaslis atolnic-namely. [hat every process sees Incssagcs thal have been broadcasl in exaclly the same order. 'rwo ltey papels on (lie use and imple~nenk~rion of reliable and aloillic com~nunicatiouprimitives are Bi~mlanand Joseph [ I987 j i111d Birn~anet al. j19911. Secrion 9.6 showed how to use ~olcen-passingto irnplernent distribulcd ~nutualexclusion and termination detection, ancl Section 8.4 sho\ved how lo use lokens to synchronize access to replicaled files. Chandy and Misra [I9841 use Loken passing to achieve fdir col~flictresolution. atid Cliandy and Lampo1.1[ I 9851 use lokens to determine global states in distl-ibuled compu~ations. The token-passing solution lo the distributed mutual exclusion prohle~n eiven in Figure 0.15 was developed by LeLann 119771. Tliat paper also shows * liow lo bypass solrlc node on the lain?, if ir should fail ancl how to regencrate the token if ii. should hecome lost. LeLant1.s ~~lelhocl requires knowi~lg~naxin~um communication delays and process identities. M~SI-a (19831 lacer developed an algorithtn that ovel.cornes rhese li~n.itationsby at sing two tokens (ha1 circula~e around the ring in opposite directio~~s. The distributed mutual exclusion j~roblemcar) also be solvcd using a broadcast algo~itlim. One way js to use dist~ihuted semaphores. I-lowever. this I-equires exchanging a large number of messages since every Inessape has lo be acknow lcdged by every process. More efficient. br~oaclciisl-hasedalgoi.i(hrn.~~ted rermin;~tioncletec~io~~. .loumctl qf Pc~/-l,nllel ~rndllistrihrltecl Cowp~llilzg8: 245-93.
K.M.,L. M, I-laas, and J. Missa. 1983. Distributed deadlock detection. ACM I ~ Z I ~or1 Z SConzpuie~. . Systems 1 , 2 (May): 144-56.
Cliandy,
Chnndy. K . M.. and Lamport, L. 1985. Distributed snapshots: Determining global states of' distrihuled systems. ACM Trans. 0 1 1 Coni.p~iterSssren1.r 3, 1 (February): 63-75. Chandy, K.. M.. and Misra. J . 1984. The drinking plii1osophe1.sprohlern. ACM Trcil.l.s.O H Pfolj7. La~zguagesund Sy,~~erns 6. 4 (October): 632-46. Chang, E. J.-H. 1982. Echo illgorithms: depth parallel operations on gzncrnl grapl~s.IEEE Ti.~m.r.on Sqfiwnte Engu. 8% 4 (July): 39 1-40 1 .
Dijkstt-it, E. W., W. H. J. Feijen. and A . J. M. van Gasreren. 1983. Derivation of a terrnjnation detection algal-ilhm l'or dislributed computation. lnfi)~-r~~.a~iuri Proc.ts.ringLelters 16, 5 (June): 2 17- 1 9. Dijksll-ii, E. W., and C. S. Scholten. 1980. Termination cleccction in diffi~sing coml~ulalions.Infornla~ionPipoces$ii~gLeilei-r I I 1 (August): 14. Fillkcl. R.. and U. Manbcl: 1987, UlB-a disrributecl implemenlalion of backt~-ncki~lg, ACIM n~~zs. on Prog. 0lngrmge.r rrrld S y s f c i ~9~,s2 (April): 235-56. Fox, G.C.. M. A. Johnson, G. A. Lyzenga, S. W. Otto, .I. K. Salmon. and D. W. Wiillecember): 175-206. Iaiunpo~l.L., R. Shosrak. and M. Pease. 1982. The Uyzncltine generals problem. A CM Tvnrrs. on Pnlg. Larrgwc~ges0n.d .Tyslen~s3 , 3 (July): 382-40 I . IA,ann, 0. 1977. Distril>ured sysierns: Towards il formal approach, /-'roc,. I1firnlc~rio17 PI-nc-essirzp77, North-l-[olland, Amsterdam, pp. 155-60.
Maekawa, M.. A. E. Olclehocfi, ancl R. R. Oldelioe-1.1. 1987. O]~~ra?irzg Systenlr: Atlvanc~(1Concppts. Menlo Park. CA: Her~jaminlCum~nings. Marzulla. K., find S . S. Owicki. 19133. Maintilining he ci~nci n a cliat~.ihuleclsystem. Punt.. Secorld ACM Syrnp. 012 I-'rirrcip/es oj' l>i.~isrr:Corr[j)r,l/ing, August, pp. 295-305. McCurley, E. K., atid F. B. Schneider. 1986. De~.ivatiouoT a disrribuled algorithm (lor finding paths in directed networks. Science t,lConlp~tter.Prog. 6: 1 (January): 1-9.
Misrn, J. 1983. Detecting terminalion of distributed computations using ~nitrkers. Proc. Secorzd ACM Syrn.p. on Prinrilh.9 y/'Di.srr. Cornl>u/illg,August! pp. 290-94. Morgan, C. 1985. Global and logical time in distribuled nlgorithnis. l~lji)r-~r/u/inrr Processing Leilria 20, 4 (May): 189-94. Mul lender, S . , ed. 1993. Disir-ihuled Syslems, 2 n d ed. Reading. MA: ACM Press and Acldison-Wesley.
Quinn, M. J . 1994. IJurullcl Comp~,~rir~,g: T / T P ocirzd ~ ~ Prrxctice. New York: McCraw-Hill. Rana, S.
P. 1983. A distributed solution uf the distribucecl terminalion problem.
Ifijorrnurion P~~-oces.\ing Letfe~:r 17, 1 (July): 43-46.
Raynal. h4. 1986. Algori1h~n.rfi)r M ~ i ~ u aExchnion. l Cambridge, M A : MIT
Press. Raynal, M. 1988. Distt-ibutedcl1goritt1.n-r~ and Protocols. New York: Wiley.
Schneider. F. B. 1980. Ensuring consiscency in a distrib~~lecl database system by use of distributed semapl~ores.Proc. nflnr. Symp. or! Di.s/rihu.teclDatcibnses, March, pp. 183-89. Schneider, F. B. 1 982. Synchconization i n clisrributcd ptogrcuna. ACM Tr~r~zs. on Prog. Lnngunges arzci Systems 4 , 2 (April): 12548.
Schncider, F. B., D. Cries; and R. D. Scl~licliting. 1984. Faull-lolerant broadcasts. Science of Conr.p~/!er. P ~ n g 4. : 1- 15. Schneidel; F.€I and .,L. Lan~po~z. 1985. P~tradigmsfor distributed progl-alns. In Distributed Systr!ms: Merllocix and 7hols ,fi)r Specjficofion, An Advunc.ed Cou~se.Lecturc Noce~in Co~nputesScience, vol. 190. Berlin: Springe).Verlag. pp. 43 1-80. Willplicaiio~~s Using Nei\~~orltcd 144)t-kstaiinns and Pc~r~rllelConr./~ure~u. EngJewood Cliffs. N N Prentice-I-lall.
Exercises 9.1 Section 9.1 describecl one way t o represent sparse n~alricesand sko~vcdhow to m~~lliply rwo sl~arsematrices using a distribuled bag of raskh.
(a) Suppose mall-ix a is ~epresen~ed by rows as described. Develol) code lo compute [he transpose of a. (b) Tmplen~enr,and experiment with the program in Figure 9.1. First use input ~natricebLhac you have en eta led by hand and for which you know [he resull. Then wriie a small pragrani to generate large mati-ices, perllaps using a ra~idonl nl~mbesgcneratur. Measure the performance of the program for (he lal-ge rnatriccs. Write a bricf report explaini~igLests you ran and the results you observe.
9.2 Section 3.6 presencecl n bag-olliasks progran Vor. ini~ltiplyi~lp dense malrjces.
(a) Construcl a distributed versjon ol' that program. In pill-iicular, change i t to use the managerlworker paradigm. (b) I~nplco~cnt your answer Lo (a), and implen~cntthe program in Figul-5 '3.1. Compare the performance 0 T the two programs. In parricular, genei-ate some I;~rge, sparse matrices and multiply them together using both programs. (The
478
Chapter 9
Paradigms for Process Interaction
pl.ogralns will, of' course. represent t l ~ ematricea di ffel-enrly. ) W r ~ t ea brief report explaini~lgyour tests and comparing the time and spikce rcquircmcnts of the two programs. 0.3 Tlic adaptive quadralure program in Figure 9.2 uses a fixcd number of task.s. (It divides Ihe inlerval from a ro b into a lixed nu~nberof subjntervals.) Tke algorilli~nin Figure 3.2 1 is f'ully adapbve: it sta~tswith just one task and generates as n u n y i:~'; are required.
(a) Modify the program in Figure 9.2to use the fully adapiive approach. This means lhar workers will never calculate a a are21 using a recursive procedure, You will lia\/e ro figure 0111how to detect 1erm.i~iatiorl! (b) Modify your anscve1 to (a) to use a il7reshl1uld T that defines the maximum size of (he bag of casks. After this many taslts have been generated, when a worker gets: a task from he bag, the lnanager should tell the worker to solve a Lask recursively and hence nor to generate any more tasks.
(c) Irnplemenl tlie program i n Figure 9.2 ant1 your answers to (a) ancl (b), then conipare the ~>el.l'ol~mance of the three programs. Conduct a set of experiments, then write a brief report describing the experiments you ran and the ~.esullsyou observed. 9.4 Quicksort is ;I recursive sorljrlg rnelkod that parlirio~is an array inlo s~nalle~. pieces and then combines them. Deve.!op a program to i~rlplelnentqi~icksort c~singthe ~nanager/workerparadigm. Array a [ l : n ] of integers is local to the manager process. U h e w ~ v o r k uprocesses to do (lie sorting. When your prograni te~.~ninales, the rcsult should be stored in adm.it~islrato~ array a. Do not use any shared variables. Explain your solurion and justify your design choices.
9.5 The eight-quecns problem is concerned with placing eiglir queens on a chess boi~rdin such a way that none can i11tacli another. Develop a program to gencsate all 92 solutions Lo the eight-clueens problc~nusing the rnani~ge~,/worke~pal-acligm. I-lave an administrator process pul eight inilia1 queen placements in a shared bag. Use w worker processes ro exlend partial solutions; when 21 worker finds a coinplete sol ut ion, i~ should sencl il to the admi nistralor. The pr-ogmm slioi~ld Do not use any sliarcd variables. co~npureall solutions and ~cm~.inate.
9.6 Consider tlie problem of detertninil~g[he numbex of words in ;I dictionary llwr contain ~ ~ n i clet.ters-namely. l~~s (lie number of worcis in which no lelter appears versions of a letter as the same letlet. more than once. Treat upper and lower (Most Unix systems conlain one or Inore onli~ledictionaries, ['or example in /usr/dict/words.)
Exercises
479
Write a disuibuted parallel program to solve this problem. Use tlie manngei./workers paradigm and w workel. processes. A t tlie end or the progl.ani, the manager should print the number of wor.cls that contain imiclue letten, and also prinl all hose thal are the longest. Tf your workers have access to shared Inemory, you )nay have them share a copy of the dictionary file, but if tliey execute on separate machine$, they should each have chcir own copy.
9.7 The Traveling Sulrswmn Problem (TSP). T1ij.s js a clashic comhini~lorialproblem-and one Lhar is also pracdcal, because jt i s the basis for things like scheduling planes and perso~incla1 an airline conlpany, Given are n cities and a symmetric malrix dist [l:n,1:n). The value in a i s t [i, j J is the distance from city i lo cily j-e.~., tlie airlinc miles. A anlcsmen starts in city 1 and wishes lo visit every city exaclly once, ending bnck i ~ ) city 1. Tlie problein is ro deler-mine a path Lhat mini~uizeslie dislnnce Lhe salesman nlust travel. The result is lo be stored in 21 vector bestpath[l:nl. The value of bes tpath is to be a pe~.mutationof integel-s 1 to n si~clithat the sum 01' the dislances belween adjaceill pairs OF cities, plus tlie dist.ailcc back to cily 1: is minimized. Develop a dislributcd parallcl program to solve this problem using the managerlwoi-kcrs paradigm. Make n ceasonable choice for what consrirutes e task; ~liercshould not be roo many, nor too few. You should also discard rasks [hat cannot possibly lead to a better resillt than yo11current1y have cu~nputed. (a)
(b) An exact TSP solution has to consicler every possible path, but Lllerc are n! of rhcm. Conscqiiently, people have developed a number of heurislics. One is ci~lled[lie neur.e.tl nci,plthor algorilhm. Starting wit11 city 1. li~slvisit tlie city. say c , nearest to cily 1. Now extend the paiia.1 tour by \~isitingtlie city nearesc to c. Coritjnue in this Pashion until all cities have bee11 visited, hen return lo cily 1. Write a program to i~iiple~nent this algorithm. (c) Another TSP heuristic is called the 1zeare.sI inserrinn algorithm. First find the other. Next find the unvisited city nearest lo pair o f ciLies Lhat are closest to either of these. two cities ancl insert it between thcm. Conti~iueto find thc unvis-
ited city with minimum distance lo some cily in h e partial lour. and inscrl rllal city between a pair of cities alre;tdp in [lie lour so that the inserlion causes the mini.cnurn jncrease in tlie Lotal length of the parlial tour. Wrile a pl-ogmm to i~npleinerltthis algorilhm. (d) A third traveli~igsalesman lxeunsric is to partition the plane of cilies into
strips. each ol which contains some bounded number, B of cities. Worker processes in p a ~ ~ l l find e l ~ninirnalcost lours from one end of the strip to lhe othe~a.
480
Chapter 9
Paradigms for Process Interaction
In odd-numbercd strips the lours sllould go from lhc top to the hotton?: in e\lennumberecl sb'ips they sliould go from Ihe bouoln to the top. Once tours have hwn founcl li)r all ships, they al-e connected togelher.
(r) C00lpa1-e the perl:ol-mance and accuracy 01' your programs for parts (a) tl~rough(d). What: are their esecutio~lLinieb? I-low good or had is the ap111.oxinlale solution rl~alis gencraled? Flow mucll larger a problem can you solve using lie appl.oxirnate algo~.ithmsttliir~ the e x x r algorithm'? Expel-imenr with several rours o f v a r i o ~ ~siz,cs. s Writ,e a report explaining the tests you corlductecl and the resu Its you nbse~.ved. "There are several additio~xal heurisljc algorithms and local opli~nijlnrion tzc}~niquesFor solving the travel in2 hnlesnlan problem. For cxa~nple.lhece are tecIi.niques called cutting planes and simul.ated annealing. Start by finding good c nlgo~:.itlims,wr.ite a refe~.erlceson the TSP. Then pick one 01'more of ~ h hette~' program to iniplenlent it? ;und conduct a se1~ie.sof espesi~ncn~s to see how well it perConns (both in terms of execution time and 11ow good a solution it genel'ates).
(f)
9.8 Figure 9.3 co~ltainsa program for [lie region-labeling prublcm.
(a) Irr~plcmen~ t J ~ cprogram using thc MPI, com~nur icntion 1ibrary. (b) Modify :y/ou~answer to ( a ) to get rid of b e separate coordinator process.
Hint: Use MPl's global co~n~nunication pri~nit.ives. (c) Compare the percormance of the I.wo prognlms. Develop a set o l sample images. then delennine the per{-ounance of each program for those images and different nu~nbel-sof workers. Write a lrpolt rl~alexplains and a~~alyzes your results.
9.9
Sectjon 53.2 describes ~ h cregion-labeling PI-oble~n.Consider a diffei-ent image proccssi ng problem caHecl a~zootlzing. Assi~meimages are repcescnted by a ~nabix01' pixels as i n Sectio~i9.2. The goal tlow is to smooth an image by rernoving "spiltes" :~tld "l-ough cdges." In yartict~lar,start w i 111 the i ~ipulimage, the11 modify llie image by unlighling (sctring lo 0) all lit pi.xels /hat do not have (11 I ~ N .d~rrc!ighho?s. I Each pixel in the original image should be considercd indepentlently. so yo11 will need to place thc new image in a new matrix. Now use dle new image 21nd repeal this algorithm. unlighting all (ncw) pixcls that do not have at leabl d neighhol-s. Keep going until tlzerc) are no chn/lge.r betr4)ee~rzho current and new: irrroge. (This Incans th;u you canoot know in advance how many times lo repeal the s~noollii ng algori Ilim.)
(a) Write a piu-allcl program Tor solvilig this problem; use Llle heartbeat paradigm. Your prograln should uve w worker processes. Divide t11e image into
Exercises
481
w erlual-sized st~.ipsand asaig~ione worker process l o each strip. Assume n is a multiple of w.
(b) Test your pr-oglxln (01- tli Werent images. di [rerent nu ~nbersof u~orkel.s.and different values of d. Write a brief rcpol-1 explaining your tests and rew~lts. 9.10 Figi~re0.4 con~ainsa program for the Garne of Lire. Modi'ly the pmgr:\m to use w worker processes. and have each worker manage either a strip or. block of cells. Imple~nentyour progra~nusing a discl-iburet1 programming language or a sequel)rial language and a subroutine library such as MPI. Display the oulpur on a g~.aphicaldisplay clcvice. Experi~nerll will1 your progral,n, Ihe~i write u b v i d repol'l explaining your experiments and the results you observe. Does the game ever converge?
0.I I The Game of Life has very siniple organisms and rules. Devise a more complicaled galnc [hat can bc nodel led using cellular aulo~natn:bul do not ~nalteit too compljcaretl! For example, nod el interactions between sharks and lish. between rabbits and foxes. or betweeu coyoles and roadrunners. Or model something like tllc burning of trees and brusli in a I'urcsc f re. Pick a problem Lo rnodel and devise an inif i l l set of iules: you will probably wan1 to usc rando~nness10 make interactions pi'obnl~i1ist.i~ mthe~than fixed. Then i~nplcmenly o ~ urnodel as a cellular autoinalon. experi~nenlwith ir, ancl modify the rules xo tlial the outcome i s no1 trivial. For example. yo^^ mighc strive for population balance. WI-ile a report describing yoill. garne arjd tlic I-esul1s you observe. 0.12 Section 9.4 describes ;I probelecho algorilhm for computing rl~etopology or a nelwork. This prohle~iican also be solvcd ~ ~ s i ral ghearlbeat algoritlin~. Assume as in Section 9.4 that a process can communicate only wit11 jfs neighbors, and [.hat initially each process knows only aboul those neighbors. Design a heartbeat nlgorilhnl Lhat has each process repca~edlyexchange informa~ionwith its neighboy-s. Wllcn Ihe pl'ogcam terrninaces, cvet-y PJ-oceshsllould know the lopology of the entire neLwotk. You will need lo figure out what [lie processes shoulcl exchange and 11ow rliey can tell bhen lo lenninale.
9.13 Suppose n2 processes are arranged in a square grid. Each process can communicate only wilh the nciglibors to ~ h cleft :and right, arid above and below. (PISOcesses on the corncrs have only Lwo neighbors; othcl.s on the edges of the grid have ~hreeneighbors.) Every process has a local integer value v. Write a heartbeat a1go1.itlim lo computc llle suln of tile n2 values. When your program terrniilares. each process sbould know the sum.
482
Chapter 9
Paradigms for Process Interaction
9.14 Scction 9.3 descl-ibes two algorilhms
1'01-
disl.ributed matrix inulliplication.
(a) Modify the algori.lbms so that each uses w worlter processes, where w i h much e s and n that make Lhe arithmetic easy.) s ~ ~ ~ athan l l an. (Pick v a l ~ ~ ol'w
(b) Compare the perfor~nance,on paper, of your answexs to (a). For give11values of w iund n, what is the rofal nu~nbel-of messages required by each program? Some of the messages can be in transit at the same time; for each program, what is the best case for the longest chain of messages that cannot be overlappecl? W h a t are the sizes of the rnessages in each program? Whal are the local storage requirements of each program? (c) Implement your answers to (a) using a programming language or a subroutine library such as MPI. Coinpare the performance of the two programs for different size matrices and different numbers of worker processes. (An easy way to tell whether your prograins are correct is to set both source matrices to all ones; then every value i n the result matrix will be n.) Write a report describing your experiments and the results you observe. 9.15 Figure 0.7 contains an algorith~nfor multiplying matrices by blocks,
(a) Show the layout of the values of a and b after the initial rearrangement for n = 6 a n d f o r n = 8. (b) Modify tbe program lo use w worker processes, where w is an even power of two and a factor of n, For example, if n is 1024,then w can be 4 , 1 6 , 64, or 2 5 6 .
Each worker is responsible for one block of the ~nal~ices. Give all the details for the code. (c) l~nplementyour answer to (b), for example using C and the MPI library. Conduct experiments to measure jts performance for different values of n and w. Write a report describing your experinlents and results. (You might want to initialize a and b to all ones, because then the final value of every element of c will be one.)
9.16 We can use a pipeline ro sort n values as follows. (This is not a very efficient sorting algoritlim, but what the heck, this is an exercise.) Given are w worker processes and a coordinator process. Assume that n is a rnultiple of w and that each worker process can store it[ most n/W + 1 values av a time. The processes Corm a closed pipeline. Initiallq; the coordinatw has n unsorled \talues. It sends Lhese one at a time to worker 1. Worker 1 keeps some of the values and scnds orl~el-son l o worker 2 . Worker 2 keeps some of the values and sends olhers on to worker 3 , and so on. Eventually, the workers send values back lo L-hecoordinator. (They can do this djrectly.)
Exercises
483
(a) Develop code For the coordi~latorancl (lie workers so that thc coordinalo~'geh back sorfed values. Each process can insert a value inlo a list or rcclnove 21 v;~lue from a list, bur it rnny not use an internal sorling routine. (b) How man!, messages are used by your algoricl~m.?Give your atlswel as a function of n and w. Re swe 1.0 show how you al-rived at the answer.
9.17 Givcn are n processes. eitc1-1corresponding to a node in a connected gl.apIi. Each node can colnmunicate only wid1 i t s neighbors. A spanning wee ol' a graph is a tree thui incli~desevery lode of che graph and a subsel of ilie edges. Wrile a program Lo construct a spanning I:ree on thc I'ly. Do no1 lirst colnpulc cllc topology. Lnste.ad. conslrucl the tree From the ground u p by having the processes iolesact wirh rlicir neighbors to decide which edges to put ill ilie tree and which to leave out. You may assume processes have unique indexes. 9.18 Extencl the probe/ccl~oalgorithm for co~npulingthe topology of a network (Fipilre 9.12) to handle a dynamic topology. I n particular, commiuiication links ~nigl~t fail during the cotnputatioo and later recover. (Alilure o'l' a processor can be nodel led b y failure of all i t s Links.) Assuine tl~alwhen a link fails, i t silently lhrows away i~ndeliveredruessages.
Def ne any additional pri~nilivesyou need to detecc when a I'ai1ul.e 01-I.ecove1.y has occurred. and explain brieHy how you would implemeni thenl. You might so that it 1-elurnsan error code if the also want lo modify the receive 1~1.imitive chaunel has failed. Your algorithm 5hould le.r~n.inale.tissum ing that e\leniually failures and recove~.iesquit happening for a long enough interval Lhat every node can agree otl the topology.
9. I9 Given are n processes. Assurne Lhal che broadcast pl-i~nitivesends a message from one process lo all n processes and that broadcast is both reliable and totally ordered. That is, every process sees all messages that arc bl-oadcast and sees them in rhe smne u~-dcl: (a) Using Lhis broadcast primitive (and receive: of ccn~l-se),develop a Ihir soliltion lo Ihe distribut.ed ~nutualexclusion pi-oble~n. In particular. devise entry and exit protocols that each process executes before and aftec n critical xection. Do not use additional helpcr processes; Lhe n processes should comniunicute dircctly with each other. (b) Discuss how one ni iglit ilnl>len)entatoniic broadcasL. What are [he pl'oblerns char have ro he solved'? How woul.d you solve ilieln'?
9.20 Conbider the following tl91.e~processes, which co~nmunicateusing asynchrono[~s nlessage passing:
481
Chapter 9
Paradigms for Process Interaction chan chA( ... ) , c h B ( ... ) ,
chC( ... ) ;
...
process A ( send chC ( . . .) ; send chB ( . - ) ; receive & A ( . send c h C ( ) ; receive ch~(-..);1
...
...
.
process B { send c h C ( . . . ) ; receive c h B ( . receive chB( ) ; send chA(
...
. .) ;
. .);
...) ;
)
process C { ... receive chC( . . . ) ; receive c h C ( . . . ) ; send chA(...); send chB( . . . ) ; receive chC( . . . ) ;
}
Assume that each process I~asa logical clock [hat is initially zero, and that it uses Llle logical clock update rules in Section 9.5 to add timestamps to messages and to update its clock when it receives messages.
Whar are the linal values of the logical clocks j.n each process? Sliow your work: the answer may not be unique. 9.21 Consider the irnplernentation of distributed semaphores in Figure 9.13.
(a) Assume there are four user processes. User one initiates a v operation at time 0 on its logical clock. Users two and thm both initiate P operations at time 1 on Lheir logical clocks. User four initiates a v operation at time 10 on its logical clock. Develop n trace of nil the communication events that occur in the algorithm and show the clock values and timestamps associated with each. Also trace [.he contents of the message queues in each helper. (b) In ihe algorithm, both user processes and helper processes maintain logical clocks, but only the helper processes interact with each other nnd bave to make decisions based on a total ordering. Suppose the users did not have logical clocks and clid not add timestarnps to messages Lo their helpers. Would the algorithm still be correct? If so, clearly explain why? If not, develop an example that shows what can go wrong. (c) Suppose the code were changed so that users broadcast lo all helpers when they wanted to do a P or v operation. For example, when a user wants to do a v, it would execute broadcast semop (i, VOP, lc)
and then update its logical clock. (A user would still wait for pe~.missionF I - O I ~i t s helper aftel. broadcasting a POP message). This w o ~ ~ simplify ld the program by gelliog rid of the first two arms o f the if/thedelse statement in llie helpers. Unfortunately, ir leads co an incorrect program. Develop an example (hat shows whal can go wrong.
Exercises
485
9.22 The solution to the disuibuted mutual exclusion problem in Figure 9.15 uses a token ring and a single toke11 that circulates contit~~ioi~sly. Ass~uneinstead that with every other-i.e., the com.rnunicaevery Helper process call co~nnl~inicate tion graph is complete. Design a solution that does not use circulating tolccnh. In parricular, the H e l p e r processes should be idle except when some process U s e r t i ] is tryirlg 1.0 enter or exit its critical seclion. Your solution should be fair and dead lock-flfr.ce.Each Helper should execute lhe same algorith~n,and the regular processes user [i] should execule the same cock as ill Figure 9.15. ( H i n t : Associate a token with every pair of proccsses-namely, with every edge of the co~nmunicationgraph.)
9.23 Figure 9.16 prese11t.sa set of rules for [erm.ination deleclion in a ring. Variable token coul~tstbe number of idle processes. 1s Ihe value 1-ea1ly~ieedcd.or do the colors suffice to detecl te~.ininatio~\? In other words. \+hat exac~lyis the role of
token?
9.24 Drinlting Philo,soptlers Prohlrm [Chandy ant1 klisra 19841. Consider the I'ollowing gcnel.aliration of the dining hilo lo sop hers problem. An undi~cccedgraph G is given. Philosophers ;ue associated wilh nodes of tlie g1.apb alld can communicate only with nejghl>o~'s.A bottle is assocjaled wid2 ei1c11edge of G . Each philosophcl- cycles between lhree staleb: tranquil. thirsly, and drinking. A wanrluil philosopher may bcco~ncthirsty. Before drinking, the philosophc~-must accluirc tlic bottle associated wilh every edge connected to the philobol>lier's noc;le. After drinking, a philosopller again becomes 11.anquil.
Design a solutiotl to this problem that i s fair and deadlock-free. Every pliilosopher should exccilte the same algorithm. Use toltens to l.ep~-esenlIhe bortles. 1t is permissible Tor a t~c?~lquil ~>l~jIosopher to respond Lo requests from neighbol~sfo~. any bottles I-he philosopher may hold. 9.25 You have been given a collection of processes thai co~nmunicateusing asynchronous message passing. A cljff~ningconzpl~turalionhas te~minatcd. Use a probefecho algorithm as well a5 ideas from Section 9.6. (kllnr: Keep counls of messages and signals.)
9.26 The distributed termination detection rules i n Figure 9.18 assumes that the graph is co~npleceand thac each process receives from exactly one channel. Solve the following exercises as separale proble~ns.
I I
486
Chapter 9
Paradigms for Process Interaction
(a) Excend (he roken-passing rilles to handle an arbi lrary connccled graph. E x t e ~ dthe token-passing rules to handlc rhe situation in which a process receives From mulriple channels (one at a rime, of course).
(13)
9.27 Consider the following variation on thc rules i n Figure 9.18 for termination delection i n a complele graph. A process lakes the same actions upon receiving a regular' Inessage. B u t a prowss now takes the fol1owin.g actions when il receives rlie token: if (color[il == blue) t o k e n = blue; else token = red; c o l o r [il = blue; set j to index oTcharlnel for next edge in cyclc c ; send ch [ j 1 (token) ;
I
/1,
E; .I
Tn other words. the value of t o k e n is no longer modified or examined. Using this new rule, i s rhere a way Lo dcrect termination'? If so. exp1ai.n when the computation is know~lto have ler~nina~ed. If not, explain why the rules are instif-
f cient. 9.28 Figure 8.15 shows how Lo impleme~ltreplicated files using one lock per copy. In thal solution. a process has to get evely lock when ir wants to updale the file. Suppose illstead that we use n tokens aud that a Loltcn does not move until it has to. Initi;~lly,every file server has one Loken.
(a) Modify the code in Figure 8.15 ro use tokens insread of locks. Inil,lernent read operations by reading llle local copy and write operaljons by updating all copies. When a clienl opens a lile for reading, its sel-vel needs to ac~juireone token; for wriling. the server needs all n rokens. Your solution shoulcl be fair and deadlock-free. (b) Modify your' answer 10 (a) to use weighled V O I I I I ~wit11 tokens. In pilrli~ular., when a clielll opens a tile for reading, its server has lo acquire any readweight tokens; I'or writing, the server needs writeweight tokens. Assume that readWeight and writeweight sac-isl'y llle conditions a1 the end 01Section 8.4. Show in your code bow you maintain tirncs1.ainps on files and how you determine whjch copy is cun-en(w h a l you open a file for reading. (c) C O I I I ~YOLII. ~ Canswers LO (a) and (b). For the diffcrcnl cljent operations, how Inilny messages do the servers have lo exchange in clie two solutions? Consider the best case and the worst case (and define whal they are).
Implementations
-This chapter clescribes ways to ilnplemenl the va~.iouslanguage mechanisms described ill Chapters 7 and 8: asynchro~iousand syncl>ronousmessage passing, ICPC. and ~.endczvous. We First show how to implement asvnchronous n1ess;tpe passing using a Iternel. We rlien use asyrrclironous messages to implcmenl synchronous message passin2 and guarded corn~nu~~icalion. Next we show how to implement RPC using il kernel. rendezvous using asynchronous message passing, (arid multiple prjmitivea) in a kcrnel. The i~nplemenkitjon 2uid finally rentle~\~ous l asynchr0nou.s Inesoi'synchro~~ous message passing is Inore complex than r l ~ aof sage passing because both send and receive statements are bloclting. Siinilarly, the imple~nentationof rendezvous is more colnplcx ~ l i t ~that n of RPC or asynchronous message passi tlg because rendezvous 1x1sboth rwo-way co~nmunication and ~wo-waysynclironizarion. The starring point for the various implementations is the shared-memory kernel of Chapler 6. Thus, even though programs tlm use message passing, RPC, or reuclezvous are usually written fol- distributccl-memory machines, they call readily execute on shared-memoly niachines. 1L so happens chat the same kind of relationship is Lrue for shatcd-variable programs. Using what is callecl a dislrihurrd .shared morloty, it is possible to execute shared-variable programs on tlistributed-memory machines, even though they are usually written LO execuk o n slial-ecl-memory machines. The l u h r section ol' this chapter describes how lo implement a distributed s h;lred memory.
I
488
Chapter 10
Implementations
10.1 Asynchronous Message Passing This section presents two irnplemenra~ion$OF asynchronous rnessage passing. 'The first adds channcls and message-passing pl.imitives to the shared-me~nory i s sujri~ldcl'ol. a single processor or a Iternel ol' Cliaptc~'6. This irnplernenti~tio~i shared-memory multiprocessor. Tllc secoild i~nplementationcxtel).cls i lie sharedmcniol-y Iternel co a tlistrjbutecl Iternel that i s suitable for LI multicomputer or a ~)e(cvo~.lce,d collcc~.ionof separare ~nacliines.
10.1.1 Shared-Memory Kernel Each channel in a program is represented in a kernel by means of n chui?ttel r1escripro1-. This contairls the heads of a message list and a bloclted list. The nlcssage list contains que~~ed messages; the blocked list contail~sprocesses waiting to receive messitges. A1 least one of these lis~swill always be empiy. This is because a process is no1 blocked it' there is iur ava.ilable message, and a message ia no1 queued if rherc is a bloclced process. A descriptor i s creared by means of Lhe kerne.1 prinlilive createchan. This is callcd once for each chan declaration in n program, bcfore any processes are createtl. An array d cllannels is created eithcr by calling createchan once fc)~. each eJement or by piuameterizing createchan with the array size ant1 calling it j o s c once. The createchan primitive rehros [.hename (index or address) of the dcscriptot.. The send slate~nentis i~nplernentedtising rl~esendChan pr'inlitive. First, (he sending process cvaluales tlie exill-essionsand collects Lhe values together into a single message. stored typically on the sendii~gprocess's execution stack. Tlieil sendChan is called; its arguinents are the channel name (returned by create~ h a n )and the message itbelf. The sendChan pl:irnitive first Linds 111e descriptos of the cliiini~cl.I f there is at least one process on tlie blocked list, Ihe olclest proccss is removed fro111 that list and t.be nlessage i s copied into the process's adclress space; rl~ntpi-ocess's descriptor i < then inserted on rlie ready list. If there is no blocked process, the rnessage has to be saved on the descriptor's message list. This it; necessary because send is nonblocking: :und hence thc sender has to be allowed to continue execi~ting. Space for the saved mcssage citn be allocated dy~i.;unicallyfrom a single buffer pool, or t1iei.e can be a cojnrnunicalion buffel- associated with each cliiuinel. I-lowever. usynchi.onous message passing ~uiseson impo~lnn[i~npleme~llalion issuowhat i f r1iei.c is no more kernel space? The kerrlcl has two choices: halt the program due to bufFer overflow or block Lhe sender until diere is enough
10.1 Asynchronous Message Passing
489
buffer space. Halling the program is a dras1.i~step since free hpace could soon become available, hur it gives immcdia~efeedback to the progl-alnnier that messages are being produced fastcr tlia~ithey are being consumed. which usually i~ldicille~ an el.l.ol. On r11e o~hcrhand: blocking llle senclef- violates lllc nonblocking semantics of send and complic;~testlie kernel somewhat since thcrc i s an additjona.l cause ol blocking; then again. the ~vriler01' a concu~'renlprograni cannot assume anything about the rate or order in which a process executes. Operating system kernels block senders. and swap blocked processes o ~ of ~ t menlory if necessaly, since they have to avoid crashing. However, lhalling the program is a reasonable choice for a high-level PI-ogramminglanguage. The receive statement is implemented by tlie receiveman primitive. Its arguments uu-eflit name of the ct-\c?nneland the address of a message buffer. The actions of receivechan are the dual of those of sendchan. First lhe kernel finds the descriptor for the appropriate cl~annel,then it checks the message list. If tlie message list is not empty, the first mebsage is removed and copied into the receiver's Inessage buffer. 1F the message list is empty. the receiver js inserred on the blocked list. Arter receiving a message, the receiver- unpacks the message from the buffer into the :ippropriate variables. A fourlh primitive. emptychan, is ~ ~ s etdo irnplcmenr the fiinctior) empty(ch). I t simply finds the c1escri.ptor and checks whether the message list is Empty. In fact. il the kernel data smctures are not in a protected address space, t(le execuli~lgprocess could simply check Tor it~elfwhet lie^ (hc message l i s t is empty; a cricica1 seclion is not requjrcd since [he process only needs Lo examine rlle head of the message list. Figure 10.1 contains outlines of these four primitives. These pri~nitivesare added to the single-processor kernel in Figure 6.2. The value of executing is the address of the descriptor of tlie currently executing process, and dispatcher i s a procedure that schedules processes on the processor. The actions of sendChan and receivechan are ve~ysimilar to the actions of F and v in the semaphore kernel in Figure 6.5. The main difference is that a channel descriptor contains a message list, whereas a semaphore descriptor merely can~ainsthe value of the semaphore. The kernel in Figure 10.1 can be turned into one for a shared-memory multiprocessor using the techniques described in Section 6.2. The main requirements are to store kernel data structures in memory accessible to all processors, and to use locks to protect critical sections of kernel code that access shared data.
490
Chapter 10
Implementations i n t createChan(int r n s g s i z e ) { gel an clnply channcl dcscl-iptol- and initialize it; sel rclu~nvalue lo lhc index 01addi-ess of tlie desa-ipmr; dispatcher ( ) ;
1 proc sendChan(int chan: byte m s g [ * ] ) f i r ~ descriptor l of channel chan; if (blocked list empty) { # save message acqi~irebut'kr and copy m s g ~ n i ~t ; o inscri bufCc1- ai end of message list; 1 else { # give message to a receiver rernove pi.ocess from hlockecl 1i.s~; copy msg into 1I1e [)I.OCCSS'Snddrtss space; insert lhc process at end of rcaclq list; }
dispatcher ( )
;
1 proc receiveChan(int chan; result byte msg[*l) find descriprol- of channel chan; if (message l i s t emply) { # block receiver inscri executing a( COO 01'b10cked list; stolr adclress of msg in descripto~. 01 executing; executing = 0;
{
}
else {
# give receiver a stored message
remove buffer from !nebsage list; copy contents of bu.l'l'er inlo msg;
> dispatcher();
1 bool emptyChan(int chan) { bool r = false; l i ~ l dtlehcril~tol-oi'chal~~ielchan; if (mehhnge list empl)]) s = true; save r as the return value; dispatcher ( ) ;
1 Figure 10.1
Asynchronous message passing in a single-processor kernel.
10.1 Asynchronous Message Passing
49 1
10.1.2 Distributed Kernel We now show how to extend the shared-memory kernel to support distributed cxecurion. The basic idea i s to replicate the kernel-placing one copy on cacl~ machine-and to have the differ en^ kernels cornlnunicate with each other using network com~nunicationprimitives. Each channel is stored on a singlc nlachine in a distribvr.ed program. For now. assume that a channel can have any nurnber of sendcrs but illat it has only one ~tceivsr. Then d ~ elogical place to pur a channel's descriptor is on the machine on which the receiver executes. A process executing on that machine accesses the channel as in the shar-ed-memory ke~nel. However, a process executing on another machine cannot access che channel directly. Instead. the kernels on rlle rwo ~~-\achines need to interact. Below we describe how to change the shared-memocy kernel and how to use the network to implerr~enta distributed prograrn. Figure 10.2 illustrates the structure of a distributed Iternel. Encb ~nachine's kcniel contains descriptors For the channels and processes located on tha~ machine. As before, each kernel has local interrupt handlers €or supa-visor c;~Ils (intenla1 traps), timers. and inputJoutpul devices. The communication network is a special kind of input/output device. Thus each kernel has network interrupt handlers and contains routines lhal write ro and read from the netwol.k. As a concrete example, an Ethernet is typically accessed as follows. An Ethernet con11.ollerhas two independent pans. one for writing and one for reading, Each part lias an associated ioterruyc handler in the kernel. A write inle~rupt is triggered when a write operation completes; the controller itself takes care of
Application
Application
Primitives
Primitives
Descriptors Kernel
Figure 10.2
Descriptors Kernel
Distributed kernel structure and interaction.
492
Chapter 10
Implementations
network access arbitr;uic',n. A read intcl-ruyl is triggered on a processor when a message For that procebbor arrives over lhc network. When a kerncl pri~niti\le-executing on behalf of an application proccssweds lo scnd a message tu a~iothcrmachine, it calls kernel PI-ocedurcnetwrite. This pi.oceclure has three argumcnls: a destination processor, 21 message kind (see below), and the rnessage itsell. Firsl, netwrite ;~ccluiresa bufier, limnats the message, and stores it in the hu fl'er. Then. if tllc wrili ug half of Llic network conh.oller is free. an nc~rralwrile is initialed; otherwise the bul'rer is inse~teclon a qileue of write requests. I n either case netwrite returlls. Later, whcn a write internipt occurs, rhe associated jntcrrupt handler frecs the buFl'e~.containing (.he Inessage that was just written; jf its write queue is no1 empty, the interrupt handler lhcn initiates another network write. Input horn lie network is typically handled in a reciprocal fashion. When a lnessagc an-i\les at a kernel. the network read jnten-upr handler is entered. It first saves (lie slate of the cxccl~tingprocess. Then i t alloc~~tes a nc\v buffer for tlie nexi uctwo1.k input message. Finally, the read handler unpaclts he first field of h e input lncssage to deter!iiine thc kind and then ca.lls (be appropria~cIternel p~imitive.' Figure 10.3 contains outlines for the nclwork inlcrfilce rouiines. These include the network int.e~.ruptbandlers and Chc netwrite procedure. The netRead-handler S C ~ V ~ C CLhree S kinds of messages: SEND. CREATE-CHAN, all(( CHAN-DONE. These are sent by one kernel a.ncl sel-\iced by another, as described l>elow. The kernel's dispatcher routine is callcd a1 the end 01' netwritehandler to resume execution of the interrupled proccss. However, the (lispalcher i s nor called at the end of netRead-handler siilce that I-uutinccalls a kernel primitive--depending on Ltre kind of' input ~nessage-and thar prirnilive in tul'li calls ilie dispatcher. For simplicity. we assome that ~~etwork transmission is error-free and hence ~ h a messages i d o not need to be acknowled,oed or retransmitted. We also igno1.e the 131-oblemof r.unning ol~ro l huffel- spncc for outgoing or irlcomilig messages: ~ I I practice, the kernels would employ wha.~is calledjow corz/~r)lt.o limit the riurnbcr o'l'bul'Ferec1 messages. The Historical Notes section citcs lilel-alure that describes how to addrcss these issues. Siuce a channel can bc stored eilher locally 01-I-ernotely,a channel name now nceds 1'0 have two fields: a machine number and an index or oCCset. The I An ; i l l c r ~ ~ ~ u appl-oacli vc LO 11;11itllin: ~rcrworkinpul ih to tmgloy A d a e ~ ~ l oprocew n lliiit cxcculca oll~siclc I I k