SERIES ON SCALABLE COMPUTING - VOL 3
ANNUAL REVIEW OF SCALABLE COMPUTING Editor
Yuen Chung Kwong
9
>v£
9
\&
&
«H8QRrc>i 'i-~-:-'-:r mtieim .• .;'
SINGAPORE UNIVERSITY PRESS World Scientific
ANNUAL REVIEW OF SCALABLE COMPUTING
SERIES ON SCALABLE COMPUTING Editor-in-Chief: Yuen Chung Kwong (National University of Singapore) ANNUAL REVIEW OF SCALABLE COMPUTING Published: Vol. 1: Vol.2:
ISBN 981-02-4119-4 ISBN 981-02-4413-4
SERIES ON SCALABLE COMPUTING-VOL 3
ANNUAL REVIEW OF SCALABLE COMPUTING
Editor
Yuen Chung Kwong School of Computing, NUS
SINGAPORE UNIVERSITY PRESS NATIONAL UNIVERSITY OF SINGAPORE
V | f e World Scientific « •
Singapore • New Jersey • London • Hong Kong
Published by Singapore University Press Yusof Ishak House, National University of Singapore 31 Lower Kent Ridge Road, Singapore 119078 and World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
ANNUAL REVIEW OF SCALABLE COMPUTING Series on Scalable Computing — Vol. 3 Copyright © 2001 by Singapore University Press and World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4579-3
Printed in Singapore by Fulsland Offset Printing
EDITORIAL BOARD Editor-in-Chief Chung-Kwong Yuen National University of Singapore Coeditor-in-Chief Kai Hwang Dept of Computer Engineering University ofSourther California University Park Campus Los Angeles, California 90089, USA Editors Amnon Barak Dept of Computer Science Hebrew University of Jerusalem 91905 Jerusalem, Israel Willy Zwaenepoel Dept of Compter Science Rice University, 6100 Main Street Houston, TX 77005-1892, USA Jack Dennis Laboratory for Computer Science MIT, Cambridge, MA 02139-4307, USA
Proposals to provide articles for forthcoming issues of Annual Review of Scalable Computing should be sent to: C K Yuen School of Computing National University of Singapore Kent Ridge, Singapore 119260 email:
[email protected] For the Year 2002 volume, draft versions should be sent by August 2001; final electronic versions (in prescribed LaTeX format) by December 2001.
This page is intentionally left blank
Preface As the Annual Review goes into its third volume, I am able to feel increasingly positive that the series has found its own niche in the publishing environment for Scalable Computing. In the busy world of technology, research and development, few authors have the time to write lengthy review articles of the type we are interested in, but there are also rather few venues for such material. Once the Annual Review establishes a presence and an archive of articles, it can have the effect of encouraging authors to seriously consider writing such material, knowing that they do have a suitable outlet. The present volume presents four papers from authors in Europe and USA, with an additional article of my own to bring the volume to the desired size. In view of the volume's reduced dependence on material from Asian authors and the articles' coverage of less abstract topics, in comparison with the previous two volumes, the Annual Review shows the potential of widening its appeal. With further determined efforts from our authors, editors and publishers, we can make the series a highly valuable resource in the field of Scalable Computing.
Chung-Kwong Yuen National University of Singapore January 2001
vii
This page is intentionally left blank
Contents
1
Anatomy of a Resource Management System for HPC Clusters 1.1 Introduction 1.1.1 Computing Center Software (CCS) 1.1.2 Target Platforms 1.1.3 Scope and Organisation of this Chapter 1.2 CCS Architecture 1.2.1 User Interface 1.2.2 User Access Manager 1.2.3 Scheduling and Partitioning 1.2.4 Access and Job Control 1.2.5 Performance Issues 1.2.6 Fault Tolerance 1.2.7 Modularity 1.3 Resource and Service Description 1.3.1 Graphical Representation 1.3.2 Textual Representation 1.3.3 Dynamic Attributes 1.3.4 Internal Data Representation 1.3.5 RSD Tools in CCS 1.4 Site Management 1.4.1 Center Resource Manager (CRM) 1.4.2 Center Information Server (CIS) 1.5 Related Work 1.6 Summary 1.7
Bibliography
1 1 2 3 3 4 4 5 6 10 12 12 13 17 17 19 20 22 23 24 24 26 26 27 28 ix
On-line OCM-Based Tool Support for Parallel Applications 2.1 Introduction 2.2 OMIS as Basis for Building Tool Environment 2.3 Adapting the OCM to MPI 2.3.1 Handling Applications in MPI vs. PVM 2.3.2 Starting-up MPI Applications 2.3.3 Flow of Information on Application 2.3.4 Detection of Library Calls 2.4 Integrating the Performance Analyzer PATOP with the OCM 2.4.1 PATOP's Starter 2.4.2 Prerequisites for Integration of PATOP with the OCM 2.4.3 Gathering Performance Data with the OCM 2.4.4 New Extension to the OCM - PAEXT 2.4.5 Modifications to the ULIBS Library 2.4.6 Costs and Benefits of using the Performance Analysis Tool 2.5 Adaptation of PATOP to MPI 2.5.1 Changes in the Environment Structure 2.5.2 Extensions to ULIBS 2.5.3 MPI-Specific Enhancements in PATOP 2.5.4 Monitoring Overhead Test 2.6 Interoperability within the OCM-Based Environment 2.6.1 Interoperability 2.6.2 Interoperability Support in the OCM 2.6.3 Interoperability in the OCM-Based Tool Environment 2.6.4 Possible Benefits of DETOP and PATOP Cooperation 2.6.5 Direct Interactions 2.7 A Case Study 2.8 Concluding Remarks 2.9 Bibliography
32 33 34 35 36 36 38 39 39 39 40 40 42 43 44 45 45 45 46 48 50 50 51 51 52 53 55 58 59
Task Scheduling on NOWs using Lottery-Based Work Stealing 3.1 Introduction 3.2 The Cilk Programming Model and Work Stealing Scheduler 3.2.1 Java Programming Language and the Cilk Programming Model 3.2.2 Lottery Victim Selection Algorithm 3.3 Architecture and Implementation of the Java Runtime System 3.3.1 Architecture of the Java Runtime System
63 64 69 69 71 73 73
xi
Contents
3.4
3.5 3.6
3.3.2 Implementation of the Java Runtime System Performance Evaluation 3.4.1 Applications 3.4.2 Results and Discussion Conclusions Bibliography
Transaction Management in a Mobile Data Access System 4.1 Introduction 4.2 Multidatabase Characteristics 4.2.1 Taxonomy of Global Information Sharing Systems 4.2.2 MDBS and Node Autonomy 4.2.3 Issues in Multidatabase Systems 4.2.4 MDAS Characteristics 4.2.5 MDAS Issues 4.3 Concurrency Control and Recovery 4.3.1 Multidatabase Transaction Processing: Basic Definitions 4.3.2 Global Serialisability in Multidatabases 4.3.3 Multidatabase Atomicity/Recoverability 4.3.4 Multidatabase Deadlock 4.3.5 MDAS Concurrency Control Issues 4.4 Solutions to Transaction Management in Multidatabases 4.4.1 Global Serializability under Complete Local Autonomy 4.4.2 Solutions using Weaker Notions of Consistency 4.4.3 Solutions Compromising Local Autonomy 4.4.4 Using Knowledge of Component Databases 4.4.5 Global Serializability Based on Transaction Semantics 4.4.6 Solutions under MDAS 4.4.7 Solutions to Global Atomicity and Recoverability 4.5 Application Based and Advanced Transaction Management 4.5.1 Unconventional Transactions Types 4.5.2 Advanced Transaction Models 4.5.3 Replication 4.5.4 Replication Solutions in MDAS 4.6 Experiments with V-Locking Algorithm 4.6.1 Simulation Studies 4.6.2 System Parameters
76 76 76 78 79 80 85 86 89 90 91 92 93 97 99 102 103 104 106 107 108 110 113 115 117 118 118 122 125 126 126 128 135 136 138 139
4.7 4.8
4.6.3 Simulation Results Conclusion Bibliography
Architecture Inclusive Parallel Programming 5.1 Introduction 5.1.1 Architecture Independence - The Holy Grail 5.1.2 Shared Memory Versus Distributed Systems 5.1.3 Homogeneous Versus Heterogeneous Systems 5.1.4 Architecture Independence Versus Inclusiveness 5.2 Concurrency 5.2.1 Threads and Processes 5.2.2 Exclusion and Synchronization 5.2.3 Atomicity 5.2.4 Monitors and Semaphores 5.3 Data Parallelism 5.3.1 Vector Processors 5.3.2 Hypercubes 5.3.3 PRAM Algorithms 5.4 Memory Consistency 5.4.1 Shared Memory System 5.4.2 Tuplespace 5.4.3 Distributed Processing 5.4.4 Distributed Shared Memory 5.5 Tuple Locks 5.5.1 The Need for Better Locks 5.5.2 Tuple Locks 5.5.3 Using Tuple Locks 5.5.4 Bucket Location 5.5.5 Homogeneous Systems 5.6 Using Tuple Locks in Parallel Algorithms 5.6.1 Gaussian Elimination 5.6.2 Prime Numbers 5.6.3 Fast Fourier Transform 5.6.4 Heap Sort 5.7 Tuples in Objects 5.7.1 Objects and Buckets
139 140 142 148 148 148 150 152 155 157 157 158 160 162 164 164 166 167 171 171 173 174 176 177 177 181 182 185 186 189 189 190 192 195 198 198
xiii
Contents
5.8
5.7.2 An Example 5.7.3 Reflective Objects Towards Architecture Inclusive Parallel Programming 5.8.1 Parallel Tasking 5.8.2 Speculative Processing 5.8.3 Efficient Implementation of Tuple Operations 5.8.4 Tuple and Bucket Programming Styles 5.8.5 Back Towards Architecture Independence
199 202 203 203 205 206 208 212
Chapter 1
Anatomy of a Resource Management System for HPC Clusters
AXEL KELLER
Paderborn Center for Parallel Computing (PC2) Furstenallee 11, D-33102 Paderborn Germany
[email protected] ALEXANDER REINEFELD
Konrad-Zuse-Zentrum fur Informationstechnik Berlin (ZIB) Takustr. 7, D-14195 Berlin-Dahlem Germany
[email protected] 1.1
Introduction
A resource management system is a portal to the underlying computing resources. It allows users and administrators to access and manage various computing resources like processors, memory, network, and permanent storage. With the current trend towards heterogeneous grid computing [17], it is important to separate the resource management software from the underlying hardware by introducing an abstraction layer between the hardware and the system management. This facilitates the management of distributed resources in grid computing environments as well as in local clusters with heterogeneous components. In Beowulf clusters [34], where multiple sequential jobs are concurrently executed in high-throughput mode, the resource management task may be as simple 1
2
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
as distributing the tasks among the compute nodes such that all processors are about equally loaded. When using clusters as dedicated high-performance computers, however, the resource management task is complicated by the fact that parallel applications should be mapped and scheduled according to their communication characteristics. Here, the efficient resource management becomes more important and also more visible to the user than the underlying operating system. The resource management system is the first access point for launching an application. It is responsible for the management of all resources, including setup and cleanup of processes. The operating system comes only at runtime into play, when processes and communication facilities must be started. As a consequence, resource management systems have evolved from the early queuing systems towards complex distributed environments for the management of clusters and high-performance computers. Many of them support space-sharing (=exclusive access) and time-sharing mode, some of them additionally provide hooks for WAN metacomputer access, thereby allowing to run distributed applications on a grid computing environment.
1.1.1
Computing Center Software (CCS)
On the basis of our Computing Center Software CCS [24], [30] we describe the anatomy of a modern resource management system. CCS has been designed for the user-friendly access and system administration of parallel high-performance computers and clusters. It supports a large number of hardware and software platforms and provides a homogeneous, vendor-independent user interface. For system administrators, CCS provides mechanisms for specifying, organizing and managing various high-performance systems that are operated in a computing service center. CCS originates from the transputer world, where massively parallel systems with up to 1024 processors had to be managed [30] by a single resource management software. Later, the design has been changed to also support clusters and grid computing. With the fast innovation rate in hardware technology, we also saw the need to encapsulate the technical aspects and to provide a coherent interface to the user and the system administrator. Robustness, portability, extensibility, and the efficient support of space sharing systems, have been among the most important design criteria. CCS is based on three elements: • a hierarchical structure of autonomous "domains", each of them managed by a dedicated CCS instance, • a tool for specifying hardware and software components in a (grid-) computing environment, • a site management layer which coordinates the local CCS domains and supports multi-site applications and grid computing.
Section 1.1.
Introduction
3
Over the years, CCS has been re-implemented several times to improve its structure and the implementation. In its current version V4.03 it comprises about 120.000 lines of code. While this may sound like a lot of code, it is necessary because CCS is in itself a distributed software. Its functional units have been kept modular to allow easy adaptation to future environments.
1.1.2
Target Platforms
CCS has been designed for a variety of hardware platforms ranging from massively parallel systems up to heterogeneous clusters. CCS runs either on a frontend node or on the target machine itself. The software is distributed in itself. It runs on a variety of UNIX platforms including AIX, IRIX, Linux, and Solaris. CCS has been in daily use at the Paderborn computing center since almost a decade by now. It provides access to three parallel Parsytec computers, which are all operated in space-sharing mode: a 1024 processor transputer system with a 32x32 grid of T805 links, a 64 processor PowerPC 604 system with a fat grid topology, and a 48 node PowerPC system with a mesh of Clos topology of HIC (IEEE Std. 1355-1995) interconnects. CCS is also used for the management of two PC clusters. Both have a fast SCI [21] network with a 2D torus topology. The larger of the two clusters is a Siemens hpcLine [22] with 192 Pentium II processors. It has all typical features of a dedicated high-performance computer and is therefore operated in multi-user space-sharing mode. The hpcLine is also embedded in the EGrid testbed [14] and is accessible via the Globus software toolkit [16]. For the purpose of this paper we have taken our SCI clusters just as an example to demonstrate some additional capabilities of CCS. Of course, CCS can also be used for managing any other Beowulf cluster with any kind of network like FE, GbE, Myrinet, or Infiniband.
1.1.3
Scope and Organisation of this Chapter
CCS is used in this paper as an example to describe the concepts of modern resource management systems. The length of the chapter reflects the complexity of such a software package: Each module is described from a user's and an implementator's point of view. The content of this chapter is as follows: In the remainder of this section, we present the architecture of a local CCS Domain and focus on scheduling and partitioning, access and job control, scalability, reliability, and modularity. In Section 1.3, we introduce the second key component of CCS, the Resource and Service Description (RSD) facility. Section 1.4 presents site management tools of CCS. We conclude the paper with a review on related work and a brief summary.
4
Anatomy of a Resource Management System for HPC Clusters
1.2
Chapter 1
CCS Architecture
A CCS Domain (Fig. 1.1) has six components, each containing several modules or daemons. They may be executed asynchronously on different hosts to improve the CCS response time.
Ul User Interface
lAccess MgrJ
iQueueMgr.
Figure 1.1. Interaction between the CCS components.
• The User Interface (UI) provides a single access point to one or more systems via an X-window or ASCII interface. • The Access Manager (AM) manages the user interfaces and is responsible for authentication, authorization, and accounting. • The Queue Manager (QM) schedules the user requests onto the machine. • The Machine Manager (MM) provides an interface to machine specific features like system partitioning, job controlling, etc. • The Domain Manager (DM) provides name services and watchdog facilities to keep the domain in a stable condition. • The Operator Shell (OS) is the main interface for system administrators to control CCS, e.g. by connecting to the system daemons (Figure 1.2).
1.2.1
User Interface
CCS has no separate command window. Instead, the user commands are integrated in a standard UNIX shell like tcsh or bash. Hence, all common UNIX mechanisms for I/O re-direction, piping and shell scripts can be used. All job control signals (ctl-z, ctl-c, ...) are supported and forwarded to the application. The CCS user interface supports six commands:
Section 1.2.
5
CCS Architecture
Queue Manager Dialoybox (PSC) fthoutQH
ftddtriistratiort
{ebugMode
Cammttimi to 'Qimffi MmAWK'
gpnneeUtro
(PSQ
. Close' Connection"
Enter scheduler i d ;
flrKSi;23j38):reqId •Duration QHU1:23;38>;
Machine'
ifcer
State
J , Nodes-' C « n f i r f l « e
T
QMi 28 PSC kel ' QMU1:2S;44); UMmt2S:425
£=20 £>25
S=27 £>25
no
no
3
Yes
Figure 3.2. Description of the lottery victim selection algorithm computation unfolds. As distinguished from Cilk-NOW, what comes across as novel in our runtime system is that when a worker becomes a thief it does not chose a victim uniformly at random. Instead, it incorporates a lottery scheduler [27] making use of the information about the level of the closure (thread) at the tail in each processor's ready deque. Lottery scheduling has been used successfully to provide efficient and responsive control over the relative execution rates of computations running on a uniprocessor. It has been shown efficient and fair even in systems that require rapid, dynamic control over scheduling at a time scale of milliseconds in seconds. Lottery scheduling implements proportional-share resource management where the resource consumption rates of active computations are proportional to the relative shares they are allocated. In the proposed randomized victim selection algorithm, each processor is associated with a set of tickets and the number of tickets associated with each processor is proportional to the level of the tail thread of its ready pool. For every thief processor, the victim processor is determined by holding a lottery. The victim is the processor with the winning ticket. For example, if the registry has multicasted a list of four processors with levels of their ready deque tail threads 12, 8, 7, and 3 respectively, there is a total of 12 + 8 + 7 + 3 = 30 tickets in the system, see Figure 3.2. Next, assume that a new processor has just joined the computation and has received the multicast message from the registry. Initially, this new processor has an empty ready pool so it becomes a thief immediately. In order to select a victim the new processor holds a lottery based on the information in the multicast message. Assume that the 25th ticket is (randomly) selected. The list of processors is searched for the winner. For every processor the partial sum of tickets from the beginning of the list is computed. If the partial sum is greater than the number of the winning ticket than the current processor is the winner and the search is aborted; otherwise, the search continues with the next processor in the list. For our four-processor example, the winner is the third processor. Therefore, the new processor will try to steal work from the third processor in the list multicasted by the registry. Further, let us assume that another new processor joins the compu-
Section 3.3.
Architecture and Implementation of the Java Runtime System
73
tation. It will also hold a lottery based on the information in the same multicast message. It is likely that the winner will be the first or the second processor because of the great number of tickets representing them. In this way the selection algorithm probabilistically avoids congestions at the busiest nodes in the system while at the same time it allows work stealing from them.
3.3
Architecture and Implementation of the Java Runtime System
This section presents our prototype runtime system and describes its core components and their interactions.
3.3.1
Architecture of the Java Runtime System
At the highest level, the runtime system implements the following functions: T h r e a d s scheduling The scheduler distributes tasks from a distributed task queue and manages load balancing through random work stealing, see Section 2. Adaptive parallelism The system makes use of idle processors which are idle when the parallel application starts or become idle during the duration of the job. When a given workstation is not being used, it joins in the system. When the owner returns to work, that processor automatically leaves the computation. Thus, the set of workers shrinks and expands dynamically throughout the execution of a job. Macroscheduling A background daemon process runs on every processor in the network. It monitors the processor state to determine when the processor is idle so that it could start a worker on that machine. The three main components of the runtime system are the registry, the workers, and the node managers. The registry is a super server providing the following services, each of which is implemented in a separate server: • registering/deregistering of workers, • updating the information about the workers currently involved in the computation, and • multicasting the list of network addresses and ages of the workers. Each worker consists of the following components:
74
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
• Master object synchronizing the access to the four pools of closures through guarded suspension and execution state variables. • Compute server fetching jobs from the ready deque and executing them. If the ready deque is empty, the worker becomes a thief and triggers the Thief thread. • Thief runnable object executed in a separate thread. This object implements the victim selection algorithm and the actual work stealing. A shortcoming of most distributed schedulers is the need for the workstations to share a common file system, such as NFS. The Thief incorporates a network classloader that allows the downloading of executable code on demand. The latter overcomes the requirement for the workstations to have a common file system and improves the scalability of the proposed runtime system. • Victim server object. This server is contacted by the Thief clients of other workers in the course of their work hunt. • Result server object. Results from stolen threads are returned to this server which updates the corresponding closure in the waiting pool. • R e g i s t e r client responsible for registering to and periodic updates with the registry. • L i s t e n e r which listens continually for the datagrams multicasted by the registry. It writes the information received in a 1-bounded buffer. The information is read from the buffer and used by the victim selection algorithm which is invoked by the Thief thread. • VictimSelection object implementing the victim selection algorithm. We use the library class java.util.Random to generate a stream of pseudo-random numbers. Each worker uses as a seed its unique ID assigned by the registry. A victim worker is selected by holding a lottery. First, a winning ticket is selected at random. Then, the list of workers is searched to locate the victim worker holding that ticket. This requires a random number generation and 0(n) operations to traverse a worker list of length n, accumulating a running ticket sum until it reaches the winning value. In [27] various optimizations are suggested to reduce the average number of elements of the worker list to be examined. For example, ordering the workers by decreasing level can substantially reduce the average search length. Since those processors with the largest number of tickets will be selected more frequently, a simple "move to the front" heuristic can be very effective. For large n, a more efficient implementation is to use a tree of partial sums, with clients at the leaves. To locate a client holding a winning ticket, the tree is traversed starting at the root node, and ending with the winning ticket leaf node, requiring only O(lgn) operations.
Section 3.3.
75
Architecture and Implementation of the Java Runtime System
03
O
10
15 time
Figure 3.3. Load distribution on a NOWs running Solaris OS
*"/**?
Q,
Q,
—
i>„
h
t>,
•
Q„
Fn
Fn+
Figure 3.4. The tree grown by the execution of the nqueens program i
9 8 7 6 5 4 : 3 2
i
I
I
I I I proposed approach, ideal paSe ^
•—
f
i
^
I ,y ^~
/ \
i
i
i
—
^ i
i
4 5 6 7 number of processors Figure 3.5. Parallel speedup
^ i
i
10
76
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
Scheduling by lottery is probabilistically fair. The expected allocations of victims to thieves is proportional to the number of tickets the victims hold. Since the scheduling algorithm is randomized, the actual allocated proportions are not guaranteed to match the expected proportions exactly. However, the disparity between them decreases as the number of allocations increases.
3.3.2
Implementation of the Java Runtime System
For efficiency, all communication protocols, except the initial registering of the workers, are implemented over UDP/IP. Some of the protocols add reliability to UDP by incorporating sequence numbers, timeouts, adaptive algorithms for evaluating the next retransmission timeout, and retransmissions [15]. The application protocol used to register new workers to the registry is developed over T C P / I P because of the needed reliability during the connection establishment and connection termination. One of the assumptions of this research work is that there is a great number of idle CPU cycles. Figure 3.3 plots the average number of jobs in the ready queue of the machines comprising our network 1 . A script was run for two weeks collecting the average load across the workstations at 15 minute intervals. The results were combined to produce an average load during a day. As can be seen from this plot, though more machines are idle at night, a significant number of idle CPU cycles exists at various time slots throughout the day. The results confirm that a network of workstations does indeed provide a valid environment for HPC. It is also possible to calculate the average daily load from Figure 3.3. By rough approximation, the average load of the workstations is around 0.25, indicating that about 75% of the CPU time of each workstation is wasted every day.
3.4
Performance Evaluation
In this section we present experimental results about the performance of our prototype runtime system for scheduling of multithreaded Java applications on networks of workstations. All experiments are done and measurements taken down on a network of 15 workstations running Solaris OS. Subsection 4.1 shows the implementation of a sample application and Subsection 4.2 presents some experimental results and interpretations of these results.
3.4.1
Applications
Consider the following example taken from [5] and rewritten in Java. The Fibonacci function fib(n) for, n > 0, is defined as ru \ - I n jio(n) - | j . . b ^ n _ ^ 1
+
jib(n
_ 2)
if n < 2 otherwise
This network is in the departmental lab of Computer Science and Engineering, Florida Atlantic University.
Section 3.4.
Performance evaluation
77
Another example that we consider is nqueens. nqueens application is a classical example of searching using backtracking. The objective is to find a configuration of n queens on an n x n chess board such that no queens can capture each other. This following shows the way this function is written as a Java program. The double recursive implementation of the Fibonacci function is a fully strict computation. The Java code is given below which is structured in the run methods of the two runnable objects. class Fib implements Runnable { Continuation dest; int n ; public Fib( Continuation k, int n ) { } public void run () { if ( n < 2 ) dest.sendArg ( n ) ; else { Continuation x = new Continuation 0 ; Continuation y = new Continuation () ; ClosureSum s = new ClosureSum( dest, x, y ) ; ClosureFib fibl = new ClosureFiM x, n-l ) ; ClosureFib fib2 = new ClosureFib( y, n-2 ) ; } return ; } } class Sum implements Runnable { Continuation dest ; int x,y ; public Sum( Continuation k, int x, int y ) { }
public void run 0 { dest.sendArg( x+y ); } } The Java code for nqueens is given in the Appendix. The nqueens problem is formulated as a tree search problem [9] and the solution is obtained by exploring this tree. The nodes of the tree are generated starting from the root, which is the empty vector corresponding to zero queens placed on the chess board. The code is structured in the run method of the classes NQueens, Success, and Failure. On each iteration, a new configuration is constructed, called conf ig in the code, as an extension of a previous safe configuration, thus spawning new parallel work. A configuration is safe if no queen threatens any other queen on the chess board. The
78
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
algorithm uses depth-first search to traverse the generated tree. On termination of the for loop of NQueens.run() method, the variable count contains the number of Nqueens closures pushed in the ready pool. This information is used to set the number of missing arguments of the F a i l u r e runnable object that is used in the backtracking stage if a dead end is reached. Since we use continuation-passing style for thread synchronization, after spawning one or more children, the parent thread cannot then wait for its children to return. Rather, as illustrated in Figure 3.4, the parent thread (Q) additionally spawns two successor threads, namely F a i l u r e (F) and Success (S), to wait for the values returned from the children. (In the Figure, Qi stands for an NQueen object which is executed in a separate thread. Si (Fi) stands for successor thread of type Successor (Failure). The edges creating successor threads are horizontal. Spawn edges are straight, shaded, and point downward. The edges created by sendArgumentO are curved and point upward.) The communication between the child threads and the parent thread's successors is done through Continuation objects. We use two different successor threads because failure and success have different semantics. In order for a thread to return failure, all of its child threads should report failure, while to return success, it suffices only one of its child threads to report success. It is important to note that nqueens spawns off parallel work which it later might find unnecessary. This "speculative work" can be aborted in our runtime system using the abort method of the Master object which synchronizes the access to the four pools of closures of each worker. Subsequently, the abort message is propagated to all workers involved currently in the computation. The latter allows nqueens program to terminate as soon as one of its threads finds a solution.
3.4.2
Results and Discussion
The performance of the runtime system was evaluated using f ibonacci and nqueens applications. Even though both of the applications are not real life, they generate a workload suitable for evaluating the performance of our system, f ibonacci is not computationally intensive but spawns a large number of threads (in millions) which makes it appropriate for evaluating the synchronization of the runtime system. nqueens features behaviour typical of most search algorithms employing backtracking. First, we present the serial slowdown incurred by the parallel scheduling overhead. The serial slowdown of an application is measured as the ratio of the singleprocessor execution of the parallel code to the execution time of the best serial implementation of the same algorithm. The serial slowdown stems from the extra overhead that the distributed scheduler incurs by wrapping threads in closures, reflecting upon closures to find out threads' constructors, and work stealing. Serial slowdown data for f ibonacci and nqueens are 6.1 and 1.15, respectively. As expected f ibonacci incurs substantial slowdown because of its tiny grain size. The slowdown of nqueens is insignificant.
Section 3.5.
Conclusions
Figure 3.5 shows the parallel speed up of the f ibonacci application. In all experiments all workstations have been started up at the same time and therefore have taken a fare share of the load. The speedup is measured as the ratio of the execution time of the parallel implementation running with one participant to the average execution time of the parallel implementation running with m participants, where m is the number of workstations involved. Tables 3.2 and 3.3 compare the performance of the classical work stealing algorithm where victims are chosen uniformly at random to the performance of the proposed work stealing algorithm which makes use of the information about the levels of the tail closures in the ready pools of the workers. Tables 3.2 and 3.3 show that the lottery-based work stealing algorithm consistently outperforms the random work stealing algorithm for f ibonacci and nqueens applications, respectively. However, we need to run more experiments with applications spawning a range of different subcomputations in order to provide stronger evidence in support of that statement. In Table 3.2 and 3.3, Columns 2 and 3 display the wall clock time in seconds for the classical work stealing algorithm and the lottery-based work stealing algorithm, respectively, for different number of processors involved.
3.5
Conclusions
We have devised and implemented a new victim selection algorithm. In the proposed victim selection algorithm, each processor is given a set of tickets whose number is proportional to the age of the oldest subcomputation in the ready pool of the processor. The victim processor is determined by holding a lottery, where the victim is the processor with the winning ticket. The experimental results have shown that the proposed work stealing algorithm outperforms the classical work stealing algorithm where the victims are selected uniformly at random. We have also designed and implemented a Java runtime system for parallel execution of strict multithreaded Java applications on networks of workstations employing the proposed lottery-based victim selection algorithm. The runtime system features: • Distributed thread scheduler that manages efficiently load balancing through a variant of work stealing. • Adaptive parallelism which allows the utilization of idle CPU cycles without violating the automicity of the workstations' users. • Network class loader which lifts up the restriction requiring that all workstations share a common file system and improves the scalability of the runtime system. Our future plans involve adding fault-tolerance to the runtime system through distributed checkpointing so that the system could survive machine crashes. The challenge of this enterprise stems from the absence of a common file system shared by all workstations.
80
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
For the Internet-based version of our runtime system we plan to incorporate in the work-stealing algorithm information about the communication delays among the processors in the system. On a LAN communication delays cannot have dramatic impact on the performance of the system since they are more or less uniform. However, on a WAN or internetwork they have to be taken into account in order to achieve efficient scheduling of the subcomputations. For the estimation of the communication delays between the processors of the network we are going to design and implement a distributed algorithm, where each processor in the system obtains its partial view of the delays in the system through its communications with the rest of the processors. We also plan to justify theoretically the performance of the proposed work stealing algorithm based on lottery victim selection.
3.6
Bibliography
[1] T. Anderson, D. Culler, and D. Patterson, "A case for NOW (networks of workstations)," IEEE Micro, 15(1), 1994. [2] A. Appel, Compiling with Continuations., Cambridge University Press, New York, 1992. [3] J.E. Baldeschwieler, R.D. Blumofe, and E.A. Brewer, "ATLAS: An infrastructure for global computing," In Proceedings of the 7th ACM SIGOPS European Workshop on System Support for Worldwide Applications, 1996. [4] A. Baratloo, M. Karaul, Z. Kedem, and P. Wyckoff, "Charlotte: Metacomputing on the Web," In Proceedings of the 9th International Conference on Parallel and Distributed Computing, 1996. [5] R. Blumofe, Executing Multithreaded Programs Efficiently. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995. [6] R. Blumofe and C. Leiserson, "Scheduling multithreaded computations by work stealing," In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), Santa Fe, New Mexico, November 1994. [7] R. Blumofe and D. Park, "Scheduling large scale parallel computations on networks of workstations," In Proceedings of the Third International Symposium on High Performance Distributed Computing (HPDC), pp. 96-105, San Francisco, California, August 1994. [8] R. Blumofe and P. Lisiecki, "Adaptive and reliable parallel computing on networks of workstations," In Proceedings of the USENIX 1997 Annual Technical Conference on Unix and Advanced Computing Systems, Anaheim, California, January 6-10, 1997.
Section 3.6.
Bibliography
81
[9; G. Brassard and P. Bratley, Fundamentals of Algorithmics. Prentice-Hall, 1996. [io: N. Carriero and D Gelernter, "The S/Net's Linda kernel," A CM Transaction on Computer Systems, 4(2), pp.110-129, 1986.
[ii B. Christiansen, P. Ionescu, M. Neary, K. Schauser, and D. Wu, "Javelin: Internet-based parallel computing using Java," Concurrency Theory and Practice, 1997.
A. J. Ferrari, "JPVM: Network parallel computing in Java," Technical Report [12: CS-97-29, Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA, http://www.cs.virginia.edu/jpvm/doc/jpvm-9729.ps.gz "Globus metacomputing infrastructure," http://www.mcs.anl.gov/globus [13: P. Gray and V. Sunderman, "IceT: Distributed computing using Java," In [14 emphProceedings of ACM 1997 Workshop on Java for Science and Engineering, 1997. [15 V. Jacobson, "Congestion avoidance and control," Computer tion Review, 18(4), pp. 341-329, August 1988.
Communica-
[16 JavaSoft Team, RMI: Java Remote Method Invocation-Distributed ing for Java. Sun Microsystems, Inc., Palo Alto, CA, 1998.
comput-
[17 JavaSoft Team, The JavaSpaces specification. Sun Microsystems, Inc., Palo Alto, CA, 1999. [18 JavaSoft Team, The JavaServer Pages 1.0 specification. Sun Microsystems, Inc., Palo Alto, CA, 1999. [19: L.F. Lau, A.L. Ananda, G. Tan, and W.F. Wong, "JAVM: Internet-based parallel computing using Java," submitted for publication, 2000. [20: P. Launay and J. Pazat, "A framework for parallel programming in Java," IRISA Internal Publications, (1154), 1997. [21: D. Lea, Concurrent Programming in Java: Design Principles and Patterns. Addison-Wesley, 1998. [22 Message Passing Interface Forum. MPI: A message passing interface. In Proceedings of Supercomputing '93, pp.878-883, IEEE Computer Society, 1993. [23: MPIJ 1.1. http://ccc.cs.byu.edu/OnlineDocs/docs/mpij/MPIJ.html
82
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
[24] L. Sarmenta, S. Hirano, and S. Ward, "Towards Bayanihan: Building and extensible framework for volunteer computing using Java," In emphProceedings of ACM 1998 Workshop on Java for High Performance Network Computing, 1998. [25] D. Skillicorn and D. Talia, "Models and languages for parallel computation," Computing Surveys, June 1998. [26] V.S. Sunderam, "PVM: A framework for parallel distributed computing," Concurrency: Practice and Experience, 2(4), pp.315-339, Dec. 1990. [27] C. Waldspurger and W. Weihl, "Lottery scheduling: Flexible proportionalshare resource management," In Proceedings of the First Symposium on Operating Systems Design and Implementation, Usenix Association, November 1994.
Section 3.6.
83
Bibliography
Appendix public class NQueens implements Runnable { Continuation success, fa ; Continuation failure ; private int n; //the total number of queens private int i; //already placed queens private int[] config; //The current configuration of queens on the chessboard public NQueens( Continuation s, Continuation f, int[] a, Integer nQueens, Integer placedQueens ){
public void run() { int j = 0 ; if (i == n) { System.out.printIn("Done") ; for (j = 0; j < i; j++) System.out.print( "" + config[j] + " " ); System.out.println( "" ) ; success.sendArgument( config ) ; return; } Continuation x Continuation y ClosureSuccess ClosureFailure
= new Continuation () ; // success = new Continuation () ; // failure cSuccess = new ClosureSuccessC this.success, x ) ; cFailure = new ClosureFailure( this.failure, y ) ;
short count = 0 ; for ( j = 0; j < n; j++ ) { int[] newConf ig = (int [] )conf ig.cloneO ; if ( safe( newConfig, i, j ) ) { count++ ; newConfig [i] = j ; ClosureNQueens q = new ClosureNQueens( x, y, newConfig, n, i+1 ): } } if ( count == 0 ) { failure.sendArgument( new Integer( 0 ) ) ; } else cFailure.setJoinCount( count ) ; return ;
84
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
boolean s a f e ( i n t conf ig[] , int n, int j) { int r = 0; int s = 0; for (r = 0; r < i ; r++) { s = conf i g [ r ] ; if (j == s | | i - r == j - s II i - r == s - j) { return f a l s e ; } } return t r u e ; } } public class Failure extends Task { Continuation destination ; Integer fail ; . ..//constructor and helper methods public void run () { //send failure notification to the failure successor of the parent thread destination.sendArgument( fail ) ;
} } public class Success extends Task { Continuation destination ; int [] config ; ... //constructor and helper methods public void run () { Master .getWorkerO .abortReadyClosures 0 ; //send configuration to the successor of the parent thread destination.sendArgument( config ) ;
} }
Chapter 4
Transaction Management in a Mobile Data Access System K.
SEGUN, A.R.
H U R S O N , V.
DESAI
Computer Science and Engineering Department The Pennsylvania State University University Park, PA 16802
A.
SPINK
School of Information Sciences and Technology The Pennsylvania State University University Park, PA 16802
L.L.
MILLER
Computer Science Department Iowa State University Ames, IA 50011
A b s t r a c t Advances in wireless networking technology and portable computing devices have led to the emergence of the mobile computing paradigm. As a result, the traditional notion of timely and reliable access to global information sources in a distributed system or multidatabase system is rapidly changing. Users have become much more demanding in t h a t they desire and sometimes even require access to information anytime, anywhere. T h e amount and the diversity of information t h a t is accessible to a user are also growing at an exponential rate. Compounding the access to information is the wide variety of technologies with differing memory, network, power, and display requirements. Within the scope of distributed databases 85
86
Transaction Management in a Mobile Data Access System
Chapter 4
and multidatabases, the issue of concurrency control as a means to provide timely and reliable access to the information sources has been studied in detail. In an MDAS environment, concurrent execution of transaction is more problematic due to the power and resource limitations of computing devices and lower communication rate of the wireless communication medium. This article is intended to address the issue of concurrency control within a multidatabase and an MDAS environment. Similarities and differences of multidatabases and MDAS environment are discussed. Transaction processing and concurrency control issues are analyzed. Finally, a taxonomy for concurrency control algorithms for both multidatabases and MDAS environments is introduced. Keywords: Multidatabase, Mobile computing environment, Mobile data access system, concurrency control, transaction processing.
4.1
Introduction
The need to maintain and manage a large amount of data efficiently has been the driving force for the emergence of database technology and more recently Ecommerce. Initially, data was stored and managed centrally. As organizations became decentralized, the number of locations, and thus local databases, increased. The need for shared access to multiple databases was inevitable. Geographical distribution of data, demand for highly available systems and autonomy coupled with economical issues, availability of low cost computers, advances in distributed computing, and the demands of supply chain based E-commerce, are among the pressing issues behind the transition towards distributed database technology. Design of distributed database management systems (DBMS) has had an impact on issues such as concurrency control, query processing/optimization and reliability. Traditionally, distributed DBMSs have been built in a top down fashion - building separate databases and distributing data among them [12]. This approach has the advantage that fixed standards can be set before the databases are built, simplifying the issues of data distribution, query processing, concurrency control and reliability. The local DBMSs are typically homogeneous with respect to the data model implemented and present the same functional interfaces at all levels. The global system has control over local data and processing. Solutions developed in a centralized environment can be typically extended to fit this model, resulting in a tightly coupled global information-sharing environment. This approach to designing a distributed database system is possible only if the design process is started from scratch. The issue is how to effectively distribute the data, having knowledge of resources like machine capacity, network overhead, and semantics of data. The natural extension to the distributed database system came in the form of applications of distributed systems in integrating preexisting databases to make the most use of the data available. Most organizations already have their major databases in place. It would be impractical to move this data into a common database, since it would not only be expensive, but the independence of managing individual databases also would be lost. The alternative is logical integration of
Section 4 . 1 .
Introduction
87
data, so as to provide a view of one logical database. This can be viewed as a bottom up approach to distributed database design. The databases themselves are loosely coupled and could potentially differ in data models, and transaction and query processing schemes used - heterogeneous databases or multidatabases [10]. A key feature is the autonomy that individual databases retain to serve their existing customer set. The goal of integrating the databases is to provide users with a uniform access pattern to data in several databases, without modifying the underlying databases and without requiring knowledge of the location or characteristics of the various DBMSs. Solutions from a centralized environment cannot be directly applied in such an environment, autonomy and heterogeneity being restricting factors. Unlike the distributed approach, this could be viewed as being motivated by efficient integration of data instead of efficient distribution of data. Multidatabase systems (MDBS) are not simply distributed implementations of centralized databases, but can be seen as much broader entities that present their own unique characteristics. This in turn raises several interesting research issues over and above those for centralized databases. Some of these issues, like query optimization and transaction management, are rooted in a centralized environment. Others, such as data distribution, have their roots in distributed databases. However, issues such as local autonomy are unique to the multidatabase problem. An important emerging computing paradigm is mobile computing. Thanks to recent advances in computer and telecommunications technology, and the subsequent merging of both technologies, mobile computing, particularly as an important technological infrastructure for E- commerce is now a reality. A Mobile Data Access System (MDAS) is a multidatabase system that is capable of accessing a large amount of data over a wireless medium. Such a system is realized by superimposing a wireless mobile computing environment on a multidatabase system [38]. Mobility raises a number of additional challenges for multidatabase system design. Current designs are not capable of resolving the difficulties that arise as a result of the inherent limitations of mobile computing: frequent disconnection, high error rates, low bandwidth, high bandwidth variability, limited computational power, and limited battery life. Wireless transmission media across wide-area telecommunication networks are an important element in the technological infrastructure of E-commerce [62]. Effective development of guided and wireless-media networks will enhance delivery of World Wide Web functionality over the Internet. Using mobile technologies will enable the purchase of E-commerce goods and services anywhere and anytime Multiple users, in general, could access a database, which implies multiple transactions occurring simultaneously. This is especially true in a distributed database system where various users at different sites could access independent databases. In data processing applications (e.g. banking, stock exchange) the need for reducing access times and maintaining availability, reliability, and integrity of data is essential. Effective E-commerce is based on the successful development and implementation of multidatabase systems. E-commerce based businesses need effective solutions to the integration of Internet front-end systems with diverse data in legacy- distributed
88
Transaction Management in a Mobile Data Access System
Chapter 4
databases. A key challenge for E-commerce is the need for real-time concurrent access to distributed databases containing accounting, marketing, inventory, sales, production systems, and vendor information. Concurrent access to data is a natural way to increase throughput and reduce response time. Database operations require extensive I/O operations and, in addition, a distributed environment has to cope with delays in the network. These characteristics motivate interleaving the execution of several transactions. Concurrent transaction processing raises the possibility of interference. It is safe to concurrently access data items as long as they are independent, but in case of related data items, accesses should be coordinated - concurrency control being the activity of coordinating concurrent accesses to shared data [6]. In a multidatabase environment, transactions could span over multiple databases. Concurrency control in such an environment should not only synchronize subtransactions of a transaction at the respective sites, but also the transactions as a whole at the global level. In addition, the coordination of transactions in a multidatabase environment should enforce minimal changes to the local databases, with autonomy of the component databases being a distinctive feature. In the MDAS environment, transactions tend to be long-lived. This is due to frequent disconnection, the limited bandwidth constraints experienced by the mobile user, and the mobility of users. The communication path tends to increase as users move from one administrative domain to another even if physical distances traversed are short. Concurrency control in such an environment should take into account the effects of disconnection, limited bandwidth, mobility and portability. Concurrency control must strive to reduce the communication, reduce computation, and conserve the battery life of the mobile unit. Computer software and hardware are failure prone. Incomplete transactions due to failure can lead to inconsistencies. Failures can even lead to loss of data. This stresses the need for a database to have effective recovery mechanisms or methods to maintain atomicity of transactions. In addition, in a distributed case, failure of some sites should not halt the execution of the whole system, availability being an important consideration. To handle these issues, proper transaction management schemes should be incorporated into a database system. It is the responsibility of transaction management schemes to ensure correctness under all circumstances. By correctness, a transaction should satisfy the following ACID properties [31]: • Atomicity: Either all operations of a transaction happen or none happen. State changes in a transaction are atomic. • Consistency: A transaction produces results consistent with integrity requirements of the database. • Isolation: In spite of the concurrent execution of transactions, each transaction believes it is executing in isolation. Intermediate results of a transaction should be hidden from other concurrently executing transactions.
Section 4.2.
Multidatabase Characteristics
89
• Durability: On successful completion of a transaction, the effects of the transaction should survive failures. Extensive research has been done to maintain the ACID properties of transactions in centralized and tightly coupled distributed database environments [31]. The emergence of the need to access preexisting databases, as in a multidatabase environment, imposes newer constraints and difficulties in maintaining the ACID properties. The difficulties stem essentially from the requirement of maintaining the autonomy of the local databases. This implies that the local sites have full control over the data at their respective sites. The consequence is that the local executions are outside the control of the global multidatabase system. Transaction processing in such a fully autonomous environment can give rise to large delays, frequent or unnecessary aborts, possible inconsistencies on failure and hidden deadlocks, just to name a few problems that are encountered. Inevitably, certain assumptions and tradeoffs have to be made, usually compromising the autonomy of the local databases, in order to maintain the goals of transaction processing in general. In the next section, we provide a brief introduction to multidatabase and MDAS systems and their research issues. In Section 4.3, concurrency control and transaction processing issues in MDBS and MDAS are discussed and their differences from traditional distributed systems are addressed. Section 4.4 looks at the existing solutions for transaction management in both environments. Section 4.5 addresses application based and advanced transaction management. Section 4.6 discusses our experiments with the V-locking algorithm in an MDAS environment. Finally, Section 4.7 concludes this article.
4.2
Multidatabase Characteristics
A multidatabase system is a distributed system that acts as a front end to multiple local DBMSs. It provides a structured global system layer on top of existing local DBMSs. The global layer is responsible for providing full database functionality and interacts with the local DBMSs at their external user interface. The end user gets an illusion of a logically integrated database, hiding intricacies of different local DBMSs at the hardware and software levels. Thus, a multidatabase can be viewed as a database system formed by independent databases joined together with a goal of providing uniform access to the local DBMSs. A multidatabase system, in general, can be represented by the architecture shown in Figure 4.1. The primary objective of the multidatabase is to place as few restrictions on the local DBMSs as possible. Another goal (or maybe more of a consequence) of forming a multidatabase system is the recognition of the need for certain basic standards in the development of databases so as to simplify global information sharing in the future. Heterogeneity is a term commonly used in a multidatabase environment. In general, heterogeneity can occur due to differences in hardware, operating systems, data models, communication protocols, to name a few. In a multidatabase environment, to make global information sharing a reality, heterogeneities in data models,
90
Transaction Management In a Mobile Data Access System
Local Transactions
Local Transactions
~~T
LDBSi
l / LDB32
LDBS3
LDBSK
Local Database
Local Database
Local Database
Local Database
•
*^
Chapter 4
\
LDBS: Local Database Management System MDBS: Multidatabase System Figure 4.1. Multidatabase System. schema, query languages, query processing, and transaction management schemes have to be resolved.
4.2.1
Taxonomy of Global Information Sharing Systems
There are a wide range of solutions for global information systems in a distributed environment, with terms like distributed databases, federated databases, and multidatabases being among the most commonly used. The distinction arises from the degree of autonomy and the manner in which the global system integrates with the local DBMSs. A tightly coupled system means global functions have access to low-level internal functions of the local DBMSs. In a loosely coupled system, the local DBMS allows global control through external user interfaces only. The amount of control that a local DBMS retains over data at its site after joining the global system is the basis for the following taxonomy. • Distributed Databases: A distributed database is the most tightly coupled global information sharing system. The global manager has control over transactions occurring both globally and locally. Such systems are typically designed in a top down fashion, with global and local functions implemented simultaneously. Logically, distributed databases give the view of centralized databases, with data at multiple sites instead of a single one. • Federated Database: Federated database systems are more loosely coupled than distributed database systems. The participating DBMSs have significantly more control of data at their respective sites. In a federated database system, each of the local DBMSs decides what part of the local data is shared.
Section 4.2.
Multidatabase Characteristics
91
They cooperate with other local DBMSs in the federation for global operations. The DBMSs in the federation have typically been designed in a bottomup manner, but when they join the federation, they give up a certain amount of local freedom. • M u l t i d a t a b a s e Systems: Multidatabase systems are the most loosely coupled systems. Here, the local DBMS retains full control over the local data even after joining the global system. In a multidatabase system, it is the responsibility of the global system to extract information and resolve various aspects of heterogeneity for global processing. This classification highlights independence and autonomy of the local databases as two important features of a multidatabase system. The relationship between multidatabases and autonomy merits more attention and is highlighted in the following discussion.
4.2.2
MDBS and Node Autonomy
Node autonomy is one of the key concepts in a distributed system [27]. A MDBS is a distributed system formed to allow uniform access to multiple local DBMSs, wherein local operations have priority and may be more frequent than global operations; thus, enforcing autonomy of the underlying sites becomes important. On the other hand, the local site could contain legacy databases; transforming this data to suit global needs would be too expensive. Thus, economical constraints are also motivating factors in preserving local autonomy. Issues such as isolating some local data from global access magnify the need for site autonomy. This allows the local DBA to restrict the information available to the global user. Autonomy may be a suitable feature from the global standpoint as well; in case of failure, it helps check the effects of a local failure from propagating throughout the system. Autonomy could come in different forms [27]: • Design Autonomy: The local DBMS should not be made to change its software or hardware platform to join a multidatabase system. In short, the local DBMS should remain as is on becoming a part of a global system. The global software can be looked upon as an add-on to the existing system. The primary reason for design autonomy is economics - an organization may have significant capital invested in existing hardware, software, and user training. This is especially relevant for systems designed in a bottom up manner. Heterogeneity arises in distributed systems if they are allowed to retain their design autonomy. • Communication Autonomy: The local DBMS has the freedom to decide what information it is willing to share globally. This can imply that a local DBMS may not inform the global user about transactions occurring locally, which makes the task of coordinating global transaction execution spanning over multiple sites extremely difficult. Synchronization of global transactions
92
Transaction Management in a Mobile Data Access System
Chapter 4
is not the responsibility of the local DBMS. Local databases in federated and distributed database systems do not retain their communication autonomy since they provide information for global transaction coordination [10]. • Execution Autonomy: A local DBMS can execute transactions submitted at the local site in any manner it desires. This implies that global requirements for the execution of a transaction at a local site may not be honored by the local DBMS. The local DBMS has the freedom to unilaterally abort any transaction executing at that local site. This is due to the fact that the local DBMS does not treat the global transaction as a special transaction; it is like any other local transaction executing at that node. In the next subsection, we examine some of the issues that arise due to autonomy and the inherent heterogeneity in the multidatabase environment.
4.2.3
Issues in Multidatabase Systems
The primary issues that are heavily influenced by local database autonomy are outlined in the following: • Schema Integration: The local databases have their own schema; the goal here is to create an integrated schema to give a logical view of an integrated database. Schema integration is difficult when component databases differ in name, format and structure. Briefly, naming differences occur due to difference in naming conventions, wherein semantically equivalent data could be named differently or semantically conflicting data could be named the same. Format differences include differences in data types, domain, scale, precision, and item combinations, whereas structural differences occur due to differences in data structures. Schema integration has been discussed extensively in [35,39]. • Query Languages and Processing: Query translation may be required since query languages used to access the local DBMS may be different. Query processing and optimization is also difficult due to difference in data structures and processing power at each local DBMS. Requirements like conversion of data into standard format and processing queries at nodes with more processing power are some of the issues that merit consideration. Not only do these factors increase the overhead of query processing, but they can also create additional problems like communication bottlenecks and hot spots at servers, especially when available information regarding the local DBMS is inadequate at the global level. Fragmentation of data and incomplete local information can make developing an accurate cost model difficult, increasing the complexity of query optimization [55]. • Transaction Processing: In a MDBS, it is possible that the local DBMS may use different concurrency control and recovery schemes. It could happen that one database follows the two-phase locking protocol while others
Section 4.2.
93
Multidatabase Characteristics
use a timestamp-based scheme to serialize accesses. Furthermore, to maintain atomicity and durability of global transactions, the local database should support some atomic commitment protocol. A local DBMS joining a multidatabase environment may not have such a facility. The problem becomes even more severe if the local DBMS does not want to divulge or does not have any information regarding its local concurrency control and recovery schemes. This makes the task even more difficult for maintaining global consistency in a MDBS [9]. In a MDBS where updates are frequent, loss of correctness and inconsistency of data is often unacceptable. The database technology is built around the notion that data will be stored reliably and will be available to multiple users. Thus, maintaining the ACID properties of transactions is of vital importance.
4.2.4
MDAS Characteristics
A mobile data access system (MDAS) is a multidatabase system that is capable of accessing a large amount of data over a wireless medium. The system is realized by superimposing a wireless mobile computing environment over a multidatabase system [38]. The general architecture of the MDAS can be represented as shown in Figure 4.2.
1
•
|MSS|
• Wired Network
|MSS
1
V. .-
G2, and at the second local site, the projection of serialization order resulted in: G2 —> G\, indicating a cycle in the global serialization graph. If we look at the local transaction history at LDBS2 we see: LDBS2 '• ?'i,2(c)wG1(c)rG2(rf)wi,2(c?).
Section 4.4.
Solutions to Transaction Management in Multidatabases
111
LDBSi: Data items a,b LDBS2'- Data items c,d LHi : wGl{a)rG2(a) SerializationGraphiDBS! '• G\ —> G2 LH2 :
WG2{c)rGl(c)
SerializationGraphiDBS!
'• G2 —> G\
Figure 4.6. Global-Global Conflicts. Although the transactions G\ and G2 were executed in the desired serialization order, G\ —> G2, an indirect conflict caused by the local transaction L2 reversed the serialization order at LDBS2 • The global transaction manager has information regarding the global transactions only. Therefore, the local transactions are like phantom transactions. Another problem that was not discussed in the previous section was the result of non-serializable schedules caused by direct conflicts between global transactions as illustrated in Figure 4.6. The only solution to serializing transactions under full autonomy and in the presence of indirect conflicts is to assume that global transactions conflict at every local site they execute concurrently. It is easier to deal with non-serializable transactions caused only by direct conflicts between global transactions since the global transaction manager has full knowledge of all global operations. Solutions under full autonomy are given below and summarized in Table 3. Site Graph Method: The site graph method, proposed in [8], is a pessimistic method towards global concurrency control in multidatabase systems. A site graph is an undirected graph, with nodes being the sites containing the data items accessed by the global transactions and edges corresponding to the sites spanned by the global transaction. Initially, there are no edges between nodes in the graph. When a global transaction issues a subtransaction, undirected edges are added between nodes of the LDBSs that participate in the execution of the global transaction. As long as the site graph is acyclic, serializability of global transactions is guaranteed. A cycle in the site graph indicates a possibility of a non-serializable schedule and hence, the global transaction has to be aborted. In our running example, when G\ executes at local sites 1 and 2, an edge is inserted between nodes 1 and 2. Now, when G2 submits its operation at sites 1 and 2, another edge is inserted between the two nodes, causing a cycle in the site graph. As a result, G2 is aborted. The site graph method is an example of a bottom up concurrency control scheme because the site graph is a global data structure used by the GTM. Note that the GTM is the only unit responsible for global concurrency control in this approach. The site graph method preserves the autonomy of the local databases but suffers from low concurrency, since a cycle in the site graph does not necessarily imply a non-serializable global schedule. The throughput is also reduced, since only one transaction can execute concurrently at the same sites. Another disadvantage is
112
Transaction Management in a Mobile Data Access System
Chapter 4
due to the method that is used to safely remove an edge from the site graph. At first, it seems that once a transaction commits, the edges corresponding to that transaction can be removed from the site graph. But conflicts can occur even after a transaction commits, resulting in non-serializable executions, as illustrated in [6]. The knowledge about the local concurrency control mechanism, or weaker notions of correctness such as quasi-serializability [16], can be used to develop a safe policy for edge removal. The site graph method has been used for concurrency control in the Amoco Distributed Database System (ADDS) [10]. Forced Conflict Method: In the forced conflict method [30], the global scheduler generates globally serializable schedules by assuming that the local databases generate locally serializable schedules: serializability among global transactions is ensured by using a special data item, called a ticket, at every local DBMS where they execute. As we have seen, global transactions that appear globally serializable may generate globally non-serializable schedules due to indirect conflicts caused by local transactions. If all concurrent global transactions that span over the same sites are forced to conflict directly, then the problem due to indirect conflicts does not arise, since the induced conflicts serialize the global transactions. Each global subtransaction is modified to read and write the ticket data item at each local database they access. The ticket operation forces direct conflict between global subtransactions, irrespective of the other operations contained in the global subtransactions - the ticket acts as a (logical) timestamp and is stored as a regular data item in each LDBS(Figure 4.7). LHi :
rGl{ti)wGl(ti)rGl(a)rLl{a)rLl{b)rG2(t])wG2{ti)wG^{a)
LH2 :
rL2(c)rGl(t2)wGl(t2)wGl(c)rG:l(t2)wG2(t2)rG2(d)wL2(d)
Figure 4.7. Example of using tickets. In our running example (Figure 4.7), additional operations to modify the ticket are required in the local schedules, shown as follows: where t\ and t2 are the tickets for LDBS\ and LDBS2, respectively. The history at LDBS2 contains a cycle. Since the local schedules are serializable at their respective sites, the schedule at LDBS2 is not allowed by the local
Section 4.4.
Solutions to Transaction Management in Multidatabases
113
concurrency control mechanism. As a consequence, to maintain local serializability, one of the transactions will be aborted or blocked. Furthermore, ticket values read are used to determine the relative serialization order among global subtransactions. It could happen that at LDBSi, GST\ —>• GST2 whereas at LDBS2, GST2 —> GSTi. Both of these executions orders are valid at their respective sites. At LDBSi, GSTi executed before GST2 and the ticket operations created the conflict GSTx —> GST2 whereas, at LDBS2, GST2 is executed before GSTi giving GST2 —> GST\. But globally, the relative serialization orders as indicted by the ticket values read at the respective sites are not consistent and hence one of the global transactions is aborted. The autonomy of the local DBMS is preserved, since no change is required in the local DBMS to support the ticket operation. Also, the local transactions are not affected by the ticket operations, since only subtransactions of global transactions have to read and manipulate the ticket values. The disadvantage of this method is that the ticket operation causes conflicts between global transactions even when they may not actually conflict. This can lead to numerous aborts, especially when optimistic scheduling is employed. Also, the ticket data item could become a hot spot [6] in the local database, if several transactions try to access the database simultaneously. A rare problem (but not too severe) is related to the autonomy of the local databases if the local DBMSs do not support the creation of the ticket data object. This problem can easily be resolved by adding additional operations in the transactions, which will induce direct conflicts among the transactions. However, this solution could result in more aborts than the pure ticket method, since the newly added operations reference a data item accessible by both global and local transactions. A forced conflict method is an important algorithm for multidatabase concurrency control since it demonstrates that global serialization can be achieved by using the fact that the LDBS concurrency control method will serialize transactions at their respective sites. However, strict serializability is very restrictive and may even be inappropriate under certain circumstances [47]. Therefore, we will look at solutions that maintain autonomy of the component databases under relaxed (serializability) correctness criteria.
4.4.2
Solutions using Weaker Notions of Consistency
Serializability can be too strong a criterion, especially in a multidatabase environment where the underlying databases are autonomous. The low degree of concurrency obtained could significantly increase transaction response time; this could also increase delays among transactions. As a consequence, weaker notions of consistency such as quasi serializability [16] or multidatabase serializability [4], and two-level serializability [41] have been proposed. Generalizations of serializability such as Epsilon serializability [49], to allow a bounded amount of inconsistency, would result in a higher degree of concurrency. Before proceeding further, we will
114
Transaction Management in a Mobile Data Access System
Chapter 4
Table 3. Concurrency Control Under Full Autonomy. Solution Site Graph Method [8]
Description • Maintains a site graph to serialize global transactions. • If more than one global transactions spans over the same sites, it allows only one of the transactions to execute.
Forced Conflict Method [30] • Global transactions are made to directly conflict at all local sites. • Local databases maintain a special data item - ticket - that is used to serialize global transactions.
briefly explain the notion of quasi-serial histories, two-level serializable histories, and Epsilon serializability. Two-Level Serializability (2LSR): Definition 2: A schedule is two-level serializable if: • The schedule restricted to each local site is serializable. • The projection of the global schedule onto the set of global transactions is serializable, i.e., the schedule excluding the local transactions is serializable. Two-level serializable schedules are a superset of conflict serializable schedules (2LSR D CSR). It has been shown in [41] that 2LSR schedules preserve database consistency only for certain applications. The basic idea is to exploit the knowledge of the inter-site data dependencies to relax restrictions on the global transactions and find schedules that use 2LSR and maintain database consistency. Consistency is preserved by capturing inter-site integrity constraints and by partitioning the data into global and local data items. A data item is a global data item if there is an inter-site integrity constraint between it and a data item at a different site. It should be noted that partitioning the data imposes restrictions on the data items read and written by a local/global transactions. Quasi-serializable schedules are a subset of 2LSR schedules. Similar to 2LSR, quasi- serializable schedules also rely on relaxing the serializability criterion. Quasi Serializability:
Section 4.4.
Solutions to Transaction Management in Multidatabases
115
Definition 3: A global history is quasi serial if all the local histories are conflict serializable and there exists a total order of all global transactions such that, for any two global transactions Gi and Gj, if G, precedes Gj, in the global order then all of Gi 's operations precedes all of Gj's operations in all local histories in which both of them appear. A global history is quasi serializable if it is equivalent to a quasi-serial history. Quasi serializability preserves database consistency, if i) there are no value dependencies between data items stored at different databases, i.e., if X = Y + Z then X and {Y, Z} are not allowed to be at different databases, and ii) integrity and consistency constraints are defined locally for each local database. These conditions may be reasonable if we consider the manner in which a MDBS system is typically developed - a collection of independent and autonomous databases. In fact, in [28] it has been argued that having a global constraint could be a violation of local autonomy; by agreeing to enforce a global constraint, a local database is at the mercy of the global transaction manager or other nodes in the MDBS system for control over its own data items. Thus, in a MDBS system, it is highly unlikely that two databases that join the system will have any inter-site integrity constraints or value dependencies (such as referential integrity constraints). Multidatabase serializability (introduced in [4]) is equivalent to quasi serializability and hence is not described here. Comparing the different correctness criteria (i.e., conflict serializability, quasi serializability and two-level serializability), quasi serializability is a subset of twolevel serializability and is more restrictive, while conflict serializability is a subset of both two-level and quasi serializability and is the most restrictive. The relationship between the correctness criteria is illustrated in Figure 4.8. Using our running example, it is worth discussing the use of quasi serializability as the correctness criterion. In the schedule at LDBS2, if subtransactions of G2 are submitted after subtransaction of Gi completes, then the total order (G\ preceding G2) among the global transactions is maintained. Indirect conflict does not affect global database consistency due to the absence of value dependencies. Thus, if we were using the site graph method at the global level, as soon as G\ completes, the edge corresponding to G\ can be removed and G2 can be submitted - the weaker notion of consistency enables early removal of edges from the site graph. Concurrency control using methods such as quasi serializability are advantageous if the local databases are truly independent of each other since, under such circumstances, no restrictions are imposed on the underlying local databases and the local autonomy is preserved.
4.4.3
Solutions Compromising Local Autonomy
Until now, we looked at solutions that try to preserve local autonomy either by using strict correctness criteria or by using weaker consistency notions. Weaker notions of consistency do need a few assumptions, but in general can be applied without imposing any design changes to the local database. The schemes that
116
Transaction Management in a Mobile Data Access System
Chapter 4
compromise local autonomy imposes some modifications on local databases when joining a multidatabase system in order to aid global concurrency control. One of the early efforts that compromised the design autonomy of local databases was given in [46]. This approach requires strict serializability as the correctness criteria for local as well as global concurrency control. The scheme uses a data structure, order element (O-element), to represent the serialization order of subtransactions in component databases. The order element is determined depending on the concurrency control method being used by the underlying local database. For example, if transactions are serialized by timestamps at a local site, then the timestamps can serve as the O-element. If we have the local serialization order to be Ti —> Tj, then according to concurrency control mechanism O-element (TJ-) —>0element(Tj), A data structure called an order vector is formed by concatenating order elements from the component databases. Having formed the order vector, it is possible to analyze the vector to maintain serializability. This method, implemented in the Harmony multidatabase project at Columbia University [45], is an example of a bottom up approach to concurrency control, since the global transaction manager is responsible for enforcing global concurrency control. A top down approach to concurrency control that also modifies local DBMSs is described in [19]. A sub process that serves as an interface between the multidatabase and the underlying local DBMSs is used to serialize transactions. The order of global transactions is maintained by controlled submission of global subtransactions to the LDBS. The approach utilizes the local concurrency control schemes in order to enforce serializability. Alternatively, a top-down approach to a concurrency control scheme can be employed by modifying the local schedulers. Modified local schedulers serialize both global and local transactions. The top down approach does not achieve maximum concurrency since the serialization order is predetermined at the global level. Because runtime ordering of global transactions is disallowed, only the local histories compatible with the predetermined global order are allowed.
2LSR
2LSR: TwroJ^el^mahzabilitx QSR: Quasi Serializability CSR: Conflict Serializability
Figure 4.8. Relationship between Different Serializability Schemes.
Section 4.4.
4.4.4
Solutions to Transaction Management in Multidatabases
117
Using Knowledge of Component Databases
To this point, we have discussed concurrency control schemes where no knowledge of local concurrency control mechanism was available. If the local database joining the MDBS can make its local concurrency control scheme known to the global manager, this information can be exploited for global concurrency control with few or no additional compromises to the local autonomy. In the remainder of this subsection, we look at how the local concurrency control scheme can be exploited to maintain global serializability. Timestamp ordering and strict two- phase locking protocols are used to motivate the discussion. Timestamp Ordering: A timestamp ordering (TO) scheduler organizes conflicting operations according to their timestamps. A TO scheduler assigns a unique timestamp, ts(T;), to each transaction IJ- to serialize it, using the TO rule. By the TO rule, if T, and Tj. have conflicting operations, then the operation of TJ- is processed earlier than that of 7} if, and only if, ts(Xi) < ts(Tj). In our running example, a problem is caused by the schedule at LDBS2 because of the indirect conflict caused by local transaction £2- If timestamp ordering were to be followed at LDBS2 then as transactions are submitted, they would be assigned timestamps. In our example L2 was the first transaction at LDBS2 and hence would be assigned the lowest timestamp, followed by Gi and G2, The problem caused by the last operation of L2 would not exist at LDBS2 since the last operation of L\ would have been disallowed by the TO scheduler because conflicting operations of G\ and G2 with ts(C?2) > t s ( d ) > ts(Z= k Then { Send ( i - k , m y p a r t ) Exit } Else { Receive ( i , h i s p a r t ) mypart = m y p a r t + h i s p a r t ; k = k/2; I f k==0 Then { Sum = m y p a r t ; Send ( i , Sum);
Section 5.3.
Data Parallelism
167
k = 1; Exit } > If i>0 Then Receive ( i , Sum); For k = 2*k Do If k == 2**n Then Exit Else { Send (i+k, Sum); k = k*2 } Note that all the send/receive are over one link only. This is because i+k or i-k always differs from i in one bit only as k is a power of 2 and it is subtracted from i = k,k+l,..,2k-l but added to i smaller than k, i.e., add 1 to 0 digit or remove 1 from 1 digit. The above is an example of a logN algorithm: to compute the sum of 2**n numbers, up to n iterations are performed; similar to complexity measures of a hypercube, in such an algorithm for a problem of size N, the total number of execution cycles increases with logN. Generally, problems below logN cycles are rather trivial, such as adding two arrays in corresponding positions, which can be done in one cycle if we have enough arithmetic units, while problems much more complex than logN algorithms, say with execution cycles increasing with powers of N or even exp(N), are difficult to solve when N is large. Finding logN and similar complexity algorithms is therefore something of a highly desired goal. We see however that, though the above problem is conceptually simple (adding up N numbers), the parallel program is relatively speaking already not simple. That is, finding algorithms fast enough to execute on parallel machines imposes demanding programming requirements. There are few parallel programming problems that are simple in an all around sense and still useful.
5.3.3
PRAM Algorithms
To produce parallel algorithms, a problem is analyzed in order to break it into independent or near-independent parts that execute separately with limited communication between them. This design process usually makes some assumptions about what data sharing mechanisms are available and even the relative cost of different mechanisms. For example, a large set of common algorithms were designed according to the Parallel Random Machine model, which assumes that multiple processors execute in step and share a common memory, though each has some local storage like registers or explicitly addressed local memory (i.e., not just a cache, as programs do not access data in cache using cache addresses but using main memory addresses so that there is no distinct cache address for the program to use). Certain cost figures are assumed for the common classes of instructions, in addition to rules concerning memory read/write operations on the same location from different processors which require timing constraints to be defined and their costs assumed (more on this topic in the following section). This allows different versions of an algorithm, or the same algorithm under different execution conditions, to be compared in terms of a common cost model, both for the purpose of algorithm op-
168
Architecture Inclusive Parallel Programming
Chapter 5
timization, and for hardware design to support the execution of frequently needed algorithms. Despite the very specific architecture assumptions, PRAM algorithms are extremely useful for parallel programming generally because most of them can be slightly modified to produce versions for other architectures, as we will see later in a commonly used example. Further, the recursive divide-and-conquer technique frequently used in PRAM algorithm designs bridges to some extent the gap between homogeneous and heterogeneous systems: if recursion results in the creation of a number of identical subprograms, these can be run on different processors of a homogeneous system; at the same time, the process of recursion itself can be started on a single processor of a heterogeneous system to spawn off multiple processes; in other words, what is "design" on one becomes "implementation" on the other. With the recursively spawned processes themselves spawning further child processes, the logN behaviour is achieved for the initial spawning, and for the eventual process closing down and returning of results, so that the parallel program start and end overheads are slowly increasing with problem size N. The cost of the algorithm steps themselves does not always behave so nicely however, because of the replicated execution of overlapping parts on the different processors. In the previous section we showed a logN program for summing N values for a hypercube machine. To illustrate the design process, we note that sum (x[l..N]) = sum (x[l..N/2]) + sum (x[N/2+l..N]) which can be spawned to two different processors to perform; each can further divide the work between two processes, and so on, eventually resulting in a program requiring N/2 processes and 21ogN clock cycles, with logN add steps and logN send/receive steps. Here we have simple a recursive process that may be actually executed, or may merely be used to design the algorithm. Let us take a somewhat more complex problem, to compute the prefix sums of an array: s[l] = x[l] s[i] = s[i-l] + x[i], i = 2..N. That is, each s contains successively more elements of x. Applying recursion, let's assume that we have already produced the prefix sums of x[l..N/2] and xfN/2+l..N] separately; that is, we already have s[l..N/2] as well as s'[N/2+l..N] with s'[N/2+l] = x[N/2+l] s'[i] = s[i-l] + x[i], i = N/2+2..N Clearly, s is produced from s' by adding s[N/2] to every element of s'
Section 5.3.
Data Parallelism
169
s[i] = s'[i] + s[N/2], i = N / 2 + 1 . . N This is a simple recursive process, and it also produces a logN algorithm. However, for most cases, this is not a good algorithm. On a shared memory system, we would have N / 2 processors all trying to read S[N/2] from the same memory location in order to add it to N / 2 separate locations; in a hardware implementation, the arithmetic unit producing S[N/2] has to send it to N / 2 other arithmetic units, which requires the signal t o be amplified to supply a strong enough input to so m a n y recipients (the "large fan-out" problem). On a distributed machine the processor N / 2 has to broadcast s[N/2] to N / 2 other processors, which can usually be done efficiently by recursively distributing the work in a hypercube structure. However, in all these situations, distributed communication would be preferable to start with. Instead of a single source sending to multiple destinations, it would be better to have multiple sources send to multiple neighbouring destinations with senders and recipients being close to each other. Instead of passing information from one section to another all at once, we use a different design in which information is passed a little bit at a time: first add x[i] to x [ i + l ] for every i (x[i] for i < 1 is assumed to be 0) producing an array containing sums of two; then add each value to the value two steps away, producing sums of four; then add each value to the value four steps away, producing sums of eight; etc. T h a t is, in each clock cycle we go twice as far as before, to find twice as much information. In a shared memory system with 2**n=N processors we have the following prog r a m running on each processor, with index i: F o r k = 1 t o N/2 Do { If i > k Then x [ i ] = x [ i ] + x [ i - k ] Else Exit; k = 2*k } while in a distributed machine
For k = 1 To N/2 Do { If i k Then Receive (i, y); x = x+y; k = 2*k } Note t h a t though the send/receive nodes have indexes differing by 1, 2, A..., they are not necessarily separated by one edge in a hypercube, since it depends on whether the 1 digit in k is added to a 0 in i.
170
Architecture Inclusive Parallel Programming
Chapter 5
We see t h a t while the recursive program itself may not be practical, it provides us with ideas to produce a better P R A M version as well as other versions suited for homogeneous, distributed systems, including one t h a t could be used on a hypercube machine. There are also a number of further algorithms t h a t can be "derived" from the prefix algorithm. For example, suppose we have a list in which each element has a pointer to the next element only, but we want to know how far each element is from the end of the list, i.e., a list element enumeration. T h e following is the program: If Null (next) Then i = 1 E l s e i = 0; F o r k = i Do { If Null ( n e x t ) Then E x i t Else { i = i + n e x t . i ; next = next.next > > T h e further away an element is from the end, the longer it would execute before getting the null next pointer, and the larger would be the accumulated value i. W i t h each iteration, one goes to an element twice as far away as the one entered last time, and obtain its accumulated value i showing how far it is from the end. By accumulating this information from all elements, we discover how far we ourselves are from the end. Another example is: remove all blanks from a string of text, moving the nonblanks leftward to take u p the room. T h e program is I f x [ i ] == " " Then v [ i ] = 0 Else v[i] = i; F o r k = 1 Do { If i > k Then v [ i ] = v [ i ] + v [ i - k ] Else Exit; k = 2*k } I f x [ i ] " " Then x [ v [ i ] ] = x [ i ] T h a t is, since blanks are give value 0, v[i] will be the sum of all the l's, or the number of all the non-blanks, to the left of x[i], and indicates where x[i] will go if all the blanks are removed. Somewhat more elaborate problems like "move all lower case letters to the right end and upper case letters to the left end maintaining the order within each group" or "remove all but one blank from each string of consecutive blanks" are left as an
Section 5.4.
Memory Consistency
171
exercise for the reader. Distributed memory versions of these algorithms are also possible but rather more complex. In fact the overheads of data sharing become so high that one would not consider parallelism worthwhile for such cases at all. Even with such small examples, we already see the divergence between shared memory and distributed systems, just as the gap between homogeneous and heterogeneous parallelism is wide.
5.4 5.4.1
Memory Consistency Shared Memory System
In a shared memory system, two processors can attempt to read/write the same memory location concurrently. This raises the question: what are the timing constraints between the concurrent accesses, and how are these constraints enforced? It might seem simple to say "first come, first served", but things are not so simple: First, processors have caches, and an item of shared data would have multiple copies being accessed separately at any time. What timing constraints are enforced by the system between these copies? Second, a piece of data may need to be processed in a sequence of steps before the final result is used by others, and it would not be useful, and indeed would be wasteful or harmful, to let others share the intermediary results. Finally, the processing may involve several related items of data, which must either be seen together, or not seen at all, i.e., they must be put into a consistent state before they are shared. Therefore, the write results of a processor may be hidden from other processors until a convenient moment when the new data are allowed to become visible. In the mean time, the other processors would only see old data unaffected by the new writes. But there must be some clearly defined rules about when the results become available; in other words, we need system-enforced timing constraints between the concurrent read/write operations. These rules constitute the memory consistency model. Virtually all the shared memory machines make it possible to provide the sequential consistency model at the hardware level: a read by any processor sees the results of the writes of all the processors up to that point in time regardless of where the reads and the writes occur. This is achieved by enforcing what is called a cache coherence protocol between the processors, e.g., 1. Write Broadcast: Whenever a processor writes into its own cache, the same information is sent to all other processors so that, if they have a copy of the same data, their copy is modified. 2. Writethrough Invalidate: When a processor writes into its own cache, it asks all other processors sharing the data to delete their copy, so that if they access the data in the future, the cache hardware would obtain the new copy. Note that this reduces the wastage of updating copies which may or may not get used, or updating repeatedly without using all but the last updated value, but if you do use the new result, then you pay the cost of a cache miss. Also, the method assumes a
172
Architecture Inclusive Parallel Programming
Chapter 5
write-through cache, i.e., new cache values are immediately transmitted to memory so that the processors with invalidated copies can get new copies from the memory. 3. Writeback Invalidate: A cache write invalidates other copies without immediately generating a memory write, but when another processor has a cache miss on the same data, the processor holding the modified copy would prevent a memory access on that address until the copy has been written back, so that the new copy would be retrieved by the other processor. This requires each processor's cache coherence hardware to watch out for reads by other processors and to determine from the address whether a read is on a memory location it has modified earlier but not written back. In contrast, in 2. each processor only has to watch out for writes by other processors and determine if these invalidate something in its own cache. We see that in the first method, new information is "eagerly" sent to others, while the last method "lazily" keeps the information until it is actually needed, and the second method tries something in between. Many other minor variations are possible, all resulting in the same behaviour of sequential consistency but with different cost behaviours that might be best for a particular program behaviour. Overall, Writethrough Invalidate protocol is probably the most common. Sequential consistency is sometimes described as a "strong" memory model, though actually the "strength" is quite limited: the system cannot guarantee that the write results seen by your read are from the "right" writes: the particular write you want may not have occurred because another processor was slow, or a write you do not want to see had already occurred because another processor happens to be fast and has already overwritten the content you wanted to read. Also, the past writes may or may not have occurred in the right order because of unpredictable timing differences in the different processors. The cache coherence hardware has no control over how fast or slow all the processors execute, and the guarantee is merely that you do see all the writes that have occurred so far, that the other processors' changes are not hidden from you in their caches. However, once we have cache coherence and sequential consistency, we can provide a stronger guarantee by implementing locks and atomic regions. For example, if reads and writes take place in a critical region, then if you attempt to read something being written at around the same time, but just before the write has occurred, you will see the new value, because your read will be delayed until the writer exits to allow you enter your critical region, though this is still only a weak guarantee. Another possibility is to adopt synchronous processing: all the processors use a common clock to drive their execution in a series of individual time periods, and at the end of each period writes performed within the period will be shared, so that they are guaranteed to be seen by other processors from the next period onward. In other words, the consistency model only guarantees that a read will see all the writes up to the end of the previous step, not the writes within the current step. In a cache system, this requires all the writes to be saved within the individual caches until the end of each synchronized step, at the end of which memory updates take place system wide. This is called the weak consistency model, designed for synchronous systems working in so called "supersteps"
Section 5.4.
Memory Consistency
173
of compute-communicate-compute-communicate... The reason sequential consistency is "not strong" lies in the fact that data are identified by their memory addresses, which contain no information about the state of the content; a stronger memory consistency requires read operations to describe data in a more specific fashion, but this implies some form of search and verification process with read and write operations matching data supplies with the corresponding data demands. This makes read/write operations more costly; hence, a compromise must be struck whereby search and verification is done for blocks of data rather than individual locations. Pages of memory would be natural units for such memory sharing, but it is also possible to share and move around whole objects. By defining a set of rules on how memory content is shared and when copying over takes place, we have defined a new consistency model. Given a particular memory consistency model, we write our parallel programs in a particular way to ensure the correct results on system offering that model. Moving to another system offering a different model would normally require reprogramming. Multiple consistency models can be offered on the same hardware platform, in order to try different versions of the same algorithm and discover one with the best performance. Until a single consistency model is widely accepted as the standard, there is no real program portability.
5.4.2
Tuplespace
The Linda tuplespace scheme provides a stronger memory model than sequential consistency. Tuples are multi-field records emitted by executing processes into a shared memory pool, and may be retrieved by presenting a matching search key. To produce a tuple, a process executes Out ( e x p r i , expr2, . . . , exprN) where exprX is any expression producing a value, and the N values expri to exprN are stored together as a record in the shared space, while In ( e x p r i , . . . . exprM, ? v a r l , ?var2, . . . , ?varN-M) or Rd ( e x p r i , . . . , exprM, ? v a r l , ?var2, . . . , ?varN-M) retrieves a tuple containing the M values on the left, and then, stores the N-M values on the right into the variables varl to varN-M. An In removes the tuple while a Rd reads its content without removing it so that other processes can access the same tuple subsequently. Tuplespaces are easily implemented on shared memory systems by building a monitor that manages the tuple pool. A process calls the In/Rd/Out functions in the monitor to search/modify the pool, with In/Rd causing a search for an existing tuple containing a given set of values, which make up the search key; if no matching
174
Architecture Inclusive Parallel Programming
Chapter 5
tuple is found in the pool, the tuple request is added to the pool and the process is suspended; for an Out operation, if a matching request in found in the pool, then the waiting process is resumed after the data on the right of the new tuple have been copied to the variables listed on the right of the In/Rd request; otherwise the new tuple is simply added to the pool to wait for future requests; after either alternative the process exits from the Out and returns to its previous execution sequence. An In/Rd specifies the key of a tuple it requires, and waits for the tuple when an Out comes later than the corresponding In/Rd. The state of the data can be included as part of the search key, so that an In/Rd specifying a later state succeeds after an In/Rd that specifies an earlier state, regardless of the execution time of each request. In other words, the tuple scheme provides stronger timing constraints on related memory accesses. This simplifies application programming, but requires more costly system support to manage the tuples. In previous implementations, the number of tuples in a pool is usually variable; hence, the pool has to be organized as a hash table, with searching carried out in hash buckets. If the number of tuples is fixed instead, each can be given a specific location and the address is used as part of the search key, with data state information given as additional key fields to be matched, so that a tuple retrieval is successful only if the tuple at the addressed location has the particular wanted content. That is, an In/Rd operation would now require a content match rather than a hash bucket search and is generally faster. However, a number of unsatisfied requests may be queued at a tuple location waiting for its content to change. So an Out operation might require a search of queued requests, and it is necessary to avoid "overusing" a single tuple by sharing it among many concurrent users as this might produce inefficiency. More desirably, a tuple should provide a "bilateral" form of sharing with a single producer and a single consumer, with the tuple buffering smoothing out unpredictable timing between the two parties. Fixed location tuples can be made to behave like locks: modifying the content of a tuple causes other processes to suspend or resume depending on their key match success, and normally memory is partioned into separate areas for different processes plus the tuple pool, perhaps with a data exchange area controlled by relevant tuple locks. If each tuple operation is followed by a number of data accesses on the exchange area, then the tuple access overhead is less significant relative to the total volume of access. In the following two sections we shall see a number of examples using tuples in this way. But first some more discussion of memory consistency.
5.4.3
Distributed Processing
As explained in subsection 1.2, a distributed system may allow a processor to copy over a block of remote memory through the system interconnect, or to invoke a message transfer to achieve the same effect. The two methods cause multiple copies of the same information to be created in different nodes and need to meet the same memory consistency requirements. However, there are some significant differences. As explained earlier, by acquiring a lock and entering a critical region or moni-
Section 5.4.
Memory Consistency
175
tor, a concurrent process can verify that a shared data block is in the correct sate, before accessing the content of this block. A shared memory machine with cache coherence support can implement such locks correctly, usually through Test and Set instructions executed by the concurrent processes on a shared lock variable. But if any processor can obtain a copy of the lock variable without going through a coherence maintenance procedure, then the critical regions cannot be correctly implemented. That is, while cache coherence hardware enforces sequential consistency between multiple copies of data in caches, multiple copies in distributed memories require other mechanisms to maintain consistency. To illustrate, while processor A copies a block of memory containing the lock variable x, and finds it "Open", another processor B may have done the same operation; after making its own copy, it too would find the lock variable Open, but it would be incorrect for both of them to set the value to "Closed" and then copy the blocks back, since this would cause both to be in their critical region at the same time. Unlike the Test and Set instruction which copies over the previous value and stores in a new value in the lock variable in one step, in a distributed system the copying over, testing previous value and returning new value are separate steps allowing an inconsistent operation from another processor to occur in between. The solution lies in some form of system support, similar to a tuplespace style "In" operation which copies over a block of memory at the same time removing it from access by others, and turning accessibility back on only after the lock variable has been reset to "closed". More commonly, locking/unlocking is done by performing a remote supervisor call on the processor whose memory holds the lock variable. If a number of such calls arrive at a processor, they would be responded to one at a time, each resulting in an atomic sequence of steps, namely Reading the lock value, If Open Then set it to Closed Else... resulting in consistent operations between the different processors. The equivalent solution is message passing: a processor requiring data, including operations on a shared lock, on another processor would send a request message and wait for a reply; if a number of competing messages arrive from different processors, they would be picked up one at a time, and the complete, consistent processing steps needed by each request are carried out before the next message is handled. A processor may not respond to a request immediately, since it does not usually simply stay idle while waiting for messages from others; instead, it has work waiting to be done in its task queue and tries to keep busy all the time. To make sure that messages get quick attention, a processor has to be interrupted, either by a communication interface interrupt mechanism, or by setting the interval timer to make the processor regularly look around for other matters to attend to. In either case, interrupts are disabled during an earlier interrupt so that the earlier message is processed to completion before the next interrupt is allowed. In this way, a locking operation is executed as an atomic sequence during which other requests are turned off. There is much system cost arising from the need to maintain the correct memory consistency between multiple copies of the same data, with programs inserting
176
Architecture Inclusive Parallel Programming
Chapter 5
the relevant locking and other synchronization steps before and after accesses on shared data, as well as in copying over the shared data and in returning them after modification. Normally, one would try to minimize overhead by careful planning so that remote copying is done in blocks which are neither too big nor too small, and the same block of local memory is reused for different remote data at different times. That is, increased programming effort is needed to maintain efficiency.
5.4.4
Distributed Shared Memory
In a distributed shared memory (DSM) system, a large virtual space is defined and its pages may be mapped to physical pages on different machines. Multiple programs executing at the same time are simply allocated different parts of this virtual space, and multiple threads within the same program would execute on different processors, each mapping over the virtual pages the thread uses into its local physical memory, including possibly multiple copies of the same virtual page. Explicit memory to memory copying and message passing are not required, and programming is simpler. A program may be dispatched to any processor to execute, and we can consider such possibilities as restarting a crashed program on another node using information saved from an earlier checkpoint, or stopping a program to move it to another, less busy node to achieve better system throughput. With these advantages, it is understandable that distributed shared memory is a highly attractive idea. But an efficient memory consistency system is critical, because multiple copies of the same information are now used more commonly: simply by referring to a variable that another process also refers to, we would cause a second copy of the page containing that variable to be loaded into our processor memory. Not only must consistency be achieved, it would also be invoked frequently. This is where the idea of lazy consistency comes in: we try to delay the work needed to achieve consistency as much as possible. When one processor changes its copy, others may not have to know, because they may not need the new value, or may need only the final result after a set of changes. Hence, the new memory content need only be delivered to a processor when it actually declares its need for a piece of shared information, by requesting a lock. A lock request is broadcast to all processors that share the same information and may have changed it; if no one is sharing the information or no one has changed it; then nothing particular need to be done, since the copy the processor already has is still correct. But if one or more processors have a modified copy, then, the "lock successful" reply also tells the requesting processor it should invalidate its own page containing the shared information. When the processor does access the information, a page fault occurs, causing the new content of the page to be delivered to it after all the changes made by the other processors have been incorporated. In turn, when some other processor tries to lock the information, it would have to invalidate its copy if this processor has made a change during its locked processing. The merits of this method is, first, memory content transfers between proces-
Section 5.5.
Tuple Locks
177
sors are kept to the minimum necessary, and second, it makes use of normal virtual memory support mechanism. There are however some debate about various details. For example, how much information should be broadcast with a lock request? By specifying which page, even which particular address, a processor wishes to lock, one minimizes false sharing where modified pages are forwarded even though the recipient uses something else in the page, but increases communication and status recording costs. Also, should each shared page have an "owner" that has its complete new content, or should all the users have separate copies with separately made changes, until someone locks it, at which point he would have all the changes sent to him by all the other users individually, to work out the latest complete content, which will be passed on to the next user doing a lock? Because of the various possibilities, there are many slightly different DSM systems, including some that allow a program to choose a combination of the features to optimize the performance of a particular run. Further, though programming on a DSM system largely proceeds like programming on a shared memory system, shared data have multiple copies and not all the program behaviours are identical, as some examples in the next section will show. Furthermore, one need to have some awareness of the way the system handles the shared data operations of a program, in order to maximize efficiency. Distributed and shared memory systems remain different systems. DSM programming is also likely to be less efficient than the equivalent distributed memory algorithms, which would minimize memory usage by re-using the same memory block and the send/receive buffers for different data needed at different times, and by grouping transmitted data into blocks of the right sizes, at the cost of greater programming complexity. While DSM "papers over" the differences, it does not provide the architecture independence we would like to enjoy. In later sections where tuple programming examples are shown, we shall see that each system architecture requires slightly different code.
5.5 5.5.1
Tuple Locks The Need for Better Locks
In the previous chapter we discussed weak memory consistency models. With the relaxation of the consistency requirements, the implementation cost of distributed shared memory is reduced, and efficient DSM systems running on clusters that behave in ways that are close shared memory systems. Concurrency constructs like monitors and semaphores can be built on top of simple locks to allow shared memory application algorithms to be ported over. However, weaker memory consistency imposes its own requirements on program sequentiality: programs need to carry out certain synchronization and locking steps before and after accesses on shared memory, producing effect that are not quite identical to true shared memory systems. Such steps affect the structures of the programs and may make them less efficient, compared to what ought to be possible
178
Architecture Inclusive Parallel Programming
Chapter 5
if we could take full advantage of the memory sharing behaviour. The the memory sharing and locking requirements also link closely with the heterogeneous concurrency ideas, and thus make DSM systems less suited to homogeneous machines and applications. We shall then see that a stronger memory consistency model based on more content laden locks improve matters on both fronts. Locks may be very basic in form; for example, the system may provide just a simple lock command with no call parameters; that is, the lock does not provide information on what is being locked nor the particular reason for locking it, in fact no information to choose any particular lock so that the same lock is used for all critical regions, rather than different locks for different regions. Process X being inside region A would then exclude process Y from entering region B, even though there is no conflict. In other cases, parameters are provided in the lock request but the information available may be insufficient to determine if a conflict exists, such as two processes wishing to lock the same page, though they use different data in the page and it is in fact safe for them to execute at the same time, i.e., false sharing. In some memory models locks are required to do more and must specify the particular item of shared information involved, or the compiler might work this out for itself from the execution region and anticipated runtime conditions, resulting in new forms of consistency such as entry consistency, scope consistency and location consistency. Each of these techniques reduces false sharing, but generates overheads in requiring the system to maintain different lock indicators as well as their current lock holding conditions. The desirable goal is to get high benefit from the more sophisticated locks while minimizing the cost. To illustrate that thee program sequences imposed by the consistency models through the locking requirements may introduce process inefficiencies; consider the example: given process PI which repeatedly writes into a shared location X for process P2 to read, we have PI lock write X unlock
P2
lock read X unlock lock write X unlock But we know that if we are running on a DSM system, then a second copy of X would have been given to P2 when it acquired the lock, while PI still has its own copy, so that there is actually no need for PI to wait for P2 to finish reading before starting a new write into X. Yet, it is necessary for PI to go through the
Section 5.5.
179
Tuple Locks
lock/write/unlock process to ensure that P2 does not start reading the next value until PI has finished writing it. The program sequencing imposed by the lock hampers programmability. The use of more complex synchronization constructs, such as named locks, semaphores and barriers, can improve programmability. For example we can have PI
P2
write X read X 2-process barrier write X read X where, with each barrier, like in the case of a lock, the modifications made by PI on its copy of the shared page would be given to P2, so that P2 can read the previously written value on its copy while PI proceeds to make another modification on its own copy, which will be seen by P2 after the next barrier. Hence, the use of a more complex synchronization operation has simplified programming. This is-the arrangement for problems in which the two processes wish to share every value of X, which is achieved in the synchronized relation between PI and P2 as the process that finishes first must wait for the other process to finish before both can start a new step together. The synchronized processing would also be correct for multiple consumers in which every new data page must be seen by every consumer, while another new page is being produced in the mean time. However, if we are programming something like a web server, then the data consumers do not necessarily want to access every new value produced, but only the latest value at the time a read is done. Further, the number of consumers is variable as they come and go, and it would not be correct to require all the processes to pass a barrier after the supply of each new value. While the system can provide a way of defining varying subgroups of processes as synchronization groups, so that a barrier need not include all processes but only those that are taking part in sharing a particular item of data. However, though the method could work for some problems, generally things are not so predictable and there is risk of program errors like a process forgets to take part in a synchronization, or is late doing so, preventing the barrier function from completing and stopping others from passing the barrier. That is, the above arrangement requires each participant in a synchronization group to refrain from starting the next read/write until the whole group is ready, and to "stick around" until the whole group is done; they cannot start/end any time they like. This complicates as well as delays individual programs. But a simple mutual exclusion is also not the right structure as usually multiple readers can read concurrently as long as the data are not currently being written. To achieve the ability to come and go as one chooses in web like behaviour, one requires different kinds of read and write locks, such that a write lock can exclude subsequent reads until a write unlock, but a read lock does not exclude other readers, who can join
180
Architecture Inclusive Parallel Programming
Chapter 5
in to read any time they like as long as the page is not being changed. This is the well known reader-writer problem from operating systems courses. It is therefore desirable that a lock specifies "why" as well as "what" in order to simplify programming and reduce inefficiency in the sharing processes, since a successful lock indicates that certain conditions for execution are met, but at the cost of increased lock complexity. Consider another example. Suppose X is a dynamic data structure such as a queue or stack. After a process acquires the lock guarding X, it might discover that X's condition (e.g., stack empty) does not permit the operation it wants to perform (in this case a pop). The process is therefore forced to release the lock in the hope that some other process will rectify the condition, but it, as well as any other processes wishing to carry out the same operation, will have to repeatedly enter the same critical region and obtain new copies of X while waiting for the rectification, just to see whether the situation has changed. This repeated entry/exit can however be avoided if we use a more complex lock arrangement such as a second lock or a semaphore. This will be further discussed below. What we really need is a kind of conditional lock, which is successfully acquired only if a specific requirement is met. So the direction we are moving in seems to be (a) A lock acquire specifies what data to lock (b) Lock acquire specifies what action is intended (c) Lock acquire specifies the condition for lock to succeed and it would seem logical to go a step further and have (d) acquire also brings over the data being locked, or alternatively, invalidates the existing copy so that new pages would be fetched as required which produces something akin to the idea of tuplespace. In other words programmability can be improved by the adoption of more sophisticated locks, which however tends to increase implementation cost. In the present study we suggest that, by imposing some restrictive structures on the tuplespace, we can produce a system for sharing data with significantly enhanced programmability at a modest cost, contrary to previous experiences that tuplespaces are expensive to implement in a distributed environment: By "tying down" tuples to specific virtual addresses and replacing a search process by a matching process, tuple operations become little more complex than simple lock - read/write - unlock operations.
Section 5.5.
5.5.2
181
Tuple Locks
Tuple Locks
Two shortcomings of the tuplespace methods are that the search process is time consuming especially in a large space, and there are at present no simple rules governing the scope of the tuple operations defining the visibility of one process's emitted tuples to another process, which would help to reduce the search space. The present proposal is to replace the search process by a matching process: tuples are fixed in number, location and field attributes, and are not dynamically created or destroyed, but merely change in content with tuple operations performed by processes. Further, they are mainly used to control access to blocks of data, which would otherwise require a large set of tuples and costly tuple operations. A process declares one or more "buckets" each guarded by one or more tuple locks. Locking and unlocking are both achieved by a RdTuple operation: RdTuple (BucketID.TuplelD: match I modify,
...)
with the alternative format RdTuple (BucketlD.TuplelD: I modify I match,
...)
i.e., change the current tuple content first and then wait for or retrieve some expected value, which may be produced later by other processes. It is also possible to have modify without match, or match without modify. When match|modify appears in one field and |modify|match in another, the modify operations are synchronized, with the matches on the left occurring earlier and those on the right taking place later. To specify a match condition, the value currently in the tuple field is denoted by ?, and the match may be any Boolean expression involving ?, e.g., RdTuple (Bucket.Tuple: ?==x, ...) succeeds if the value in the tuple equals the search key x. A simpler way to denote this is RdTuple (Bucket.Tuple: x, ...) and x may be an expression instead of a variable or constant. If multiple match conditions are given for several tuple fields, RdTuple succeeds if all of them are true. If a successful match followed by a —modify, then the same RdTuple operation performed by the same or another process, would no longer succeed, until some other RdTuple operation changes the tuple fields being matched back to its original values. This corresponds to a lock/unlock. For example, a simple binary lock contains just a T / F value, and locking is achieved by RdTuple (BucketlD.TuplelD: T I = F)
182
Architecture Inclusive Parallel Programming
Chapter 5
which succeeds if the tuple currently contains T (match) and changes its content to F, (modify) so that another process trying to perform the same operation will suspend until some process executes RdTuple (BucketID.TuplelD: F I = T) changes the value back to T. This is a simple lock and unlock. The modify part may contain any operator expression (e.g., +1) instead of an assignment (i.e., = ? + l ) . A match specification ?var retrieves a value from a tuple into a local variable as before. Note that ?var can have a modify on its left or right. Two operations involve copying the bucket guarded by a tuple: InBucket (BucketlD.TuplelD: match I modify,
...)
causes the bucket to be copied to the local memory (which could be achieved on a DSM system by invalidating any current copy so that a new copy would be fetched upon access) if the tuple match fields equate. Similarly, OutBucket (BucketlD.TuplelD: match I m o d i f y , . . . ) makes the modified local copy visible to others, again provided that the tuple lock contains matching values, which indicates that the shared bucket is "write enabled". The Out could modify tuple content to indicate that the new bucket content is now "read enabled". That is, by modifying the content of the tuple in a Rd/In, one prevents other processes from retrieving the bucket using the same match key, until an OutBucket or some other tuple operation restores the matching values, indicating the condition to release the data to others is met. OutBucket may mean "overwrite existing content" or "incorporate my changes along with changes made by others", which is an implementation detail varying between systems. Details like this affect program behaviour, and we shall from time to time present different code versions for different assumptions of platform and implementation details. However, the ultimate objective is to standardize on a particular system behaviour regardless of whether the platform is distributed or shared memory, as will be shown in the final part of this work.
5.5.3
Using Tuple Locks
For our first example, consider the multiple read/exclusive write problem in a distributed system. We require a tuple with two fields: a Boolean read-enable flag and a reader count. We assume a distributed system in which the writer can update its own copy at the same time as existing readers read their copies, but must prevent later readers/writers starting until a new, consistent copy has been released; hence we have Readers:
InBucket (BucketlD.TuplelD: T, I +1) ..read..
Section 5.5.
RdTuple (BucketlD.TuplelD: - , Writer:
183
Tuple Locks
I -1)
InBucket (BucketlD.TuplelD: T | =F, - ) ..write.. OutBucket (BucketlD.TuplelD: I =T, 0)
Note that each reader increments the reader count by 1 before reading, and decrements it after; writer can start writing even when there are readers reading, but would not put back the new copy until the reader count goes to 0. (- indicates we do not care about the current value or already know it.) Multiple writers can also be accommodated by replacing the read enable flag by the number of writers: Readers:
InBucket (Bucket.Tuple: 0, I +1) RdTuple (Bucket.Tuple: - , I -1)
Writers:
InBucket (Bucket.Tuple: I + 1 , - ) OutBucket (Bucket.Tuple: I - 1 , 0)
with the assumption that they write to different parts of the bucket so that no conflict arises. Note also that the arrangement has problem that the readers will suffer starvation if new writers keep starting before earlier writers end. Tuple locks can also be implemented on shared memory systems, but the programs generally cannot be ported between shared memory and DSM systems without change; for example, concurrent read and write would not work, and in the reader/writer problem writer must for reader count to go to 0 before starting to write RdTuple (Bucket.Tuple: I + 1 , , 0) ..write.. RdTuple (Bucket.Tuple: | - i , - ) Note that because the bucket in the shared memory is accessible to all processes we can use RdTuple rather than InBucket and OutBucket, which would bring back/send out a local copy rather than use a common shared block, and this is not require for the shared memory reader/writer problem. Comparing distributed system with DSM, the program behaviour is different, because the local copy is obtained at once in whole, rather than the actually used pages in a bucket being transferred upon demand. To keep things efficient, buckets are usually smaller, and may be used as buffer spaces for different purposes as execution proceeds. An example approximating the single reader multiple writer problem is Gaussian elimination with each process taking care of one row of the coefficient matrix: the
184
Architecture Inclusive Parallel Programming
Chapter 5
first row is not changed in the elimination, while the Nth row is modified N-l times; in fact the process in charge of row 1 does no computation until the back substitution stage, completing last since x[l] cannot be computed until x[2..N] are available. The processes that manage the lower rows perform progressively more work, but finish earlier the lower the row. Hence, the N processes are not well synchronized, and constitute an asynchronous structure with each process i providing the ith pivot row as writer in iteration i, and processes i-f 1 to N reading the pivot row, during the elimination stage, and then in the reverse order during the back substitution stage, with each process i writing x[i] into the shared location so that processes 1 to i-1 can act as readers to subtract x[i] from its right hand side. For a multiple writer example, consider the producer/consumer problem with the bucket containing a circular buffer guarded by two tuples, which contain the head and tail pointers respectively in addition to the queue/dequeue enable flags and empty/filled buffer counts. To proceed: Consumer - InBucket (Queue.Head: T I = F, ? F i l l e d , ? CIndex) remove Bucket[CIndex] OutBucket (Queue.Head: = F i l l e d > l , 1 - 1 , 1 + 1 mod N) RdTuple ( q u e u e . T a i l : I = T, | + 1 , - ) Producer - InBucket ( q u e u e . T a i l : T I = F, ? Empty, ? PIndex) add Bucket[PIndex] OutBucket ( q u e u e . T a i l : | = Empty>l, 1 - 1 , 1 + 1 mod N) RdTuple (queue.Head: I = T, I + 1 , - ) Assuming that there are both filled and empty buffers, then one producer and one consumer can both be modifying the bucket, but while they are doing so, the enable flags are F and so other producer/consumers must wait. Afterwards, the consumer decrements the filled buffer count and increments the empty buffer count, while the producer does the reverse. Each forwards its pointer by 1. If the buffer used is the last empty/filled buffer, then the queue/dequeue enable flag in the relevant tuple remains F until a new empty /filled buffer is created by another process. The program assumes that concurrent OutBucket operations are processed by incorporating the changes each makes into the home bucket, rather than by a simple overwrite. In a shared memory system RdTuple would be used in place of InBucket/OutBucket. The use of the tuple locks and buckets produces a number of advantages: 1. Absence of false sharing: even when two small buckets share a page, the tuple operations indicate which buckets are required, so that if the wanted bucket has not been modified, the current copy of the page would not been invalidated even if other buckets have changed. 2. Absence of false exclusion: concurrent reads/writes on the same bucket are possible with appropriate locking, and reads/writes on different buckets would not
Section 5.5.
Tuple Locks
185
compete for the same lock. 3. Prefetching: by appropriately structuring bucket contents, one can cause related items to be fetched together, so that access on one part prefetches others. A bucket can contain one or more objects, or in reverse, an object can enclose several tuples guarding related buckets. 4. Scoping rules: Buckets and their guard tuples are declared like other data structures, and normal scoping rules apply. Where multiple readers and writers share a bucket, they go through the locking operations so that the process managing the home bucket need only to broadcast invalidate signals to the actual users of the bucket lazily. . Hence, there are simple rules on both the scope of visibility and the scope of sharing. 5. Limited memory sharing support: Only tuples and buckets are shared processes, limiting system resources devoted to the purpose. However, in produce identical program behaviour on shared memory and distributed bucket duplication may be needed in certain programs to run on shared systems; this too will be discussed in the final part of the article.
5.5.4
between order to systems, memory
Bucket Location
A bucket and its tuples may be defined globally, where it is visible to all the processes which will share it according to the scope rules, e.g., Bucket Queue: Tuple Head : Boolean, Integer, Integer; Tuple Tail : Boolean, Integer, Integer; Array[0..N-l] of QElement; End; Procedure Consumer (Integer) End; Procedure Producer (Integer End; For i = Exec For i = Exec
i To n Consumer (i); 1 to m Producer (i);
186
Architecture Inclusive Parallel Programming
Chapter 5
On a shared memory system, the producers and consumers can directly refer to the bucket Queue and the tuples Queue.Head and Queue.Tail, as well as its content, a simple array which does not require a separate name and each element is referred to as Queueflndex]. On a distributed machine, each consumer/producer must define a local duplicate of Queue
Duplicate Queue;
to allocate local storage for the bucket, and copy the home bucket content over with InBucket and return modified content with OutBucket. On a DSM system both are achieved simply by RdTuple, which causes any changes made to the local copy (diffs) to be incorporated into the home bucket, and if the local copy is not identical to the home bucket, the invalidation of the local copy so that future accesses would cause the home bucket content to be transferred over. Note that on a shared memory system the Duplicate declaration causes no local space allocation, since the home bucket is already accessible. (In subsection 8.4 we will discuss situations where a separate copy is needed.) Buckets may also be locally defined as a group, with each process holding a separate one identified by the process ID: Process Bucket A [] ; Tuple . . Array [1..N] A process may refer to another process's bucket/tuple by RdTuple (A[processID].tuplelD : . . . ) and use InBucket/OutBucket to transfer bucket contents. It refers to its own bucket by the bucket name alone; the meaning of A[i] is therefore context dependent: within the bucket identification part of bucket/tuple operations it means bucket A of process i, while elsewhere it means the ith element of array bucket A.
5.5.5
Homogeneous Systems
For homogeneous parallel systems that go through a set of compute-communicate supersteps, tuple locks can be used in the communicate step. For example ...compute... OutBucket (Bucket.Tuple: I +1) InBucket (Bucket.Tuple: N) ...compute...
Section 5.5.
187
Tuple Locks
with OutBucket causing incorporation of the processes changes (diffs) to the home bucket. T h a t is, each process contributes its results to the shared bucket and increments the arrive count, and when all N processes have arrived, the InBucket succeeds to give everyone the collected new information. (In a repetitive process there is the question of how to reset the arrive count to 0 for the next communicate step. This will arise in a number of programming examples and will also be discussed in section 8.) However, often we need not invoke bucket operations, but cause inter process communication using tuples only. A homogeneous parallel system usually requires a number of system wide functions in which all processors participate, such as Reduce to compute the total of N contributed values, Broadcast, Scatter, Gather, PrefixSum, etc. T h e RdTuple operation can be used to specify such functions, e.g. RdTuple(Bucket.Tuple: ?x) broadcasts the same value to local locations x in each process, which has been put into a tuple by a process via RdTuple(.., | = x ) . RdTuple(Bucket.Tuple:
I +1 I N, I +x I ? r e s u l t )
sums x from N processes and assigns the total to each process, RdTuple ( B u c k e t . T u p l e :
I +1 I N, I = I f ?<x => x I ? r e s u l t )
extracts the m a x i m u m of an array distributed across nodes, and RdTuple(Bucket:
[i]
| =x)
places x received from the ith process into the ith position of the bucket. Note the absence of a tuple ID indicates operation on the bucket content, and the index field [i] matches a particular element of the bucket. Like other tuples, the bucket content tuple can have additional fields as locking or content status signals though none are used in this operation. Then RdTuple(Bucket:
[i]
| ?x)
does the reverse. Finally the more obscure, RdTuple(Bucket.Tuple: i
I + 1 , I +x I ? r e s u l t )
successively adds each x contributed by process i, and hands over the sum u p to t h a t point, thus performing the prefix sum operation. Note t h a t + 1 is synchronized with + x , waiting for i is earlier, and retrieving the current prefix sum value is later. To perform a m a t r i x multiplication C = AB with each node holding a row of each array, we need a shared Array bucket with an array access control Tuple, and an additional Result tuple; in each outer iteration j node j initiates the scatter of j - t h row of m a t r i x A, A[j, 1..N], which is stored at t h a t node as a vector A, across the N nodes:
188
Architecture Inclusive Parallel Programming
Chapter 5
Array = A to put its row of A into the local copy of the Array bucket and then executes OutBucket ( A r r a y : - ,
I =j)
sending it to the home bucket so t h a t everyone can share its content, and each node i executes RdTuple ( A r r a y :
[i]
I ?x,
j)
to receive a particular element. In the inner iteration k each node i performs RdTuple ( A r r a y . R e s u l t :
1 + 1 , I + x*B[k])
with node j performing in addition RdTuple ( A r r a y . R e s u l t : N, ? C [ k ] ) to receive the result and RdTuple ( A r r a y . R e s u l t : N I = 0 ,
I =0.0)
to reset the result tuple for the next iteration. So the whole program is If i==i Then RdTuple ( A r r a y . R e s u l t : I = 0 , I = 0 . 0 , N + l ) ; F o r j = 1 To N Do i If i==j Then { A r r a y = A; RdTuple ( A r r a y . R e s u l t : 0 , 0 . 0 , N+l I = 1 ) ; OutBucket ( A r r a y : - , I = j ) } RdTuple ( A r r a y : [ i ] I ? x , j ) ; For k = 1 To N Do •C RdTuple ( A r r a y . R e s u l t : 1 + 1 , I + x * B [ k ] , k ) If i==j Then RdTuple ( A r r a y . R e s u l t : N I = 0 , ? C[k] I = 0 . 0 , } }
I =k+l)
Note t h a t there is an extra field in the Result tuple to control the inner iteration over k and and trigger the outer iteration when k exceeds N. In this program reinitialization of the Result tuple is simple because only one process uses the result; when it sees t h a t all N processes have made the contributions, it retrieve the sum of products and re-initializes. When the result is shared by others, reinitialization must wait until everyone has read the result, and more care need to be taken. We are now ready to look at some larger examples using buckets and tuples.
Section 5.6.
5.6 5.6.1
189
Using Tuple Locks in Parallel Algorithms
Using Tuple Locks in Parallel Algorithms Gaussian Elimination
First a small b u t complete algorithm: Gaussian elimination on a multiprocessor system with each node holding one row of the m a t r i x plus one element of the right hand side. This is named vector A with A[i] for i = 1 t o N + l , and each node has a bucket Pivot to receive the pivot row for each step of elimination, with the pivot chosen by the size of the leftmost element. T h e bucket has tuples Value and Index to broadcast pivot and root information to other nodes. Each node j executes: I f j = = l Then RdTuple ( P i v o t . V a l u e : I F o r i = 1 To N - l Do •C x = Abs ( A [ i ] ) ; RdTuple ( P i v o t . V a l u e : i , I +1 I N, I f i = = j Then RdTuple ( P i v o t . I n d e x : RdTuple ( P i v o t . I n d e x : i , I +1 I N, I f j==p Then { j = i ; P i v o t = A;
= 1 , I =0, I = 0 . 0 ) ;
I = I f ?<x => x I ? r e s u l t ) ; I =i, | =i-l, - ) ; I = I f x==p => j I ? p ) ;
OutBucket (Pivot.Value: | =i+l, I =i, I =0.0); Exit } Else { If j==i Then j = p; InBucket (Pivot.Value: i+1, -, - ) ; x = A[i]/Pivot [i]; { For k = i+1 To N+l Do A[k] = A[k] - P i v o t [ k ] * x > Continue }
> r = A [N+l] F o r k = N Downto j + 1 Do { RdTuple ( P i v o t . V a l u e : k , I - 1 , ? x ) ; r = r - A [k] *x > x = r/A[j]; RdTuple ( P i v o t . V a l u e : | = j , 0 I = j - l , x )
T h e program can be readily extended to perform a m a t r i x inversion; instead of Ax = a where x and a are N-element column vectors, we solve AX = U where U is an NxN unit m a t r i x . T h u s , the A vector at each node contains 2N elements: N for a row of A matrix, and a row of the unit matrix, and the back substitution requires each process t o OutBucket a column of roots for the other nodes t o use instead of a single root in a tuple.
190
Architecture Inclusive Parallel Programming
Chapter 5
Each node j executes: If j==l Then RdTuple (Pivot.Value: I =1, I =0, I =0.0); For i = 1 To N-l Do { x = Abs (A[i]); RdTuple (Pivot.Value: i, I +1 I N, I = If ?<x => x | ?result); If j==i Then RdTuple (Pivot.Index: I =i, I =i-l, - ) ; RdTuple (Pivot.Index: i, I +1 | N, I = If result==x => j I ? p) If j==p Then { j = i; Pivot = A; OutBucket (Pivot.Value: I =i+l, N I =i, I =0.0); Exit > Else { If j==i Then j=p; InBucket (Pivot.Value: i+1, -, - ) ; x = A[i]/Pivot[i]; For k = i+1 To N*2 Do A[k] = A[k] - Pivot Ck]*x; Continue } } r = A[N+1..N*2] For k = N Downto j+1 Do { InBucket (Pivot.Value: k, I - 1 ) ; r = r - A[k]*Pivot } r = r/ACj]; Pivot = r; OutBucket (Pivot.Value: I =j, 0 I =j-l)
5.6.2
Prime Numbers
It is simple to write a sequential program to find prime numbers using the sieve: For i = 2 To N Do Prime [ i ] = T; For i = 2 To Sqrt(N) Do If Not Prime[i] Then Continue Else For k = i * i To N By i Do Prime[i] = F; but now suppose we have up to N processors each performing the sieve on one section of N integers. On processor j , Prime[i] indicates whether jN+i is a number number. Processor 0 has the responsibility of supplying the primes up to N for others to use in their sieve, sending the numbers through a shared bucket; that is, it acts as the writer while the others are readers. The bucket size n need to be large enough to contain all the primes up to N, plus an extra 0; it increases with N at a rate nearly but slower than linear.
Section 5.6.
Using Tuple Locks in Parallel Algorithms
Processor 0: For i= 1 To n Do Bucket [i] = 0; RdTuple (Bucket.Tuple: 0); Ready = 0; For i = 2 To N Do Prime[i] = T; For i = 2 To Sqrt(N) Do If Not Prime[i] Then Continue Else { Ready = Ready+i; Bucket[Ready] = i; RdTuple (Bucket.Tuple: |=Ready); For k = i*i To N By i Do Prime[i] = F } For i = Sqrt(N)+l To N Do If Not Prime[i] Then Continue Else { Ready = Ready+1; Bucket[Ready] = i; } RdTuple (Bucket.Tuple: |=Ready)
Processor j, j = 1..N-1: For i = 1 To N Do Prime[i] = T; Done = 0; For i = 1 Do { InBucket (Bucket.Tuple: ?>Done); For ii = 1 Do { Done = Done + 1; d = Bucket[Done]; If d==0 Then Exit Else For k = If d*d > j*N Then d*d - j*N Else d - Mod(j*N,d) To N By d Do Prime[k] = F > }
191
192
Architecture Inclusive Parallel Programming
Chapter 5
In other words, the bucket is retrieved if the number of available primes shown in the tuple exceeds the number already processed by the node, and the node keeps obtaining new primes from the bucket until it runs out of small primes to sieve with, when it gets a 0 in the bucket, at which point it tries to get a new bucket. While Processor 0 is still doing its own sieving, it supplies each new prime it finds to others without delay, but because the others are busy they may not access the bucket every time it changes. After its sieving is finished, Processor 0 would put all the primes between Sqrt(N)+l and N into the bucket in one go. Note that in this problem the data structure in the bucket is simple and the usual reader-writer exclusion need not be enforced.
5.6.3
Fast Fourier Transform
The FFT algorithm factorizes an NxN Fourier transform matrix into logN sparse matrices each containing just two non-zero elements per row, thus reducing the number of multiplications to 2NlogN and additions to NlogN. However, to keep the structure of computation simple, the algorithm would produce the results in "bit reverse" order, requiring a re-ordering of the output vector by moving element in position i = abc... to position i' = ...cba where a,b,c, etc. are the binary digits of index i. Suppose the elements are stored in m buckets of size N, then computation proceeds with each bucket j combining its elements with bucket j + m / 2 , j + m / 4 , etc, followed by in-bucket computation; after which bucket j exchanges elements with bucket j + m / 2 , j + m / 4 , etc., in groups of size 1, 2, etc., with odd groups from the upper bucket changing position with even groups of the lower bucket, while group size < sqrt(m). There are altogether log m compute steps and (log m)/2 reorder steps. Note that'each process owns one bucket but shares several buckets of other processes according to different index relations at various steps of computation and element exchanges. '/, compute s t e p between buckets: gap = m/2; same = F; group = 0; For k = 2*m Do { If k >= m Then { If Not (same) Then Bucket = A; OutBucket ( B u c k e t [ j - g a p ] . t u p l e : l=gap > InBucket ( B u c k e t [ j - g a p ] . t u p l e : - g a p ) ; k = k-m; same = T;
Section 5.6.
Using Tuple Locks in Parallel Algorithms
group = group*2 + 1; } Else { RdTuple (Bucket[j].tuple: gap); minrow = j*N; E = e(group*gap/m); For row = 0 To N-l Do { x = A[row] + E*Bucket[row]; Bucket[row] = A[row] - E*Bucket[row]; A [row] = x } same = F; RdTuple (Bucket[j]. tuple, |=-gap); group = group*2 } k = 2*k; If gap==i Then Exit; gap = gap/2 } '/, compute steps within bucket; For gap = N/2 Do { For 1 = 0 To N-i By gap Do { E = e(2*(j+l/N)/m); For row = 1 to 1+gap-l Do { other = row+gap; x = A[row] + A[other]*E; A[other] = A[row] - A[other]*E; A[row] = x } } If gap==2 Then Exit; gap = gap/2 } '/. reorder steps between buckets in groups; gap = m/2; group = 1; same = F; For k = 2*j Do { If k >= m Then { If Not (Same) Then Bucket = A; OutBucket (Bucket[j-gap].tuple: I =gap); InBucket (Bucket[j-gap].tuple: -gap);
193
194
Architecture Inclusive Parallel Programming
Chapter 5
A = Bucket; same = T; k = k-m } Else { RdTuple (Bucket[j].tuple: gap); For 1 = group To N-l By 2*group Do For row = 1 To 1+Group-l Do { x = A [row] ; A[row] = Bucket [row-group]; Bucket [row-group] = x > RdTuple (Bucket[j].tuple: |=-gap); same = F > } group = group*2; gap = gap/2; '/, reorder within bucket - this is for m gap = gap/2; group = group*2; If group >= gap Then Exit; > '/, reorder whole buckets - this is for m >= N; If group >= M Then { For ij = mod (j, gap*2), same = F Do { If ij = gap Then { If Even (ij) Then { If Not (same) Then Bucket = A; OutBucket (Bucket[j-gap]+1.tuple: I =gap); InBucket (Bucket[j].tuple: gap); A = Bucket; same = T }
Section 5.6.
Using Tuple Locks in Parallel Algorithms
195
ij = (ij-gap)/2 } Else { If Odd ( i j ) Then { RdTuple ( B u c k e t [ j ] . t u p l e : gap); For 1 = 0 To N-l Do { x = Bucket[1]; Bucket[1] = A[1]; A[l] = x > RdTuple (Bucket[j+gap-1].tuple: I =gap); same = F > ij = iJ/2 > } gap = gap/2 } }
5.6.4
Heap Sort
A heap sort starts by organizing N elements into a binary tree, in which each element is larger than its left and right successors; hence, each element is larger than all elements below it, and the largest element is at the root. This is the heap making phase. Afterwards, the heap is successively reduced in size by removing the root and replacing it with the element at position k, k = N-l,N-2,... each time moving the new root down and larger elements up until the heap property is restored. Since the removed element is always the largest in the heap, the final result is a sorted array with largest element at N-l and smallest at 0. When m buckets of size N each are used, bucket 1 contains the top part of the heap, with each leave element linking to two sub heaps in two successor buckets; the interior elements have successors within bucket 1 itself. In buckets 2 to m, the leaf elements have no successors. In the Make Heap phase, each recursion calls the subroutine MH to make the left and right sub heaps, and compares the two sub heap roots with the parent element, putting the largest at the parent node and, if necessary, calls the Remake Heap function for the subheap whose root has been swapped with the parent element. Whenever we reach an element which is a leaf of bucket 1, then the tuples for the left and right buckets are used to obtain the two subheap roots as well as to return the new subheap roots. In the heap reduction phase, elements are exchanged between bucket 1 and the tail bucket, and the new element at the bucket 1 root causes the Remake Heap function to be repeatedly invoked; when the tail bucket has performed N swaps with bucket 1, the bucket Tail-1 becomes the next tail bucket for further exchanges, until all the non-root buckets contain sorted elements and exchanges occur within the root bucket. MH ( i ) :
196
Architecture Inclusive Parallel Programming
x = A[i]; i i = i+i; iil = ii+i; If i A [ i i l ] Then { If x < A [ i i ] Then { A C i ] = A [ i i ] ; A [ i i ] = x; RH ( i i ) ; > > x < ACiil] Then { A[i] = A [ i i l ] ; ACiil] = x; RH ( i i l ) } } bucket==0 Then { i i = i i - N ; i i l = iil-N; RdTuple (ACii].Bucket: T, ? y ) ; RdTuple ( A C i i l ] . B u c k e t : , T, ? z ) ; If y > z Then { If x < y Then { ACi] = y; RdTuple (ACii].Bucket, |=F, l=x) } } x < z Then {ACi] = z; RdTuple (ACiil]-Bucket, |=F, l=x) } } RH: x = ACi]; i i = i+i; i i l = ii+1; If i > A[ii] > A[iil] Then { If x < A[ii] Then { A[i] = A[ii]; A[ii] = x; RH (ii); } > x < A[iil] Then { A[i] = A[iil]; A[iil] = x; RH (iii) > } bucket==l Then { ii = ii-N; iil = iii-N; If iil z Then { If x < y Then { A[i] = y; RdTuple (A[ii].Bucket: I =F, I =x); Return (ii) > } x ii==Tail Then { RdTuple (A[ii].Bucket: T, ?y); If x < y Then { A[i] = y; RdTuple (A[ii].Bucket, =F, =x); Return (ii) >
}
bucket > 0: interior = N/2; Tail = Nil; NR = Nil; MH (1); For loop = 0 Do
}
197
198
Architecture Inclusive Parallel Programming
Chapter 5
{ RdTuple (A[i].Bucket: | =T I F, I =A[i] I ?x); If Not (Null (x)) Then { A[l] = x; RH (1); > If Null (x) OR Not (Null (NR)) Then { If Null (NR) => NR = N; x = ACNR] ; If NR==1 Then { RdTuple (A[root].Bucket: T I =F, ?A[1] I =x, I =Tail-l); Exit } Else { RdTuple (A[root].Bucket: T I =F, ?A[NR] I =x, -) NR = NR-1; interior = NR/2 > > > bucket=0; interior = N/2; Tail = N+l; MH (1); For loop = 0 Do { RdTuple (A[Tail].Bucket: I =T IF, I =A[1] l?A[l], ||?Tail); j = RH (1); If Tail==0 => Exit; If j t
Then RdTuple A[(Tail].Bucket: I =T, I =Nil); t = Tail } For NR = N Downto 2 Do •C x = A [ l ] ; A[l] = A[NR]; A[NR] = x i n t e r i o r = NR/2; RH (1) }
5.7 5.7.1
Tuples in Objects Objects and Buckets
A bucket is simply a block of memory and can be used to store any data structures including objects, and a large object may spread over a number of buckets. Hence, the concepts of object and bucket may be treated as orthogonal, and for execution efficiency most systems would see page size buckets as most natural. However, there is also merit in imposing a connection between buckets and objects, because this
Section 5.7.
Tuples in Objects
199
leads to simple rules of bucket and tuple visibility: if a bucket must be an object, then the tuple locks of a bucket are visible only to threads whose scopes include the object, provided the tuples have been declared as public; otherwise, only processes executing inside the object can see the tuple. In any case, two threads which can both see the same tuple can synchronize with each other using the tuple, and rules on what inter process communication is permitted can then be naturally derived from object scoping rules. Since objects can be defined in complex parent child hierarchies, matching the stages of parallelism to the hierarchical object-bucket structure is likely to be an important part of the algorithm design, further complicated by the need to have shared storage and communication efficiency during runtime. By checking the way objects engage in information exchange via a particular bucket/tuple, the compiler can help the runtime system to distribute objects to the right nodes so that remote storage operations and message transfers are minimized. A number of related issues arise in using objects in parallel programming, such as object atomicity requiring a shared object to behave like a monitor, and the connection between object and process spawning (active versus passive objects). There are also issues on how the manner of using objects changes with objects in shared memory or distributed objects, and with homogeneity and heterogeneity. For example, a homogeneous algorithm is likely to require arrays of objects of the same class, so that processors execute the same or similar steps on similarly structured data content, whereas a heterogeneous algorithm is more likely to have a hierarchical object structure with more varied inter-object relations and objects created and deleted in a more dynamic fashion. Again the basic thinking on this issue need to be sorted out in light of further developmental work. To illustrate the use of tuple locks attached to objects, we present below an example, medical reception room, which is one of the four so called Salishan problems posed by a parallel programming group meeting held at resort by that name. Each of these problems tests the capability of a parallel programming language to capture a particular aspect of parallelism. Problem 1, Hamming Numbers, requires three processes producing results at different rates but maintaining a loose synchronism with each other; problem 3 involves breaking up a large molecular bonding structure into smaller chunks and then recombining the results; while problem 4, Skyline Matrix, is a Gaussian elimination program that takes sparseness into consideration. Each of the other three can be programmed using tuples as the communication mechanism but without involving objects. Problem 2 is however clearly object oriented.
5.7.2
An example
In the Reception Room problem, a set of patient objects and doctor objects interact via a reception room queue: when a patient arrives, if all doctors are busy, the patient waits in the queue; when a doctor becomes free a patient is taken off the queue for treatment; if a free doctor finds no waiting patient, he/she waits in the queue until a patient arrives.
200
Architecture Inclusive Parallel Programming
Chapter 5
{ DefClass Reception; Public Tuple Queue : Boolean, Integer, Integer, Pointer; ... } { DefClass Patient; RdTuple (Lobby.Queue: T I =F, ? Doc, - , - ) ; If Doc==0 Then { RdTuple (Lobby.Queue: I =T, -, | +1, -) RdTuple (Lobby.Queue: -, -l|=-2, -, ? MyDoctorI=.); > Else { RdTuple (Lobby.Queue: -, -, l=-l, 1=.); RdTuple (Lobby.Queue: l=T, |-1, -21=0, ? MyDoctor) } ..speak to MyDoctor.. Exit } i DefClass Doctor; For i=0 Do •C RdTuple (Lobby.Queue: T I =F, - , ? P a t , - ) ; If Pat==0 Then { RdTuple (Lobby.Queue: |=T, 1+1, - , - ) ; RdTuple (Lobby.Queue: - , - , - l | = - 2 , ?MyPatient|=.) } Else { RdTuple (Lobby.Queue: - , l = - i , - , 1=.); RdTuple (Lobby.Queue: |=T, -21=0, | - 1 , ?MyPatient) > ..Speak t o MyPatient.. > > Lobby = Reception; For j = i To n Do Exec Room[j] = Doctor; For j = l Do { Delay (Random); Exec P a t i e n t }
A call on a class returns a pointer to a new instance, i.e., after Lobby = Reception; Lobby points to a reception object. This contains a queue tuple made up of a read enable flag, a doctor count, a patient count, and a doctor or patient object pointer. A total of n doctor objects are generated at the start, pointed to from n Room pointers, while patient objects are generated at random intervals without
Section 5.7.
Tuples in Objects
201
their pointers being retained as variables (i.e., only they themselves know where they are) but EXEC attaches a process to each object so that it would not be garbage collected. If upon checking the Lobby tuple, a doctor object finds that the patient number is 0, it waits for patient arrival which is signalled by patient number being set to -1; this comes associated with a patient pointer. The dequeued doctor object then passes its own pointer, denoted by ".", through the tuple with the signal -2, after which the tuple access flag is returned to T. After a patient finishes talking to MyDoctor, the object exits so that its process terminates and its space may be garbage collected, while the freed doctor object goes back to Lobby to look for another patient. If it finds the patient number to be non-zero, it passes its own pointer "." to a waiting patient with the -1 signal, and waits for the -2 signal which comes with the patient pointer. The tuple operations of the patient are similar. The program makes use of the tuple request queue maintained by the tuple management routine, accepting its queuing policy. It is not difficult to provide one's own Enqueue and Dequeue functions so that there is greater control over queuing policies. The Pointer field of the Reception.Queue tuple would now contain the queue pointer. A patient object finding the Doctor number to be 0 would execute Enqueue (Pointer, .) before suspending via a RdTuple on its own tuple. Later a newly free doctor object would acquire a patient object pointer with a Dequeue and resume the patient via a RdTuple(MyPatient.Tuple: ...), passing over its own pointer to the patient object to enable it to take actions requiring the MyDoctor information. The Salishan problem specification does not prescribe how a patient object would speak with a doctor object. It is possible for a patient to take a passive stance: RdTuple(.Tuple: I =F I T); Exit while the doctor object takes various readings a = MyPatient.Temperature; b = MyPatient.Bloodpressure; RdTuple (MyPatient.Tuple: |=T)
with the tuple operation waking up the patient object so that it could exit. It is also possible to envisage that the doctor object might delve into more complex patient information stored in a MyPatient.History object. Alternatively, the two objects could engage in a dynamic interaction via the MyDoctor.Tuple (which is .Tuple to the doctor object) and My Patient.Tuple tuples. Note that the two tuples have different scopes: the doctor object can place information from within itself into .Tuple for the patient to retrieve, and it can retrieve patient information placed into My Patient.Tuple by the patient object. Neither object can directly retrieve nonpublic information from the other, but can call a method in the other object to extract information.
202
Architecture Inclusive Parallel Programming
Chapter 5
We have seen that tuples can be used to control storage access on buckets and the execution of processes that use the buckets; we have also seen that tuples can be used to control the execution of objects. It is also meaningful to use tuples to control the access of objects with the appropriate RdTuple before and after access on object content, and on a distributed system, using InBucket and OutBucket to transfer object content between the home process, which originally created an object instance, and processes holding local duplicates. Note that since one refers to an object instance using a pointer, sharing processes must have declared duplicates of the same pointer. Where, as in the case of the patient object example, an active object without a pointer to it is created, then no duplicates are possible. What if a process modifies the structure, rather than just the content, of a shared object, even perhaps reassigning a shared pointer to an object of another class? Theoretically, this can be handled: when a process executes a RdTuple on the object whose home copy has been modified, it is told not just to invalidate its own copy, but also to re-allocate the space for the object to the new object template. Such operations are likely to be expensive, and also assumes that the process anticipates an object change and knows what tuples the new object contains. The issue is analogous to the need to make sure that all processes using a tuple have already accessed it before re-initializing, but technically more complex.
5.7.3
Reflective Objects
An object contains a packet of data together with methods that operate on them, and a method call may, in addition to returning a result to the caller, also modify the internal data, i.e., change the object state. This could change the object response to the future calls, and a call on the same method with identical arguments may return a different result. The behaviour of other methods can similarly be modified. We can therefore design objects to provide multiple public methods sharing information that evolve with calls from outside threads, resulting in reflective function families. Further, such objects may enclose sub objects, which may display context dependency with sub object behaviour changing with its external environment. For a simple example of the kind of problem that requires such object design, we look at state space control theory: a control system has a vector of state variables x and input vector y, which determine the next state variables and the output vector z via the linear equations x(t) = Ax(t-l) + By(t) z(t) = Cx(t-l) + Dy(t) There are direct ways to deduce the differential equation equivalents of these matrix formalisms, including various special cases, but we would not go into this here. During execution, reception of the input values, the production of the next state variables and output values must be synchronized, and this can be achieved using
Section 5.8.
Towards Architecture Inclusive Parallel Programming
203
tuple locks attached to the object. The methods that receive the input signals and modify the states are spawned to execute concurrently, using tuples to form iteration barriers. Another useful formalism arises from the Kalman filters, designed to produce an output that deviates from the desired values in accordance with a least square error criterion. This too involves the use of a set of matrix equations operating on the input and the state but again no details will be discussed here.
5.8 5.8.1
Towards Architecture Inclusive Parallel Programming Parallel Tasking
One would think that putting parallel tasking features into a language is a simple, even trivial job: just annotate your program to indicate parts that can be executed independently, and get compilers to carry these intentions out. This is architecture inclusive, since if a number of identical or similar parts are so indicated, e.g., a loop body, then homogeneous parallelism results, while a more arbitrary structure would lead to heterogeneous parallelism. If the parts annotated as parallel share data globally, then a shared memory structure is needed; otherwise a distributed architecture would be suitable. It should be possible to agree on a standard, widely usable parallel tasking construct. But as things turned out, parallel tasking constructs have been as varied as inter process communication constructs, because people wished to combine parallel tasking with some related ideas; depending on which ideas were chosen and how they are added to parallel tasking, all kinds of constructs were invented, usually with some level of architecture dependence so that they are no longer inclusive. Take for example the widely used idea of future: when a parallel thread is spawned, a pointer to its result is created, and another thread that attempts to retrieve the result via the future pointer suspends until the result is available. This appears to offer a synchronization mechanism in addition to a parallel tasking mechanism, since a number of threads can suspend on a commonly required future. However, future can only be used to provide inter process communication in very restricted ways, and cannot replace IPC features generally. Hence, more conventional IPC mechanisms still have to be provided, and one gets embroiled into issues of how to combine their use with futures. In the mean time, the need to support future pointers efficiently, restricts architectural and runtime system design choices, as it is more natural on shared memory systems, while on distributed systems a future to message interface has to be constructed to support it. For another example, the traditional Linda tuplespace systems use the Eval command to spawn a number of parallel tasks, each returning a result that becomes one field of a tuple, which becomes ready when all these tasks end. By executing an In/Rd on this result tuple, other parallel threads achieve synchronization, like tasks suspending on a shared future. While this multiple tasking tool is more flexible than the single tasking future construct, it generates new questions like: are the
204
Architecture Inclusive Parallel Programming
Chapter 5
multiple tasks returning results to the same tuple homogeneous or heterogeneous? Since they start and stop together, there is homogeneity, but only a small number of parallel tasks are fired up with a tuple and so the method is actually more suited to support heterogeneous parallelism. The same issue arises with the parbegin...parend construct used in CSP and OpenMP, since one can only specify a limited number of parallel blocks between the two boundaries; thus, though the blocks start and end together homogeneously they cannot specify homogeneous parallelism on the scale it requires.. Ideas of performance and optimization would often creep into parallel tasking, with the parallel tasking construct giving the programmer room to specify the exact mapping of threads of code onto a particularly linked set of processors, matching a similar set of data mapping constructs used in another, declarative part of the program. Such facilities are however rarely architecture inclusive, and indeed are usually difficult to recode for different machines. We .have instead expressed preference for the more inclusive idea of specifying parallel parts of the code by simple annotations, declaring the architecture separately, and putting the optimization expertise into the particular compiler. Even ideas of making programs "architecture independent" could actually commit the programmer to particular architectures at the same time. For example the summation program of subsection 1.3, designed to allow variable numbers of threads, and hence independent of the number of processors, can only work on shared memory machines, since each processor i must be able to obtain elements x[i+j*n] from the memory for any j and any n. So in trying to achieve one kind of architecture independence, one simultaneously commits to a particular architecture in another direction. In short, in consideration parallel tasking, we need to resist temptation to entangle with other ideas however nice. It simply remains to use a parallel annotation (Spawn, Par, or our own past choice, Exec) to indicate that a statement or block in a program may be executed as a separate thread:
Exec <statement> Sync
means <statement> and may be executed in parallel. Hence For i = 1 To N Do Exec { . . . > Sync spawns N threads to execute the body of the loop, but all threads synchronize at the end of each loop execution. Hence, both homogeneous and heterogeneous tasks can be accommodated.
Section 5.8.
Towards Architecture Inclusive Parallel Programming
205
To determine the data distribution, the normal scoping rules of the language would show what data the parallel threads are entitled to access, so that these may be packaged with the instructions of the thread into a dispatchable task. If the thread corresponds to an object or a subroutine denned at the outmost level of the program, then its data scope is self contained and is satisfiable locally. In this way, the shared memory /distributed memory requirements of the program can be determined. Note that no architecture information is invoked in the program directly. Such information may however, be separately declared in another part of the program for each machine running the program, so that the compiler may work out the actual mapping of the threads onto processors and data onto the memory modules of the system.
5.8.2
Speculative Processing
Parallel tasks may be mandatory, meaning that we are certain their results are needed for the continuing execution of the program, or speculative, when we spawn off work that may or may not be necessary, in the hope that if a result is shown to be needed, we need not wait for it to be computed since it has already been produced speculatively. While wasted processing from unsuccessful speculation may arise, this is not a problem if a system has idle capacity, which may be used for computing results in advance and thus reduce the overall elapsed time of the program. A scheme for assigning decreasing priority to more speculative tasks is needed so that if the system is busy, only the mandatory tasks will be selected for execution. A simple way to do this is to use the If .. Then .. Else .. structure as the speculative construct: If . . . Exec Then ... Exec Else ...
which says the Then and Else parts will execute without waiting for the If part to finish first. Since we do not know whether the Then or Else part will be needed until the If part returns its result, spawning the Then and Else parts in parallel with the If part would produce speculative processing; the Then and Else parts would have lower priority than the If part, while more deeply nested parts would have successively lower priorities still. In this way, the programmer can cause different levels of speculation that meets his priority requirements and adapts to runtime conditions. A question arises regarding the assignments and tuple operation of speculative tasks: they should not become visible to other threads until speculation is confirmed, but in order to determine whether to confirm a speculative task, we may want to first see what its results are; in other words, the If part may need to see the results of the Then and Else parts before deciding whether to choose one or the other by returning a True or False. This dilemma is resolved by requiring each speculative
206
Architecture Inclusive Parallel Programming
Chapter 5
task to correspond to an object, and allowing the If part to see the public portions of the Then and Else objects: If . . . Exec Then A Exec Else B
(...) (...)
where A and B are defined classes, so that invoking them creates an instance of each type, which may be referred to from the If part using the object pointers Then and Else; therefore, the If part may inspect Then.x to inspect a public variable defined in class A, or call a public method y defined in class B using Else.y in order to see what the Else part is doing, before deciding whether to return a T or F. If we execute z = If . . . Exec Then A (...) Exec Else B (...)
we would have two objects executing in parallel with the Boolean expression in If, and z will point to one of them depending on whether the Boolean returns T or F, possibly after looking into the two objects before making the choice. Speculative processing is inherently heterogeneous: it assumes that some processors have less work to do than others and become idle, so that their capacity can be used speculatively. The need to make objects out of speculative tasks, with limited visibility into their contents also make them more amenable to distributed architectures. Thus, the class of speculative algorithms is not architecture inclusive; we have merely deployed the architecture inclusive annotation of Then and Else parts for parallel tasking in order to invoke architecture related speculative processing. In short, the same philosophy has been followed.
5.8.3
Efficient Implementation of Tuple Operations
In both shared memory and distributed systems, the cost of acquiring and modifying a tuple is little more than doing a lock/unlock: either one enters a monitor in charge of a group of buckets in order to test and modify one of its tuples, or sends a message to the node for the home bucket after figuring out the node ID from the bucket/tuple information supplied in the tuple operation, and waits for a reply. By distributing control of all the home buckets among a group of processors, we can minimize the overhead due to the sequentiality of tuple processing. Similarly, the cost of transferring buckets via In/Out operations is no greater than block transfers on a distributed system or page transfers on a DSM system. There is however a special kind of tuple operation overhead; it lies in the queuing up of tuple requests before they succeed in acquiring a tuple. Not only is it necessary to provide space with the bucket for the queued requests; whenever the tuple is modified, its new content must be compared with all the search keys in the
Section 5.8.
Towards Architecture Inclusive Parallel Programming
207
queued requests to determine if a previously unmet match is now satisfied. The tuple content must be delivered to all the successful requests, and if a successful request modifies the tuple, the new content must once again be matched against the remaining requests. This is why "overusing" a tuple is highly undesirable, because a tuple getting many different requests is bound to have a long wait queue, requiring an extended search process with each tuple modification, and once a long queue arises, it would take considerable time and processing before they can all be satisfied and removed. For example, when a tuple is used for barrier synchronization, with each process executing RdTuple (Bucket.Tuple: I +1 I N) there will in the end be N-l tuple requests waiting for the value N to appear. This is however not a serious problem as all the requests have the same match key, and with each increment, a single comparison is sufficient; when the key finally matches, all the requests can be removed one after another, as none of them modifies the tuple so that they all match and can be satisfied with the same tuple. The more difficult problem occurs when using a tuple for the prefix sum RdTuple (Bucket.Tuple: i I + 1 , I +x I ?sum) with each request waiting for its own index to appear in the tuple, before incrementing the index, adding its own element to the sum in the tuple and copying back the new sum. Each waiting request would have a different match key and a long queue search would be needed, repeatedly. By organizing all the requests in a heap (priority queue), the overhead could be minimized, but it remains high in comparison with most tuple operations. One way to avoid this is to distribute the process, perhaps using recursive divide and conquer to find a logN algorithm, and in the particular case of a hypercube machine, with communication steps between neighbouring nodes only. For the above example, we could divide the N processes into m groups of size n, and have each member node execute RdTuple (groupID.Tuple: i I +1, I +x I ?sum) after which the leader of each group, which controls the group home bucket and holds the rightmost element of the group prefix sum, executes RdTuple (Global.Tuple: Group | + 1 , | +sum I ? t o t a l ) RdTuple (groupID.Tuple: I =0, I = t o t a l ) to bring back the inter group prefix sum, while each member executes RdTuple (groupID.Tuple: 0, ? t o t a l ) sum = sum + t o t a l
208
Architecture Inclusive Parallel Programming
Chapter 5
to add the prefix sura from all the groups to the left to its own intra group prefix sum. The logN algorithm for a hypercube is more complex, but can be produced in a fairly straightforward fashion. We claim that, while some reprogramming is needed to adapt the tuple and bucket code to different architectures such as shared memory or distributed as well as different interconnections, the adaptation is not difficult to carry out. Further, the process could also be automated as a form of compiler optimization, with the application programmer providing architecture inclusive statements like the above RdTuple command, the user who runs the program on a particular machine providing architecture descriptions, and the compiler modifying the code to fit the system. On systems where system wide functions like barrier or prefix sum are provided as hardware operations, the compiler can recognize the tuple equivalent of such functions and compile the tuple operations given in the program accordingly. Discussing the details of such compiler optimization would take us rather far from the themes of this book and into such issues as architecture specification notations, system programming and compiler design; so we would only briefly point out that some relevant past work is likely to be found in the data mapping declarations used in High Performance Fortran and similar work, as the notation for such declarations reflect memory as well as processing capacity distributions in a system. There is also need to design the right bucket configuration to optimize data distribution and runtime processing in tuple and bucket programming, and possibly part of this can also be automated with compilers utilizing architecture declaration and data mapping statements. Some developmental experience will be needed before more definitive ideas can be formulated for general cases.
5.8.4
Tuple and Bucket Programming Styles
The provision of tuples and buckets permits information sharing in a standard way: tuples are accessed using RdTuple commands to verify that the shared bucket is in the right condition for processing, and InBucket/OutBucket commands are used to bring back private copies of a block and to return results produced in these copies; the same mechanism works whether the copies exist in a shared memory with the home bucket, or come across a communication channel of a distributed memory. By changing the content of the control tuple with the Rd/In/Out commands, one also controls whether the tuple/bucket commands of other processes would succeed in accessing the shared information, achieving exclusion, synchronization, monitor and other inter process relations. However, though the same notations are used, the programming styles are not independent of architecture. In a shared memory system, private copies are not usually necessary; read only processing can be performed directly on the home bucket, and even writes can be done this way with the appropriate locks. Only in rather special situations would it be necessary to obtain a private copy, either to work on an old copy while the home bucket is being modified by others, or to carry out processing that makes changes to the bucket content for private purposes only
Section 5.8.
Towards Architecture Inclusive Parallel Programming
209
so that the changes need not be given to others, or to reduce memory contention for a heavily shared home bucket. It would be rare for two processes to have two private copies, both modifying them, and try to merge the changes, since it is not trivial to define what the final result ought to be when changes overlap; if all changes are meant to be shared, then both should simply work on the shared copy, setting the tuple values in such a way that non-conflicting changes (such as queuing and dequeuing operations on a queue neither filled nor empty) take place concurrently, while conflicting changes are done sequentially. In a distributed system, one has to obtain a private copy first before processing can occur. Hence, InBucket/OutBucket operations are performed whereas in a shared memory system just RdTuple would be sufficient for an exclusive read and modify operation. Concurrent writes on a shared bucket are done only for quite specific algorithms with proven results. In a distributed shared memory system things are different yet again. Each processor would have a local copy of a shared bucket, but this is not a "private" copy, since the operating system would reflect all changes performed on the local copy in the home bucket whenever another process performs a successful RdTuple or InBucket operation. That is, all writes are shared, but they are not shared immediately like in a shared memory system. A process that has an old, unmodified copy can continue using its content without being affected by writes elsewhere as long as it does not perform any Rd/In operations on the bucket which may cause invalidation of the local copy. It is not necessary to create another local copy in order to preserve old contents if such care is taken. To illustrate the differences, consider the simple case of exclusive bucket processing. On the shared memory system RdTuple (Bucket.Lock: Open | = Closed) . . c r i t i c a l region.. RdTuple (Bucket.Lock:I = Open) and on a distributed system InBucket (Bucket.Lock: Open I = Closed) . . c r i t i c a l region.. OutBucket (Bucket.Lock: I = Open) On a DSM system, one also would use Rd/Rd instead of In/Out, but things are different from the shared memory system in that another process can still access its own copy containing the previous content of the bucket, at the same time the process in the critical region is putting in the new content, or even after the bucket has been unlocked, as long as it does not perform a Rd/In on the bucket. On a shared memory system the new content is immediately visible to others if they do not attempt to lock the bucket for exclusive access. If they do acquire the lock, the they can only access the version after the unlock, and thus cannot see the intermediary changes. In any case, the old content is not guaranteed.
210
Architecture Inclusive Parallel Programming
Chapter 5
Now take the more complex example of multiple synchronized readers/writers. Whereas the asynchronous reader/writer problem of section 5 has readers which wish to read the latest data only, now the readers wish to receive every new block produced by the writers, and all the reads/writes must be synchronized so that while the writers are producing the next block the readers are reading the previous block, and would switch together after all finish. For the shared memory case: reader Bucket = 0; For j = 1 To Do •[ RdTuple (Bucket. Tuple: j , I +1 I N) read Bucket; complement (Bucket) } except that for reader 1 the tuple operation is different: RdTuple (0.Tuple: I =0, - , - ) RdTuple (1.Tuple: I =0, - , - ) For j = 1 . . . { RdTuple (Bucket.Tuple: I = j , I =i I N)
i.e., it has the job of initializing the tuple before waiting for all the processes to synchronize using it. Note that we need two tuples in two buckets used for synchronization: if I have passed one barrier, then the tuple used for the previous barrier must be free as everyone must have passed the previous barrier already, and it is safe to re-initialize it, while the tuple for the current barrier should not be reinitialized immediately after I pass it as I cannot be sure whether others have passed it too. Note that two tuples .Value and .Index were used in Gaussian elimination with successful access of one followed by re-initialization of the other within each iteration. writer Bucket = 0; For j = 1 To Do { write to Bucket; RdTuple (Bucket.Tuple: j, I +1 I N) complement (Bucket) } That is, there are two buffers alternatingly used for writing and reading, and at the end of each cycle readers switch to the block written by the writers, and writers switch to the block just read by the readers. N is the total number of processes,
Section 5.8.
Towards Architecture Inclusive Parallel Programming
211
whether reader or writer, and when a barrier is passed, is means the writers have finished writing, and the readers are ready to read. Now consider the same problem on distributed systems; t is necessary to use only one bucket, as everyone must make a local copy to process. First suppose that each OutBucket overwrites the home bucket, so that writers must execute exclusively, bringing over the current home bucket first. This is achieved by each writer setting the index in the tuple to 0 before writing and putting it back to j afterwards. Before starting, they also need to know that all readers have taken a copy of the previous block, in which case the last reader, noting that the reader count is R, would have re-initialized the tuple. Before starting to read, readers need to wait for all the writers to have done their writing and incremented the write count, so that the tuple contains the number W. Hence: Reader For j = 1 Do { InBucket (Bucket.Tuple: j , W, I +1 I ? n ) ; If n==R Then RdTuple (Bucket.Tuple: I +1, | =0, I =0); read (Bucket) > Writer For j = 1 Do { InBucket (Bucket.Tuple: j I =0, -, 0 ) ; write (Bucket);
OutBucket (Bucket.Tuple: I = j , I + 1 , 0) } Next we assume that the system processes OutBucket by merging all the changes a writer made on its own copy into the home bucket, and if several writers OutBucket at the same time, the home bucket incorporates all their diffs. Readers need to wait for all the writers to do this before executing InBucket; hence Reader For j = 1 Do { InBucket (Bucket.Tuple: j , W, I +1 I ? n ) ; If n==R Then RdTuple (Bucket.Tuple: | +1, | =0, I =0); read (Bucket) > Writer For j = 1 Do
212
Architecture Inclusive Parallel Programming
Chapter 5
{ InBucket (Bucket.Tuple: j , - , 0 ) ; w r i t e (Bucket); OutBucket (Bucket.Tuple: j , I +1, 0) > That is, simply by not modifying the tuple, the writers' InBucket all succeed on the same bucket content, and the changes made by each will be incorporated during the OutBucket. The DSM version is similar to the shared memory version, except that on DSMs system it is again only necessary to have one bucket, and after a RdTuple, a node would be able to access the latest bucket content by referring to its virtual address, while a writer can modify the copy at its node without affecting the others as the new content is not passed to the home node until a tuple operation is performed by a node. However, it is still necessary to use two tuples to handle re-initialization correctly.
5.8.5
Back Towards Architecture Independence
We have thus illustrated the need to program in an architecture specific way, not only taking into consideration hardware differences, but difference in bucket operations. This occurs because we used the tuples and buckets as parallel programming tools and tried to program in the most efficient way for an architecture, rather than as the basis of a common model. Now that we know the differences, we should try to bring everything together again. One difference between distributed and DSM systems is that the former requires the use of InBucket/OutBucket to move data from or to the home node of a globally defined bucket, but the equivalent effect is achieved if a RdTuple causes changes made on the local duplicate to be incorporated into the home bucket (OutBucket), and invalidates the local copy if it is different from the home bucket (InBucket), with the difference that, if the home bucket is modified before its pages are brought to the local node by misses, the content received would differ from the content at the time of the RdTuple. To ensure the equivalence between Rd with In/Out, it is necessary to impose the programming convention that the owner process of the home bucket does not write into it, which is usually satisfied because the home copy of the globally defined bucket is with the parent process that generates child processes to do the work on their local copies while it waits for them to complete so that the results can be incorporated into the home copy; alternatively, home buckets can be defined in separate, dummy processes that only act as data repositories. With locally defined buckets, every process has a distinct bucket and In/Out are used to transfer information between them, regardless of whether the system is shared memory or distributed. We also saw that a major difference between DSM and shared memory systems is that when several writers concurrently modify a common bucket, they do not see other processors' writes on a DSM system until there have been tuple operation that cause incorporation of changes into the home bucket and invalidation of local
Section 5.8.
Towards Architecture Inclusive Parallel Programming
213
copies. If we impose the programming convention t h a t concurrent writers write t o separate areas of the shared bucket, and readers only read old d a t a , until tuple operations are used to shared the new information among all the parties, then again the distinction is covered up. Note t h a t while other processes can request the incorporation of their writes into the home bucket with a RdTuple, the change m a y be prevented by the tuple being set to write-disable, until the current readers finish acquiring the current content and reset the tuple. However, for some algorithms the above programming convention might negate the basic advantage of distributed shared memory t h a t allows readers and writes freedom to access their own copies. An alternative solution is to provide for the same freedom in a shared memory system: a writer may declare Own D u p l i c a t e
..name..;
which causes a separate copy of the shared bucket/object to be allocated in the process's local virtual space, and content transfer between this copy and the home copy is performed by the tuple system with each RdTuple. Each writer is then free to modify its own copy, until it causes its changes to be incorporated into the home copy with a RdTuple provided the home copy is write-enabled. On a distributed system Own Duplicate and Duplicate have the same effect since each process must have its own copy of shared content. Now the multiple reader-writer program may be coded as: writer F o r j = 1 To Do { w r i t e t o Bucket; RdTuple ( B u c k e t . T u p l e : j , I +1 I N) > reader F o r j = 1 To Do { RdTuple { B u c k e t . T u p l e : j , I + i I N ) ; r e a d Bucket } since on both share memory and distributed systems, a RdTuple causes new content to be incorporated and distributed when its match suceeds, and each writer has its own copy. The same program now runs on both. Another useful convention is to program in a homogeneous manner as far as possible. For example, if we have For i = i To N Do procedure; then on an S P M D machine the given procedure is simply started on N processors, while in a hypercube machine or a linked set of workstations the program
214
Architecture Inclusive Parallel Programming
Chapter 5
would recursively spawn N processes in logN steps. Similarly, a system wide tuple operation like prefix sum would be compiled into a hardware function on a massively parallel machine, or a logN-step recursive operation on machines with some kind of uniform interconnect. If we have a system with n nodes each containing a subgroup of m processors then a two level distribution between groups then within groups would result. In each, the compiler can be assisted by system declarations describing the hardware configurations as well as the memory model features. So given the tuple and bucket mechanisms, the compiler facilities and the programming conventions, we would have a common, architecture inclusive framework, with the same application program plus the architecture description running on different systems. The programming conventions do not always work out for a particular algorithm. For example, in some multiple reader/writer problems one cannot be sure that readers would confine themselves to reading old parts or that writes are all non overlapping, and the algorithm that is guaranteed to work correctly on a shared memory system (using exclusive writers or separate read and write buckes) would not be optimal for DSM systems. Possibly we have to settle for programs made up of architecture independent segments, interpersed with code that has to be re-written for each system, but even this would be a useful step towards the elusive goal of architecture independence. In short, by accommodating both system wide, homogeneous interprocesss communication and bilateral, heterogeneous communication in the tuples, using them as locks controlling buckets as part of a stronger memory consistency model, we have a framework that includes the various divergencies that have prevented progress in parallel programming previously, and a starting point for a new effort towards architecture independence.
Acknowledgement This article was written during the author's sabbatical leave at Rice University, Houston, Texas for most of 2000. The support provided by NUS and Rice for this work is gratefully acknowledged.
N articles describing significant developments in the field. Include h current topics as clusters, parallel tools, load balancing, mobile systei I architecture independence.
ISBN 981-02-4579-31
www. worldscientific. com
9 "789810"245795"